EVENT AGENDA
Event times below are displayed in PT.
Event times below are displayed in PT.
Revolutionizing Video Creation and Editing: Insights from Engineering Leaders on the Present and Future of Generative Video Models
Hear video gen model experts together in the same panel for the first time. Hear them talk about the future of Gen AI Video Models
Image, video, and audio generation are a fundamental building block for Generative AI research and applications in the real world. In this talk, I'll present Movie Gen, a set of foundational models for video generation, editing, personalization and audio generation. Movie Gen models are one the world's most advanced media generation models, with state-of-the-art results compared to industry solutions. I'll focus my talk on text-to-video generation and share key insights that enabled this step change in quality. Movie Gen produces HD quality videos of up to 16 seconds in length and has been used by movie producers in Hollywood.
At Captions, we believe that anyone can become a video creator, regardless of their experience. This mission presents both product and technical challenges as we bridge the gap between cutting-edge video generation capabilities and a user-friendly experience. In this presentation, we'll deep dive on Captions and our underlying technical systems. Lastly, we'll demonstrate how we're evolving these systems to empower users to create videos that resonate globally.
In this talk we will show how we implemented a media processing pipeline to perform (autodub / lipsync) media inference at Meta scale.
We will focus on the challenges we faced from a media processing / scaling point of view, such as: inference latency and scheduling, voice isolation, media timing/alignment, alternate tracks delivery, instrumentation, model evaluation, etc.
An efficient and flexible video processing pipeline is critical for enabling innovation and supporting both our streaming service and studio partners, which is essential for Netflix's continued success. Over the past few years, we have been rebuilding this pipeline on our next-generation microservice-based platform. In this talk, we will share our journey and learnings with the community
Various video upsampling technologies are adaptively applied to video playback on mobile clients. The video playback quality are constraint when streamed to the end users mostly due to the device network bandwidth constraints or the low quality in the original content. With these different advanced upsampling technologies on device playback, the streamed video quality are improved and at the meantime, it helped the user to save their cell/wifi data by playing a video with lower bitrate.
We present Segment Anything Model 2 (SAM 2), a foundation model towards solving promptable visual segmentation in images and videos. We build a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. Our model is a simple transformer architecture with streaming memory for real-time video processing. SAM 2 trained on our data provides strong performance across a wide range of tasks. In video segmentation, we observe better accuracy, using 3x fewer interactions than prior approaches. In image segmentation, our model is more accurate and 6x faster than the Segment Anything Model (SAM). We believe that our data, model, and insights will serve as a significant milestone for video segmentation and related perception tasks.
In the past couple of years, the landscape of image and video generation has transformed dramatically. Despite such phenomenal progress, rigorous and holistic evaluation of the generative models continues to suffer. This is primarily due to the multi-faceted and highly subjective nature of the task: the generated image / video should be evaluated not just on overall visual quality and aesthetics, but also on its alignment to the input prompt , originality, lack of propagating stereotypical biases, and several more factors.
In this talk, I’ll give an overview of current metrics, their shortcomings, and the rapid progress in the research community to improve the rigor in evaluation.
Viewer experience is a complex outcome of many competing dimensions, from encoding quality to network performance of the video pipeline. Metrics, like Zero Buffer Rates (ZBR), illuminate the impact of various components in the end-to-end pipeline on the viewer experience. . This talk will explore the essential elements of operational excellence in video infrastructure, working backwards from viewer’s perspective and into the video pipeline.
While video input pre-filtering is not a new concept, it has evolved significantly in recent years. Originally designed to reduce noise and artifacts in low-quality video, pre-filtering techniques have now been adapted for use even on pristine video content. This talk will explore the importance of video input pre-filtering and the specific advantages it can offer in modern video production and delivery workflows.
Key benefits of advanced input pre-filtering include reduced file sizes without sacrificing perceived video quality, as well as mitigation of common video quality issues like softness and deblocking artifacts. However, implementing a production-ready, broadcast-grade pre-filtering solution requires careful consideration of several critical factors.
This talk will dive into the core pillars of the newly released video input pre-filter in AWS Elemental's media processing solutions. It will explain how this advanced pre-filtering technology can help video providers deliver high-quality, efficiently encoded video for a wide range of applications, from over-the-top streaming to broadcast television. Attendees will come away with a deep understanding of the evolving role of pre-filtering in the video technology landscape and practical insights they can apply to their own workflows.