More Research

Seaweed APT2

Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation

Research Paper
Real-Time Streaming Video Generation
Introducing Seaweed-APT2, a streaming video generation model for real-time interactive applications. Our model autoregressively generates a single latent frame (4 video frames) at a time, allowing the result to be streamed to the user with a minimum latency. Each latent frame is generated using only a single network forward evaluation (1NFE). With KV cache, our model can generate long durations of videos all under the total computation of 1NFE. As a result, our 8-billion-parameter model achieves real-time, 24fps, nonstop, streaming video generation at 736x416 resolution (equivalent to 640x480) on a single H100 GPU. This is significantly faster than existing approaches. Our model is trained for the image-to-video (I2V) generation scenario, where the user provides the initial frame, as interactive applications commonly employ this setting. We showcase some 1-minute (1440-frame) videos generated by our model.
736x416 24fps
Real-time on 1xH100
752x400 24fps
Real-time on 1xH100
736x416 24fps
Real-time on 1xH100
736x416 24fps
Real-time on 1xH100
736x416 24fps
Real-time on 1xH100
736x416 24fps
Real-time on 1xH100
736x416 24fps
Real-time on 1xH100
736x416 24fps
Real-time on 1xH100
736x416 24fps
Real-time on 1xH100
736x416 24fps
Real-time on 1xH100
736x416 24fps
Real-time on 1xH100
736x416 24fps
Real-time on 1xH100
High-Resolution Streaming Generation
Our model can also achieve real-time, 24fps, streaming generation of 1280x720 resolution (or other aspect ratios with equivalent pixels) videos on 8xH100 GPUs. We showcase some results of 1-minute, high-resolution, nonstop streaming video generation below. Compared to APT1, which can only generate 49 frames, APT2 can generate 1440 frames while maintaining 1NFE compute!
1472x608 24fps
Real-time on 8xH100
1552x576 24fps
Real-time on 8xH100
1280x720 24fps
Real-time on 8xH100
1232x688 24fps
Real-time on 8xH100
1312x688 24fps
Real-time on 8xH100
1280x720 24fps
Real-time on 8xH100
Interactive Virtual Human Generation
Our model can additionally receive real-time user responses as interactive control. We demonstrate our model's real-time interactive generation capability for virtual human generation, where the user provides the initial frame to establish the subject identity and controls the virtual human in real time through pose conditions. We show the pose condition input at the bottom-right corner. Please hover to see the overlay display.
688x384 24fps
Real-time on 1xH100
688x384 24fps
Real-time on 1xH100
688x384 24fps
Real-time on 1xH100
688x384 24fps
Real-time on 1xH100
688x384 24fps
Real-time on 1xH100
Interactive World Exploration
We also demonstrate our model's real-time interactive generation capability for camera-controlled world exploration, where the user provides the initial frame to establish the environment and controls the camera to explore the virtual world. We show the camera input embeddings as given to the model at the bottom-left corner of each video. The embeddings represent the precise change of camera displacement (Left/Right, Up/Down, Forward/Backward) and orientation (ray direction in XYZ coordinates). We also show the camera trajectories at the bottom-right corner. Please hover to see the full details.
640x352 24fps
Real-time on 1xH100
640x352 24fps
Real-time on 1xH100
640x352 24fps
Real-time on 1xH100
640x352 24fps
Real-time on 1xH100
656x368 24fps
Real-time on 1xH100
640x352 24fps
Real-time on 1xH100
656x368 24fps
Real-time on 1xH100
640x352 24fps
Real-time on 1xH100
Architecture
Unlike diffusion-forcing or next-token-prediction, we propose adversarial training as a new paradigm for autoregressive generation. Specifically, our architecture is LLM-like and can fully exploit the KV cache. However, it is trained with an adversarial objective to generate a frame of tokens at once, sampled by noise.
Generator uses block causal attention that only attends to previous and current frames. A sliding window and KV cache are used to ensure constant inference speed. Previous results are recycled as input for the next autoregressive step. At training, it runs recurrently in a student-forcing fashion. At inference, it runs autoregressively nonstop to generate full-minute videos.
Discriminator uses the same causal architecture as the generator. The discriminator evaluates all frames in parallel and produces logits for each frame. Relativistic GAN loss and approximated R1 + R2 regularization are used.
Both the generator and the discriminator are initialized from a pre-trained bidirectional video diffusion model. Autoregressive adversarial post-training (AAPT) is proposed as a method to efficiently transform the bidirectional video diffusion model into our autoregressive generator. We also propose techniques for training models that support long-duration video generation bypassing the limitation of data and GPU memory. For more details, please refer to our research publication.
Comparison to Other Methods
We compare our model to other methods. We showcase SkyReel-V2, a state-of-the-art diffusion-forcing model that is designed for long-video generation, and our diffusion baseline, which is trained for 5-second generation with repeated extension for long-video generation. The results show that both diffusion-based methods have severe degradation after 20 seconds. Changing the CFG scale can mitigate some drifting issues but causes degradation in structural integrity. Our model performs better than the compared methods.
SkyReel-V2
Our Diffusion
Ours
Simulate Infinite Possibilities
Our model can simulate infinite possibilities of the world. We showcase the results generated by our model given the same initial frame but with different noises. All the simulations below are generated in real-time.
Challenging Cases
We also showcase some challenging cases to show our model's limitations. Our model can struggle with fast-changing motions as the 1NFE model still has limited capacity to generate immediately emerging scenes and objects. Our model also has limitations in long-distance memory due to the use of sliding window attention. Our model can also sometimes violate physics. We aim to address those issues in the future. Additionally, our model is currently trained without human-preference alignment, which can further improve performance as seen in diffusion models. We leave it to future work.
Recycle Inputs
Our ablation studies show that recycling the past result as input for future autoregressive steps is critical for long-duration generation. Specifically, we compare models (left) with the exact model architecture and training procedure, except we mask the recycled input to zero tensors for subsequent autoregressive steps. We find that the model fails to generate large and coherent motions. Please refer to our publication for more details.
No-Recycle
Recycle
Teacher vs. Student Forcing
Autoregressive adversarial training can be done in either teacher-forcing or student-forcing fashion. We conduct experiments to train our model in a teacher-forcing setting as inspired by LLMs. However, we find that models trained in the teacher-forcing setting drift significantly at inference. Unlike LLMs that predict discrete words, we suspect that predicting continuous latent pixel values is more prone to error accumulation. Please refer to our publication for more details.
Teacher-Forcing
Student-Forcing
Toward Infinite-Length Streaming
We show that our model can zero-shot extrapolate to generate videos with much longer duration than it has seen during training. Below are examples of 5-minute (7200-frame) videos generated in the real-time, nonstop, streaming fashion. The entire video is under 1NFE compute with KV cache. We find that our model can forget the subject and environment due to the current sliding window attention, and the structural integrity remains a challenge for 1NFE models. We believe our method is a promising approach to building infinite-length streaming generators. We aim to address the limitations in future work.
736x416 24fps
Real-time on 1xH100
Conclusion
We present Seaweed APT2, an autoregressive adversarial model that is designed for real-time interactive video generation. Our research introduces adversarial training as a new paradigm for building practical autoregressive generation models. Our model achieves significantly improved speed, latency, and duration for streaming video generation. We emphasize that APT2 is an early research work. We look forward to further advancing the research in real-time interactive video generation.