Seaweed APT

Diffusion Adversarial Post-Training for One-Step Video Generation

Research Paper Showreel Video
Video Generation in 1 Step
Existing diffusion and autoregressive generative models require repeated neural network evaluations. It is extremely slow for the high-resolution video generation task, as a few-second video can take many minutes to generate. Our work is the first to demonstrate the generation of an entire video using a single neural function evaluation (1NFE) by using our proposed adversarial post-training technique. Our model generates 2 seconds of 1280x720 24fps videos in real-time. We showcase some of the results below:
A girl raises her left hand to cover her smiling mouth.
The camera follows behind a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope.
A personified cat wearing suits walking on the street.
A white and orange tabby cat is seen happily darting through a dense garden.
An extreme close-up of an gray-haired man with a beard in his 60s. He is dressed in a wool coat suit coat with a button-down shirt. He wears a brown beret and glasses and has a very professorial appearance, depth of field, cinematic 35mm film.
A very high waterfall pouring down.
Medium shot, An adorable happy otter confidently stands on a surfboard wearing a yellow lifejacket, riding along turquoise tropical waters near lush tropical islands, 3D digital render art style.
Wide-angle shot, neon lights and holograms illuminating the scene, a futuristic cityscape with flying cars and towering skyscrapers, a cyborg standing on a rooftop looking over the city, cyberpunk style.
A grandmother with neatly combed grey hair stands behind a colorful birthday cake with numerous candles at a wood dining room table.
Ancient China, with beautiful mountains and rivers, birds singing and flowers fragrant, a middle-aged scholar walks in the picture
Dramatic Rembrandt light, a bearded man pulls out a pistol.
Zombies, mutant species attacking cities, attacking humans.
Humanoid mechs left in the ruins of the city, movie lighting, cold lighting.
A western princess, with sunlight shining through the leaves on her face, facial close-up.
A woman wearing a traditional red wedding dress stands in a dilapidated courtyard, with a Chinese-style gloomy and terrifying atmosphere, oppressive composition, strange, dramatic, movie, folk horror.
A litter of golden retriever puppies playing in the snow. Their heads pop out of the snow, covered in.
A huge spider is climbing to the top of skyscrapers.
A female knight holding a heavy sword stands in front of a Gothic castle in medieval style.
The wind blows the fallen leaves.
Wong Kar-wai style, on the streets of Shanghai, back shot of a woman walking in a cheongsam, nostaglic sepia tone, brightly saturated colors.
Two cute orange cats took a UFO ride to explore the Milky Way, extremely excited, dancing.
A beautiful woman walking on the school playground. The sun shining on her face.
The camera rotates around a large stack of vintage televisions all showing different programs.
An older man playing piano, lit from the side, advertising style.
A dog wearing virtual reality goggles in sunset.
Full shot, 4 men walking across the street.
The monster stared at the food with wide eyes and open mouth. Its posture and expression convey a sense of innocence and playfulness.
Robot dancing in Times Square, American comic style.
A group of friends laughing and chatting around a dinner table, plates of food and glasses of wine, a cozy living room with a fireplace in the background, warm and joyful atmosphere, camera shift left.
Two kangaroos are busy cooking dinner in a kitchen, 3d animation.
An astronaut running through an alley in Rio de Janeiro, 4k, high resolution.
A young singer performing on stage.
A little boy playing with water.
An astronaut riding a horse, high definition, 4k.
A woman reading a flaming burning book while many people are walking around her in a the busy city.
In an ancient, dimly lit library, a young student sits cross-legged on a plush, worn rug, surrounded by towering bookshelves filled with centuries-old volumes.
POV flying through a colorful coral lined streets of an underwater suburban neighborhood.
Timelapse of day to night, Eiffel Tower.
A birthday cake in the plate.
An old lady selling vegetables at a market stall.
A sika deer is drinking water by the stream.
Sea, green island, girl, surround shooting slowly pulling away. realistic movie.
Image Generation in 1 Step
Our model can also generate 1024px images using a single neural function evaluation (1NFE). The images generated by our model have realistic textures and details. We highlight some of the examples below:
A girl raises her left hand to cover her smiling mouth.
A Border Collie runs excitedly on the track, and the audiences on both sides of the track applaud.
The samurai tightly holds the long stick, nervously staring at the tiger in the grass ahead, creating a tense fighting atmosphere and a realistic sense of film.
On a lively and cheerful night, a white kitten walks on the bustling street with balloons, and the street is full of small animals.
The camera follows a pink off-road vehicle moving quickly in the jungle.
There is a long river in the middle of a spectacular valley, and the drone aerial photography shows the mountains and rivers.
A rescue soldier runs forward in the battlefield with gunfire, and the camera follows.
A black and white picture, sweat dripping from the man's determined face.
The car is accelerating, the background on both sides is blurred, focus on the body.
A man in yellow clothes, facing the camera, motionless, with moving vehicles in the background.
A Labrador wearing glasses and casual clothes is lying on the bed reading.
A cute little penguin walks on an Antarctic glacier, searching for food.
Video Generation Compared to Diffusion
We compare our adversarial post-trained 1NFE and 2NFE video model to the original diffusion model with 50NFE. In other words, our model requires 50x and 25x less compute. We present both the good and the bad cases to provide a holistic representation of our model.

The good cases: We show that adversarial post-training has the additional benefit of improving details and realism. The original diffusion model with classifier-free guidance can have exposure issues and may generate samples that appear synthetic.
Diffusion 50NFE
APT 2NFE
APT 1NFE
A western princess, with sunlight shining through the leaves on her face, facial close-up.
Wong Kar-wai style, on the streets of Shanghai, back shot of a woman walking in a cheongsam, nostaglic sepia tone, brightly saturated colors.
A personified cat wearing suits walking on the street.
The average cases: Few-step generated videos can mostly represent the scene even with large motions. However, there are degradations in structural integrity and text alignment. Our research later analyzes the cause of these issues for future improvements.
Zombies, mutant species attacking cities, attacking humans.
First-person perspective, the camera passes through a classroom entering the school playground.
A rescue soldier runs forward in the battlefield with gunfire, and the camera follows.
The failure cases: We show that our model can still fail in some cases. This is still an active research area that we hope to improve in the future.
The camera follows a pink off-road vehicle moving quickly in the jungle.
A person wearing a spacesuit stands on the moon and looks at the distant earth in a realistic style.
A terracotta warrior holds a white paper in one hand, and the paper flutters in the wind. The background is a museum.
Video Generation Diversity
Our 1NFE video model can generate diverse results without mode collapse. Below are video examples generated by our model in 1 step with different initial noises.
A male violinist with a sad expression, performing a melody on a dimly lit stage.
A dog wearing virtual reality goggles in sunset.
Slow motion, A fox dressed in a suit dancing in the park.
Image Generation Compared to Other Methods
One-step generation has been previously attempted for image generation. Previous one-step image generation models have visible artifacts and poor structures. We compare our one-step image generation against previous methods and show that our method marks a significant advancement in visual quality.
Our Diffusion 50NFE
Our APT 1NFE
FLUX-Shnell 1NFE
A frustrated child.
SD3.5-Turbo 1NFE
SDXL-DMD2 1NFE
SDXL-Lightning 1NFE
SDXL-Hyper 1NFE

Our Diffusion 50NFE
Our APT 1NFE
FLUX-Shnell 1NFE
A close-up of the eyes of an owl.
SD3.5-Turbo 1NFE
SDXL-DMD2 1NFE
SDXL-Lightning 1NFE
SDXL-Hyper 1NFE

Our Diffusion 50NFE
Our APT 1NFE
FLUX-Shnell 1NFE
The city of London.
SD3.5-Turbo 1NFE
SDXL-DMD2 1NFE
SDXL-Lightning 1NFE
SDXL-Hyper 1NFE

Our Diffusion 50NFE
Our APT 1NFE
FLUX-Shnell 1NFE
A tree growing through a fence.
SD3.5-Turbo 1NFE
SDXL-DMD2 1NFE
SDXL-Lightning 1NFE
SDXL-Hyper 1NFE
Training Progression
We show that the training procedure is stable and fast. We use the exponential moving average (EMA) technique for steady updates. The model reaches peak quality only after 350 updates. Training longer leads to quality deterioration.
Consistency init
50 updates
150 updates
250 updates
350 updates
450 updates
550 updates
650 updates
We also share the visualization of the online non-EMA model. It has a lot more fluctuation between iterations and performs worse than the EMA model overall.
Model Internals
To understand the internal model behavior of one-step generation, we freeze the model and train an additional linear projection after every transformer layer. This projects the model's internal features to the image space for visualization. Below we visualize every layer's result of our 36-layer transformer model. It shows that the model generates the overall composition in earlier layers, and progressively works on finer details. This is similar to diffusion models which also generate from low-frequency to high-frequency, except our model compresses the entire generation within 36 transformer layers using a single forward pass.
Layer 6
Layer 12
Layer 18
Layer 24
Layer 30
Layer 36
A black and white picture, sweat dripping from the man's determined face.
1
2
3
4
5
7
8
9
10
11
13
14
15
16
17
19
20
21
22
23
25
26
27
28
29
31
32
33
34
35

Layer 6
Layer 12
Layer 18
Layer 24
Layer 30
Layer 36
A group of fluffy chickens are weaving wreaths in a fantasy style.
1
2
3
4
5
7
8
9
10
11
13
14
15
16
17
19
20
21
22
23
25
26
27
28
29
31
32
33
34
35
Latent Interpolation
We interpolate the input noises to our image model and generate smooth latent traversal videos. This helps visualize the generative latent space of our model. We show that our model is able to generate a continuous distribution of diverse images.
The boy puts on the headphone for the girl, orange-blue tone, warm tone, movie aesthetics.
A Border Collie runs excitedly on the track, and the audiences on both sides of the track applaud.
Dramatic Rembrandt light, a bearded man pulls out a pistol.
In the early morning, the bright natural light shines through the leaves outside the window, and the sunlight shines on the desk in front of the window. Fresh and clean style.
He handed the microphone in his right hand to the audience below.
Backlit clown facing the camera, movie background light, American horror movie atmosphere, realistic.
Parents celebrate their daughter's birthday at home, with candlelight illuminating their smiling faces, full of warmth.
The camera follows a pink off-road vehicle moving quickly in the jungle.
A black and white picture, sweat dripping from the man's determined face.
Visualizing our model's latent space also reveals why one-step generation has degraded structural integrity. We notice that structural incorrectness often occurs during the transition between modes. On the other hand, the diffusion model can make very sharp changes between modes, despite the same rate of change for the noise interpolation. The flickering is a separate issue due to numerical inaccuracy, but we mainly focus on the diffusion model's ability to switch between modes quickly. We hypothesize that single-step models have poor structural integrity because the generator has limited capacity to make such sharp changes. We aim to address this issue in future works with this insight.
Diffusion 50NFE
APT 1NFE
Parents celebrate their daughter's birthday at home, with candlelight illuminating their smiling faces, full of warmth.
Uncurated Results
We clarify that the samples on this page are selected to best present our work, and some of the prompts have been truncated or modified to fit the format of the web. We plan to provide all the uncurated samples for download soon. This will allow future research to compare to our work.
Conclusion
We present Seaweed-APT, an adversarial post-trained model that supports real-time one-step generation of 1280x720 24fps videos with a duration of two seconds. To the best of our knowledge, our work is the first to demonstrate the possibility of generating high-resolution videos in just a single step. To achieve this, we have introduced multiple improvements to the model architecture, training procedure, and loss objective. More details are available in our research paper. However, this is still an early research work. Though the results are impressive, we find one-step generated results still have degradation in structural integrity and text alignment. Our research analyzes the cause of the issues, and we aim to address them in future works.
Research Shanchuan Lin Xin Xia Yuxi Ren Ceyuan Yang Xuefeng Xiao Lu Jiang