MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion

Onkar Susladkar1, Jishu Sen Gupta2*, Chirag Sehgal3*, Sparsh Mittal4, Rekha Singhal5
1Northwestern University, 2IIT BHU, 3Delhi Technological University, 4IIT Roorkee, 5TCS Research, India
*Work done during internship at IIT Roorkee, Support for this work was provided by Science and Engineering Research Board (SERB) of India, under the project CRG/2022/003821
ICLR'25 Spotlight ✨

Abstract

The spatio-temporal complexity of video data presents significant challenges in tasks such as compression, generation, and inpainting. We present four key contributions to address the challenges of spatiotemporal video processing. First, we introduce the 3D Mobile Inverted Vector-Quantization Variational Autoencoder (3D-MBQ-VAE), which combines Variational Autoencoders (VAEs) with masked token modeling to enhance spatiotemporal video compression. The model achieves superior temporal consistency and state-of-the-art (SOTA) reconstruction quality by employing a novel training strategy with full frame masking. Second, we present MotionAura, a text-to-video generation framework that utilizes vector-quantized diffusion models to discretize the latent space and capture complex motion dynamics, producing temporally coherent videos aligned with text prompts. Third, we propose a spectral transformer-based denoising network that processes video data in the frequency domain using the Fourier Transform. This method effectively captures global context and long-range dependencies for high-quality video generation and denoising. Lastly, we introduce a downstream task of Sketch Guided Video Inpainting. This task leverages Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning. Our models achieve SOTA performance on a range of benchmarks. Our work offers robust frameworks for spatiotemporal modeling and user-driven video content manipulation. We will release the code, datasets, and models in open-source.

Video Reconstruction, Generation and Sketch based Video Inpainting

Generated videos of varied aspect ratios and styles

Architecture Diagram

Qualitative analysis of image reconstruction using different Video VAEs

Architecture Diagram

Qualitative analysis of sketch-based image inpainting

Model Architecture

Discrete diffusion pretraining of the spectral transformer involves processing tokenized video frame representations from the 3D-MBQ-VAE encoder. These representations are subjected to random masking based on a predefined probability distribution. The resulting corrupted tokens are then denoised through a series of N Spectral Transformers. Contextual information from text representations generated by the T5-XXL-Encoder aids in this process. The denoised tokens are reconstructed using the 3D decoder

Model Architecture

Sketch-guided video inpainting process. The network inputs masked video latents, fully diffused unmasked latent, sketch condi- tioning, and text conditioning. It predicts the denoised latents using LoRA infused in our pre-trained denoiser ϵθ

Model Architecture

Our proposed MotionAura architecture consists of a 3D Mobile Inverted Vector-Quantization VAE (3D-MBQ-VAE) for efficient video tokenization and a discrete diffusion model to generate high-quality, motion-consistent videos. By leveraging auxilary discriminative loss for predicting masked frames and enforcing random fram masking, the method enforces both learning efficient spatio-temporal features.

Abstract Structures

Create an abstract video featuring fluid, metallic sculptures morphing and twisting in a serene, white space. The camera circles around the forms, capturing their smooth surfaces as they shift between shapes. Reflections play across their surfaces, creating a mesmerizing, almost hypnotic effect.

Render is a video showcasing abstract sculptures made of geometric shapes fused with organic elements. The camera moves through a gallery-like space, focusing on each sculpture’s intricate details. The forms appear to grow and change, as if they are alive, with soft lighting emphasizing their textures.

Generate a video featuring abstract sculptures that resemble intricate crystal formations. The camera pans slowly, revealing the sharp angles and translucent surfaces of the sculptures. The light refracts through the crystals, creating a dazzling display of colors and reflections against a dark background.

Cinematic Videos

Create a cinematic video featuring a sprawling, futuristic city at dusk. Skyscrapers tower above, casting long shadows. The camera slowly zooms out, revealing the bustling city streets below, filled with neon lights and fast-moving traffic. A heavy atmosphere of anticipation fills the air as dark clouds gather on the horizon.

Render is a cinematic video of an epic battle in a medieval fantasy world. Armored knights clash with mythical creatures on a foggy battlefield. The camera captures intense close-ups of swords clashing and panoramic views of the chaotic scene. The sky is overcast, with the sound of thunder rumbling in the distance.

Create a cinematic video of a couple reuniting at a train station in the pouring rain. The camera focuses on their expressions as they embrace, with raindrops glistening on their faces. Soft, warm lighting from the station’s lamps contrasts with the cold, wet environment, highlighting the emotion of the moment.

10s Video Showcase

Citation

@inproceedings{ref217, title = "MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion", year = "2025", author = "Onkar Susladkar and Jishu Sen Gupta and Chirag Sehgal and Sparsh Mittal and Rekha Singhal", booktitle = "International Conference on Learning Representations (ICLR)" }