Taming Teacher Forcing
for
Masked Autoregressive Video Generation

1The Hong Kong University of Science and Technology (Guangzhou), 2StepFun, 3University of Illinois Urbana-Champaign, 4Tsinghua University, 5The Hong Kong University of Science and Technology

How to Achieve Frame-Level Autoregressive Video Generation?

We introduce MAGI, a hybrid video generation framework that combines masked modeling for intra-frame generation with causal modeling for next-frame generation.

Diagram of the proposed MAGI model

Abstract

We introduce MAGI, a hybrid video generation framework that combines masked modeling for intra-frame generation with causal modeling for next-frame generation.

Our key innovation, Complete Teacher Forcing (CTF), conditions masked frames on complete observation frames rather than masked ones (namely Masked Teacher Forcing, MTF), enabling a smooth transition from token-level (patch-level) to frame-level autoregressive generation.

CTF significantly outperforms MTF, achieving a +23% improvement in FVD scores on first-frame conditioned video prediction. To address issues like exposure bias, we employ targeted training strategies, setting a new benchmark in autoregressive video generation. Experiments show that MAGI can generate long, coherent video sequences exceeding 100 frames, even when trained on as few as 16 frames, highlighting its potential for scalable, high-quality video generation.

Smooth Transition from Patch-Level to Frame-Level AR

Our proposed model, MAGI, inherits all the advantages of traditional patch-level autoregressive models.

Comparison of patch-level autoregressive models

Key Contribution: Complete Teacher Forcing (CTF)

Masked Teacher Forcing (MTF) extends masked image generation to video prediction by using causal temporal attention but suffers from a training-inference gap due to high mask ratios during training.

To address this, we propose Complete Teacher Forcing (CTF), which conditions on unmasked frames during training to predict masked frames, bridging the gap more effectively.

Explanation of Complete Teacher Forcing (CTF)

Long Video Prediction

MAGI can predict stable videos with more than 100 frames while trained with only 16 frames.

Ablation Study

MAGI with CTF generates video with smooth motion

MAGI with MTF generates video with poor coherency

Acknowledgements

Tianhong Li
Chenfei Wu
Haoyang Huang
Guoqing Ma
Hongyu Zhou
Liangyu Chen
Chunrui Han
Yimin Jiang
Yu Deng
Tianhong Li
Chenfei Wu
Haoyang Huang
Guoqing Ma
Hongyu Zhou
Liangyu Chen
Chunrui Han
Yimin Jiang
Yu Deng