Logo

RIFLEx: A Free Lunch for Length Extrapolation in
Video Diffusion Transformers


1THU, 2ShengShu, 3UT-Austin, 4RUC, 5Pazhou Lab

arXiv    Code

TL;DR: Effortlessly extend your video with just one line of code: freq[k-1]=(2*np.pi)/(L*s).

(This webpage contains a lot of videos. We suggest using Chrome for the best experience.)


Overview


Motivation: Recent advancements in video generation allow models to create high-quality videos, but fixed sequence lengths limit their ability to extend content. In this paper, we explore length extrapolation techniques that generate new and temporally coherent content without longer training videos. Current extrapolation strategies lead to temporal repetition or slow motion, indicating a gap in understanding how positional encodings affect video extrapolation.

Analysis: We systematically analyze the role of individual frequency components in positional, discovering that high frequencies capture rapid movements and short-term dependencies, inducing temporal repetition, while low frequencies encode long-term dependencies with slow motion. Furthermore, we surprisingly identify an intrinsic frequency that primarily dictates repetition patterns among all components during extrapolation.

Method: Building on these insights, we propose Reducing Intrinsic Frequency for Length Extrapolation (RIFLEx), a minimal yet effective solution that lowers the intrinsic frequency to ensure it remains within a single cycle after extrapolation, without requiring additional modifications.




Training-free 2× Temporal Extrapolation

RIFLEx offers a true free lunch—achieving high-quality 2× extrapolation on SOTA video diffusion transformers in a completely training-free manner. In the following, we present videos extended from 129 to 261 frames at 24 fps.






2× Temporal Extrapolation: Fine-tuning 1,000 Steps without Long Videos

When fine-tuning is performed for only 1,000 steps with a batch size of 8 on original-length videos, dynamic quality and visual quality are further improved. Below are videos extended from 129 to 261 frames at 24 fps.






Spatial and Joint Temporal-spatial Extrapolation

RIFLEx can be extended for spatial resolution extrapolation, as well as for both video duration and resolution. Compared to position extrapolation (PE), we achieve superior visual quality by effectively addressing repetition issues. Notably, while PE struggles with joint temporal-spatial extrapolation, our RIFLEx still delivers high-quality results.


Normal size        
480$\times$720        

PE        
480$\times$1440        

RIFLEx (Ours)        
480$\times$1440        


Normal size        
480$\times$720        

PE        
960$\times$720        

RIFLEx (Ours)        
960$\times$720        


Normal size        
480$\times$720        

PE        
960$\times$1440       

RIFLEx (Ours)        
960$\times$1440       


Normal size        
480$\times$720$\times$49     

PE        
960$\times$1440$\times$97    

RIFLEx (Ours)        
960$\times$1440$\times$97    





Temporal Extrapolation Comparsions

We compare popular length extrapolation methods in LLMs and image diffusion transformers, which often introduce temporal repetition or motion deceleration. In contrast, our approach generates new, temporally coherent content that evolves smoothly, delivering superior performance. Below, we highlight repeated frames with a red box, which are identical to the initial frames of the video.


 PE 

 PI 

TASR

YaRN  

RIFLEx (Ours)

Repetition

Slower Motion

Slower Motion

Slower Motion


While the NTK in HunyuanVideo coincidentally avoids repetition at 2× extrapolation, this stems from an unintended effect that happens to satisfy the non-repetition condition rather than its intended mechanism. At longer extrapolations (e.g., 2.3×), significant repetition still occurs.


BibTeX


        @misc{zhao2025riflex,
          title={RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers},
          author={Min Zhao and Guande He and Yixiao Chen and Hongzhou Zhu and Chongxuan Li and Jun Zhu},
          year={2025},
          eprint={2502.15894},
          archivePrefix={arXiv},
          primaryClass={cs.CV}
      }