RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers

RIFLEx: A Free Lunch for Length Extrapolation in
Video Diffusion Transformers

Min Zhao^1,2, Guande He³, Yixiao Chen^1,2, Hongzhou Zhu^1,2, Chongxuan Li⁴, Jun Zhu^1,2,5

¹THU, ²ShengShu, ³UT-Austin, ⁴RUC, ⁵Pazhou Lab

arXiv Code

TL;DR: Effortlessly extend your video with just one line of code: freq[k-1]=(2*np.pi)/(L*s).

(This webpage contains a lot of videos. We suggest using Chrome for the best experience.)

Overview

Motivation: Recent advancements in video generation allow models to create high-quality videos, but fixed sequence lengths limit their ability to extend content. In this paper, we explore length extrapolation techniques that generate new and temporally coherent content without longer training videos. Current extrapolation strategies lead to temporal repetition or slow motion, indicating a gap in understanding how positional encodings affect video extrapolation.

Analysis: We systematically analyze the role of individual frequency components in positional, discovering that high frequencies capture rapid movements and short-term dependencies, inducing temporal repetition, while low frequencies encode long-term dependencies with slow motion. Furthermore, we surprisingly identify an intrinsic frequency that primarily dictates repetition patterns among all components during extrapolation.

Method: Building on these insights, we propose Reducing Intrinsic Frequency for Length Extrapolation (RIFLEx), a minimal yet effective solution that lowers the intrinsic frequency to ensure it remains within a single cycle after extrapolation, without requiring additional modifications.

Training-free 2× Temporal Extrapolation

RIFLEx offers a true free lunch—achieving high-quality 2× extrapolation on SOTA video diffusion transformers in a completely training-free manner. In the following, we present videos extended from 129 to 261 frames at 24 fps.

A person in a red tracksuit pours something out of a cup.

A petri dish with a bamboo forest growing within it that has tiny red pandas running around.

An animated porcupine with a mix of brown and white fur and prominent quills is seen in a cozy, warmly lit interior setting, interacting with a green gift box with a yellow ribbon. The room is filled with wooden furniture and colorful wall decorations, suggesting a cheerful and domestic atmosphere. The porcupine's large eyes and expressive face convey a sense of lightheartedness and curiosity. The camera maintains a low angle, close to the ground, providing an intimate view of the character's actions without any movement, focusing on the playful and curious mood of the scene. The visual style is characteristic of contemporary 3D animation, with vibrant colors and smooth textures that create a polished and engaging look. The scene transitions to an outdoor environment, showcasing a sunny, verdant landscape with rocks, trees, and grass, indicating a natural, possibly forest-like setting. The presence of a small character in the final frame suggests the continuation of a narrative or the introduction of new characters.

Two men and a woman engage in a conversation within a traditional Korean indoor setting, characterized by wooden architecture and natural lighting. The men exhibit a variety of emotions, from shock to amusement, while the woman appears distressed. The camera captures their expressions in medium close-up shots, with a focus on their faces against a softly blurred background, creating an intimate and intense atmosphere. The realistic and cinematic visual style enhances the emotional gravity of the scene.

At a serene lakeside, an amateur photographer adjusts the settings on his camera, preparing to capture the sunset. He starts by focusing on the distant horizon, snapping several shots. As the sun sinks lower, he moves closer to the water’s edge, crouching down to frame the reflection of the sky in the lake. Finally, standing up, he captures the final moments of daylight, the sky ablaze with color.

Several giant wooly mammoths approach treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered trees and dramatic snow capped mountains in the distance, mid afternoon light with wispy clouds and a sun high in the distance creates a warm glow, the low camera view is stunning capturing the large furry mammal with beautiful photography, depth of field.

A white and orange tabby cat is seen happily darting through a dense garden, as if chasing something. Its eyes are wide and happy as it jogs forward, scanning the branches, flowers, and leaves as it walks. The path is narrow as it makes its way between all the plants. the scene is captured from a ground-level angle, following the cat closely, giving a low and intimate perspective. The image is cinematic with warm tones and a grainy texture. The scattered daylight between the leaves and plants above creates a warm contrast, accentuating the cat’s orange fur. The shot is clear and sharp, with a shallow depth of field.

An extreme close-up of an gray-haired man with a beard in his 60s, he is deep in thought pondering the history of the universe as he sits at a cafe in Paris, his eyes focus on people offscreen as they walk as he sits mostly motionless, he is dressed in a wool coat suit coat with a button-down shirt , he wears a brown beret and glasses and has a very professorial appearance, and the end he offers a subtle closed-mouth smile as if he found the answer to the mystery of life, the lighting is very cinematic with the golden light and the Parisian streets and city in the background, depth of field, cinematic 35mm film.

A fawn Pembroke Welsh Corgi walking in Times Square.

A person's hand is seen interacting with a black and white toy orca in a staged miniature aquatic environment. The scene includes clear blue water, small plastic aquatic plants, and miniature flags with fish symbols, all set against a solid teal background. The playful and imaginative atmosphere is conveyed through the gentle manipulation of the toy, suggesting a storytelling or demonstration context. The camera remains fixed throughout, capturing the scene in a medium shot that focuses on the toy and its immediate surroundings. The visual style is clear and colorful, highlighting the details of the toy and the miniature aquatic setup.

Animated characters, a rabbit and a mouse, are depicted in a perilous situation, first plummeting through a dark, undefined space, and then floating and swimming in a serene underwater environment. The characters are dressed in adventure gear, suggesting a narrative context. The camera closely follows their expressions and movements, capturing the tension and urgency of their situation. The medium and close-up shots emphasize their facial expressions, which convey fear and determination. The visual style is high-quality 3D animation with detailed textures and lighting, creating a cinematic feel.

A man with facial hair, dressed in a burgundy shirt, is seen knocking on a weathered wooden door with a metal latch and a small window, set in a stone wall. The scene transitions to an indoor setting where the man, now wearing a blue shirt, speaks to the camera in a well-lit room furnished with a couch, a bookshelf, and various decorations. The video captures the man in a medium shot with a stationary camera, conveying a casual and friendly atmosphere in the indoor scene, contrasted with a neutral atmosphere in the outdoor scene. The visual style is realistic with natural lighting and color grading.

In a lush meadow, a teenage girl practices yoga on a blanket laid out among wildflowers. She begins with simple poses, stretching her arms upwards and bending backwards. Gradually, her routine intensifies; she transitions into more challenging positions, balancing on one leg and twisting her torso. Her breathing remains steady throughout, eyes closed, fully immersed in the moment.

In a forested area, a temporary camp is set up with tents, a dirt ground, and various equipment, including a four-wheeled vehicle and barrels. A man in a white shirt appears distressed, holding his head, while a woman in a brown dress looks on with concern. The presence of military personnel and civilians suggests a situation of conflict or crisis. The mood is tense and somber, with an undercurrent of urgency or the aftermath of a significant event, as evidenced by the body lying on the ground. The camera maintains a steady, medium-long shot, capturing the expressions and movements of the characters, and the realistic, cinematic visual style enhances the gravity of the scene.

In a serene and traditional Japanese environment at night, two animated characters—a male in traditional Japanese armor and a female with white hair in a blue dress—are the focal point of the scene. They are engaged in a calm interaction, with two other female characters in the background, one with blonde hair and another with white hair. The medium shot captures the characters from behind the male character, providing a clear view of their attire and the traditional Japanese architecture around them. The visual style is akin to Japanese anime with 3D elements, and the camera remains in a fixed position throughout the frames, emphasizing the character's dialogue and movements.

A border collie named Max waits eagerly by the door, tail wagging furiously. As soon as the door opens, he bounds out into the yard, paws skidding slightly on the smooth patio tiles. He races towards a frisbee thrown by his owner, leaping high into the air to catch it mid-flight. Landing gracefully, he trots back proudly, dropping the frisbee at his owner’s feet, ready for another round.

The camera follows behind a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope, dust kicks up from it’s tires, the sunlight shines on the SUV as it speeds along the dirt road, casting a warm glow over the scene. The dirt road curves gently into the distance, with no other cars or vehicles in sight. The trees on either side of the road are redwoods, with patches of greenery scattered throughout. The car is seen from the rear following the curve with ease, making it seem as if it is on a rugged drive through the rugged terrain. The dirt road itself is surrounded by steep hills and mountains, with a clear blue sky above with wispy clouds.

A woman and a man are engaged in a dialogue or confrontation within an opulent indoor setting, suggested by the blurred background with bokeh light effects. The woman, dressed in a blue traditional outfit with intricate patterns, exudes elegance and poise, indicative of high social status, possibly royalty. Her makeup and hair are styled traditionally. The man, in contrast, has a stern expression, long dark hair, and is adorned with a black hat featuring a white pattern, which may signify his lower social status or warrior status. The close-up shots focus on their faces, capturing the tension and moderate emotional intensity of the scene. The camera remains in a fixed position, emphasizing the characters' expressions and the historical drama's visual style.

2× Temporal Extrapolation: Fine-tuning 1,000 Steps without Long Videos

When fine-tuning is performed for only 1,000 steps with a batch size of 8 on original-length videos, dynamic quality and visual quality are further improved. Below are videos extended from 129 to 261 frames at 24 fps.

A person is seen through circular openings in a dark environment, suggesting a secretive or investigative scenario. The scene transitions to a nighttime setting where the same person is cautiously looking out of a window, illuminated by the warm glow of interior lighting against the cool darkness of the exterior. The atmosphere is tense and mysterious, with a moderate emotional intensity. The camera employs close-up shots with a shallow depth of field to focus on the subject, creating an intimate and intense viewing experience. The cinematic visual style is characterized by dramatic lighting and color grading that underscores the mood of the scene.

3D animation of a small, round, fluffy creature with big, expressive eyes explores a vibrant, enchanted forest. The creature, a whimsical blend of a rabbit and a squirrel, has soft blue fur and a bushy, striped tail. It hops along a sparkling stream, its eyes wide with wonder. The forest is alive with magical elements: flowers that glow and change colors, trees with leaves in shades of purple and silver, and small floating lights that resemble fireflies. The creature stops to interact playfully with a group of tiny, fairy-like beings dancing around a mushroom ring. The creature looks up in awe at a large, glowing tree that seems to be the heart of the forest.

A cat waking up its sleeping owner demanding breakfast. The owner tries to ignore the cat, but the cat tries new tactics and finally the owner pulls out a secret stash of treats from under the pillow to hold the cat off a little longer.

A close-up of a musician's fingers skillfully moving over a violin's strings during a solo performance.

A dog wearing virtual reality goggles in sunset.

Animated characters are engaging in a magical interaction within a dark, cavernous environment. The scene centers on a small, orange magical creature with a glowing heart, as well as two dragon-like creatures, one of which is holding a magical potion. The creature opens the potion, causing a transformation, which captures the attention of the dragons. Subsequently, two human characters with a torch discover the aftermath of the transformation, revealing a small, glowing creature resembling the one from earlier. The atmosphere is whimsical and magical, with a sense of curiosity and discovery. The camera remains static, offering medium shots that focus on the characters and their actions, while the visual style is traditional animation with smooth lines and vibrant colors.

A man with slicked-back hair, dressed in a black period costume, is captured in a medium close-up shot within an opulent interior setting. He wears a white half-face mask that covers the left side of his face, which remains expressionless. The scene is rich with historical or theatrical elements, including candlelight, red drapes, and period-style furniture. The atmosphere is dramatic and tense, with a moderate emotional intensity evident from the man's expressive eyes and the intimate lighting. The camera remains in a fixed position throughout, focusing on the man's upper body and facial expressions, which are central to the narrative. The cinematic visual style emphasizes the period detail and dramatic ambiance of the scene.

A man in a formal black suit stands in the entrance of a traditional Korean house, characterized by wooden architecture and a tiled floor. Household items such as shoes, pots, and a hanging basket are visible, suggesting a lived-in space. The man's expression is neutral, and the overall atmosphere is calm and mild. The camera remains in a fixed position, capturing the scene in a medium shot with natural lighting, indicative of a standard television drama style.

A man in a suit is actively speaking and gesturing in front of an audience in an indoor setting, likely a conference hall or auditorium. The audience members are seated in rows, with varying levels of attention and engagement. The speaker's enthusiastic demeanor suggests a formal and educational event. The camera work consists of medium shots of the speaker and long shots of the audience, with a fixed position and sharp focus. The visual style is standard, with no special effects or cinematic techniques, indicative of a lecture or seminar recording.

A man with unkempt brown hair, dressed in a brown jacket and a red neckerchief, is seen interacting with a woman inside a horse-drawn carriage. The setting is outdoors, with historical buildings in the background, suggesting a European town or city from a bygone era. The man's facial expressions convey a sense of urgency and distress, with moderate emotional intensity. The camera work includes close-up shots to emphasize the man's reactions and medium shots to show the interaction with the woman. The focus on the man's face and the coin he examines indicates their significance in the narrative. The visual style is characteristic of a historical drama, with natural lighting and a color scheme that enhances the period feel of the scene.

A man with facial hair, dressed in a plaid shirt, is seated playing an acoustic guitar and singing with visible enjoyment and passion. He is in an indoor setting with a simple gray background, accompanied by a guitar amplifier and a microphone stand to his left, and a stack of "Guitar World" books to his right. The camera maintains a fixed medium shot, capturing his expressive face and hand movements on the guitar strings. The visual quality is clear and unembellished, focusing on the performance without distractions.

In a virtual block-based environment, a group of blue sheep is contained within a wooden pen situated in a flat, grassy area under a clear sky. The sheep exhibit simple, blocky movements as they are herded and directed by an unseen player. The scene is captured from a fixed first-person perspective, providing a medium to long shot view of the activity. The playful and casual atmosphere is conveyed through the mild emotions of the virtual animal management activity. The visual style is characteristic of a block-based building game, with simple textures and a cubic world design.

An animated character with white hair and a muscular build is shown in a close-up, displaying a stern and intense expression. The character is dressed in a red and gold outfit, suggesting a regal or powerful status. The scene transitions to reveal the character seated on a throne-like structure with ornate decorations, addressing a group of people who are standing in front of it. The atmosphere is serious and charged with emotion, indicating a moment of significance or decision-making. The camera focuses on the character's face before widening the shot to include the character's interaction with the group, using fixed position shots without any discernible camera movement. The visual style is characteristic of Japanese anime, with detailed character designs and vibrant coloring.

Two animated characters are engaged in a tense interaction within an ornate indoor setting, possibly a throne room or chamber. The character with long red hair is dressed in a dark, regal outfit, exuding an air of authority or leadership. The other character, with long purple hair, is adorned in a dark, possibly magical or warrior-like attire, and displays a serious or confrontational demeanor. The camera remains fixed on medium shots, capturing the detailed expressions and gestures of the characters, emphasizing the dramatic tension of the scene. The visual style is characteristic of Japanese anime, with vibrant colors and dynamic character designs that contribute to the overall atmosphere of the video.

Spatial and Joint Temporal-spatial Extrapolation

RIFLEx can be extended for spatial resolution extrapolation, as well as for both video duration and resolution. Compared to position extrapolation (PE), we achieve superior visual quality by effectively addressing repetition issues. Notably, while PE struggles with joint temporal-spatial extrapolation, our RIFLEx still delivers high-quality results.

Normal size
480$\times$720

PE
480$\times$1440

RIFLEx (Ours)
480$\times$1440

Normal size
480$\times$720

PE
960$\times$720

RIFLEx (Ours)
960$\times$720

Normal size
480$\times$720

PE
960$\times$1440

RIFLEx (Ours)
960$\times$1440

Normal size
480$\times$720$\times$49

PE
960$\times$1440$\times$97

RIFLEx (Ours)
960$\times$1440$\times$97

Temporal Extrapolation Comparsions

We compare popular length extrapolation methods in LLMs and image diffusion transformers, which often introduce temporal repetition or motion deceleration. In contrast, our approach generates new, temporally coherent content that evolves smoothly, delivering superior performance. Below, we highlight repeated frames with a red box, which are identical to the initial frames of the video.

TASR

YaRN

RIFLEx (Ours)

Repetition

Slower Motion

While the NTK in HunyuanVideo coincidentally avoids repetition at 2× extrapolation, this stems from an unintended effect that happens to satisfy the non-repetition condition rather than its intended mechanism. At longer extrapolations (e.g., 2.3×), significant repetition still occurs.

BibTeX


        @article{zhao2025riflex,
          title={Riflex: A free lunch for length extrapolation in video diffusion transformers},
          author={Zhao, Min and He, Guande and Chen, Yixiao and Zhu, Hongzhou and Li, Chongxuan and Zhu, Jun},
          journal={arXiv preprint arXiv:2502.15894},
          year={2025}
        }