PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

RefAlign: Representation Alignment for Reference-to-Video Generation

Lei Wang^1,2,*,‡ Yuxin Song^2,‡ Ge Wu¹ Haocheng Feng² Hang Zhou² Jingdong Wang² Yaxing Wang^4✉ Jian Yang^1,3✉

¹PCA Lab, VCIP, College of Computer Science, Nankai University, ²Baidu Inc., ³PCA Lab, School of Intelligence Science and Technology, Nanjing University, ⁴College of Artificial Intelligence, Jilin University
Arxiv 2026
^‡Indicates Equal Contribution
^✉Corresponding authors
^*Interns in Baidu Inc.

Paper

RefAlign-1.3B

RefAlign-14B

RefAlign-1.3B

RefAlign-14B Code arXiv

Reference-to-video generation using our proposed method, RefAlign.

Abstract

Reference-to-video (R2V) generation is a controllable video synthesis paradigm that constrains the generation process using both text prompts and reference images, enabling applications such as personalized advertising and virtual try-on. In practice, existing R2V methods typically introduce additional high-level semantic or cross-modal features alongside the VAE latent representation of the reference image and jointly feed them into the diffusion Transformer (DiT). These auxiliary representations provide semantic guidance and act as implicit alignment signals, which can partially alleviate pixel-level information leakage in the VAE latent space. However, they may still struggle to address copy--paste artifacts and multi-subject confusion caused by modality mismatch across heterogeneous encoder features. In this paper, we propose RefAlign, a representation alignment framework that explicitly aligns DiT reference-branch features to the semantic space of a visual foundation model (VFM). The core of RefAlign is a reference alignment loss that pulls the reference features and VFM features of the same subject closer to improve identity consistency, while pushing apart the corresponding features of different subjects to enhance semantic discriminability. This simple yet effective strategy is applied only during training, incurring no inference-time overhead, and achieves a better balance between text controllability and reference fidelity. Extensive experiments on the OpenS2V-Eval benchmark demonstrate that RefAlign outperforms current state-of-the-art methods in TotalScore, validating the effectiveness of explicit reference alignment for R2V tasks.

Motivation

Motivation of the proposed RefAlign method. (a) The R2V task suffers from copy–paste artifacts (top) and multi-subject confusion (bottom), both generated by Kling. (b) t-SNE visualization of reference feature distributions. DiT features (conditioned on VAE-encoded inputs) are highly entangled and overlap substantially across references, whereas DINOv3 features are more separable. RefAlign aligns DiT features to the DINOv3 feature space via an alignment loss, improving reference separability by pulling same-reference features closer and pushing differentreference features farther apart. (c) Visual comparison with and without RefAlign.

Method

(a) Overview of RefAlign. During training, we apply the proposed reference alignment loss LRA to intermediate features in selected DiT blocks and align them to target features extracted by a frozen vision foundation model (VFM). During inference, we discard the alignment process and the VFM. (b) Illustration of the reference alignment (RA) loss. RA loss aligns DiT reference features to their corresponding VFM teacher features by pulling matched (same-subject) pairs together and pushing mismatched (cross-subject) pairs apart, improving reference-consistent generation.

Relationship to REPA

Training comparison between REPA and RefAlign. (a) REPA: Trained from scratch, aligning noisy generation targets with clean VFM features to accelerate DiT convergence. (b) RefAlign: Fine-tuned from Wan2.1 initialization, aligning clean reference-branch image features with clean VFM features to optimize reference representations and improve reference controllability

Qualitative Results

Qualitative results. We compare RefAlign with three representative methods, namely Kling1.6 , Phantom , and VINO.

Quantitative Results

Quantitative comparison of RefAlign and other methods on zero-shot OpenS2V-Eval results.

Ablation Study

Qualitative Ablation Study. RA loss (A) improves reference fidelity and instruction following. Removing the negative loss (B) leads to multi-reference confusion; removing RA loss (C) causes copy–paste artifacts; using DINOv3 features as input (D) reduces subject fidelity.

More Visualization Results

Prompt: a man playing with his dog in front of the house

Prompt: a man playing with his dog on the beach.

Prompt: A man is playing with a beach ball by the sea.

Prompt: A man is standing on the street, wearing pink basketball shorts with a white Jumpman logo and the text 'OWN THE GAME' printed on the side. The shorts feature a faded image of a basketball player in action, blending sporty style with urban surroundings.

Prompt: a man feeding a bird in the park.

Prompt: a man sitting in the office, a cat sitting on his legs

Prompt: a man standing up from the chair, a cat sitting on the desk and watching him

Prompt: a man walking around the schoolyard, a dog sitting on the ground and watching him.

Prompt: a man walking in the park, a bird flying around him.

Prompt: a man standing in the park watching the flying bird.

Prompt: a man sitting on the grass in the park, a bird flying around him.

Prompt: a man is watching a bird in the park.

Prompt: in the airport terminal, a man crouching down, watching a cat walking towards him.

Prompt: a man sitting in the office, a cat jumping on his desk.

Prompt: The video captures three individuals seated outdoors on a bench, engaged with electronic devices...

Prompt: The video depicts two individuals seated on a gray couch in a modern living room...

Prompt: A woman wearing a colorful scarf and cozy sweater, her eyes sparkling with a hint of wonder as she looks around at the falling leaves. Her lips curl into a slight, content smile, adding a touch of warmth to the cool air. Golden and orange leaves cascade softly around her, with the trees forming a vibrant canopy overhead. The shot is captured from the waist up, showcasing her relaxed stance and the intricate patterns of her scarf as they complement the autumn backdrop.

Prompt: A man gently clutching a bouquet of vibrant flowers, his eyes radiating a serene contentment as he glances at the camera. His slightly upturned lips convey a sense of calm joy, accompanied by a faint twinkle in his eye. The scene is set in a lush garden, brimming with colorful blooms and verdant foliage, creating a tranquil haven. The shot captures him from the waist up, emphasizing his relaxed stance and the natural harmony of his surroundings.

Prompt: The video begins with a close-up of a boat anchored near a rocky coastline, gently bobbing on the calm water. The camera zooms in on the boat, capturing the smooth motion of the hull as it shifts with the subtle waves. The sail flutters slightly in the breeze, adding a sense of life to the still scene. The camera slowly pulls back, revealing the rugged coastline and the clear blue sky above, with the sound of the water lapping against the boat’s sides. The boat remains anchored, but its subtle rocking interacts with the rhythm of the sea, creating a peaceful, tranquil atmosphere.

Prompt: The video begins with a close-up of a t-shirt laid out on a sunny rock, the fabric gently catching the light of the sun. The camera zooms in to capture the texture of the shirt as it rests on the rock, with the surrounding hiking trail visible in the background. A light breeze picks up, causing the edges of the shirt to flutter slightly. The camera follows the movement of the fabric as it shifts with the wind. As the camera pulls back, the serene mountain landscape unfolds, with distant trees and a winding trail. The sounds of rustling leaves and distant birdsong add to the calm atmosphere, emphasizing the quiet, peaceful moment in nature.

BibTeX

@article{wang2026refalign, title={RefAlign: Representation Alignment for Reference-to-Video Generation}, author={Wang, Lei and Song, YuXin and Wu, Ge and Feng, Haocheng and Zhou, Hang and Wang, Jingdong and Wang, Yaxing and others}, journal={arXiv preprint arXiv:2603.25743}, year={2026} }

This page was built using the Academic Project Page Template which was adopted from the Nerfies project page. You are free to borrow the source code of this website, we just ask that you link back to this page in the footer.
This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

More Works from Our Lab

WaDi: Weight Direction-aware Distillation for One-step Image Synthesis

Not All Parameters Matter: Masking Diffusion Models for Enhancing Generation Ability

Hierarchical Spatio-Temporal Representation Learning for Gait Recognition

One-Way Ticket:Time-Independent Unified Encoder for Distilling Text-to-Image Diffusion Models

Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think

RefAlign: Representation Alignment for Reference-to-Video Generation

Reference-to-video generation using our proposed method, RefAlign.

Abstract

Motivation

Method

Relationship to REPA

Qualitative Results

Quantitative Results

Ablation Study

More Visualization Results

BibTeX