WaDi: Weight Direction-aware Distillation for One-step Image Synthesis

We propose the Low-rank Rotation of weight Direction (LoRaD), a parameter-efficient adapter tailored to one-step diffusion distillation. LoRaD is designed to model these structured directional changes using learnable low-rank rotation matrices. We further integrate LoRaD into Variational Score Distillation (VSD), resulting in Weight Direction-aware Distillation (WaDi)

CVPR 2026
Not All Parameters Matter: Masking Diffusion Models for Enhancing Generation Ability

In this paper, we propose a universal masking strategy, MaskUNet, to improve the generation quality and generalization ability of stable diffusion.

CVPR 2025
Hierarchical Spatio-Temporal Representation Learning for Gait Recognition

In this paper, we propose a hierarchical spatio-temporal representation learning (HSTL) framework for extracting gait features from coarse to fine.

ICCV 2023
One-Way Ticket:Time-Independent Unified Encoder for Distilling Text-to-Image Diffusion Models

We introduce the first Time-independent Unified Encoder (TiUE) architecture, which is a loop-free distillation approach and eliminates the need for iterative noisy latent processing while maintaining high sampling fidelity with a time cost comparable to previous one-step methods.

CVPR 2025
Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think

In this work, we propose a straightforward method called Representation Entanglement for Generation (REG), which entangles low-level image latents with a single high-level class token from pretrained foundation models for denoising.

NeurIPS 2025 Oral


RefAlign: Representation Alignment for Reference-to-Video Generation

Lei Wang1,2,*,‡   Yuxin Song2,‡   Ge Wu1   Haocheng Feng2   Hang Zhou2   Jingdong Wang2   Yaxing Wang4✉   Jian Yang1,3✉  
1PCA Lab, VCIP, College of Computer Science, Nankai University, 2Baidu Inc., 3PCA Lab, School of Intelligence Science and Technology, Nanjing University, 4College of Artificial Intelligence, Jilin University
Arxiv 2026

Indicates Equal Contribution

Corresponding authors

*Interns in Baidu Inc.

Reference-to-video generation using our proposed method, RefAlign.

Abstract

Reference-to-video (R2V) generation is a controllable video synthesis paradigm that constrains the generation process using both text prompts and reference images, enabling applications such as personalized advertising and virtual try-on. In practice, existing R2V methods typically introduce additional high-level semantic or cross-modal features alongside the VAE latent representation of the reference image and jointly feed them into the diffusion Transformer (DiT). These auxiliary representations provide semantic guidance and act as implicit alignment signals, which can partially alleviate pixel-level information leakage in the VAE latent space. However, they may still struggle to address copy--paste artifacts and multi-subject confusion caused by modality mismatch across heterogeneous encoder features. In this paper, we propose RefAlign, a representation alignment framework that explicitly aligns DiT reference-branch features to the semantic space of a visual foundation model (VFM). The core of RefAlign is a reference alignment loss that pulls the reference features and VFM features of the same subject closer to improve identity consistency, while pushing apart the corresponding features of different subjects to enhance semantic discriminability. This simple yet effective strategy is applied only during training, incurring no inference-time overhead, and achieves a better balance between text controllability and reference fidelity. Extensive experiments on the OpenS2V-Eval benchmark demonstrate that RefAlign outperforms current state-of-the-art methods in TotalScore, validating the effectiveness of explicit reference alignment for R2V tasks.

Motivation


Method


Relationship to REPA


Qualitative Results


Quantitative Results


Ablation Study


More Visualization Results

BibTeX

@article{wang2026refalign,
  title={RefAlign: Representation Alignment for Reference-to-Video Generation},
  author={Wang, Lei and Song, YuXin and Wu, Ge and Feng, Haocheng and Zhou, Hang and Wang, Jingdong and Wang, Yaxing and others},
  journal={arXiv preprint arXiv:2603.25743},
  year={2026}
}