FVGen: Accelerating Novel-View Synthesis with Adversarial Video Diffusion Distillation

¹Institute for Creative Technologies ²University of Southern California

Abstract

Recent progress in 3D reconstruction has enabled realistic 3D models from dense image captures, yet challenges persist with sparse views, often leading to artifacts in unseen areas. Recent works leverage Video Diffusion Models (VDMs) to generate dense observations, filling the gaps when only sparse views are available for 3D reconstruction tasks. A significant limitation of these methods is their slow sampling speed when using VDMs. In this paper, we present FVGen, a novel framework that addresses this challenge by enabling fast novel view synthesis using VDMs in as few as four sampling steps. We propose a novel video diffusion model distillation method that distills a multi-step denoising teacher model into a few-step denoising student model using Generative Adversarial Networks (GANs) and softened reverse KL-divergence minimization. Extensive experiments on real-world datasets show that, compared to previous works, our framework generates the same number of novel views with similar (or even better) visual quality while reducing sampling time by more than 90%. FVGen significantly improves time efficiency for downstream reconstruction tasks, particularly when working with sparse input views (more than 2) where pre-trained VDMs need to be run multiple times to achieve better spatial coverage.

Architecture

We initialize our student model by training a GAN objective. The student model $G_\theta$, initialized with the weights of teacher model, uses few-step denoising to generate fake videos to fool the discriminator $D$. The fixed teacher model, together with a trainable 3D CNN, serves as the discriminator to differentiate between fake and real videos.

More Video Results

All the videos are generated with 4 sampling steps within 1 second on an H100 GPU.

BibTeX

@inproceedings{teng2025fvgen, title={FVGen: Accelerating Novel-View Synthesis with Adversarial Video Diffusion Distillation}, author={Teng, Wenbin and Chen, Gonglin and Chen, Haiwei and Zhao, Yajie}, booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision}, pages={26095--26105}, year={2025} }

Acknowledgements

Supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/ Interior Business Center (DOI/IBC) contract number 140D0423C0075. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DOI/IBC, or the U.S. Government.