SpatialCrafter: Unleashing the Imagination of Video Diffusion Models for Scene Reconstruction from Limited Observations

ICCV 2025

Songchun Zhang ¹, Huiyao Xu ¹, Sitong Guo ¹, Zhongwei Xie ³, Hujun Bao ¹, Weiwei Xu ¹, Changqing Zou ^1,2

¹ Zhejiang University, ² Zhejiang Lab, ³ Wuhan University

Paper Code Slides Demo

Abstract

Novel view synthesis (NVS) boosts immersive experiences in computer vision and graphics. Existing techniques, though progressed, rely on dense multi-view observations, restricting their application. This work takes on the challenge of reconstructing photorealistic 3D scenes from sparse or single-view inputs. We introduce SpatialCrafter, a framework that leverages the rich knowledge in video diffusion models to generate plausible additional observations, thereby alleviating reconstruction ambiguity. Through a trainable camera encoder and an epipolar attention mechanism for explicit geometric constraints, we achieve precise camera control and 3D consistency, further reinforced by a unified scale estimation strategy to handle scale discrepancies across datasets. Furthermore, by integrating monocular depth priors with semantic features in the video latent space, our framework directly regresses 3D Gaussian primitives and efficiently processes long-sequence features using a hybrid network structure. Extensive experiments show our method enhances sparse view reconstruction and restores the realistic appearance of 3D scenes.

Method

Overview of our pipeline. Our generative reconstruction pipeline consists of two parts: camera-controlled video generation, and then reconstruction. First, we set the exploration path based on the input views. In the video generation module, ray embedding is used to parameterize the camera information, and a cross-attention mechanism with epipolar constraints is introduced to improve the 3D consistency of the generated video. The robust 3D scene reconstruction pipeline integrates monocular depth priors with semantic features extracted from the video latent space, and directly regress 3D Gaussian primitives through a feed-forward manner.

Controllable Video Generation - Indoor Scenes

Controllable Video Generation - Outdoor Scenes

Sparse-View Novel View Synthesis

Citation

@article{zhang2025spatialcrafter,
  title={SpatialCrafter: Unleashing the Imagination of Video Diffusion Models for Scene Reconstruction from Limited Observations},
  author={Zhang, Songchun and Xu, Huiyao and Guo, Sitong and Xie, Zhongwei and Liu, Pengwei and Bao, Hujun and Xu, Weiwei and Zou, Changqing},
  journal={arXiv preprint arXiv:2505.11992},
  year={2025}
}