Efficiently Reconstructing Dynamic Scenes One 🎯 D4RT at a Time

Chuhan Zhang✦ Guillaume Le Moing✦ Skanda Koppula✦◇ Ignacio Rocco✦ Liliane Momeni✦ Junyu Xieâ—‹ 1 Shuyang Sun✦ Rahul Sukthankar✦ Joëlle K. Barral✦ Raia Hadsell✦ Zoubin Ghahramani✦ Andrew Zisserman✦ â—‹ Junlin Zhang✦ Mehdi S. M. Sajjadi✦ 2
✦Google DeepMind    â—‡University College London    â—‹University of Oxford
1Work done during an internship at Google DeepMind    2Correspondence: d4rt@msajjadi.com

Overview

Abstract

Understanding and reconstructing the complex geometry and motion of dynamic scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel querying mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our decoding interface allows the model to independently and flexibly probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state of the art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks.

Method

D4RT method left D4RT method right

A global self-attention encoder first transforms the input video into the latent Global Scene Representation, which is passed to a lightweight decoder. The decoder can be independently queried for the 3D position P of any given 2D point (u, v) from the source timestep tsrc at target timestep ttgt in camera coordinate tcam, unlocking full decoding at any point in space and time. The query also contains an embedding of the local frame patch centered around (u, v), providing additional spatial context.

Capabilities

D4RT capabilities left D4RT capabilities right

3D Tracking: predicts sparse 3D tracks for a few pixels in selected frames in local camera coordinates.

3D Reconstruction: projects depth values using camera pose: no correspondences, dynamic objects are deduplicated.

All pixels tracking: produces a holistic scene reconstruction by predicting the 3D tracks of all pixels in the video in world coordinates.

Interactive 4D Reconstruction

Click and drag to rotate the scene.

For reduced file size and faster loading, up to 90% of points have been removed from each scene.
Loading example...

Citation

@article{zhang2025d4rt,
  title={Efficiently Reconstructing Dynamic Scenes One D4RT at a Time},
  author={Zhang, Chuhan and Le Moing, Guillaume and Koppula, Skanda and Rocco, Ignacio and Momeni, Liliane and Xie, Junyu and Sun, Shuyang and Sukthankar, Rahul and Barral, Jo{\"e}lle K and Hadsell, Raia and Ghahramani, Zoubin and Zisserman, Andrew and Zhang, Junlin and Sajjadi, Mehdi SM},
  journal={arXiv preprint},
  year={2025}
}

Contributions

MS led the project, with management support from JZ. MS proposed the SRT-style decoder, and GL proposed local RGB patches and tracking-all-pixels. CZ, GL, IR, and MS contributed to the core model design. LM, SS, IR, and SK designed and created training datasets. CZ, GL, and MS carried out the major implementation, with significant contributions from SK, IR, LM, and JX. CZ drove the model experimentation. Comprehensive evaluations and data pipelines were set up by CZ, GL, SK, IR, LM, JX, SS, and MS. Visualizations were produced by GL, IR and JX. RS, JB, RH, ZG, and AZ provided project support, advising, and guidance.

Acknowledgements

We thank a number of colleagues and advisors for making this work possible. We thank Saurabh Saxena, Kaiming He, Carl Doersch, Leonidas Guibas, Noah Snavely, Ben Poole, Joao Carreira, Pauline Luc, Yi Yang, Howard Huang, Huizhong Chen, Cordelia Schmid for providing advice during the project; Gabriel Brostow, for advising SK during the course of the project and for providing feedback on the manuscript; Relja Aranđelović and Maks Ovsjanikov for providing feedback on the manuscript; Ross Goroshin, Tengda Han and Dilara Gokay for their help during early-stage development; Aravindh Mahendran for helping with code reviews; Daniel Duckworth for helping with visualizations and comparisons against baselines; and Alberto García and Jesús Pérez for the invaluable contributions to data generation and collection.

Finally, we thank the authors of the splat viewer for their WebGL renderer that we adapted for our 4D reconstruction visualizations.