D4RT

Chuhan Zhang^✦ Guillaume Le Moing^✦ Skanda Koppula^✦◇ Ignacio Rocco^✦ Liliane Momeni^✦ Junyu Xie^○ 1 Shuyang Sun^✦ Rahul Sukthankar^✦ Joëlle K. Barral^✦ Raia Hadsell^✦ Zoubin Ghahramani^✦ Andrew Zisserman^✦ ○ Junlin Zhang^✦ Mehdi S. M. Sajjadi^✦ 2

Abstract

Understanding and reconstructing the complex geometry and motion of dynamic scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel querying mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our decoding interface allows the model to independently and flexibly probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state of the art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks.

Method

A global self-attention encoder first transforms the input video into the latent Global Scene Representation, which is passed to a lightweight decoder. The decoder can be independently queried for the 3D position P of any given 2D point (u, v) from the source timestep t_src at target timestep t_tgt in camera coordinate t_cam, unlocking full decoding at any point in space and time. The query also contains an embedding of the local frame patch centered around (u, v), providing additional spatial context.

Capabilities

3D Tracking: predicts sparse 3D tracks for a few pixels in selected frames in local camera coordinates.

3D Reconstruction: projects depth values using camera pose: no correspondences, dynamic objects are deduplicated.

All pixels tracking: produces a holistic scene reconstruction by predicting the 3D tracks of all pixels in the video in world coordinates.

Citation

@article{zhang2025d4rt,
  title={Efficiently Reconstructing Dynamic Scenes One D4RT at a Time},
  author={Zhang, Chuhan and Le Moing, Guillaume and Koppula, Skanda and Rocco, Ignacio and Momeni, Liliane and Xie, Junyu and Sun, Shuyang and Sukthankar, Rahul and Barral, Jo{\"e}lle K. and Hadsell, Raia and Ghahramani, Zoubin and Zisserman, Andrew and Zhang, Junlin and Sajjadi, Mehdi S. M.},
  journal={arXiv preprint},
  year={2025}
}

Contributions

MS led the project, with management support from JZ. MS proposed the SRT-style decoder, and GL proposed local RGB patches and tracking-all-pixels. CZ, GL, IR, and MS contributed to the core model design. LM, SS, IR, and SK designed and created training datasets. CZ, GL, and MS carried out the major implementation, with significant contributions from SK, IR, LM, and JX. CZ drove the model experimentation. Comprehensive evaluations and data pipelines were set up by CZ, GL, SK, IR, LM, JX, SS, and MS. Visualizations were produced by GL, IR and JX. RS, JB, RH, ZG, and AZ provided project support, advising, and guidance.

Acknowledgements

We thank a number of colleagues and advisors for making this work possible. We thank Saurabh Saxena, Kaiming He, Carl Doersch, Leonidas Guibas, Noah Snavely, Ben Poole, Joao Carreira, Pauline Luc, Yi Yang, Howard Huang, Huizhong Chen, Cordelia Schmid for providing advice during the project; Gabriel Brostow, for advising SK during the course of the project and for providing feedback on the manuscript; Relja Aranđelović and Maks Ovsjanikov for providing feedback on the manuscript; Ross Goroshin, Tengda Han and Dilara Gokay for their help during early-stage development; Aravindh Mahendran for helping with code reviews; Daniel Duckworth for helping with visualizations and comparisons against baselines; and Alberto García and Jesús Pérez for the invaluable contributions to data generation and collection.

Finally, we thank the authors of the splat viewer for their WebGL renderer that we adapted for our 4D reconstruction visualizations.

We use Cloudflare to collect anonymous usage statistics. No personal data or cookies are stored.

Efficiently Reconstructing Dynamic Scenes One 🎯 D4RT at a Time

Overview