Understanding and reconstructing the complex geometry and motion of dynamic scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel querying mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our decoding interface allows the model to independently and flexibly probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state of the art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks.
A global self-attention encoder first transforms the input video into the latent Global Scene Representation, which is passed to a lightweight decoder. The decoder can be independently queried for the 3D position P of any given 2D point (u, v) from the source timestep tsrc at target timestep ttgt in camera coordinate tcam, unlocking full decoding at any point in space and time. The query also contains an embedding of the local frame patch centered around (u, v), providing additional spatial context.
3D Tracking: predicts sparse 3D tracks for a few pixels in selected frames in local camera coordinates.
3D Reconstruction: projects depth values using camera pose: no correspondences, dynamic objects are deduplicated.
All pixels tracking: produces a holistic scene reconstruction by predicting the 3D tracks of all pixels in the video in world coordinates.
@article{zhang2025d4rt,
title={Efficiently Reconstructing Dynamic Scenes One D4RT at a Time},
author={Zhang, Chuhan and Le Moing, Guillaume and Koppula, Skanda and Rocco, Ignacio and Momeni, Liliane and Xie, Junyu and Sun, Shuyang and Sukthankar, Rahul and Barral, Jo{\"e}lle K and Hadsell, Raia and Ghahramani, Zoubin and Zisserman, Andrew and Zhang, Junlin and Sajjadi, Mehdi SM},
journal={arXiv preprint},
year={2025}
}
MS led the project, with management support from JZ. MS proposed the SRT-style decoder, and GL proposed local RGB patches and tracking-all-pixels. CZ, GL, IR, and MS contributed to the core model design. LM, SS, IR, and SK designed and created training datasets. CZ, GL, and MS carried out the major implementation, with significant contributions from SK, IR, LM, and JX. CZ drove the model experimentation. Comprehensive evaluations and data pipelines were set up by CZ, GL, SK, IR, LM, JX, SS, and MS. Visualizations were produced by GL, IR and JX. RS, JB, RH, ZG, and AZ provided project support, advising, and guidance.
We thank a number of colleagues and advisors for making this work possible. We thank Saurabh Saxena, Kaiming He, Carl Doersch, Leonidas Guibas, Noah Snavely, Ben Poole, Joao Carreira, Pauline Luc, Yi Yang, Howard Huang, Huizhong Chen, Cordelia Schmid for providing advice during the project; Gabriel Brostow, for advising SK during the course of the project and for providing feedback on the manuscript; Relja AranÄ‘elović and Maks Ovsjanikov for providing feedback on the manuscript; Ross Goroshin, Tengda Han and Dilara Gokay for their help during early-stage development; Aravindh Mahendran for helping with code reviews; Daniel Duckworth for helping with visualizations and comparisons against baselines; and Alberto GarcÃa and Jesús Pérez for the invaluable contributions to data generation and collection.
Finally, we thank the authors of the splat viewer for their WebGL renderer that we adapted for our 4D reconstruction visualizations.