ReFlow: Self-correction Motion Learning for Dynamic Scene Reconstruction

Is it possible to unlock 4D dynamic scenes purely from 2D observations, without external dense motion guidance?

Motivation

Motivation of Self-correction Flow Matching. (a) We start with a simple observation: 2D observations, such as the shifting balloon, are caused by 3D motion. Accurate reconstructed 3D Motion should naturally align with these visible changes. (b) Unlike previous methods that use external motion priors to supervise 3D motion, we instead uses raw \textbf{video as motion supervision} through a self-correction flow matching mechanism to directly align predicted 3D motion projections with 2D frame differences.

Method Overview

Overview of ReFlow. We start by constructing a complete canonical space, which includes both static and dynamic components, ensuring a reliable 3D scene initialization. Next, we disentangle these elements using spatial and spatiotemporal feature planes, providing a structured representation that separately handles static and dynamic regions. This preparation allows us to introduce targeted motion constraints: Full Flow supervises motion across the entire scene, while Camera Flow enforces consistency in static regions, enabling the self-correction learning mechanism for accurate 3D motion reconstruction.

Self-correction flow matching mechanism. (a) Different Motion and Flow in the 4D Scene. Static areas move only due to camera motion (camera flow), while dynamic areas involve both camera and object motion (full flow). Accurate motion learning requires region-specific flow supervision. (b) Self-correction flow matching. We apply full flow to warp the entire image from state t₁ to state t₂ and compare with the real observation, validating overall motion. Camera flow is used similarly but only on static regions, ensuring their stability. Together, these provide a complementary self-correction signal for 3D motion learning.

Reconstruction Results

Qualitative comparison on Nvidia Monocular dataset. Yellow boxes highlight zoomed-in regions for detail examination. Per-scene average PSNR values are provided.

Qualitative comparison on Nerfies-HyperNeRF dataset. Yellow boxes highlight zoomed-in regions for detail examination. Per-scene average PSNR values are provided.

More Visualization

Visualization of our self-correction flow matching progress across training iterations, using the DynamicFace sequence from the Nvidia Monocular dataset. Each row shows results at different training iterations: 300 (top row), 1000 (second row), 5000 (third row), and 7000 (bottom row). The columns present: Left column (I₁): Ground truth image at time t₁. Middle column (I₁^warped): Ground truth image from t₁ warped using our predicted flow field F, demonstrating how content transforms to match the target frame. Right column (I₂): Ground truth image at time t₂, serving as the reference for evaluating warping accuracy. As training progresses, I₁^warped increasingly aligns with I₂, demonstrating that our self-correction flow matching effectively learns to model dynamic scene motion without requiring external motion guidances.

Qualitative comparison on all scenes from the Nvidia Monocular dataset.

More qualitative comparisons from the Nerfies-HyperNeRF dataset.

Novel time synthesis results with trajectory visualization across different dynamic scenes. Each column shows a different dataset: Nvidia-Playground, Nvidia-Umbrella, Nvidia-Skating, Hypernerf-broom, and Hypernerf-banana. The yellow boxes highlight dynamic/static regions with our tracked trajectories visualized using the Deform-GS approach. Rows represent different timesteps with fixed camera positions, demonstrating how our method correctly models temporal scene evolution. Note how static background elements remain perfectly stable across frames while dynamic components exhibit physically plausible motion paths. The Playground scene (leftmost column) particularly demonstrates our method's capability to preserve fine structures like blue ribbons during motion, which are typically challenging to reconstruct accurately.

Video Results

Dynamic scene reconstruction results. We present reconstruction results on various scenes, including everyday activities and dynamic scenarios.