Loading video...

Video Failed to Load

Go Home

Introducing “FlowMap”, the first self-supervised, differentiable structure-from-motion method that is competitive with conventional SfM like Colmap! IMO this solves a major missing piece for internet-scale training of 3D Deep Learning methods. 1/n

128,543 views • 2 years ago •via X (Twitter)

12 Comments

Vincent Sitzmann's profile picture
Vincent Sitzmann2 years ago

Work led by our amazing @omcamsmith and @DavidCharatan with the support of the brilliant @_atewari ! 2/n

Vincent Sitzmann's profile picture
Vincent Sitzmann2 years ago

Structure-from-Motion is the only area of computer vision where non-deep learning methods - Colmap - remains the state-of-the-art. This has slowed us down: Colmap is used to generate pseudo-ground truth for 3D vision, instead of finding a self-supervised way! 3/n

Vincent Sitzmann's profile picture
Vincent Sitzmann2 years ago

FlowMap is a major step towards solving that problem: it is a fully differentiable, self-supervised structure-from-motion method! From only off-the-shelf point tracks / optical flow, FlowMap performs SfM that outperforms Colmap’s on Gaussian Splatting Novel View Synthesis! 4/n

Vincent Sitzmann's profile picture
Vincent Sitzmann2 years ago

Here are some point clouds reconstructed from FlowMap on popular scenes - it really works very robustly!! 5/n

Vincent Sitzmann's profile picture
Vincent Sitzmann2 years ago

There are two unique aspects to FlowMap: (1) Depth is the *only* free variable - poses and intrinsics are inferred feed-forward! (2) FlowMap is differentiable with respect to the depth estimator - this enables us to train one fully self-supervised just on video! 6/n

Vincent Sitzmann's profile picture
Vincent Sitzmann2 years ago

In the limit of having a *perfect* depth estimator and perfect correspondence (for instance from large-scale training), FlowMap solves the Structure-from-Motion problem - poses, intrinsics, and fused multi-view pointcloud - in a single feed-forward pass! 7/n

Vincent Sitzmann's profile picture
Vincent Sitzmann2 years ago

We have already released the code, which we’ve spent time organizing for ease of use. It includes the scripts for baselines, figures, and tables, so it will be a breeze for you to reproduce & build on top of it! 8/n

Vincent Sitzmann's profile picture
Vincent Sitzmann2 years ago

FlowMap minimizes a “camera-induced correspondence loss.” When a camera moves through a static scene, that motion induces correspondences on the image sensor according to the scene’s geometry, the camera motion, and the camera intrinsics, which we supervise with point tracks 9/n

Vincent Sitzmann's profile picture
Vincent Sitzmann2 years ago

However, solving for depth, poses and intrinsics as free variables via gradient descent does not work well (see the paper for why!). Instead, we reparameterize both poses and intrinsics in terms of depth and optical flow, *leaving only depth as a free variable*! 10/n

Vincent Sitzmann's profile picture
Vincent Sitzmann2 years ago

Even so, an unnecessary degree of freedom remains: Two identical image patches can have *different* depths! To fix this, we re-parameterize depth via a small monocular depth predictor. 11/n

Vincent Sitzmann's profile picture
Vincent Sitzmann2 years ago

We can use FlowMap itself to supervise & pre-train the depth estimator! Pre-training leads to better results & faster convergence, but not strictly necessary—it works even without any pre-training! The key is “patch-match” regularization: similar RGB patch → similar depth. 12/n

Vincent Sitzmann's profile picture
Vincent Sitzmann2 years ago

FlowMap allows us to train *any* 3D computer vision model self-supervised, just on video of static scenes. There are infinite cool follow-up directions, from feed-forward SfM to dynamics to multi-view stereo - we can't wait what the community will do with it! n/n

Related Videos