ๆญฃๅจๅ ่ฝฝ่ง้ข...
่ง้ขๅ ่ฝฝๅคฑ่ดฅ
๐ข๐ข๐ข RoMo: Robust Motion Segmentation Improves Structure from Motion TL;DR: boost your SfM pipeline on dynamic scenes. We use epipolar cues + SAMv2 features to find robust masks for moving objects in a zero-shot manner. ๐งต๐
18,594 ๆฌก่ง็ โข 1 ๅนดๅ โขvia X (Twitter)
7 ๆก่ฏ่ฎบ

Let's look at some results. An optimization process finds the moving components of the video, disentangling camera ego motion from scene motion.

Our masks are robust to slow/fast camera movements, and can find multiple moving objects, even when they are in the background (look at the pedestrian๐ง)

Why care about motion masks? We show that good motion masks improve SfM performance, making COLMAP+our masks the SOTA on synthetic benchmarks. We also collect a real evaluation dataset with GT camera pose using a robotic arm, to evaluate our method in real casual captures.

How does it work? (three steps) 1) We find the Fundamental matrix between adjacent frames in the video with RANSAC. 2) We then identify parts of the frame that have a very low or a very high epipolar error, as weak supervision signals to find the moving objects.

3) Finally, we train a tiny MLP that classifies SAMv2 features as moving or static given the weak supervisory signal from high and low error masks. These features help complete the motion masks over the video effectively!

and just like that... we get good quality masks, without human annotation or synthetic supervision! Find more results on our website โ

This work was led by @lily_goli and @sabour_sara. In collaboration with Mark Matthews, @marcusabrubaker, Dmitry Lagun, @fleet_dj and @srbhsxn at Google DeepMind, and @_AlecJacobson at the University of Toronto.

