ๆญฃๅœจๅŠ ่ฝฝ่ง†้ข‘...

่ง†้ข‘ๅŠ ่ฝฝๅคฑ่ดฅ

๐Ÿ“ข๐Ÿ“ข๐Ÿ“ข RoMo: Robust Motion Segmentation Improves Structure from Motion TL;DR: boost your SfM pipeline on dynamic scenes. We use epipolar cues + SAMv2 features to find robust masks for moving objects in a zero-shot manner. ๐Ÿงต๐Ÿ‘‡

18,594 ๆฌก่ง‚็œ‹ โ€ข 1 ๅนดๅ‰ โ€ขvia X (Twitter)

7 ๆก่ฏ„่ฎบ

Andrea Tagliasacchi ๐Ÿ‡จ๐Ÿ‡ฆ ็š„ๅคดๅƒ
Andrea Tagliasacchi ๐Ÿ‡จ๐Ÿ‡ฆ1 ๅนดๅ‰

Let's look at some results. An optimization process finds the moving components of the video, disentangling camera ego motion from scene motion.

Andrea Tagliasacchi ๐Ÿ‡จ๐Ÿ‡ฆ ็š„ๅคดๅƒ
Andrea Tagliasacchi ๐Ÿ‡จ๐Ÿ‡ฆ1 ๅนดๅ‰

Our masks are robust to slow/fast camera movements, and can find multiple moving objects, even when they are in the background (look at the pedestrian๐Ÿง)

Andrea Tagliasacchi ๐Ÿ‡จ๐Ÿ‡ฆ ็š„ๅคดๅƒ
Andrea Tagliasacchi ๐Ÿ‡จ๐Ÿ‡ฆ1 ๅนดๅ‰

Why care about motion masks? We show that good motion masks improve SfM performance, making COLMAP+our masks the SOTA on synthetic benchmarks. We also collect a real evaluation dataset with GT camera pose using a robotic arm, to evaluate our method in real casual captures.

Andrea Tagliasacchi ๐Ÿ‡จ๐Ÿ‡ฆ ็š„ๅคดๅƒ
Andrea Tagliasacchi ๐Ÿ‡จ๐Ÿ‡ฆ1 ๅนดๅ‰

How does it work? (three steps) 1) We find the Fundamental matrix between adjacent frames in the video with RANSAC. 2) We then identify parts of the frame that have a very low or a very high epipolar error, as weak supervision signals to find the moving objects.

Andrea Tagliasacchi ๐Ÿ‡จ๐Ÿ‡ฆ ็š„ๅคดๅƒ
Andrea Tagliasacchi ๐Ÿ‡จ๐Ÿ‡ฆ1 ๅนดๅ‰

3) Finally, we train a tiny MLP that classifies SAMv2 features as moving or static given the weak supervisory signal from high and low error masks. These features help complete the motion masks over the video effectively!

Andrea Tagliasacchi ๐Ÿ‡จ๐Ÿ‡ฆ ็š„ๅคดๅƒ
Andrea Tagliasacchi ๐Ÿ‡จ๐Ÿ‡ฆ1 ๅนดๅ‰

and just like that... we get good quality masks, without human annotation or synthetic supervision! Find more results on our website โ†’

Andrea Tagliasacchi ๐Ÿ‡จ๐Ÿ‡ฆ ็š„ๅคดๅƒ
Andrea Tagliasacchi ๐Ÿ‡จ๐Ÿ‡ฆ1 ๅนดๅ‰

This work was led by @lily_goli and @sabour_sara. In collaboration with Mark Matthews, @marcusabrubaker, Dmitry Lagun, @fleet_dj and @srbhsxn at Google DeepMind, and @_AlecJacobson at the University of Toronto.

็›ธๅ…ณ่ง†้ข‘

๐Ÿ“ข๐Ÿ“ข ๐๐ž๐ซ๐œ๐‡๐ž๐š๐: ๐๐ž๐ซ๐œ๐ž๐ฉ๐ญ๐ฎ๐š๐ฅ ๐‡๐ž๐š๐ ๐Œ๐จ๐๐ž๐ฅ ๐Ÿ๐จ๐ซ ๐’๐ข๐ง๐ ๐ฅ๐ž-๐ˆ๐ฆ๐š๐ ๐ž ๐Ÿ‘๐ƒ ๐‡๐ž๐š๐ ๐‘๐ž๐œ๐จ๐ง๐ฌ๐ญ๐ซ๐ฎ๐œ๐ญ๐ข๐จ๐ง & ๐„๐๐ข๐ญ๐ข๐ง๐ ๐Ÿ“ข๐Ÿ“ข PercHead reconstructs realistic 3D heads from a single image and enables disentangled 3D editing via geometric controls and style inputs from images or text. At its core is a generalized 3D head decoder trained with perceptual supervision from DINOv2 and SAM 2.1. We find that our new perceptual loss formulation improves reconstruction fidelity compared to commonly-used methods such as LPIPS. Our trained reconstruction model is able to generate 3D-consistent heads from a single input image. Even with challenging side-view inputs, the model robustly infers missing regions for a coherent, high-fidelity output. In addition, our architecture seamlessly adapts to downstream tasks: by swapping the encoder, we can transform the model into a disentangled 3D editing pipeline. In this scenario, we can control geometry through - potentially hand-drawn - segmentation maps, and condition style via image or text prompt. We also provide an interactive GUI to enable the exploration of our editing pipeline. ๐ŸŒ ๐Ÿ“ฝ๏ธ Great work by Antonio Oroz and Tobias Kirschstein

Matthias Niessner

18,808 ๆฌก่ง‚็œ‹ โ€ข 7 ไธชๆœˆๅ‰