Video wird geladen...

Video konnte nicht geladen werden

Zur Startseite

๐Ÿ“ข๐Ÿ“ข๐Ÿ“ข RoMo: Robust Motion Segmentation Improves Structure from Motion TL;DR: boost your SfM pipeline on dynamic scenes. We use epipolar cues + SAMv2 features to find robust masks for moving objects in a zero-shot manner. ๐Ÿงต๐Ÿ‘‡

18,603 Aufrufe โ€ข vor 1 Jahr โ€ขvia X (Twitter)

7 Kommentare

Profilbild von Andrea Tagliasacchi ๐Ÿ‡จ๐Ÿ‡ฆ
Andrea Tagliasacchi ๐Ÿ‡จ๐Ÿ‡ฆvor 1 Jahr

Let's look at some results. An optimization process finds the moving components of the video, disentangling camera ego motion from scene motion.

Profilbild von Andrea Tagliasacchi ๐Ÿ‡จ๐Ÿ‡ฆ
Andrea Tagliasacchi ๐Ÿ‡จ๐Ÿ‡ฆvor 1 Jahr

Our masks are robust to slow/fast camera movements, and can find multiple moving objects, even when they are in the background (look at the pedestrian๐Ÿง)

Profilbild von Andrea Tagliasacchi ๐Ÿ‡จ๐Ÿ‡ฆ
Andrea Tagliasacchi ๐Ÿ‡จ๐Ÿ‡ฆvor 1 Jahr

Why care about motion masks? We show that good motion masks improve SfM performance, making COLMAP+our masks the SOTA on synthetic benchmarks. We also collect a real evaluation dataset with GT camera pose using a robotic arm, to evaluate our method in real casual captures.

Profilbild von Andrea Tagliasacchi ๐Ÿ‡จ๐Ÿ‡ฆ
Andrea Tagliasacchi ๐Ÿ‡จ๐Ÿ‡ฆvor 1 Jahr

How does it work? (three steps) 1) We find the Fundamental matrix between adjacent frames in the video with RANSAC. 2) We then identify parts of the frame that have a very low or a very high epipolar error, as weak supervision signals to find the moving objects.

Profilbild von Andrea Tagliasacchi ๐Ÿ‡จ๐Ÿ‡ฆ
Andrea Tagliasacchi ๐Ÿ‡จ๐Ÿ‡ฆvor 1 Jahr

3) Finally, we train a tiny MLP that classifies SAMv2 features as moving or static given the weak supervisory signal from high and low error masks. These features help complete the motion masks over the video effectively!

Profilbild von Andrea Tagliasacchi ๐Ÿ‡จ๐Ÿ‡ฆ
Andrea Tagliasacchi ๐Ÿ‡จ๐Ÿ‡ฆvor 1 Jahr

and just like that... we get good quality masks, without human annotation or synthetic supervision! Find more results on our website โ†’

Profilbild von Andrea Tagliasacchi ๐Ÿ‡จ๐Ÿ‡ฆ
Andrea Tagliasacchi ๐Ÿ‡จ๐Ÿ‡ฆvor 1 Jahr

This work was led by @lily_goli and @sabour_sara. In collaboration with Mark Matthews, @marcusabrubaker, Dmitry Lagun, @fleet_dj and @srbhsxn at Google DeepMind, and @_AlecJacobson at the University of Toronto.

ร„hnliche Videos

๐Ÿ“ข๐Ÿ“ข ๐๐ž๐ซ๐œ๐‡๐ž๐š๐: ๐๐ž๐ซ๐œ๐ž๐ฉ๐ญ๐ฎ๐š๐ฅ ๐‡๐ž๐š๐ ๐Œ๐จ๐๐ž๐ฅ ๐Ÿ๐จ๐ซ ๐’๐ข๐ง๐ ๐ฅ๐ž-๐ˆ๐ฆ๐š๐ ๐ž ๐Ÿ‘๐ƒ ๐‡๐ž๐š๐ ๐‘๐ž๐œ๐จ๐ง๐ฌ๐ญ๐ซ๐ฎ๐œ๐ญ๐ข๐จ๐ง & ๐„๐๐ข๐ญ๐ข๐ง๐ ๐Ÿ“ข๐Ÿ“ข PercHead reconstructs realistic 3D heads from a single image and enables disentangled 3D editing via geometric controls and style inputs from images or text. At its core is a generalized 3D head decoder trained with perceptual supervision from DINOv2 and SAM 2.1. We find that our new perceptual loss formulation improves reconstruction fidelity compared to commonly-used methods such as LPIPS. Our trained reconstruction model is able to generate 3D-consistent heads from a single input image. Even with challenging side-view inputs, the model robustly infers missing regions for a coherent, high-fidelity output. In addition, our architecture seamlessly adapts to downstream tasks: by swapping the encoder, we can transform the model into a disentangled 3D editing pipeline. In this scenario, we can control geometry through - potentially hand-drawn - segmentation maps, and condition style via image or text prompt. We also provide an interactive GUI to enable the exploration of our editing pipeline. ๐ŸŒ ๐Ÿ“ฝ๏ธ Great work by Antonio Oroz and Tobias Kirschstein

Matthias Niessner

18,808 Aufrufe โ€ข vor 7 Monaten