正在加载视频...

视频加载失败

🔥 Introducing MVLift: Generate realistic 3D motion without any 3D training data - just using 2D poses from monocular videos! Applicable to human motion, human-object interaction & animal motion. Joint work w/ Jiajun Wu & Karen 💡 How? We reformulate 3D motion estimation as generating consistent multi-view 2D pose...

15,788 次观看 • 1 年前 •via X (Twitter)

3 条评论

Shashwat 的头像
Shashwat1 年前

@jiajunwu_cs how would it differentiate between a baby crawling and a dog crawling ? very creative though 👍

Jiaman Li 的头像
Jiaman Li1 年前

@jiajunwu_cs It requires training on specific types of 2D keypoints.

Digital Currency 的头像
Digital Currency2 年前

From 3D modeling to VR/AR development, our MSc in Metaverse program equips you with the technical skills to excel in the rapidly evolving digital world. Don't miss out—enroll today! #UNIC #MScMetaverse

相关视频

Alibaba presents MIMO Controllable Character Video Synthesis with Spatial Decomposed Modeling Character video synthesis aims to produce realistic videos of animatable characters within lifelike scenes. As a fundamental problem in the computer vision and graphics community, 3D works typically require multi-view captures for per-case training, which severely limits their applicability of modeling arbitrary characters in a short time. Recent 2D methods break this limitation via pre-trained diffusion models, but they struggle for pose generality and scene interaction. To this end, we propose MIMO, a novel framework which can not only synthesize character videos with controllable attributes (i.e., character, motion and scene) provided by simple user inputs, but also simultaneously achieve advanced scalability to arbitrary characters, generality to novel 3D motions, and applicability to interactive real-world scenes in a unified framework. The core idea is to encode the 2D video to compact spatial codes, considering the inherent 3D nature of video occurrence. Concretely, we lift the 2D frame pixels into 3D using monocular depth estimators, and decompose the video clip to three spatial components (i.e., main human, underlying scene, and floating occlusion) in hierarchical layers based on the 3D depth. These components are further encoded to canonical identity code, structured motion code and full scene code, which are utilized as control signals of synthesis process. The design of spatial decomposed modeling enables flexible user control, complex motion expression, as well as 3D-aware synthesis for scene interactions. Experimental results demonstrate effectiveness and robustness of the proposed method.

AK

148,853 次观看 • 1 年前