Загрузка видео...

Не удалось загрузить видео

На главную

Video diffusion models have strong implicit representations of 3D shape, material, and lighting, but controlling them with language is cumbersome, and control is critical for artists and animators. GenLit connects these implicit representations with a continuous 5D control signal describing the direction and intensity of a point light source....

22,004 просмотров • 5 месяцев назад •via X (Twitter)

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

🇨🇳 Another great Chinese Model, OmniHuman-1.5 from ByteDance Turns 1 image plus a voice track into expressive avatar video by pairing a System 1 and System 2 inspired planner with a Diffusion Transformer, Produces coherent motion for over 1 minute with moving camera and multi character scenes. Most avatar models move to the beat of the audio but miss meaning, so gestures feel generic and emotions feel shallow. The fix here is a Multimodal LLM planner that listens to the speech and drafts a structured plan describing intent, emotions, beats, and high level actions, which gives the motion engine clear semantic targets instead of only rhythm. The motion engine is a Multimodal Diffusion Transformer that fuses the plan with audio, the single reference image, and optional text prompts, then synthesizes continuous body, face, and head motion that matches both words and tone. A key trick is a Pseudo Last Frame, a synthetic target that summarizes the next expected state, which stabilizes fusion across modalities and keeps motion consistent over long spans. From just 1 image and speech, the system outputs speaking avatars with synchronized lips, context aware gestures, and continuous camera movement, and it also supports multi character interactions without manual choreography. Reported results show strong lip sync accuracy, high video quality, natural motion, and close match to text prompts, and the same setup works on nonhuman characters too.

Rohan Paul

63,859 просмотров • 9 месяцев назад

You can't 3D reconstruct glass from images... ...WRONG! Thanks for video diffusion, now just about anything is possible! Introducing...Diffusion Knows Transparency (DKT) Transparent and reflective objects usually break robot vision and photogrammetry pipelines because they don't follow the "solid object" rules standard cameras expect. DKT is a new AI model that repurposes the "internal physics engine" found in video generation models to solve this problem. Researchers took a massive video diffusion model (WAN) and fine-tuned it using a custom-built synthetic dataset to turn it into a high-precision depth sensor. To train the AI, they built the first massive synthetic video library of transparent objects, 1.32 million frames of perfectly labeled glass and metal objects in motion. Without ever seeing a "real" labeled video of glass during training, the model (DKT) outperformed all previous specialized systems on real-world benchmarks (ClearPose, DREDS). They created a "lightweight" 1.3B parameter version that runs fast enough (0.17s per frame) to be used on actual robot hardware. Two reasons I find this project important: 1. It further proves that synthetic data will be essential for training the next generation vision models. 2. In real-world robotic tests, using DKT's depth maps nearly doubled the success rate of robot arms trying to pick up objects on tricky reflective or translucent surfaces. At home robots will need to interact with these types of objects on a daily basis. Check out the project page here: Code is LIVE! #Computervision #Robotics #AI

Jonathan Stephens

17,712 просмотров • 5 месяцев назад