正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Why panorama? Standard video models struggle with object permanence—if a camera pans away and comes back, objects may disappear. With panoramas, the model is forced to generate everything in the scene. This serves as a "working memory" for consistent world generation. (3/N)

Ziyi Wu

1,239 subscribers

22,019 次观看 • 4 个月前 •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

V3D Video Diffusion Models are Effective 3D Generators Automatic 3D generation has recently attracted widespread attention. Recent methods have greatly accelerated the generation speed, but usually produce less-detailed objects due to limited model capacity or 3D data. Motivated by recent advancements in video diffusion models, we introduce V3D, which leverages the world simulation capacity of pre-trained video diffusion models to facilitate 3D generation. To fully unleash the potential of video diffusion to perceive the 3D world, we further introduce geometrical consistency prior and extend the video diffusion model to a multi-view consistent 3D generator. Benefiting from this, the state-of-the-art video diffusion model could be fine-tuned to generate 360degree orbit frames surrounding an object given a single image. With our tailored reconstruction pipelines, we can generate high-quality meshes or 3D Gaussians within 3 minutes. Furthermore, our method can be extended to scene-level novel view synthesis, achieving precise control over the camera path with sparse input views. Extensive experiments demonstrate the superior performance of the proposed approach, especially in terms of generation quality and multi-view consistency

V3D Video Diffusion Models are Effective 3D Generators Automatic 3D generation has recently attracted widespread attention. Recent methods have greatly accelerated the generation speed, but usually produce less-detailed objects due to limited model capacity or 3D data. Motivated by recent advancements in video diffusion models, we introduce V3D, which leverages the world simulation capacity of pre-trained video diffusion models to facilitate 3D generation. To fully unleash the potential of video diffusion to perceive the 3D world, we further introduce geometrical consistency prior and extend the video diffusion model to a multi-view consistent 3D generator. Benefiting from this, the state-of-the-art video diffusion model could be fine-tuned to generate 360degree orbit frames surrounding an object given a single image. With our tailored reconstruction pipelines, we can generate high-quality meshes or 3D Gaussians within 3 minutes. Furthermore, our method can be extended to scene-level novel view synthesis, achieving precise control over the camera path with sparse input views. Extensive experiments demonstrate the superior performance of the proposed approach, especially in terms of generation quality and multi-view consistency

AK

31,997 次观看 • 2 年前

Synthesizing worlds with video diffusion models is often inconsistent — moving the camera back and forth leads to different scenes. We propose 🌐𝗪𝗼𝗿𝗹𝗱𝗠𝗲𝗺, a memory-based approach that ensures consistent world simulation without relying on explicit 3D reconstruction.

Synthesizing worlds with video diffusion models is often inconsistent — moving the camera back and forth leads to different scenes. We propose 🌐𝗪𝗼𝗿𝗹𝗱𝗠𝗲𝗺, a memory-based approach that ensures consistent world simulation without relying on explicit 3D reconstruction.

Xingang Pan

19,413 次观看 • 1 年前

A breakthrough in real-time video generation. As a research preview developed with NVIDIA and shared at NVIDIAGTC this week, we trained a new real-time video model running on Vera Rubin. HD videos generate instantly, with time-to-first-frame under 100ms. Unlocking an entirely new creative paradigm and bolstering the foundations of our General World Model, GWM-1. Real-time generation opens a fundamentally different design space for video models and world simulation. We're investing in co-designing our models alongside advances in hardware to keep pushing this frontier.

A breakthrough in real-time video generation. As a research preview developed with NVIDIA and shared at NVIDIAGTC this week, we trained a new real-time video model running on Vera Rubin. HD videos generate instantly, with time-to-first-frame under 100ms. Unlocking an entirely new creative paradigm and bolstering the foundations of our General World Model, GWM-1. Real-time generation opens a fundamentally different design space for video models and world simulation. We're investing in co-designing our models alongside advances in hardware to keep pushing this frontier.

Runway

1,161,128 次观看 • 3 个月前

Introducing Runway Aleph, a new way to edit, transform and generate video. Aleph is a state-of-the-art in-context video model, setting a new frontier for multi-task visual generation, with the ability to perform a wide range of edits on an input video such as adding, removing and transforming objects, getting new angles of a scene and modifying style and lighting, among many other tasks.

Introducing Runway Aleph, a new way to edit, transform and generate video. Aleph is a state-of-the-art in-context video model, setting a new frontier for multi-task visual generation, with the ability to perform a wide range of edits on an input video such as adding, removing and transforming objects, getting new angles of a scene and modifying style and lighting, among many other tasks.

Runway

646,528 次观看 • 11 个月前

VMem Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory

VMem Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory

AK

83,550 次观看 • 1 年前

Can video generative models exhibit visuospatial intelligence? 🤔 Introducing Video4Spatial — a video-only framework that tackles spatial tasks. With just video context, our model can: 🔍 Ground objects by planning geometry-consistent paths 📸 Follow camera-pose instructions for scene navigation 🌐 Generalize to long contexts & unseen outdoor scenes A step toward video models as visual-spatial reasoners. Project: arXiv:

Can video generative models exhibit visuospatial intelligence? 🤔 Introducing Video4Spatial — a video-only framework that tackles spatial tasks. With just video context, our model can: 🔍 Ground objects by planning geometry-consistent paths 📸 Follow camera-pose instructions for scene navigation 🌐 Generalize to long contexts & unseen outdoor scenes A step toward video models as visual-spatial reasoners. Project: arXiv:

Xingang Pan

15,902 次观看 • 6 个月前

Blended-NeRF: Zero-Shot Object Generation and Blending in Existing Neural Radiance Fields paper page: Editing a local region or a specific object in a 3D scene represented by a NeRF is challenging, mainly due to the implicit nature of the scene representation. Consistently blending a new realistic object into the scene adds an additional level of difficulty. We present Blended-NeRF, a robust and flexible framework for editing a specific region of interest in an existing NeRF scene, based on text prompts or image patches, along with a 3D ROI box. Our method leverages a pretrained language-image model to steer the synthesis towards a user-provided text prompt or image patch, along with a 3D MLP model initialized on an existing NeRF scene to generate the object and blend it into a specified region in the original scene. We allow local editing by localizing a 3D ROI box in the input scene, and seamlessly blend the content synthesized inside the ROI with the existing scene using a novel volumetric blending technique. To obtain natural looking and view-consistent results, we leverage existing and new geometric priors and 3D augmentations for improving the visual fidelity of the final result. We test our framework both qualitatively and quantitatively on a variety of real 3D scenes and text prompts, demonstrating realistic multi-view consistent results with much flexibility and diversity compared to the baselines. Finally, we show the applicability of our framework for several 3D editing applications, including adding new objects to a scene, removing/replacing/altering existing objects, and texture conversion.

Blended-NeRF: Zero-Shot Object Generation and Blending in Existing Neural Radiance Fields paper page: Editing a local region or a specific object in a 3D scene represented by a NeRF is challenging, mainly due to the implicit nature of the scene representation. Consistently blending a new realistic object into the scene adds an additional level of difficulty. We present Blended-NeRF, a robust and flexible framework for editing a specific region of interest in an existing NeRF scene, based on text prompts or image patches, along with a 3D ROI box. Our method leverages a pretrained language-image model to steer the synthesis towards a user-provided text prompt or image patch, along with a 3D MLP model initialized on an existing NeRF scene to generate the object and blend it into a specified region in the original scene. We allow local editing by localizing a 3D ROI box in the input scene, and seamlessly blend the content synthesized inside the ROI with the existing scene using a novel volumetric blending technique. To obtain natural looking and view-consistent results, we leverage existing and new geometric priors and 3D augmentations for improving the visual fidelity of the final result. We test our framework both qualitatively and quantitatively on a variety of real 3D scenes and text prompts, demonstrating realistic multi-view consistent results with much flexibility and diversity compared to the baselines. Finally, we show the applicability of our framework for several 3D editing applications, including adding new objects to a scene, removing/replacing/altering existing objects, and texture conversion.

AK

62,768 次观看 • 3 年前

Bernini-R is now live on fal! One unified model that both generates and edits video Edit by instruction: swap objects, weather, backgrounds, camera angles or style, scene stays intact Guide it with up to 5 reference images for a consistent look

Bernini-R is now live on fal! One unified model that both generates and edits video Edit by instruction: swap objects, weather, backgrounds, camera angles or style, scene stays intact Guide it with up to 5 reference images for a consistent look

fal

13,998 次观看 • 15 天前

Tencent presents GameGen-O Open-world Video Game Generation We introduce GameGen-O, the first diffusion transformer model tailored for the generation of open-world video games. This model facilitates high-quality, open-domain generation by simulating a wide array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, thus allowing for the gameplay simulation. The development of GameGen-O involves a comprehensive data collection and processing effort from scratch. We collect and build the first Open-World Video Game Dataset (OGameData), amassed extensive data from over a hundred of next-generation open-world games, employing a proprietary data pipeline for efficient sorting, scoring, filtering, and decoupled captioning. This robust and extensive OGameData forms the foundation of our model's training process. GameGen-O undergoes a two-stage training process, consisting of foundation model pretraining and instruction tuning. In the first phase, the model is pre-trained on the OGameData via the text-to-video and video continuation, endowing GameGen-O with the capability for open-domain video game generation. In the second phase, the pre-trained model is frozen, and we fine-tuned using a trainable InstructNet, which enables the production of subsequent frames based on multimodal structural instructions. This whole training process imparts the model with the ability to generate and interactively control content. In summary, GameGen-O represents a notable initial step forward in the realm of open-world video game generation via generative models. It underscores the potential of generative models to serve as an alternative to rendering techniques, which can efficiently combine creative generation with interactive capabilities.

Tencent presents GameGen-O Open-world Video Game Generation We introduce GameGen-O, the first diffusion transformer model tailored for the generation of open-world video games. This model facilitates high-quality, open-domain generation by simulating a wide array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, thus allowing for the gameplay simulation. The development of GameGen-O involves a comprehensive data collection and processing effort from scratch. We collect and build the first Open-World Video Game Dataset (OGameData), amassed extensive data from over a hundred of next-generation open-world games, employing a proprietary data pipeline for efficient sorting, scoring, filtering, and decoupled captioning. This robust and extensive OGameData forms the foundation of our model's training process. GameGen-O undergoes a two-stage training process, consisting of foundation model pretraining and instruction tuning. In the first phase, the model is pre-trained on the OGameData via the text-to-video and video continuation, endowing GameGen-O with the capability for open-domain video game generation. In the second phase, the pre-trained model is frozen, and we fine-tuned using a trainable InstructNet, which enables the production of subsequent frames based on multimodal structural instructions. This whole training process imparts the model with the ability to generate and interactively control content. In summary, GameGen-O represents a notable initial step forward in the realm of open-world video game generation via generative models. It underscores the potential of generative models to serve as an alternative to rendering techniques, which can efficiently combine creative generation with interactive capabilities.

AK

366,948 次观看 • 1 年前

Most AI video models struggle with weight and momentum. PixVerse-R1 integrates physical laws into the generation process, ensuring that objects fall, collide, and interact realistically. It’s a physics engine disguised as a creative tool. Here's how it works:

Most AI video models struggle with weight and momentum. PixVerse-R1 integrates physical laws into the generation process, ensuring that objects fall, collide, and interact realistically. It’s a physics engine disguised as a creative tool. Here's how it works:

Farhan

26,833 次观看 • 5 个月前

Consistent character with Hunyuan and Skyreel! 🎥✨ In collaboration with Kartel.ai .ai, we worked on a pipeline using ComfyUI with the Hunyuan & Skyreel models—both designed for video. We first captured him volumetrically, then trained a LoRA model directly on video data, allowing us to generate sequences where a person remains consistent across shots. The goal is to maintain identity and style over time, pushing the limits of AI-generated video. Excited to explore this further! 🔥 with J. Salvatore in main actor , thank you ! #AI #GenAI #ComfyUI #LoRA #VideoGeneration #Hunyuan #Skyreel

Consistent character with Hunyuan and Skyreel! 🎥✨ In collaboration with Kartel.ai .ai, we worked on a pipeline using ComfyUI with the Hunyuan & Skyreel models—both designed for video. We first captured him volumetrically, then trained a LoRA model directly on video data, allowing us to generate sequences where a person remains consistent across shots. The goal is to maintain identity and style over time, pushing the limits of AI-generated video. Excited to explore this further! 🔥 with J. Salvatore in main actor , thank you ! #AI #GenAI #ComfyUI #LoRA #VideoGeneration #Hunyuan #Skyreel

Lovis Odin

13,348 次观看 • 1 年前

Restyling games is much harder than generating a video. Games can’t unintentionally “reset” at every frame: • Your character should stay the same when the camera moves • Damage, lighting, and objects must persist • A room should look the same when you look away and look back And it all has to run in real-time. Reverie v0.1.0 is our first stab at a real-time generative model for playable worlds. Below is a comparison with ours (left) with MirageLSD v2 (right). It’s early and far from perfect but watch what stays consistent as the video continues. 🧵 Prompts below (GPT-5.2).

Restyling games is much harder than generating a video. Games can’t unintentionally “reset” at every frame: • Your character should stay the same when the camera moves • Damage, lighting, and objects must persist • A room should look the same when you look away and look back And it all has to run in real-time. Reverie v0.1.0 is our first stab at a real-time generative model for playable worlds. Below is a comparison with ours (left) with MirageLSD v2 (right). It’s early and far from perfect but watch what stays consistent as the video continues. 🧵 Prompts below (GPT-5.2).

Sharon Lee

30,737 次观看 • 6 个月前

You might want to take more panoramas with your iPhone. No matter whether you own a visionOS device today or not, there's a good chance that one day you'll want to turn those panoramas into immersive 3D environments you can actually work in. To be honest, not every panorama in my library was as impressive as this one. To get the most out of the feature, you'll want a really wide, high-resolution panorama with a compelling scene. But when you have the right shot, it genuinely looks and feels like you're there.

You might want to take more panoramas with your iPhone. No matter whether you own a visionOS device today or not, there's a good chance that one day you'll want to turn those panoramas into immersive 3D environments you can actually work in. To be honest, not every panorama in my library was as impressive as this one. To get the most out of the feature, you'll want a really wide, high-resolution panorama with a compelling scene. But when you have the right shot, it genuinely looks and feels like you're there.

Phil Traut ᯅ

28,961 次观看 • 15 天前

✨ CVPR 2025 highlight: A Distractor-Aware Memory for Visual Object Tracking with SAM2 the authors propose a new distractor-aware memory model for SAM2 and an introspection-based update strategy that jointly addresses the segmentation accuracy as well as tracking robustness 🏡 (1/n)🧵👇

✨ CVPR 2025 highlight: A Distractor-Aware Memory for Visual Object Tracking with SAM2 the authors propose a new distractor-aware memory model for SAM2 and an introspection-based update strategy that jointly addresses the segmentation accuracy as well as tracking robustness 🏡 (1/n)🧵👇

GeekyRakshit (e/mad)

32,669 次观看 • 1 年前

Nairobi becomes the convening point for one of the most consequential conversations for Africa in this generation. More than 30 Heads of States, coming together to discuss 7 agenda themes with 1 vision - Taking #AfricaFoward. We have spent months working to ensure that when the world comes to Nairobi on 11 and 12th May, Africa arrives as a continent with a clear vision of its priorities, and the partners committed to advancing them alongside us. This is a Summit of substance, one with the potential to define a new chapter in Africa’s global engagement.

Nairobi becomes the convening point for one of the most consequential conversations for Africa in this generation. More than 30 Heads of States, coming together to discuss 7 agenda themes with 1 vision - Taking #AfricaFoward. We have spent months working to ensure that when the world comes to Nairobi on 11 and 12th May, Africa arrives as a continent with a clear vision of its priorities, and the partners committed to advancing them alongside us. This is a Summit of substance, one with the potential to define a new chapter in Africa’s global engagement.

Musalia W Mudavadi

31,751 次观看 • 1 个月前

Most generative models predict pixels. Predicting a 3D scene instead has many benefits: the scene won’t change if you look away and come back, and it obeys the basic physical rules of 3D geometry. The simplest way to visualize the 3D scene is a depth map, where each pixel is colored by its distance to the camera. 4/n

Most generative models predict pixels. Predicting a 3D scene instead has many benefits: the scene won’t change if you look away and come back, and it obeys the basic physical rules of 3D geometry. The simplest way to visualize the 3D scene is a depth map, where each pixel is colored by its distance to the camera. 4/n

World Labs

16,605 次观看 • 1 年前

Introducing WildDet3D, a grounding model for monocular 3D object detection in the wild. A question I keep coming back to is: what is the right backbone for robotics foundation models? Should it be a video model, a language model, or perhaps a grounding model? WildDet3D is our first step in exploring that direction.

Introducing WildDet3D, a grounding model for monocular 3D object detection in the wild. A question I keep coming back to is: what is the right backbone for robotics foundation models? Should it be a video model, a language model, or perhaps a grounding model? WildDet3D is our first step in exploring that direction.

Jiafei Duan

12,072 次观看 • 2 个月前

New open-source 3D world-generation model. I'm rendering a couple of worlds in the video, so check it out. You'll find the GitHub and the Hugging Face links to the model below. This is a multi-modal world model that you can use for a bunch of things: • To generate new worlds • To reconstruct worlds • To simulate 3D interactive worlds from a prompt, images, or a video You can edit the 3D outputs in Unity and Unreal Engine (they export as meshes, 3DGS files, and point clouds). You can also generate 3D characters in the world and walk around. Pretty fun stuff!

New open-source 3D world-generation model. I'm rendering a couple of worlds in the video, so check it out. You'll find the GitHub and the Hugging Face links to the model below. This is a multi-modal world model that you can use for a bunch of things: • To generate new worlds • To reconstruct worlds • To simulate 3D interactive worlds from a prompt, images, or a video You can edit the 3D outputs in Unity and Unreal Engine (they export as meshes, 3DGS files, and point clouds). You can also generate 3D characters in the world and walk around. Pretty fun stuff!

Santiago

65,446 次观看 • 2 个月前