Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

SceneScape: Text-Driven Consistent Scene Generation abs: project page: text-driven perpetual view generation -- synthesizing long videos of arbitrary scenes solely from an input text describing the scene and camera poses

AK

456,558 subscribers

73,258 views • 3 years ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

7 Comments

Nilu Kulasingham3 years ago

holy *** this is going to be huge for video games

The AI Race 🏁3 years ago

this will be massive for video games but also for future viral Seinfeld simulations and entertainment writ large

Umar Farooq3 years ago

Subway surfer with infinite possibilities, this is going to be a game changer if we can add it to Unity3D and create meshes on runtime with infinite possibilities. Maybe change the gameplay type as per the mood of the user or his geograpical presence.

William Lamkin3 years ago

Very cool 😎

Olivier Lattrez3 years ago

@memdotai mem it

Mem3 years ago

@_akhaliq Saved! Here's the compiled thread: 🪄 AI-generated summary: "A new system called SceneScape can generate long, consistent videos of arbitrary scenes from an input text description and camera poses."

fakery3 years ago

Y'all remember that one screen saver?

Related Videos

DreamBooth3D: Subject-Driven Text-to-3D Generation Personalized 3D models from just a few casual photos, with text-driven modifications abs: project page:

DreamBooth3D: Subject-Driven Text-to-3D Generation Personalized 3D models from just a few casual photos, with text-driven modifications abs: project page:

AK

107,503 views • 3 years ago

WorldExplorer: Towards Generating Fully Navigable 3D Scenes Contributions: • We introduce the first method for generating 3D scenes from text that supports high-quality view synthesis while enabling exploration across a wide range of camera poses. • We propose an iterative scene expansion strategy using video diffusion models, driven by trajectory sampling and adaptive collision detection. • We design a scene memory mechanism that conditions each video generation step on relevant past frames, improving view consistency and overall scene coherence.

WorldExplorer: Towards Generating Fully Navigable 3D Scenes Contributions: • We introduce the first method for generating 3D scenes from text that supports high-quality view synthesis while enabling exploration across a wide range of camera poses. • We propose an iterative scene expansion strategy using video diffusion models, driven by trajectory sampling and adaptive collision detection. • We design a scene memory mechanism that conditions each video generation step on relevant past frames, improving view consistency and overall scene coherence.

MrNeRF

23,814 views • 1 year ago

Google announces InseRF Text-Driven Generative Object Insertion in Neural 3D Scenes paper page: InseRF generates an object in a 3D scene via a text prompt and one 2D bounding box

Google announces InseRF Text-Driven Generative Object Insertion in Neural 3D Scenes paper page: InseRF generates an object in a 3D scene via a text prompt and one 2D bounding box

AK

205,987 views • 2 years ago

Can we use video diffusion to generate 3D scenes? 𝐖𝐨𝐫𝐥𝐝𝐄𝐱𝐩𝐥𝐨𝐫𝐞𝐫 (#SIGGRAPHAsia25) creates fully-navigable scenes via autoregressive video generation. Text input -> 3DGS scene output & interactive rendering! 🌍 📽️

Can we use video diffusion to generate 3D scenes? 𝐖𝐨𝐫𝐥𝐝𝐄𝐱𝐩𝐥𝐨𝐫𝐞𝐫 (#SIGGRAPHAsia25) creates fully-navigable scenes via autoregressive video generation. Text input -> 3DGS scene output & interactive rendering! 🌍 📽️

Matthias Niessner

30,883 views • 10 months ago

Make Pixels Dance: High-Dynamic Video Generation paper page: Creating high-dynamic videos such as motion-rich actions and sophisticated visual effects poses a significant challenge in the field of artificial intelligence. Unfortunately, current state-of-the-art video generation methods, primarily focusing on text-to-video generation, tend to produce video clips with minimal motions despite maintaining high fidelity. We argue that relying solely on text instructions is insufficient and suboptimal for video generation. In this paper, we introduce PixelDance, a novel approach based on diffusion models that incorporates image instructions for both the first and last frames in conjunction with text instructions for video generation. Comprehensive experimental results demonstrate that PixelDance trained with public data exhibits significantly better proficiency in synthesizing videos with complex scenes and intricate motions, setting a new standard for video generation.

Make Pixels Dance: High-Dynamic Video Generation paper page: Creating high-dynamic videos such as motion-rich actions and sophisticated visual effects poses a significant challenge in the field of artificial intelligence. Unfortunately, current state-of-the-art video generation methods, primarily focusing on text-to-video generation, tend to produce video clips with minimal motions despite maintaining high fidelity. We argue that relying solely on text instructions is insufficient and suboptimal for video generation. In this paper, we introduce PixelDance, a novel approach based on diffusion models that incorporates image instructions for both the first and last frames in conjunction with text instructions for video generation. Comprehensive experimental results demonstrate that PixelDance trained with public data exhibits significantly better proficiency in synthesizing videos with complex scenes and intricate motions, setting a new standard for video generation.

AK

102,444 views • 2 years ago

Collaborative Video Diffusion Consistent Multi-video Generation with Camera Control Research on video generation has recently made tremendous progress, enabling high-quality videos to be generated from text prompts or images. Adding control to the video generation

Collaborative Video Diffusion Consistent Multi-video Generation with Camera Control Research on video generation has recently made tremendous progress, enabling high-quality videos to be generated from text prompts or images. Adding control to the video generation

AK

29,278 views • 2 years ago

Story-to-Motion: Synthesizing Infinite and Controllable Character Animation from Long Text Outperforms previous SotA motion synthesis methods across the board proj: abs:

Story-to-Motion: Synthesizing Infinite and Controllable Character Animation from Long Text Outperforms previous SotA motion synthesis methods across the board proj: abs:

Aran Komatsuzaki

38,401 views • 2 years ago

📢📢📢 Excited to share our new work *Autonomous Character-Scene Interaction Synthesis from Text Instruction* (Siggraph Asia 24). It presents a unified model for flexible scene-conditioned motion generation given text, scene, trajectory conditions. The results with smooth interaction look very impressive! 📰Paper: Project: Code and data will be released soon.

📢📢📢 Excited to share our new work Autonomous Character-Scene Interaction Synthesis from Text Instruction (Siggraph Asia 24). It presents a unified model for flexible scene-conditioned motion generation given text, scene, trajectory conditions. The results with smooth interaction look very impressive! 📰Paper: Project: Code and data will be released soon.

Siyuan Huang

11,340 views • 1 year ago

Ovi is out on Hugging Face Twin Backbone Cross-Modal Fusion for Audio-Video Generation Ovi is a veo-3 like, video+audio generation model that simultaneously generates both video and audio content from text or text+image inputs. Video+Audio Generation: Generate synchronized video and audio content simultaneously Flexible Input: Supports text-only or text+image conditioning 5-second Videos: Generates 5-second videos at 24 FPS, area of 720×720, at various aspect ratios (9:16, 16:9, 1:1, etc)

Ovi is out on Hugging Face Twin Backbone Cross-Modal Fusion for Audio-Video Generation Ovi is a veo-3 like, video+audio generation model that simultaneously generates both video and audio content from text or text+image inputs. Video+Audio Generation: Generate synchronized video and audio content simultaneously Flexible Input: Supports text-only or text+image conditioning 5-second Videos: Generates 5-second videos at 24 FPS, area of 720×720, at various aspect ratios (9:16, 16:9, 1:1, etc)

AK

23,082 views • 9 months ago

I'm excited to share our new work Align3R that estimates camera poses and consistent depth maps from a monocular video of a dynamic scene. Project page: Code: Paper:

I'm excited to share our new work Align3R that estimates camera poses and consistent depth maps from a monocular video of a dynamic scene. Project page: Code: Paper:

Yuan Liu

56,547 views • 1 year ago

yet ANOTHER release from ByteDance just landed on the hub ✨HuMo✨ > video generation w/ multi-modal conditioning: audio, text & image > supports consistent subject preservation, synchronized audio-driven motion > based on Wan 2.1 & Whisper Large v3

yet ANOTHER release from ByteDance just landed on the hub ✨HuMo✨ > video generation w/ multi-modal conditioning: audio, text & image > supports consistent subject preservation, synchronized audio-driven motion > based on Wan 2.1 & Whisper Large v3

Linoy Tsaban🎗️

17,573 views • 10 months ago

Designing an Encoder for Fast Personalization of Text-to-Image Models TL;DR: use an encoder to personalize a text-to-image model to new concepts with a single image and 5-15 tuning steps abs: project page:

Designing an Encoder for Fast Personalization of Text-to-Image Models TL;DR: use an encoder to personalize a text-to-image model to new concepts with a single image and 5-15 tuning steps abs: project page:

AK

165,165 views • 3 years ago

Multi-Track Timeline Control for Text-Driven 3D Human Motion Generation paper page: Recent advances in generative modeling have led to promising progress on synthesizing 3D human motion from text, with methods that can generate character animations from short prompts and specified durations. However, using a single text prompt as input lacks the fine-grained control needed by animators, such as composing multiple actions and defining precise durations for parts of the motion. To address this, we introduce the new problem of timeline control for text-driven motion synthesis, which provides an intuitive, yet fine-grained, input interface for users. Instead of a single prompt, users can specify a multi-track timeline of multiple prompts organized in temporal intervals that may overlap. This enables specifying the exact timings of each action and composing multiple actions in sequence or at overlapping intervals. To generate composite animations from a multi-track timeline, we propose a new test-time denoising method. This method can be integrated with any pre-trained motion diffusion model to synthesize realistic motions that accurately reflect the timeline. At every step of denoising, our method processes each timeline interval (text prompt) individually, subsequently aggregating the predictions with consideration for the specific body parts engaged in each action. Experimental comparisons and ablations validate that our method produces realistic motions that respect the semantics and timing of given text prompts.

Multi-Track Timeline Control for Text-Driven 3D Human Motion Generation paper page: Recent advances in generative modeling have led to promising progress on synthesizing 3D human motion from text, with methods that can generate character animations from short prompts and specified durations. However, using a single text prompt as input lacks the fine-grained control needed by animators, such as composing multiple actions and defining precise durations for parts of the motion. To address this, we introduce the new problem of timeline control for text-driven motion synthesis, which provides an intuitive, yet fine-grained, input interface for users. Instead of a single prompt, users can specify a multi-track timeline of multiple prompts organized in temporal intervals that may overlap. This enables specifying the exact timings of each action and composing multiple actions in sequence or at overlapping intervals. To generate composite animations from a multi-track timeline, we propose a new test-time denoising method. This method can be integrated with any pre-trained motion diffusion model to synthesize realistic motions that accurately reflect the timeline. At every step of denoising, our method processes each timeline interval (text prompt) individually, subsequently aggregating the predictions with consideration for the specific body parts engaged in each action. Experimental comparisons and ablations validate that our method produces realistic motions that respect the semantics and timing of given text prompts.

AK

126,585 views • 2 years ago

(1/2) LightIt: Illumination Modeling and Control for Diffusion Models! #CVPR2024 We facilitate lighting control for novel image generation from text prompts. We can also edit lighting for a given input image. Video: Project:

(1/2) LightIt: Illumination Modeling and Control for Diffusion Models! #CVPR2024 We facilitate lighting control for novel image generation from text prompts. We can also edit lighting for a given input image. Video: Project:

Matthias Niessner

19,849 views • 2 years ago

ByteDance just unveiled HuMo, a unified framework for human-centric video generation. With HuMo you can: • Generate videos from text + image • Create audio-synced videos from text + audio • Combine text, image, and audio for maximum control • Preserve consistent subjects across frames • Achieve natural audio-visual synchronization This isn’t just another release , it’s a big step toward fully controllable, coherent, and cinematic AI video.

ByteDance just unveiled HuMo, a unified framework for human-centric video generation. With HuMo you can: • Generate videos from text + image • Create audio-synced videos from text + audio • Combine text, image, and audio for maximum control • Preserve consistent subjects across frames • Achieve natural audio-visual synchronization This isn’t just another release , it’s a big step toward fully controllable, coherent, and cinematic AI video.

DStudioproject

29,408 views • 10 months ago

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models turn the publicly available, state-of-the-art text-to-image LDM Stable Diffusion into an efficient and expressive text-to-video model with resolution up to 1280 x 2048 abs: project page:

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models turn the publicly available, state-of-the-art text-to-image LDM Stable Diffusion into an efficient and expressive text-to-video model with resolution up to 1280 x 2048 abs: project page:

AK

718,760 views • 3 years ago

Wan2.5: Let Sound Take the Director’s Chair! 🎬 Today, we’re excited to unveil another major feature in our powerful Wan 2.5 Preview: Native Audio-Driven Video Generation. ✨ Now you can use audio input directly for both text-to-video and image-to-video generation. Combine audio with text prompts or a reference image to shape your video's narrative. ✨ With support for videos up to 10 seconds and enhanced video quality, unlock a richer visual space where more engaging stories come to life.

Wan2.5: Let Sound Take the Director’s Chair! 🎬 Today, we’re excited to unveil another major feature in our powerful Wan 2.5 Preview: Native Audio-Driven Video Generation. ✨ Now you can use audio input directly for both text-to-video and image-to-video generation. Combine audio with text prompts or a reference image to shape your video's narrative. ✨ With support for videos up to 10 seconds and enhanced video quality, unlock a richer visual space where more engaging stories come to life.

Wan

52,119 views • 9 months ago

Dreamix: Video Diffusion Models are General Video Editors abs: project page: present diffusion-based method that is able to perform text-based motion and appearance editing of general videos

Dreamix: Video Diffusion Models are General Video Editors abs: project page: present diffusion-based method that is able to perform text-based motion and appearance editing of general videos

AK

398,166 views • 3 years ago

Introducing MegaSaM! 🎥 Accurate, fast, & robust structure + camera estimation from casual monocular videos of dynamic scenes! MegaSaM outputs camera parameters and consistent video depth, scaling to long videos with unconstrained camera paths and complex scene dynamics!

Introducing MegaSaM! 🎥 Accurate, fast, & robust structure + camera estimation from casual monocular videos of dynamic scenes! MegaSaM outputs camera parameters and consistent video depth, scaling to long videos with unconstrained camera paths and complex scene dynamics!

Zhengqi Li

57,010 views • 1 year ago

Blended-NeRF: Zero-Shot Object Generation and Blending in Existing Neural Radiance Fields paper page: Editing a local region or a specific object in a 3D scene represented by a NeRF is challenging, mainly due to the implicit nature of the scene representation. Consistently blending a new realistic object into the scene adds an additional level of difficulty. We present Blended-NeRF, a robust and flexible framework for editing a specific region of interest in an existing NeRF scene, based on text prompts or image patches, along with a 3D ROI box. Our method leverages a pretrained language-image model to steer the synthesis towards a user-provided text prompt or image patch, along with a 3D MLP model initialized on an existing NeRF scene to generate the object and blend it into a specified region in the original scene. We allow local editing by localizing a 3D ROI box in the input scene, and seamlessly blend the content synthesized inside the ROI with the existing scene using a novel volumetric blending technique. To obtain natural looking and view-consistent results, we leverage existing and new geometric priors and 3D augmentations for improving the visual fidelity of the final result. We test our framework both qualitatively and quantitatively on a variety of real 3D scenes and text prompts, demonstrating realistic multi-view consistent results with much flexibility and diversity compared to the baselines. Finally, we show the applicability of our framework for several 3D editing applications, including adding new objects to a scene, removing/replacing/altering existing objects, and texture conversion.

Blended-NeRF: Zero-Shot Object Generation and Blending in Existing Neural Radiance Fields paper page: Editing a local region or a specific object in a 3D scene represented by a NeRF is challenging, mainly due to the implicit nature of the scene representation. Consistently blending a new realistic object into the scene adds an additional level of difficulty. We present Blended-NeRF, a robust and flexible framework for editing a specific region of interest in an existing NeRF scene, based on text prompts or image patches, along with a 3D ROI box. Our method leverages a pretrained language-image model to steer the synthesis towards a user-provided text prompt or image patch, along with a 3D MLP model initialized on an existing NeRF scene to generate the object and blend it into a specified region in the original scene. We allow local editing by localizing a 3D ROI box in the input scene, and seamlessly blend the content synthesized inside the ROI with the existing scene using a novel volumetric blending technique. To obtain natural looking and view-consistent results, we leverage existing and new geometric priors and 3D augmentations for improving the visual fidelity of the final result. We test our framework both qualitatively and quantitatively on a variety of real 3D scenes and text prompts, demonstrating realistic multi-view consistent results with much flexibility and diversity compared to the baselines. Finally, we show the applicability of our framework for several 3D editing applications, including adding new objects to a scene, removing/replacing/altering existing objects, and texture conversion.

AK

62,768 views • 3 years ago