Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

DiffSplat Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation DiffSplat is a generative framework to synthesize 3D Gaussian Splats from text prompts & single-view images in ⚡️ 1~2 seconds. It is fine-tuned directly from a pretrained text-to-image diffusion model.

AK

411,694 subscribers

38,416 Aufrufe • vor 1 Jahr •via X (Twitter)

Bildung Wissenschaft & Technologie

Anya Rossi• Live Now

Private livecam show

3 Kommentare

Profilbild von AK

AKvor 1 Jahr

discuss:

Profilbild von AssemblyAI

AssemblyAIvor 1 Jahr

Our speech-to-text models are the most accurate on the market with top rankings across industry benchmarks. - The highest accuracy rates—up to 95% - Up to 30% fewer hallucinations than other leaders - Low latency—63 minutes converts in 35 seconds Try via API for free today 👇

Profilbild von Panwang Pan

Panwang Panvor 1 Jahr

Thanks @_akhaliq for featuring our work! 🥰 Project: Code: The code and pre-trained checkpoints are officially released! 🎉🎉🎉 We are working on the huggingface demo. Stay tuned.

Ähnliche Videos

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

AK

83,657 Aufrufe • vor 3 Jahren

(1/2) LightIt: Illumination Modeling and Control for Diffusion Models! #CVPR2024 We facilitate lighting control for novel image generation from text prompts. We can also edit lighting for a given input image. Video: Project:

(1/2) LightIt: Illumination Modeling and Control for Diffusion Models! #CVPR2024 We facilitate lighting control for novel image generation from text prompts. We can also edit lighting for a given input image. Video: Project:

Matthias Niessner

19,849 Aufrufe • vor 2 Jahren

MVDream: Multi-view Diffusion for 3D Generation paper page: propose MVDream, a multi-view diffusion model that is able to generate geometrically consistent multi-view images from a given text prompt. By leveraging image diffusion models pre-trained on large-scale web datasets and a multi-view dataset rendered from 3D assets, the resulting multi-view diffusion model can achieve both the generalizability of 2D diffusion and the consistency of 3D data. Such a model can thus be applied as a multi-view prior for 3D generation via Score Distillation Sampling, where it greatly improves the stability of existing 2D-lifting methods by solving the 3D consistency problem. Finally, we show that the multi-view diffusion model can also be fine-tuned under a few shot setting for personalized 3D generation, i.e. DreamBooth3D application, where the consistency can be maintained after learning the subject identity.

MVDream: Multi-view Diffusion for 3D Generation paper page: propose MVDream, a multi-view diffusion model that is able to generate geometrically consistent multi-view images from a given text prompt. By leveraging image diffusion models pre-trained on large-scale web datasets and a multi-view dataset rendered from 3D assets, the resulting multi-view diffusion model can achieve both the generalizability of 2D diffusion and the consistency of 3D data. Such a model can thus be applied as a multi-view prior for 3D generation via Score Distillation Sampling, where it greatly improves the stability of existing 2D-lifting methods by solving the 3D consistency problem. Finally, we show that the multi-view diffusion model can also be fine-tuned under a few shot setting for personalized 3D generation, i.e. DreamBooth3D application, where the consistency can be maintained after learning the subject identity.

AK

294,442 Aufrufe • vor 2 Jahren

👀 Pixel perfect 💎✨ 🖼️ Edify Image from #NVIDIAResearch is a family of diffusion models that supports a wide range of applications, including text-to-image synthesis, 4K upsampling, ControlNets, 360° HDR panorama generation, and finetuning for image customization. 🧵 1/2

👀 Pixel perfect 💎✨ 🖼️ Edify Image from #NVIDIAResearch is a family of diffusion models that supports a wide range of applications, including text-to-image synthesis, 4K upsampling, ControlNets, 360° HDR panorama generation, and finetuning for image customization. 🧵 1/2

NVIDIA AI Developer

14,747 Aufrufe • vor 1 Jahr

3DTopia-XL High-Quality 3D PBR Asset Generation via Primitive Diffusion demo: model: 3DTopia-XL scales high-quality 3D asset generation using Diffusion Transformer (DiT) built upon an expressive and efficient 3D representation, PrimX. The denoising process takes 5 seconds to generate a 3D PBR asset from text/image input which is ready for the graphics pipeline to use.

3DTopia-XL High-Quality 3D PBR Asset Generation via Primitive Diffusion demo: model: 3DTopia-XL scales high-quality 3D asset generation using Diffusion Transformer (DiT) built upon an expressive and efficient 3D representation, PrimX. The denoising process takes 5 seconds to generate a 3D PBR asset from text/image input which is ready for the graphics pipeline to use.

AK

87,282 Aufrufe • vor 1 Jahr

Wild. Single shot text to 3D scene (gaussian splat) with no editing. This is approaching a usable scene asset for a game / 3D application.

Wild. Single shot text to 3D scene (gaussian splat) with no editing. This is approaching a usable scene asset for a game / 3D application.

martin_casado

56,136 Aufrufe • vor 11 Monaten

Adobe is entering the image-to-3D game! Their new method, LRM, can create high-fidelity 3D meshes from a single image in just 5 seconds 🔥 It's trained on 1 million objects and is able to generate objects from real-world images and generative AI models.

Adobe is entering the image-to-3D game! Their new method, LRM, can create high-fidelity 3D meshes from a single image in just 5 seconds 🔥 It's trained on 1 million objects and is able to generate objects from real-world images and generative AI models.

Dreaming Tulpa 🥓👑

270,942 Aufrufe • vor 2 Jahren

Excited to share "MultiDiffusion"! A controlled image generation framework w/ pre-trained text-to-image diffusion model. * Spatial guidance controls (bounding boxes/masks) * Arbitrary aspect ratios (huge Panoramas!) NO training NO finetuning. [1/3]Lior Yariv Yaron Lipman Tali Dekel

Excited to share "MultiDiffusion"! A controlled image generation framework w/ pre-trained text-to-image diffusion model. * Spatial guidance controls (bounding boxes/masks) * Arbitrary aspect ratios (huge Panoramas!) NO training NO finetuning. [1/3]Lior Yariv Yaron Lipman Tali Dekel

Omer Bar Tal

88,866 Aufrufe • vor 3 Jahren

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

Sherwin Bahmani

66,576 Aufrufe • vor 9 Monaten

VAST AI releases Triplane Meets Gaussian Splatting on Hugging Face Fast and Generalizable Single-View 3D Reconstruction with Transformers demo: TGS enables fast reconstruction from single-view image. It builds the 3D representation upon a hybrid Triplane-Gaussian representation by evaluating a transformer-based framework, from which 3D Gaussians would be decoded

VAST AI releases Triplane Meets Gaussian Splatting on Hugging Face Fast and Generalizable Single-View 3D Reconstruction with Transformers demo: TGS enables fast reconstruction from single-view image. It builds the 3D representation upon a hybrid Triplane-Gaussian representation by evaluating a transformer-based framework, from which 3D Gaussians would be decoded

AK

113,226 Aufrufe • vor 2 Jahren

Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery TL;DR: Skyfall-GS converts satellite images to explorable 3D urban scenes using diffusion models, with real-time rendering performance. Contributions: • We introduce Skyfall-GS, the first method to synthesize immersive, real-time, free-flight navigable 3D urban scenes solely from multi-view satellite imagery using generative refinement. • An open-domain refinement approach leverages pre-trained text-to-image diffusion models without domain-specific training. • A curriculum-learning-based iterative refinement strategy progressively enhances reconstruction quality from higher to lower viewpoints, significantly improving visual fidelity in occluded areas.

Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery TL;DR: Skyfall-GS converts satellite images to explorable 3D urban scenes using diffusion models, with real-time rendering performance. Contributions: • We introduce Skyfall-GS, the first method to synthesize immersive, real-time, free-flight navigable 3D urban scenes solely from multi-view satellite imagery using generative refinement. • An open-domain refinement approach leverages pre-trained text-to-image diffusion models without domain-specific training. • A curriculum-learning-based iterative refinement strategy progressively enhances reconstruction quality from higher to lower viewpoints, significantly improving visual fidelity in occluded areas.

MrNeRF

66,111 Aufrufe • vor 9 Monaten

Stable Diffusion 3.5 from Stability AI is now LIVE on Civitai! Stable Diffusion 3.5 Large is a Multimodal Diffusion Transformer text-to-image model that features improved performance in image quality, typography, complex prompt understanding, and resource-efficiency. Generate with SD 3.5 Live on Civitai here👇🏽 All images used to create this video were generated with SD 3.5 by our community 🫶🏽

Stable Diffusion 3.5 from Stability AI is now LIVE on Civitai! Stable Diffusion 3.5 Large is a Multimodal Diffusion Transformer text-to-image model that features improved performance in image quality, typography, complex prompt understanding, and resource-efficiency. Generate with SD 3.5 Live on Civitai here👇🏽 All images used to create this video were generated with SD 3.5 by our community 🫶🏽

Civitai

31,353 Aufrufe • vor 1 Jahr

This is amazing! You can now create high-quality 3D Scenes from a single image using Multi-Instance Diffusion Models (MIDI) 🔥

This is amazing! You can now create high-quality 3D Scenes from a single image using Multi-Instance Diffusion Models (MIDI) 🔥

Gradio

41,770 Aufrufe • vor 1 Jahr

Google presents VLOGGER Multimodal Diffusion for Embodied Avatar Synthesis We propose VLOGGER, a method for audio-driven human video generation from a single input image of a person, which builds on the success of recent generative diffusion models. Our method consists of

AK

66,375 Aufrufe • vor 2 Jahren

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

Tencent Hy

89,257 Aufrufe • vor 10 Monaten

Current 3D generative models are slow and low quality. We present GRM, a large-scale model that reconstructs 3D Gaussians in 0.1s and generates high-quality 3D assets from text or single images in a few seconds. Demo: 1/4

Current 3D generative models are slow and low quality. We present GRM, a large-scale model that reconstructs 3D Gaussians in 0.1s and generates high-quality 3D assets from text or single images in a few seconds. Demo: 1/4

Gordon Wetzstein

19,210 Aufrufe • vor 2 Jahren

Excited to release our first public AI web app, powered by Apple’s open-source ML-SHARP model, capable of transforming a single image into a 3D Gaussian Splat with real depth understanding in seconds. Try it here → #AI #Apple #VR #GaussianSplatting

Excited to release our first public AI web app, powered by Apple’s open-source ML-SHARP model, capable of transforming a single image into a 3D Gaussian Splat with real depth understanding in seconds. Try it here → #AI #Apple #VR #GaussianSplatting

Niccolò Miranda

201,290 Aufrufe • vor 6 Monaten