Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

We introduce 🔥X-InstructBLIP🔥, a simple and effective scalable cross-modal framework to empower LLMs to handle tasks across modalities such as text, image, video, sound, and 3D. Web: ArXiv: Code:

Caiming Xiong

8,177 subscribers

37,476 просмотров • 2 лет назад •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

Комментарии: 5

Фото профиля Caiming Xiong

Caiming Xiong2 лет назад

We extend InstructBLIP’s instruction-aware representations beyond images to 3D, audio, and video. Despite the lack of modality-specific pre-training, X-InstructBLIP achieves comparable performance to SoTA models on a variety of out-of-domain tasks and modalities.

Фото профиля Caiming Xiong

Caiming Xiong2 лет назад

Despite the lack of joint modality training and distinct frozen pre-trained encoders for each modality, X-InstructBLIP demonstrates emergent capabilities in cross-modal comprehension.

Фото профиля Caiming Xiong

Caiming Xiong2 лет назад

To evaluate its abilities we introduce a new Cross-modal Discriminative Reasoning benchmark (DisCRn): Given two distinct modality inputs, the model needs to select the entity that matches the property queried.

Фото профиля Caiming Xiong

Caiming Xiong2 лет назад

X-InstructBLIP outperforms a strong SoTA captioning baseline on the new DisCRn task by 6.3 and 3.2 points for image-3D and audio-video pairs respectively. Nevertheless, the task remains an open challenge.

Фото профиля Caiming Xiong

Caiming Xiong2 лет назад

Thanks to all awesome collaborators: @artemispng, @Le_Xue01, @realNingYu, @LiJunnan0409, @dongxuli_, @JotyShafiq, @stanleyran, @silviocinguetta and @jcniebles

Похожие видео

(1/4) Excited to share our #ICCV2023 paper Text2Room! We generate scene-scale textured 3D meshes from a given text prompt leveraging 2D text-to-image models such as StableDiffusion. Project: Code: Video:

(1/4) Excited to share our #ICCV2023 paper Text2Room! We generate scene-scale textured 3D meshes from a given text prompt leveraging 2D text-to-image models such as StableDiffusion. Project: Code: Video:

Matthias Niessner

74,893 просмотров • 2 лет назад

🔥Text-to-3D Foundation Model🔥 We are excited to announce #3DTopia, a generalist 🧊text-to-3D🧊 foundation model, which produces ** high-quality 3D assets within 5 minutes ** - Code: - Video:

🔥Text-to-3D Foundation Model🔥 We are excited to announce #3DTopia, a generalist 🧊text-to-3D🧊 foundation model, which produces high-quality 3D assets within 5 minutes - Code: - Video:

Ziwei Liu

62,424 просмотров • 2 лет назад

We’re excited to announce 𝗚𝗲𝗺𝗶𝗻𝗶: Google’s largest and most capable AI model. Built to be natively multimodal, it can understand and operate across text, code, audio, image and video - and achieves state-of-the-art performance across many tasks. 🧵

We’re excited to announce 𝗚𝗲𝗺𝗶𝗻𝗶: Google’s largest and most capable AI model. Built to be natively multimodal, it can understand and operate across text, code, audio, image and video - and achieves state-of-the-art performance across many tasks. 🧵

Google DeepMind

1,315,239 просмотров • 2 лет назад

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

AK

41,713 просмотров • 1 год назад

🕹️We are excited to introduce "ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation" ChronoEdit reframes image editing as a video generation task to encourage temporal consistency. It leverages a temporal reasoning stage that denoises with “video reasoning tokens” to "reason" on physically plausible edits. See the attached video for results. Project Page: Arxiv: Code and model are coming.

🕹️We are excited to introduce "ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation" ChronoEdit reframes image editing as a video generation task to encourage temporal consistency. It leverages a temporal reasoning stage that denoises with “video reasoning tokens” to "reason" on physically plausible edits. See the attached video for results. Project Page: Arxiv: Code and model are coming.

Huan Ling

36,841 просмотров • 8 месяцев назад

Champ Controllable and Consistent Human Image Animation with 3D Parametric Guidance In this study, we introduce a methodology for human image animation by leveraging a 3D human parametric model within a latent diffusion framework to enhance shape alignment and motion

Champ Controllable and Consistent Human Image Animation with 3D Parametric Guidance In this study, we introduce a methodology for human image animation by leveraging a 3D human parametric model within a latent diffusion framework to enhance shape alignment and motion

AK

194,356 просмотров • 2 лет назад

CineMaster A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation

CineMaster A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation

AK

18,655 просмотров • 1 год назад

(1/10) 🔥Thrilled to introduce OneDiffusion—our latest work in unified diffusion modeling! 🚀 This model bridges the gap between image synthesis and understanding, excelling in a wide range of tasks: T2I, conditional generation, image understanding, identity preservation, multiview generation, and even camera pose estimation. Learn more at: Project: arXiv: Code (on the way):

(1/10) 🔥Thrilled to introduce OneDiffusion—our latest work in unified diffusion modeling! 🚀 This model bridges the gap between image synthesis and understanding, excelling in a wide range of tasks: T2I, conditional generation, image understanding, identity preservation, multiview generation, and even camera pose estimation. Learn more at: Project: arXiv: Code (on the way):

Jiasen Lu

33,383 просмотров • 1 год назад

🔥🔥We propose #VideoBooth to enable **customized video generation** with image prompts, which provide more accurate and direct content control beyond the text prompts. - Project: - Code: - Video:

🔥🔥We propose #VideoBooth to enable customized video generation with image prompts, which provide more accurate and direct content control beyond the text prompts. - Project: - Code: - Video:

Ziwei Liu

26,329 просмотров • 2 лет назад

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,572 просмотров • 2 лет назад

DiffSplat Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation DiffSplat is a generative framework to synthesize 3D Gaussian Splats from text prompts & single-view images in ⚡️ 1~2 seconds. It is fine-tuned directly from a pretrained text-to-image diffusion model.

DiffSplat Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation DiffSplat is a generative framework to synthesize 3D Gaussian Splats from text prompts & single-view images in ⚡️ 1~2 seconds. It is fine-tuned directly from a pretrained text-to-image diffusion model.

AK

38,416 просмотров • 1 год назад

Multi-modal #LLMs understand a lot about humans. But do they understand our 3D pose? We train #PoseGPT to estimate, generate, and reason about 3D human pose (#SMPL) in images and text. This is the first true foundation model for understanding 3D humans.

Multi-modal #LLMs understand a lot about humans. But do they understand our 3D pose? We train #PoseGPT to estimate, generate, and reason about 3D human pose (#SMPL) in images and text. This is the first true foundation model for understanding 3D humans.

Michael Black

81,365 просмотров • 2 лет назад

🔥Time to Upgrade Your Classifier-Free Guidance🔥 🌠CFG-Zero*🌠 offers consistently better *visual quality* and *text alignment* on text-to-image/video - Project: - Code: - Demo Gradio: . Thanks AK!

🔥Time to Upgrade Your Classifier-Free Guidance🔥 🌠CFG-Zero🌠 offers consistently better visual quality* and text alignment on text-to-image/video - Project: - Code: - Demo Gradio: . Thanks AK!

Ziwei Liu

43,577 просмотров • 1 год назад

We’re honored to present the story behind the code of Nuxt, a web framework to build performant and production-grade apps and websites with Vue 🔥

We’re honored to present the story behind the code of Nuxt, a web framework to build performant and production-grade apps and websites with Vue 🔥

Supabase

78,643 просмотров • 1 год назад

😈 Today, we introduce WebGym, the largest-to-date open-source RL environment for web agent training that contains 300k tasks and a rollout framework optimized specifically for web environments' rollout speed. We reveal the effects of essential scaling directions we observe with WebGym. 1/n

😈 Today, we introduce WebGym, the largest-to-date open-source RL environment for web agent training that contains 300k tasks and a rollout framework optimized specifically for web environments' rollout speed. We reveal the effects of essential scaling directions we observe with WebGym. 1/n

Jack Bai

45,075 просмотров • 5 месяцев назад

Collaborative Score Distillation for Consistent Visual Synthesis paper page: Generative priors of large-scale text-to-image diffusion models enable a wide range of new generation and editing applications on diverse visual modalities. However, when adapting these priors to complex visual modalities, often represented as multiple images (e.g., video), achieving consistency across a set of images is challenging. In this paper, we address this challenge with a novel method, Collaborative Score Distillation (CSD). CSD is based on the Stein Variational Gradient Descent (SVGD). Specifically, we propose to consider multiple samples as "particles" in the SVGD update and combine their score functions to distill generative priors over a set of images synchronously. Thus, CSD facilitates seamless integration of information across 2D images, leading to a consistent visual synthesis across multiple samples. We show the effectiveness of CSD in a variety of tasks, encompassing the visual editing of panorama images, videos, and 3D scenes. Our results underline the competency of CSD as a versatile method for enhancing inter-sample consistency, thereby broadening the applicability of text-to-image diffusion models.

Collaborative Score Distillation for Consistent Visual Synthesis paper page: Generative priors of large-scale text-to-image diffusion models enable a wide range of new generation and editing applications on diverse visual modalities. However, when adapting these priors to complex visual modalities, often represented as multiple images (e.g., video), achieving consistency across a set of images is challenging. In this paper, we address this challenge with a novel method, Collaborative Score Distillation (CSD). CSD is based on the Stein Variational Gradient Descent (SVGD). Specifically, we propose to consider multiple samples as "particles" in the SVGD update and combine their score functions to distill generative priors over a set of images synchronously. Thus, CSD facilitates seamless integration of information across 2D images, leading to a consistent visual synthesis across multiple samples. We show the effectiveness of CSD in a variety of tasks, encompassing the visual editing of panorama images, videos, and 3D scenes. Our results underline the competency of CSD as a versatile method for enhancing inter-sample consistency, thereby broadening the applicability of text-to-image diffusion models.

AK

33,500 просмотров • 2 лет назад

🤯 OneDiffusion: A versatile, large-scale diffusion model that seamlessly supports bidirectional image synthesis and understanding across diverse tasks. ✅ Text to Image ✅ Image to Depth ✅ Image to Segmentation ✅ Image to Pose ✅ FaceID ✅ Image to Multiview How to use & more👇

🤯 OneDiffusion: A versatile, large-scale diffusion model that seamlessly supports bidirectional image synthesis and understanding across diverse tasks. ✅ Text to Image ✅ Image to Depth ✅ Image to Segmentation ✅ Image to Pose ✅ FaceID ✅ Image to Multiview How to use & more👇

Gradio

11,820 просмотров • 1 год назад

Phoenix GenAI’s new generative AI model Image-to-3D is now online. Create high fidelity 3D models importable to Unreal Engine 5 or Unity simply by sending an original image into PhoenixLLM. Better yet, generate the source image using Phoenix GenAI’s Flux and feed it into Image-to-3D. This release marks yet another upgrade of GenAI’s arsenal of capabilities, getting it ready for multi-workflow GenAI agents, in which users will be able to combine text-to-image, image-to-prompt, text-to-video, text-to-3D, and image-to-3D into complex multi-step workflows with simple commands via PhoenixLLM. Image-to-3D is yet another addition to Phoenix’s Vertical AI Solutions for gaming, content, and metaverse. Users are able to use it as a Phoenix-native alternative to SkyNet AI Marketplace’s Tripo Integration earlier this year. #Phoenix $PHB

Phoenix GenAI’s new generative AI model Image-to-3D is now online. Create high fidelity 3D models importable to Unreal Engine 5 or Unity simply by sending an original image into PhoenixLLM. Better yet, generate the source image using Phoenix GenAI’s Flux and feed it into Image-to-3D. This release marks yet another upgrade of GenAI’s arsenal of capabilities, getting it ready for multi-workflow GenAI agents, in which users will be able to combine text-to-image, image-to-prompt, text-to-video, text-to-3D, and image-to-3D into complex multi-step workflows with simple commands via PhoenixLLM. Image-to-3D is yet another addition to Phoenix’s Vertical AI Solutions for gaming, content, and metaverse. Users are able to use it as a Phoenix-native alternative to SkyNet AI Marketplace’s Tripo Integration earlier this year. #Phoenix $PHB

Phoenix AI

30,853 просмотров • 1 год назад

The Multi-Shot App makes it easy to go from a simple prompt to a thoughtfully crafted scene. All with dialogue, sound effects and cinematic framing. Start from an image or go purely Text to Video. Available now in the App drawer on the web app.

The Multi-Shot App makes it easy to go from a simple prompt to a thoughtfully crafted scene. All with dialogue, sound effects and cinematic framing. Start from an image or go purely Text to Video. Available now in the App drawer on the web app.

Runway

14,229 просмотров • 2 месяцев назад

🖼️🎞️🔊📄Excited to introduce Composable Diffusion (CoDi), a new generative-AI foundation model that can take any combo of input modalities & generate any combo of output modalities (text, audio, image, video)! Ziyi Yang Chenguang Zhu Mohit Bansal 🧵👇 #CoDi

🖼️🎞️🔊📄Excited to introduce Composable Diffusion (CoDi), a new generative-AI foundation model that can take any combo of input modalities & generate any combo of output modalities (text, audio, image, video)! Ziyi Yang Chenguang Zhu Mohit Bansal 🧵👇 #CoDi

Zineng Tang

105,269 просмотров • 3 лет назад