Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

Do Vision-Language Models represent space, and how? Spatial terms like "left" or "right" may not be enough to match images with spatial descriptions, as we often overlook the different frames of reference (FoR) used by speakers and listeners. See Figure 1 for examples! Introducing the COnsistent Multilingual Frame Of... Reference Test (COMFORT), an evaluation protocol to assess the spatial reasoning capabilities of VLMs. COMFORT includes systematically designed datasets and metrics that evaluate model performance, and their deeper linguistic competence, specifically the spatial knowledge encoded in their internal representations. Find out more in the video teaser! Almost all VLMs prefer the egocentric relative FoR with reflected transform, similar to English. Yet, we reveal significant shortcomings of VLMs: notably, the models (1) exhibit poor robustness and consistency, (2) lack the flexibility to accommodate multiple FoRs, and (3) fail to adhere to language-specific or culture-specific conventions in cross-lingual tests, as English tends to dominate other languages. A shortened version will appear in Pluralistic Alignment Workshop Pluralistic Alignment Workshop #NeurIPS2024. It seems that the ArXiv moderators put it on hold and are eager to give it a thorough read first🤣! So here is the Paper/Code/Data: This collaboration turns out to be amazing, jointly led by Brian Zheyuan Zhang, @Hu_FY_ Jayjun Lee, with so many contributions and insights from Freda Shi, Parisa Kordjamshidi Michigan SLED Lab. With a growing effort to align vision-language models with human cognitive intuitions, we call for more attention to the ambiguous nature and cross-cultural diversity of spatial reasoning!show more

Martin Ziqiao Ma

4,520 subscribers

35,565 Aufrufe • vor 1 Jahr •via X (Twitter)

Bildung Gesundheit & Wellness Wissenschaft & Technologie

Anya Rossi• Live Now

Private livecam show

1 Kommentare

Profilbild von Martin Ziqiao Ma

Martin Ziqiao Mavor 1 Jahr

Interesting fact: Reasoning across multiple intrinsic frames of reference is quite challenging, even for GPT-4. I adapted Figure 2.5 from Levinson 2003 (Logical Inadequacies of the Intrinsic Frame of Reference) into a yes/no question format, and GPT-4 struggled with both.

Ähnliche Videos

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

AK

83,657 Aufrufe • vor 3 Jahren

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,708 Aufrufe • vor 3 Jahren

🔥In Magma, we talked a lot about spatial/temporal intelligence beyond verbal intelligencen as advocated by Dr. Fei-Fei Li. So how to interpret it? Today I am happy to announce a new demo Magma-Gaming: 👉 Rather than asking LLMs to write game code, we further ask the model to PLAY the game. A simple game like moving to the target in a 2D grid still requires precise action grounding and planning capability, yet challenges the most advanced VLMs and even GPT-4o-mini model. Magma, born with stronger spatial understanding and reasoning ability, significantly outperforms the counterparts and achieves much higher scores in zero-shot manner. This result pinpoints the huge potential of building multimodal agentic models endowed with both verbal and spatial inteligence! Also, I believe this simple demo gives you a better hints to understand what is spatial intelligence, and why it is important!

🔥In Magma, we talked a lot about spatial/temporal intelligence beyond verbal intelligencen as advocated by Dr. Fei-Fei Li. So how to interpret it? Today I am happy to announce a new demo Magma-Gaming: 👉 Rather than asking LLMs to write game code, we further ask the model to PLAY the game. A simple game like moving to the target in a 2D grid still requires precise action grounding and planning capability, yet challenges the most advanced VLMs and even GPT-4o-mini model. Magma, born with stronger spatial understanding and reasoning ability, significantly outperforms the counterparts and achieves much higher scores in zero-shot manner. This result pinpoints the huge potential of building multimodal agentic models endowed with both verbal and spatial inteligence! Also, I believe this simple demo gives you a better hints to understand what is spatial intelligence, and why it is important!

Jianwei Yang

17,940 Aufrufe • vor 1 Jahr

Vision-Language Models (VLMs) can describe the environment, but can they refer within it? Our findings reveal a critical gap: VLMs fall short of pragmatic optimality. We identify 3 key failures of pragmatic competence in referring expression generation with VLMs: (1) cannot uniquely refer to the referent, (2) include excessive or irrelevant information, and (3) misalign with human pragmatic preferences. We introduce RefOI, a new dataset of 1.5k objects, each with 3 written and 2 spoken human-produced referring expressions. We also release RefOI-TLHF, a large dataset of token-level human feedback for 10.6k referring expressions. 👀 📄 Excited to colead this project with the amazing JaneDing, and huge thanks to the dream team Xuejun Zhang Dezhi Luo Diiikee @6SihanXu Yuchen Huang Roihn Run Peng 彭润 Michigan SLED Lab.

Vision-Language Models (VLMs) can describe the environment, but can they refer within it? Our findings reveal a critical gap: VLMs fall short of pragmatic optimality. We identify 3 key failures of pragmatic competence in referring expression generation with VLMs: (1) cannot uniquely refer to the referent, (2) include excessive or irrelevant information, and (3) misalign with human pragmatic preferences. We introduce RefOI, a new dataset of 1.5k objects, each with 3 written and 2 spoken human-produced referring expressions. We also release RefOI-TLHF, a large dataset of token-level human feedback for 10.6k referring expressions. 👀 📄 Excited to colead this project with the amazing JaneDing, and huge thanks to the dream team Xuejun Zhang Dezhi Luo Diiikee @6SihanXu Yuchen Huang Roihn Run Peng 彭润 Michigan SLED Lab.

Martin Ziqiao Ma

20,636 Aufrufe • vor 1 Jahr

Demis Hassabis says world models are his longest standing passion and explains benefits vs. language models: ▫️ “I think language models are able to understand a lot about the world. More than we expected because language is actually probably richer than we thought. But there's still a lot about the spatial dynamics of the world, spatial awareness and the physical context we're in — and how that works mechanically — that is hard to describe in words and isn't generally described in corpuses of words. Alot of this is allied to learning from experience. There's a lot of things which you can't really describe something. You have to just experience it. Maybe the senses and so on are very hard to put into words. Whether that's motor angles and smell and these kinds of senses, it's very difficult to describe that in any kind of language.”▫️ This is what Demis and Google Deepmind is trying to solve with Genie. He also says that the video models (Veo) will play a part in training the world models and this is all key for AI robotics.

Demis Hassabis says world models are his longest standing passion and explains benefits vs. language models: ▫️ “I think language models are able to understand a lot about the world. More than we expected because language is actually probably richer than we thought. But there's still a lot about the spatial dynamics of the world, spatial awareness and the physical context we're in — and how that works mechanically — that is hard to describe in words and isn't generally described in corpuses of words. Alot of this is allied to learning from experience. There's a lot of things which you can't really describe something. You have to just experience it. Maybe the senses and so on are very hard to put into words. Whether that's motor angles and smell and these kinds of senses, it's very difficult to describe that in any kind of language.”▫️ This is what Demis and Google Deepmind is trying to solve with Genie. He also says that the video models (Veo) will play a part in training the world models and this is all key for AI robotics.

Bearly AI

123,703 Aufrufe • vor 7 Monaten

Today we’re releasing OpenEQA — the Open-Vocabulary Embodied Question Answering Benchmark. It measures an AI agent’s understanding of physical environments by probing it with open vocabulary questions like “Where did I leave my badge?” More details ➡️ All of today’s state-of-art vision+language models (VLMs) fall well short of human performance. In fact, for questions that require spatial understanding, today’s VLMs are nearly “blind” – access to visual content provides only minor improvements over language-only models. We hope that OpenEQA motivates additional research into helping AI understand and communicate about the world it sees.

Today we’re releasing OpenEQA — the Open-Vocabulary Embodied Question Answering Benchmark. It measures an AI agent’s understanding of physical environments by probing it with open vocabulary questions like “Where did I leave my badge?” More details ➡️ All of today’s state-of-art vision+language models (VLMs) fall well short of human performance. In fact, for questions that require spatial understanding, today’s VLMs are nearly “blind” – access to visual content provides only minor improvements over language-only models. We hope that OpenEQA motivates additional research into helping AI understand and communicate about the world it sees.

AI at Meta

407,089 Aufrufe • vor 2 Jahren

🚀 New Paper: Pixel Reasoner 🧠🖼️ How can Vision-Language Models (VLMs) perform chain-of-thought reasoning within the image itself? We introduce Pixel Reasoner, the first open-source framework that enables VLMs to “think in pixel space” through curiosity-driven reinforcement learning. Current VLMs reason only in text — even when grounded in rich images or videos, their logical steps are verbalized in natural language. This restricts their ability to interrogate visual evidence and demonstrate how conclusions are drawn. 🔍 So we ask: What if we could make VLMs "show their work" by reasoning directly in the pixel space? Inspired by GPT-o3’s "think-in-image" ability, we propose a framework where VLMs use interactive visual operations — zoom, select-frame, highlight — to reason through complex visual inputs. To do this, we design a two-stage training process: Instruction tuning with synthesized visual reasoning traces. Reinforcement learning with curiosity-driven reward to balance exploration between pixel and text reasoning ✨ With this, Pixel Reasoner achieves near-SoTA performance on many information-rich multimodal benchmarks: 📊 84% on InfographicsVQA 🧠 84% on V* benchmark 🧩 74% on TallyQA-Complex It also achieves strong accuracy of 68% on MVBench (a video benchmark). Website: Paper: Code: Demo: (coming soon)

🚀 New Paper: Pixel Reasoner 🧠🖼️ How can Vision-Language Models (VLMs) perform chain-of-thought reasoning within the image itself? We introduce Pixel Reasoner, the first open-source framework that enables VLMs to “think in pixel space” through curiosity-driven reinforcement learning. Current VLMs reason only in text — even when grounded in rich images or videos, their logical steps are verbalized in natural language. This restricts their ability to interrogate visual evidence and demonstrate how conclusions are drawn. 🔍 So we ask: What if we could make VLMs "show their work" by reasoning directly in the pixel space? Inspired by GPT-o3’s "think-in-image" ability, we propose a framework where VLMs use interactive visual operations — zoom, select-frame, highlight — to reason through complex visual inputs. To do this, we design a two-stage training process: Instruction tuning with synthesized visual reasoning traces. Reinforcement learning with curiosity-driven reward to balance exploration between pixel and text reasoning ✨ With this, Pixel Reasoner achieves near-SoTA performance on many information-rich multimodal benchmarks: 📊 84% on InfographicsVQA 🧠 84% on V* benchmark 🧩 74% on TallyQA-Complex It also achieves strong accuracy of 68% on MVBench (a video benchmark). Website: Paper: Code: Demo: (coming soon)

Wenhu Chen

82,829 Aufrufe • vor 1 Jahr

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

AK

125,319 Aufrufe • vor 3 Jahren

Pretraining is essential for good performance on a wide variety of robotics tasks, and so most vision-language-action models build off of a vision language model (VLM) trained on a wide variety of image-language data. But how does the choice of VLM translate to downstream robotics performance? Jianke Zhang and @GYanjiang join us to talk about this key part of the robot policy, looking at a wide variety of different VLMs and how they perform. Interestingly, they see that performance on auxiliary tasks like quesiton answering did not lead to downstream improvements in control. To learn more, watch episode 65 of RoboPapers now, with Chris Paxton and Jiafei Duan!

Pretraining is essential for good performance on a wide variety of robotics tasks, and so most vision-language-action models build off of a vision language model (VLM) trained on a wide variety of image-language data. But how does the choice of VLM translate to downstream robotics performance? Jianke Zhang and @GYanjiang join us to talk about this key part of the robot policy, looking at a wide variety of different VLMs and how they perform. Interestingly, they see that performance on auxiliary tasks like quesiton answering did not lead to downstream improvements in control. To learn more, watch episode 65 of RoboPapers now, with Chris Paxton and Jiafei Duan!

RoboPapers

23,905 Aufrufe • vor 4 Monaten

LP-MusicCaps: LLM-Based Pseudo Music Captioning paper page: Automatic music captioning, which generates natural language descriptions for given music tracks, holds significant potential for enhancing the understanding and organization of large volumes of musical data. Despite its importance, researchers face challenges due to the costly and time-consuming collection process of existing music-language datasets, which are limited in size. To address this data scarcity issue, we propose the use of large language models (LLMs) to artificially generate the description sentences from large-scale tag datasets. This results in approximately 2.2M captions paired with 0.5M audio clips. We term it Large Language Model based Pseudo music caption dataset, shortly, LP-MusicCaps. We conduct a systemic evaluation of the large-scale music captioning dataset with various quantitative evaluation metrics used in the field of natural language processing as well as human evaluation. In addition, we trained a transformer-based music captioning model with the dataset and evaluated it under zero-shot and transfer-learning settings. The results demonstrate that our proposed approach outperforms the supervised baseline model.

LP-MusicCaps: LLM-Based Pseudo Music Captioning paper page: Automatic music captioning, which generates natural language descriptions for given music tracks, holds significant potential for enhancing the understanding and organization of large volumes of musical data. Despite its importance, researchers face challenges due to the costly and time-consuming collection process of existing music-language datasets, which are limited in size. To address this data scarcity issue, we propose the use of large language models (LLMs) to artificially generate the description sentences from large-scale tag datasets. This results in approximately 2.2M captions paired with 0.5M audio clips. We term it Large Language Model based Pseudo music caption dataset, shortly, LP-MusicCaps. We conduct a systemic evaluation of the large-scale music captioning dataset with various quantitative evaluation metrics used in the field of natural language processing as well as human evaluation. In addition, we trained a transformer-based music captioning model with the dataset and evaluated it under zero-shot and transfer-learning settings. The results demonstrate that our proposed approach outperforms the supervised baseline model.

AK

78,794 Aufrufe • vor 3 Jahren

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

AK

290,517 Aufrufe • vor 3 Jahren

"Yes, it’s too soon to tell, but it is interesting that large language models, as opposed to social media, have tended to be much more aligned with what we would consider objective reality and truth. That is, I think it’s not so easy to get a large language model to endorse a ridiculous conspiracy theory, such as that the 2020 U.S. election was stolen or that COVID vaccines are actually microchips that Bill Gates is implanting to surveil us. All of those ideas are very easy to find on social media. And it may indeed be, as you say, that large language models tend to gravitate toward common priors—namely, all the knowledge that’s out there on the web. But, in particular, during pre-training and post-training—the fine-tuning process—they are engineered to be aligned with truth and reality. And that is because the AI companies have a different business model than the social media companies." From an episode of Scaling Theory with Thibault Schrepel: Steven Pinker on Common Knowledge, From Eye Contact to the Super Bowl

"Yes, it’s too soon to tell, but it is interesting that large language models, as opposed to social media, have tended to be much more aligned with what we would consider objective reality and truth. That is, I think it’s not so easy to get a large language model to endorse a ridiculous conspiracy theory, such as that the 2020 U.S. election was stolen or that COVID vaccines are actually microchips that Bill Gates is implanting to surveil us. All of those ideas are very easy to find on social media. And it may indeed be, as you say, that large language models tend to gravitate toward common priors—namely, all the knowledge that’s out there on the web. But, in particular, during pre-training and post-training—the fine-tuning process—they are engineered to be aligned with truth and reality. And that is because the AI companies have a different business model than the social media companies." From an episode of Scaling Theory with Thibault Schrepel: Steven Pinker on Common Knowledge, From Eye Contact to the Super Bowl

Steven Pinker

57,812 Aufrufe • vor 23 Tagen

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

AK

23,958 Aufrufe • vor 1 Jahr

Dr. Fei-Fei Li just called out the biggest blind spot in the entire AI industry. We have been building half of human intelligence. And calling it the finish line. Li: “If you look at human intelligence, it pretty much boils down to two buckets.” The first bucket is language. Symbolic reasoning. Communication. The ability to think in words and abstractions. That’s what every major AI lab has spent the last decade building. The second bucket is the one the industry has almost entirely ignored. Li: “We call that in AI spatial intelligence.” How humans and animals perceive, navigate, and interact with the three-dimensional physical world. How we reach for objects. How we move through space. How we build and manipulate physical reality. From painting masterpieces to constructing the pyramids, non-verbal spatial intelligence is what actually shapes the world. Language describes reality. Spatial intelligence acts on it. And the gap between those two things is the gap between a chatbot and a robot. Li: “When this technology is ready, the robotic revolution is gonna start. We’re already seeing that trend.” Every robot is a moving agent. Every moving agent requires spatial intelligence to function in the real world. The humanoid robots being deployed in factories right now are hitting the ceiling of what language models alone can power. Spatial intelligence is the unlock. But Li didn’t stop at robotics. Li: “From a geopolitics point of view, this is part of the technology that goes straight into weapons.” Autonomous drone swarms. Battlefield navigation. Physical target acquisition without human oversight. Every military application of AI that operates in the real world runs on spatial intelligence. The nation that masters the transition from static text to dynamic three-dimensional perception doesn’t just win the software race. It commands the physical battlefield. The AI arms race just broke out of the data center. It’s operating in three dimensions now.

Dr. Fei-Fei Li just called out the biggest blind spot in the entire AI industry. We have been building half of human intelligence. And calling it the finish line. Li: “If you look at human intelligence, it pretty much boils down to two buckets.” The first bucket is language. Symbolic reasoning. Communication. The ability to think in words and abstractions. That’s what every major AI lab has spent the last decade building. The second bucket is the one the industry has almost entirely ignored. Li: “We call that in AI spatial intelligence.” How humans and animals perceive, navigate, and interact with the three-dimensional physical world. How we reach for objects. How we move through space. How we build and manipulate physical reality. From painting masterpieces to constructing the pyramids, non-verbal spatial intelligence is what actually shapes the world. Language describes reality. Spatial intelligence acts on it. And the gap between those two things is the gap between a chatbot and a robot. Li: “When this technology is ready, the robotic revolution is gonna start. We’re already seeing that trend.” Every robot is a moving agent. Every moving agent requires spatial intelligence to function in the real world. The humanoid robots being deployed in factories right now are hitting the ceiling of what language models alone can power. Spatial intelligence is the unlock. But Li didn’t stop at robotics. Li: “From a geopolitics point of view, this is part of the technology that goes straight into weapons.” Autonomous drone swarms. Battlefield navigation. Physical target acquisition without human oversight. Every military application of AI that operates in the real world runs on spatial intelligence. The nation that masters the transition from static text to dynamic three-dimensional perception doesn’t just win the software race. It commands the physical battlefield. The AI arms race just broke out of the data center. It’s operating in three dimensions now.

Dustin

122,680 Aufrufe • vor 4 Monaten

“We express our gratitude for this crystal clear position by the 145 countries addressed to the host country, to respect its obligation under the Headquarters Agreement… the host country does not have the right or the privilege to abuse its authority and deny visas for our leaders to participate. We sincerely hope that this abuse is reversed as quickly as possible. We have the right to be with all of you, for our leaders, to be with all of you, and to share our thoughts and ideas peacefully, diplomatically, legally, and in a civilized way. Denying us, denying our leaders, the opportunity to do so is an abuse of the authority, and it is a punishment for the State of Palestine that should not take place. Our leaders will participate in the capacity that is possible to us, but we will not yield an inch for our right to be treated like the rest of you and to receive our visas for our leaders in any meetings that we are invited to participate in, and we will continue doing that. “

“We express our gratitude for this crystal clear position by the 145 countries addressed to the host country, to respect its obligation under the Headquarters Agreement… the host country does not have the right or the privilege to abuse its authority and deny visas for our leaders to participate. We sincerely hope that this abuse is reversed as quickly as possible. We have the right to be with all of you, for our leaders, to be with all of you, and to share our thoughts and ideas peacefully, diplomatically, legally, and in a civilized way. Denying us, denying our leaders, the opportunity to do so is an abuse of the authority, and it is a punishment for the State of Palestine that should not take place. Our leaders will participate in the capacity that is possible to us, but we will not yield an inch for our right to be treated like the rest of you and to receive our visas for our leaders in any meetings that we are invited to participate in, and we will continue doing that. “

State of Palestine

81,051 Aufrufe • vor 10 Monaten

GS^3: Efficient Relighting with Triple Gaussian Splatting Abstract: We present a spatial and angular Gaussian based representation and a triple splatting process, for real-time, high-quality novel lighting-and-view synthesis from multi-view point-lit input images. To describe complex ap pearance, we employ a Lambertian plus a mixture of angular Gaussians as an effective reflectance function for each spatial Gaussian. To generate self-shadow, we splat all spatial Gaussians towards the light source to obtain shadow values, which are further refined by a small multi-layer perceptron. To compensate for other effects like global illumination, another network is trained to compute and add a per-spatial-Gaussian RGB tuple. The effectiveness of our representation is demonstrated on 30 samples with a wide variation in geometry (from solid to fluffy) and appearance (from translucent to anisotropic), as well as using different forms of input data, including rendered images of synthetic/reconstructed objects, photographs captured with a handheld camera and a flash, or from a professional lightstage. We achieve a training time of 40-70 minutes and a rendering speed of 90 fps on a single commodity GPU. Our results compare favorably with state-of-the-art techniques in terms of quality/performance.

GS^3: Efficient Relighting with Triple Gaussian Splatting Abstract: We present a spatial and angular Gaussian based representation and a triple splatting process, for real-time, high-quality novel lighting-and-view synthesis from multi-view point-lit input images. To describe complex ap pearance, we employ a Lambertian plus a mixture of angular Gaussians as an effective reflectance function for each spatial Gaussian. To generate self-shadow, we splat all spatial Gaussians towards the light source to obtain shadow values, which are further refined by a small multi-layer perceptron. To compensate for other effects like global illumination, another network is trained to compute and add a per-spatial-Gaussian RGB tuple. The effectiveness of our representation is demonstrated on 30 samples with a wide variation in geometry (from solid to fluffy) and appearance (from translucent to anisotropic), as well as using different forms of input data, including rendered images of synthetic/reconstructed objects, photographs captured with a handheld camera and a flash, or from a professional lightstage. We achieve a training time of 40-70 minutes and a rendering speed of 90 fps on a single commodity GPU. Our results compare favorably with state-of-the-art techniques in terms of quality/performance.

MrNeRF

17,786 Aufrufe • vor 1 Jahr

Google presents Still-Moving Customized Video Generation without Customized Video Data Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth or StyleDrop). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight Spatial Adapters that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on "frozen videos" (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel Motion Adapter module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.

Google presents Still-Moving Customized Video Generation without Customized Video Data Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth or StyleDrop). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight Spatial Adapters that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on "frozen videos" (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel Motion Adapter module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.

AK

40,474 Aufrufe • vor 2 Jahren

Say hello to the reason we’ve been eating so much pizza lately 🍕🍕🍕 “Spatial Pizza” is an interactive Vision Pro experience by Brian (Brian McIntosh) and me. This project has been “cooking” for quite some time, based on a concept I came up with back in July. So here’s the pitch: Imagine you’re in a distant future, you’re hungry, and once again, you’re too lazy to cook. Instead of opening a website and ordering pizza the old-fashioned way with boring drop-down menus and checkboxes, you get to playfully create your pizza right on your living room table. Once you’re done, simply hit “Order,” and your creation will be delivered straight to your doorstep. Well, this isn’t that app—at least not yet—but it’s a lot of fun to create your own wild and controversial pizza creations (yes, we added pineapple) and watch raw realistic 3D ingredients transform into their baked versions. That’s why with "Spatial Pizza" we’re excited to introduce our first demo for the future of spatial food computing. Out now on Apple Vision Pro.

Say hello to the reason we’ve been eating so much pizza lately 🍕🍕🍕 “Spatial Pizza” is an interactive Vision Pro experience by Brian (Brian McIntosh) and me. This project has been “cooking” for quite some time, based on a concept I came up with back in July. So here’s the pitch: Imagine you’re in a distant future, you’re hungry, and once again, you’re too lazy to cook. Instead of opening a website and ordering pizza the old-fashioned way with boring drop-down menus and checkboxes, you get to playfully create your pizza right on your living room table. Once you’re done, simply hit “Order,” and your creation will be delivered straight to your doorstep. Well, this isn’t that app—at least not yet—but it’s a lot of fun to create your own wild and controversial pizza creations (yes, we added pineapple) and watch raw realistic 3D ingredients transform into their baked versions. That’s why with "Spatial Pizza" we’re excited to introduce our first demo for the future of spatial food computing. Out now on Apple Vision Pro.

Phil Traut ᯅ

14,073 Aufrufe • vor 1 Jahr

Introducing FoundationMotion. A large-scale, video-derived motion annotation dataset & auto-labeling pipeline + advanced models for motion understanding. Fully open-source: code, datasets, and models, free to use and build on. Understanding motion is core to physical reasoning, yet today’s leading models still struggle with simple spatial actions like “turn right” or “move up” or “flip the toast” - mainly due to the lack of large, fine-grained motion datasets. We present FoundationMotion, a fully automated pipeline that: • detects & tracks objects in videos • extracts trajectories • uses LLMs + frames to generate rich motion captions & QA pairs → creating large-scale, high-quality motion datasets at scale. After fine-tuning the open-source models Qwen and NVILA on our annotations, these models now outperform the closed-source Gemini-3-Flash and GPT-5.1 on spatial understanding tasks across autonomous driving, robotics, and everyday scenarios. 📜Paper: 🌐Webpage: 💻 Code: 🕸️Model: 📊 Dataset: 👉 Interactive Demo: Let’s move research forward together. FoundationMotion is also referred to as Wolf V2 🐺, the second chapter in the Wolf series:

Introducing FoundationMotion. A large-scale, video-derived motion annotation dataset & auto-labeling pipeline + advanced models for motion understanding. Fully open-source: code, datasets, and models, free to use and build on. Understanding motion is core to physical reasoning, yet today’s leading models still struggle with simple spatial actions like “turn right” or “move up” or “flip the toast” - mainly due to the lack of large, fine-grained motion datasets. We present FoundationMotion, a fully automated pipeline that: • detects & tracks objects in videos • extracts trajectories • uses LLMs + frames to generate rich motion captions & QA pairs → creating large-scale, high-quality motion datasets at scale. After fine-tuning the open-source models Qwen and NVILA on our annotations, these models now outperform the closed-source Gemini-3-Flash and GPT-5.1 on spatial understanding tasks across autonomous driving, robotics, and everyday scenarios. 📜Paper: 🌐Webpage: 💻 Code: 🕸️Model: 📊 Dataset: 👉 Interactive Demo: Let’s move research forward together. FoundationMotion is also referred to as Wolf V2 🐺, the second chapter in the Wolf series:

Boyi Li

66,999 Aufrufe • vor 7 Monaten