正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Do Vision-Language Models represent space, and how? Spatial terms like "left" or "right" may not be enough to match images with spatial descriptions, as we often overlook the different frames of reference (FoR) used by speakers and listeners. See Figure 1 for examples! Introducing the COnsistent Multilingual Frame Of... Reference Test (COMFORT), an evaluation protocol to assess the spatial reasoning capabilities of VLMs. COMFORT includes systematically designed datasets and metrics that evaluate model performance, and their deeper linguistic competence, specifically the spatial knowledge encoded in their internal representations. Find out more in the video teaser! Almost all VLMs prefer the egocentric relative FoR with reflected transform, similar to English. Yet, we reveal significant shortcomings of VLMs: notably, the models (1) exhibit poor robustness and consistency, (2) lack the flexibility to accommodate multiple FoRs, and (3) fail to adhere to language-specific or culture-specific conventions in cross-lingual tests, as English tends to dominate other languages. A shortened version will appear in Pluralistic Alignment Workshop Pluralistic Alignment Workshop #NeurIPS2024. It seems that the ArXiv moderators put it on hold and are eager to give it a thorough read first🤣! So here is the Paper/Code/Data: This collaboration turns out to be amazing, jointly led by Brian Zheyuan Zhang, @Hu_FY_ Jayjun Lee, with so many contributions and insights from Freda Shi, Parisa Kordjamshidi Michigan SLED Lab. With a growing effort to align vision-language models with human cognitive intuitions, we call for more attention to the ambiguous nature and cross-cultural diversity of spatial reasoning!show more

Martin Ziqiao Ma

4,219 subscribers

35,542 次观看 • 1 年前 •via X (Twitter)

教育健康养生科学技术

Anya Rossi• Live Now

Private livecam show

1 条评论

Martin Ziqiao Ma 的头像

Martin Ziqiao Ma1 年前

Interesting fact: Reasoning across multiple intrinsic frames of reference is quite challenging, even for GPT-4. I adapted Figure 2.5 from Levinson 2003 (Logical Inadequacies of the Intrinsic Frame of Reference) into a yes/no question format, and GPT-4 struggled with both.

相关视频

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

AK

83,657 次观看 • 2 年前

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,572 次观看 • 2 年前

🔥In Magma, we talked a lot about spatial/temporal intelligence beyond verbal intelligencen as advocated by Dr. Fei-Fei Li. So how to interpret it? Today I am happy to announce a new demo Magma-Gaming: 👉 Rather than asking LLMs to write game code, we further ask the model to PLAY the game. A simple game like moving to the target in a 2D grid still requires precise action grounding and planning capability, yet challenges the most advanced VLMs and even GPT-4o-mini model. Magma, born with stronger spatial understanding and reasoning ability, significantly outperforms the counterparts and achieves much higher scores in zero-shot manner. This result pinpoints the huge potential of building multimodal agentic models endowed with both verbal and spatial inteligence! Also, I believe this simple demo gives you a better hints to understand what is spatial intelligence, and why it is important!

🔥In Magma, we talked a lot about spatial/temporal intelligence beyond verbal intelligencen as advocated by Dr. Fei-Fei Li. So how to interpret it? Today I am happy to announce a new demo Magma-Gaming: 👉 Rather than asking LLMs to write game code, we further ask the model to PLAY the game. A simple game like moving to the target in a 2D grid still requires precise action grounding and planning capability, yet challenges the most advanced VLMs and even GPT-4o-mini model. Magma, born with stronger spatial understanding and reasoning ability, significantly outperforms the counterparts and achieves much higher scores in zero-shot manner. This result pinpoints the huge potential of building multimodal agentic models endowed with both verbal and spatial inteligence! Also, I believe this simple demo gives you a better hints to understand what is spatial intelligence, and why it is important!

Jianwei Yang

17,940 次观看 • 1 年前

Vision-Language Models (VLMs) can describe the environment, but can they refer within it? Our findings reveal a critical gap: VLMs fall short of pragmatic optimality. We identify 3 key failures of pragmatic competence in referring expression generation with VLMs: (1) cannot uniquely refer to the referent, (2) include excessive or irrelevant information, and (3) misalign with human pragmatic preferences. We introduce RefOI, a new dataset of 1.5k objects, each with 3 written and 2 spoken human-produced referring expressions. We also release RefOI-TLHF, a large dataset of token-level human feedback for 10.6k referring expressions. 👀 📄 Excited to colead this project with the amazing JaneDing, and huge thanks to the dream team Xuejun Zhang Dezhi Luo Diiikee @6SihanXu Yuchen Huang Roihn Run Peng 彭润 Michigan SLED Lab.

Vision-Language Models (VLMs) can describe the environment, but can they refer within it? Our findings reveal a critical gap: VLMs fall short of pragmatic optimality. We identify 3 key failures of pragmatic competence in referring expression generation with VLMs: (1) cannot uniquely refer to the referent, (2) include excessive or irrelevant information, and (3) misalign with human pragmatic preferences. We introduce RefOI, a new dataset of 1.5k objects, each with 3 written and 2 spoken human-produced referring expressions. We also release RefOI-TLHF, a large dataset of token-level human feedback for 10.6k referring expressions. 👀 📄 Excited to colead this project with the amazing JaneDing, and huge thanks to the dream team Xuejun Zhang Dezhi Luo Diiikee @6SihanXu Yuchen Huang Roihn Run Peng 彭润 Michigan SLED Lab.

Martin Ziqiao Ma

20,636 次观看 • 1 年前

Demis Hassabis says world models are his longest standing passion and explains benefits vs. language models: ▫️ “I think language models are able to understand a lot about the world. More than we expected because language is actually probably richer than we thought. But there's still a lot about the spatial dynamics of the world, spatial awareness and the physical context we're in — and how that works mechanically — that is hard to describe in words and isn't generally described in corpuses of words. Alot of this is allied to learning from experience. There's a lot of things which you can't really describe something. You have to just experience it. Maybe the senses and so on are very hard to put into words. Whether that's motor angles and smell and these kinds of senses, it's very difficult to describe that in any kind of language.”▫️ This is what Demis and Google Deepmind is trying to solve with Genie. He also says that the video models (Veo) will play a part in training the world models and this is all key for AI robotics.

Demis Hassabis says world models are his longest standing passion and explains benefits vs. language models: ▫️ “I think language models are able to understand a lot about the world. More than we expected because language is actually probably richer than we thought. But there's still a lot about the spatial dynamics of the world, spatial awareness and the physical context we're in — and how that works mechanically — that is hard to describe in words and isn't generally described in corpuses of words. Alot of this is allied to learning from experience. There's a lot of things which you can't really describe something. You have to just experience it. Maybe the senses and so on are very hard to put into words. Whether that's motor angles and smell and these kinds of senses, it's very difficult to describe that in any kind of language.”▫️ This is what Demis and Google Deepmind is trying to solve with Genie. He also says that the video models (Veo) will play a part in training the world models and this is all key for AI robotics.

Bearly AI

123,666 次观看 • 6 个月前

Today we’re releasing OpenEQA — the Open-Vocabulary Embodied Question Answering Benchmark. It measures an AI agent’s understanding of physical environments by probing it with open vocabulary questions like “Where did I leave my badge?” More details ➡️ All of today’s state-of-art vision+language models (VLMs) fall well short of human performance. In fact, for questions that require spatial understanding, today’s VLMs are nearly “blind” – access to visual content provides only minor improvements over language-only models. We hope that OpenEQA motivates additional research into helping AI understand and communicate about the world it sees.

Today we’re releasing OpenEQA — the Open-Vocabulary Embodied Question Answering Benchmark. It measures an AI agent’s understanding of physical environments by probing it with open vocabulary questions like “Where did I leave my badge?” More details ➡️ All of today’s state-of-art vision+language models (VLMs) fall well short of human performance. In fact, for questions that require spatial understanding, today’s VLMs are nearly “blind” – access to visual content provides only minor improvements over language-only models. We hope that OpenEQA motivates additional research into helping AI understand and communicate about the world it sees.

AI at Meta

407,012 次观看 • 2 年前

Demis on why world models are his longest standing passion and explains benefits vs. language models: ▫️ “I think language models are able to understand a lot about the world. More than we expected because language is actually probably richer than we thought. But there's still a lot about the spatial dynamics of the world, spatial awareness and the physical context we're in — and how that works mechanically — that is hard to describe in words and isn't generally described in corpuses of words. Alot of this is allied to learning from experience. There's a lot of things which you can't really describe something. You have to just experience it. Maybe the senses and so on are very hard to put into words. Whether that's motor angles and smell and these kinds of senses, it's very difficult to describe that in any kind of language.”▫️

Demis on why world models are his longest standing passion and explains benefits vs. language models: ▫️ “I think language models are able to understand a lot about the world. More than we expected because language is actually probably richer than we thought. But there's still a lot about the spatial dynamics of the world, spatial awareness and the physical context we're in — and how that works mechanically — that is hard to describe in words and isn't generally described in corpuses of words. Alot of this is allied to learning from experience. There's a lot of things which you can't really describe something. You have to just experience it. Maybe the senses and so on are very hard to put into words. Whether that's motor angles and smell and these kinds of senses, it's very difficult to describe that in any kind of language.”▫️

Bearly AI

78,545 次观看 • 26 天前

"Visual-spatial intelligence is as fundamental as language." Fei-Fei Li says world models are the key to letting AI see the world, reason about it, interact with it, navigate it, and even 'build civilization upon it.' "It's very natural for me that World Labs' north star is to unlock spatial intelligence. The moment to me is right to do it." "We've got these ingredients—we've got compute, we've got a much deeper understanding of data, way deeper than [the ImageNet days]... and we've got some advancement of algorithms."

"Visual-spatial intelligence is as fundamental as language." Fei-Fei Li says world models are the key to letting AI see the world, reason about it, interact with it, navigate it, and even 'build civilization upon it.' "It's very natural for me that World Labs' north star is to unlock spatial intelligence. The moment to me is right to do it." "We've got these ingredients—we've got compute, we've got a much deeper understanding of data, way deeper than [the ImageNet days]... and we've got some advancement of algorithms."

a16z

31,675 次观看 • 3 个月前

🔥 Battle for the top reasoning LLM intensifies! The QwQ-32B-Preview is a very good reasoning LLM. Full video of my tests here: Summary of my findings and thoughts: It was able to solve a couple of hard math problems so it looks very promising for maths. It didn’t do so well on my coding task (generating bash script). By the results reported on the LiveCodeBench it has room for improvement. One thing that’s become very clear to me is that the reasoning capabilities of these LLMs are significantly closing the gap between the open and closed-sourced models. The competition is now going to be on a different level and it's going to be focused on which model produces the most efficient, optimized, accurate, and fastest reasoning steps beyond just accurate responses. That's what developers will care about. Traditional benchmarks are not going to be good enough for this. On that note, it's getting harder to assess these models, especially the consistency, efficiency, and quality of reasoning steps. After experimenting with this model, I realized that the reasoning paths are not fully optimized and there is a lot more optimization that needs to happen before these models are used in production settings. There might be a need to build some type of native and efficient self-assessment or self-reflection capability that prevents these reasoning LLMs to go in loops or produce unnecessary lengthy sequences. I also noticed that this model, at least from the HF demo, doesn’t separate the reasoning from the response. I think that actually hurts the performance of the model. On the other hand, o1 and R1 do that really well. In addition to that, I believe the training on reasoning is hurting the performance of the LLM in other areas such as helpfulness (check the code example in the video). Something that’s necessary at the moment is validating or evaluating the quality of the reasoning chains and figuring out a better strategy to optimize them. Current methods are probably not sufficient to solve this problem but that's where innovation will comes next. I recognize that this is a first effort so kudos to the Qwen team on this release. These issues highlight the importance of transparency with reasoning LLMs. We need to know how it was trained and with exact data or optimization strategy. Understanding that will enable researchers and developers to build better intuition and improve the reasoning capabilities and components at a faster rate. There is an opportunity for someone or a company to build a truly open-reasoning LLM. The race is on! I will continue to track the state-of-the-art in reasoning LLMs and report my takes and observations here. Stay tuned for more.

🔥 Battle for the top reasoning LLM intensifies! The QwQ-32B-Preview is a very good reasoning LLM. Full video of my tests here: Summary of my findings and thoughts: It was able to solve a couple of hard math problems so it looks very promising for maths. It didn’t do so well on my coding task (generating bash script). By the results reported on the LiveCodeBench it has room for improvement. One thing that’s become very clear to me is that the reasoning capabilities of these LLMs are significantly closing the gap between the open and closed-sourced models. The competition is now going to be on a different level and it's going to be focused on which model produces the most efficient, optimized, accurate, and fastest reasoning steps beyond just accurate responses. That's what developers will care about. Traditional benchmarks are not going to be good enough for this. On that note, it's getting harder to assess these models, especially the consistency, efficiency, and quality of reasoning steps. After experimenting with this model, I realized that the reasoning paths are not fully optimized and there is a lot more optimization that needs to happen before these models are used in production settings. There might be a need to build some type of native and efficient self-assessment or self-reflection capability that prevents these reasoning LLMs to go in loops or produce unnecessary lengthy sequences. I also noticed that this model, at least from the HF demo, doesn’t separate the reasoning from the response. I think that actually hurts the performance of the model. On the other hand, o1 and R1 do that really well. In addition to that, I believe the training on reasoning is hurting the performance of the LLM in other areas such as helpfulness (check the code example in the video). Something that’s necessary at the moment is validating or evaluating the quality of the reasoning chains and figuring out a better strategy to optimize them. Current methods are probably not sufficient to solve this problem but that's where innovation will comes next. I recognize that this is a first effort so kudos to the Qwen team on this release. These issues highlight the importance of transparency with reasoning LLMs. We need to know how it was trained and with exact data or optimization strategy. Understanding that will enable researchers and developers to build better intuition and improve the reasoning capabilities and components at a faster rate. There is an opportunity for someone or a company to build a truly open-reasoning LLM. The race is on! I will continue to track the state-of-the-art in reasoning LLMs and report my takes and observations here. Stay tuned for more.

elvis

14,740 次观看 • 1 年前

Excited to share our latest work on 🎧spatial audio-driven human motion generation. We aim to tackle a largely underexplored yet important problem of enabling virtual humans to move naturally in response to spatial audio—capturing not just what is heard, but also where the sound is coming from. To this end, we introduce the Spatial Audio-Driven Human Motion (SAM) dataset—the first comprehensive dataset featuring paired high-quality human motion and spatial audio recordings. For benchmarking, we develop a generative framework for human MOtion generation driven by SPAtial audio, termed MOSPA, which learns to synthesize realistic and diverse human motions conditioned on spatial audio input. We hope this research could provide a foundation for future research in spatial perception, virtual characters, and embodied AI. The dataset and model will be open-sourced soon. A big thank you to our intern, Shuyang Xu, for the wonderful collaboration! Congratulations, Shuyang! Project page: Paper: Video: #Animation #CG #CV #AIGC #DL #Deeplearning #Motion #Graphics #AI #GenerativeAI

Excited to share our latest work on 🎧spatial audio-driven human motion generation. We aim to tackle a largely underexplored yet important problem of enabling virtual humans to move naturally in response to spatial audio—capturing not just what is heard, but also where the sound is coming from. To this end, we introduce the Spatial Audio-Driven Human Motion (SAM) dataset—the first comprehensive dataset featuring paired high-quality human motion and spatial audio recordings. For benchmarking, we develop a generative framework for human MOtion generation driven by SPAtial audio, termed MOSPA, which learns to synthesize realistic and diverse human motions conditioned on spatial audio input. We hope this research could provide a foundation for future research in spatial perception, virtual characters, and embodied AI. The dataset and model will be open-sourced soon. A big thank you to our intern, Shuyang Xu, for the wonderful collaboration! Congratulations, Shuyang! Project page: Paper: Video: #Animation #CG #CV #AIGC #DL #Deeplearning #Motion #Graphics #AI #GenerativeAI

Zhiyang (Frank) Dou

14,603 次观看 • 11 个月前

"We ought to be able to press 1 for English and get somebody on the other end that English is their native tongue. That they understand and speak the language of THIS LAND." President Trump made English the official language of this great country, it is time for everyone to get on board or GET OUT!

"We ought to be able to press 1 for English and get somebody on the other end that English is their native tongue. That they understand and speak the language of THIS LAND." President Trump made English the official language of this great country, it is time for everyone to get on board or GET OUT!

AmericanPapaBear™

11,255 次观看 • 6 个月前

🚀 New Paper: Pixel Reasoner 🧠🖼️ How can Vision-Language Models (VLMs) perform chain-of-thought reasoning within the image itself? We introduce Pixel Reasoner, the first open-source framework that enables VLMs to “think in pixel space” through curiosity-driven reinforcement learning. Current VLMs reason only in text — even when grounded in rich images or videos, their logical steps are verbalized in natural language. This restricts their ability to interrogate visual evidence and demonstrate how conclusions are drawn. 🔍 So we ask: What if we could make VLMs "show their work" by reasoning directly in the pixel space? Inspired by GPT-o3’s "think-in-image" ability, we propose a framework where VLMs use interactive visual operations — zoom, select-frame, highlight — to reason through complex visual inputs. To do this, we design a two-stage training process: Instruction tuning with synthesized visual reasoning traces. Reinforcement learning with curiosity-driven reward to balance exploration between pixel and text reasoning ✨ With this, Pixel Reasoner achieves near-SoTA performance on many information-rich multimodal benchmarks: 📊 84% on InfographicsVQA 🧠 84% on V* benchmark 🧩 74% on TallyQA-Complex It also achieves strong accuracy of 68% on MVBench (a video benchmark). Website: Paper: Code: Demo: (coming soon)

🚀 New Paper: Pixel Reasoner 🧠🖼️ How can Vision-Language Models (VLMs) perform chain-of-thought reasoning within the image itself? We introduce Pixel Reasoner, the first open-source framework that enables VLMs to “think in pixel space” through curiosity-driven reinforcement learning. Current VLMs reason only in text — even when grounded in rich images or videos, their logical steps are verbalized in natural language. This restricts their ability to interrogate visual evidence and demonstrate how conclusions are drawn. 🔍 So we ask: What if we could make VLMs "show their work" by reasoning directly in the pixel space? Inspired by GPT-o3’s "think-in-image" ability, we propose a framework where VLMs use interactive visual operations — zoom, select-frame, highlight — to reason through complex visual inputs. To do this, we design a two-stage training process: Instruction tuning with synthesized visual reasoning traces. Reinforcement learning with curiosity-driven reward to balance exploration between pixel and text reasoning ✨ With this, Pixel Reasoner achieves near-SoTA performance on many information-rich multimodal benchmarks: 📊 84% on InfographicsVQA 🧠 84% on V* benchmark 🧩 74% on TallyQA-Complex It also achieves strong accuracy of 68% on MVBench (a video benchmark). Website: Paper: Code: Demo: (coming soon)

Wenhu Chen

82,829 次观看 • 1 年前

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

AK

125,311 次观看 • 3 年前

This is how large language models turn objects to vector representations. In this video, we explore how large language models (LLMs) convert objects into internal representations, especially when translating between languages like English and Hindi. Using real-world examples, we highlight the challenges of gender inference, grammatical structure, and why direct word-to-word translations often fail. If you're curious about how LLMs deal with multilingual contexts and what it takes to improve translation quality across languages, this video is for you. #LLMs #Vectors #LCM

This is how large language models turn objects to vector representations. In this video, we explore how large language models (LLMs) convert objects into internal representations, especially when translating between languages like English and Hindi. Using real-world examples, we highlight the challenges of gender inference, grammatical structure, and why direct word-to-word translations often fail. If you're curious about how LLMs deal with multilingual contexts and what it takes to improve translation quality across languages, this video is for you. #LLMs #Vectors #LCM

Gaurav Sen

27,368 次观看 • 1 年前

Google DeepMind introduced two foundational models for embodied reasoning, enabling robots to comprehend, react, and take action in the physical world: ⦿ Gemini Robotics – built on Gemini 2.0. Integrates vision, language, and action for real-world dexterity, . ⦿ Gemini Robotics-ER – Enhances spatial reasoning for advanced robotic control. They are working with Apptronik to develop the next generation of humanoid robots.

Google DeepMind introduced two foundational models for embodied reasoning, enabling robots to comprehend, react, and take action in the physical world: ⦿ Gemini Robotics – built on Gemini 2.0. Integrates vision, language, and action for real-world dexterity, . ⦿ Gemini Robotics-ER – Enhances spatial reasoning for advanced robotic control. They are working with Apptronik to develop the next generation of humanoid robots.

The Humanoid Hub

73,097 次观看 • 1 年前

Pretraining is essential for good performance on a wide variety of robotics tasks, and so most vision-language-action models build off of a vision language model (VLM) trained on a wide variety of image-language data. But how does the choice of VLM translate to downstream robotics performance? Jianke Zhang and @GYanjiang join us to talk about this key part of the robot policy, looking at a wide variety of different VLMs and how they perform. Interestingly, they see that performance on auxiliary tasks like quesiton answering did not lead to downstream improvements in control. To learn more, watch episode 65 of RoboPapers now, with Chris Paxton and Jiafei Duan!

Pretraining is essential for good performance on a wide variety of robotics tasks, and so most vision-language-action models build off of a vision language model (VLM) trained on a wide variety of image-language data. But how does the choice of VLM translate to downstream robotics performance? Jianke Zhang and @GYanjiang join us to talk about this key part of the robot policy, looking at a wide variety of different VLMs and how they perform. Interestingly, they see that performance on auxiliary tasks like quesiton answering did not lead to downstream improvements in control. To learn more, watch episode 65 of RoboPapers now, with Chris Paxton and Jiafei Duan!

RoboPapers

23,905 次观看 • 3 个月前

The noble announcement aside, we need to learn to be more sensitive in the language we use in reference to people with disabilities. It is strange how we often find insulting and making fun of them funny!

The noble announcement aside, we need to learn to be more sensitive in the language we use in reference to people with disabilities. It is strange how we often find insulting and making fun of them funny!

Jim Spire Ssentongo

56,140 次观看 • 5 个月前

a demo of using both the gpt-oss-20b and qwen3-coder-480b in the hyperspace agentic browser. this spatial, AI-native UI is designed as an extension of the spatial space of our brain - which places some things more in recency to us, and adjacent to each other. pinch to zoom.

a demo of using both the gpt-oss-20b and qwen3-coder-480b in the hyperspace agentic browser. this spatial, AI-native UI is designed as an extension of the spatial space of our brain - which places some things more in recency to us, and adjacent to each other. pinch to zoom.

Varun

180,438 次观看 • 10 个月前

LP-MusicCaps: LLM-Based Pseudo Music Captioning paper page: Automatic music captioning, which generates natural language descriptions for given music tracks, holds significant potential for enhancing the understanding and organization of large volumes of musical data. Despite its importance, researchers face challenges due to the costly and time-consuming collection process of existing music-language datasets, which are limited in size. To address this data scarcity issue, we propose the use of large language models (LLMs) to artificially generate the description sentences from large-scale tag datasets. This results in approximately 2.2M captions paired with 0.5M audio clips. We term it Large Language Model based Pseudo music caption dataset, shortly, LP-MusicCaps. We conduct a systemic evaluation of the large-scale music captioning dataset with various quantitative evaluation metrics used in the field of natural language processing as well as human evaluation. In addition, we trained a transformer-based music captioning model with the dataset and evaluated it under zero-shot and transfer-learning settings. The results demonstrate that our proposed approach outperforms the supervised baseline model.

LP-MusicCaps: LLM-Based Pseudo Music Captioning paper page: Automatic music captioning, which generates natural language descriptions for given music tracks, holds significant potential for enhancing the understanding and organization of large volumes of musical data. Despite its importance, researchers face challenges due to the costly and time-consuming collection process of existing music-language datasets, which are limited in size. To address this data scarcity issue, we propose the use of large language models (LLMs) to artificially generate the description sentences from large-scale tag datasets. This results in approximately 2.2M captions paired with 0.5M audio clips. We term it Large Language Model based Pseudo music caption dataset, shortly, LP-MusicCaps. We conduct a systemic evaluation of the large-scale music captioning dataset with various quantitative evaluation metrics used in the field of natural language processing as well as human evaluation. In addition, we trained a transformer-based music captioning model with the dataset and evaluated it under zero-shot and transfer-learning settings. The results demonstrate that our proposed approach outperforms the supervised baseline model.

AK

78,794 次观看 • 2 年前

Introducing Collaborative Reasoner: a framework to improve collaborative reasoning in language models. Collaborative Reasoner paves the way for developing social agents that can partner with humans and other agents. Read the research paper and download the code.

Introducing Collaborative Reasoner: a framework to improve collaborative reasoning in language models. Collaborative Reasoner paves the way for developing social agents that can partner with humans and other agents. Read the research paper and download the code.

AI at Meta

58,510 次观看 • 1 年前