Загрузка видео...

Не удалось загрузить видео

На главную

Do Vision-Language Models represent space, and how? Spatial terms like "left" or "right" may not be enough to match images with spatial descriptions, as we often overlook the different frames of reference (FoR) used by speakers and listeners. See Figure 1 for examples! Introducing the COnsistent Multilingual Frame Of...

35,542 просмотров • 1 год назад •via X (Twitter)

Комментарии: 1

Фото профиля Martin Ziqiao Ma
Martin Ziqiao Ma1 год назад

Interesting fact: Reasoning across multiple intrinsic frames of reference is quite challenging, even for GPT-4. I adapted Figure 2.5 from Levinson 2003 (Logical Inadequacies of the Intrinsic Frame of Reference) into a yes/no question format, and GPT-4 struggled with both.

Похожие видео

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,494 просмотров • 2 лет назад

🔥 Battle for the top reasoning LLM intensifies! The QwQ-32B-Preview is a very good reasoning LLM. Full video of my tests here: Summary of my findings and thoughts: It was able to solve a couple of hard math problems so it looks very promising for maths. It didn’t do so well on my coding task (generating bash script). By the results reported on the LiveCodeBench it has room for improvement. One thing that’s become very clear to me is that the reasoning capabilities of these LLMs are significantly closing the gap between the open and closed-sourced models. The competition is now going to be on a different level and it's going to be focused on which model produces the most efficient, optimized, accurate, and fastest reasoning steps beyond just accurate responses. That's what developers will care about. Traditional benchmarks are not going to be good enough for this. On that note, it's getting harder to assess these models, especially the consistency, efficiency, and quality of reasoning steps. After experimenting with this model, I realized that the reasoning paths are not fully optimized and there is a lot more optimization that needs to happen before these models are used in production settings. There might be a need to build some type of native and efficient self-assessment or self-reflection capability that prevents these reasoning LLMs to go in loops or produce unnecessary lengthy sequences. I also noticed that this model, at least from the HF demo, doesn’t separate the reasoning from the response. I think that actually hurts the performance of the model. On the other hand, o1 and R1 do that really well. In addition to that, I believe the training on reasoning is hurting the performance of the LLM in other areas such as helpfulness (check the code example in the video). Something that’s necessary at the moment is validating or evaluating the quality of the reasoning chains and figuring out a better strategy to optimize them. Current methods are probably not sufficient to solve this problem but that's where innovation will comes next. I recognize that this is a first effort so kudos to the Qwen team on this release. These issues highlight the importance of transparency with reasoning LLMs. We need to know how it was trained and with exact data or optimization strategy. Understanding that will enable researchers and developers to build better intuition and improve the reasoning capabilities and components at a faster rate. There is an opportunity for someone or a company to build a truly open-reasoning LLM. The race is on! I will continue to track the state-of-the-art in reasoning LLMs and report my takes and observations here. Stay tuned for more.

elvis

14,740 просмотров • 1 год назад