Video yükleniyor...

Video Yüklenemedi

Ana Sayfaya Dön

Meta presents Sapiens Foundation for Human Vision Models discuss: We present Sapiens, a family of models for four fundamental human-centric vision tasks - 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. Our models natively support 1K high-resolution inference and are extremely easy to adapt for individual...

151,511 görüntüleme • 1 yıl önce •via X (Twitter)

10 Yorum

BensenHsu profil fotoğrafı
BensenHsu1 yıl önce

The paper presents Sapiens, a family of vision transformer models trained on a large dataset of human images. The goal is to develop models that can generalize well, be applicable to a wide range of tasks, and produce high-quality outputs. The results demonstrate the benefit of pretraining on a large, curated dataset of human images. The models are able to generalize well to various scenarios, including multi-person scenes, egocentric views, and challenging poses. The high-resolution (1024x1024) pretraining and the detailed annotation of the finetuning datasets also contribute to the models' strong performance. full paper:

Supreme profil fotoğrafı
Supreme1 yıl önce

normal map is mind blowing what the tech

TheEarningsNugget profil fotoğrafı
TheEarningsNugget1 yıl önce

"Sapiens: Foundation for Human Vision Models" PAPER SUMMARY

Miguel Xochicale 🧑🏽‍🔬🤖〰️ profil fotoğrafı
Miguel Xochicale 🧑🏽‍🔬🤖〰️1 yıl önce

Nice one but these links are not working (will they open it soon?)

bryan pratte profil fotoğrafı
bryan pratte1 yıl önce

No code :(

Alessandro De Blasis profil fotoğrafı
Alessandro De Blasis1 yıl önce

Is it real-time or post-processing?

Alessandro De Blasis profil fotoğrafı
Alessandro De Blasis1 yıl önce

Want

Self-Attention Mechanism profil fotoğrafı
Self-Attention Mechanism1 yıl önce

can it spot a soldier and identify the head?

Cavit Erginsoy profil fotoğrafı
Cavit Erginsoy1 yıl önce

Why non commercial license @Meta 😵‍💫

Patryk Zoltowski profil fotoğrafı
Patryk Zoltowski1 yıl önce

Hope there will be some distilled model for realtime inference on mobile

Benzer Videolar

We benchmarked leading multimodal foundation models (GPT-4o, Claude 3.5 Sonnet, Gemini, Llama, etc.) on standard computer vision tasks—from segmentation to surface normal estimation—using standard datasets like COCO and ImageNet. These models have made remarkable progress; however, it is unclear exactly where they stand in terms of understanding vision in detail. Especially when it comes to tasks beyond question-answering. How well do they understand an object's segments or geometry? Our analyses yield an assessment that is quantitatively and qualitatively detailed and is compatible with evaluations developed in the field of computer vision over the past decades. Observed trends: 🔹 The foundation models consistently underperform task-specific SOTA models across all tasks. However, they are respectable generalists, which is remarkable as they are presumably trained primarily on image-text-based tasks. 🔹 They perform semantic tasks notably better than geometric ones. 🔹 GPT-4o performs the best among non-reasoning models, getting the top position in 4 out of 6 tasks. 🔹 Reasoning models, e.g., o3, show improvements in geometric tasks. 🔹 The 'image generation' models, e.g., GPT-40 Image Generation, which have been natively trained multimodally, exhibit quirks. E.g., hallucinated objects, misalignment between the input and output, etc. 🔹 While the prompting techniques affect performance, better models exhibit less sensitivity to variations in prompts. We control for the variance introduced by the prompting methods in our experiments. 🌐 Detailed analyses, visualizations: ⌨️ code: 🧵 1/n

Amir Zamir

73,074 görüntüleme • 11 ay önce

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,572 görüntüleme • 2 yıl önce