Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

Meta presents Sapiens Foundation for Human Vision Models discuss: We present Sapiens, a family of models for four fundamental human-centric vision tasks - 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. Our models natively support 1K high-resolution inference and are extremely easy to adapt for individual... tasks by simply fine-tuning models pretrained on over 300 million in-the-wild human images. We observe that, given the same computational budget, self-supervised pretraining on a curated dataset of human images significantly boosts the performance for a diverse set of human-centric tasks. The resulting models exhibit remarkable generalization to in-the-wild data, even when labeled data is scarce or entirely synthetic. Our simple model design also brings scalability - model performance across tasks improves as we scale the number of parameters from 0.3 to 2 billion. Sapiens consistently surpasses existing baselines across various human-centric benchmarks. We achieve significant improvements over the prior state-of-the-art on Humans-5K (pose) by 7.6 mAP, Humans-2K (part-seg) by 17.1 mIoU, Hi4D (depth) by 22.4% relative RMSE, and THuman2 (normal) by 53.5% relative angular error.show more

AK

501,677 subscribers

151,511 görüntüleme • 1 yıl önce •via X (Twitter)

Bilim & Teknoloji Eğitim

Anya Rossi• Live Now

Private livecam show

10 Yorum

BensenHsu profil fotoğrafı

BensenHsu1 yıl önce

The paper presents Sapiens, a family of vision transformer models trained on a large dataset of human images. The goal is to develop models that can generalize well, be applicable to a wide range of tasks, and produce high-quality outputs. The results demonstrate the benefit of pretraining on a large, curated dataset of human images. The models are able to generalize well to various scenarios, including multi-person scenes, egocentric views, and challenging poses. The high-resolution (1024x1024) pretraining and the detailed annotation of the finetuning datasets also contribute to the models' strong performance. full paper:

Supreme profil fotoğrafı

Supreme1 yıl önce

normal map is mind blowing what the tech

TheEarningsNugget profil fotoğrafı

TheEarningsNugget1 yıl önce

"Sapiens: Foundation for Human Vision Models" PAPER SUMMARY

Miguel Xochicale 🧑🏽‍🔬🤖〰️ profil fotoğrafı

Miguel Xochicale 🧑🏽‍🔬🤖〰️1 yıl önce

Nice one but these links are not working (will they open it soon?)

bryan pratte profil fotoğrafı

bryan pratte1 yıl önce

No code :(

Alessandro De Blasis profil fotoğrafı

Alessandro De Blasis1 yıl önce

Is it real-time or post-processing?

Alessandro De Blasis profil fotoğrafı

Alessandro De Blasis1 yıl önce

Want

Self-Attention Mechanism profil fotoğrafı

Self-Attention Mechanism1 yıl önce

can it spot a soldier and identify the head?

Cavit Erginsoy profil fotoğrafı

Cavit Erginsoy1 yıl önce

Why non commercial license @Meta 😵‍💫

Patryk Zoltowski profil fotoğrafı

Patryk Zoltowski1 yıl önce

Hope there will be some distilled model for realtime inference on mobile

Benzer Videolar

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

AK

125,319 görüntüleme • 3 yıl önce

CosmicMan A Text-to-Image Foundation Model for Humans We present CosmicMan, a text-to-image foundation model specialized for generating high-fidelity human images. Unlike current general-purpose foundation models that are stuck in the dilemma of inferior quality and

CosmicMan A Text-to-Image Foundation Model for Humans We present CosmicMan, a text-to-image foundation model specialized for generating high-fidelity human images. Unlike current general-purpose foundation models that are stuck in the dilemma of inferior quality and

AK

46,778 görüntüleme • 2 yıl önce

JARVIS-VLA just dropped on Hugging Face Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse obtain VLA models in Minecraft that can follow human instructions on over 1k different atomic tasks, including crafting, smelting, cooking, mining, and killing. experiments demonstrate that post-training on non-trajectory tasks leads to a significant 40% improvement over the best agent baseline on a diverse set of atomic tasks. Furthermore, demonstrate that approach surpasses traditional imitation learning-based policies in Minecraft, achieving state-of-the-art performance.

JARVIS-VLA just dropped on Hugging Face Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse obtain VLA models in Minecraft that can follow human instructions on over 1k different atomic tasks, including crafting, smelting, cooking, mining, and killing. experiments demonstrate that post-training on non-trajectory tasks leads to a significant 40% improvement over the best agent baseline on a diverse set of atomic tasks. Furthermore, demonstrate that approach surpasses traditional imitation learning-based policies in Minecraft, achieving state-of-the-art performance.

AK

60,162 görüntüleme • 1 yıl önce

We benchmarked leading multimodal foundation models (GPT-4o, Claude 3.5 Sonnet, Gemini, Llama, etc.) on standard computer vision tasks—from segmentation to surface normal estimation—using standard datasets like COCO and ImageNet. These models have made remarkable progress; however, it is unclear exactly where they stand in terms of understanding vision in detail. Especially when it comes to tasks beyond question-answering. How well do they understand an object's segments or geometry? Our analyses yield an assessment that is quantitatively and qualitatively detailed and is compatible with evaluations developed in the field of computer vision over the past decades. Observed trends: 🔹 The foundation models consistently underperform task-specific SOTA models across all tasks. However, they are respectable generalists, which is remarkable as they are presumably trained primarily on image-text-based tasks. 🔹 They perform semantic tasks notably better than geometric ones. 🔹 GPT-4o performs the best among non-reasoning models, getting the top position in 4 out of 6 tasks. 🔹 Reasoning models, e.g., o3, show improvements in geometric tasks. 🔹 The 'image generation' models, e.g., GPT-40 Image Generation, which have been natively trained multimodally, exhibit quirks. E.g., hallucinated objects, misalignment between the input and output, etc. 🔹 While the prompting techniques affect performance, better models exhibit less sensitivity to variations in prompts. We control for the variance introduced by the prompting methods in our experiments. 🌐 Detailed analyses, visualizations: ⌨️ code: 🧵 1/n

We benchmarked leading multimodal foundation models (GPT-4o, Claude 3.5 Sonnet, Gemini, Llama, etc.) on standard computer vision tasks—from segmentation to surface normal estimation—using standard datasets like COCO and ImageNet. These models have made remarkable progress; however, it is unclear exactly where they stand in terms of understanding vision in detail. Especially when it comes to tasks beyond question-answering. How well do they understand an object's segments or geometry? Our analyses yield an assessment that is quantitatively and qualitatively detailed and is compatible with evaluations developed in the field of computer vision over the past decades. Observed trends: 🔹 The foundation models consistently underperform task-specific SOTA models across all tasks. However, they are respectable generalists, which is remarkable as they are presumably trained primarily on image-text-based tasks. 🔹 They perform semantic tasks notably better than geometric ones. 🔹 GPT-4o performs the best among non-reasoning models, getting the top position in 4 out of 6 tasks. 🔹 Reasoning models, e.g., o3, show improvements in geometric tasks. 🔹 The 'image generation' models, e.g., GPT-40 Image Generation, which have been natively trained multimodally, exhibit quirks. E.g., hallucinated objects, misalignment between the input and output, etc. 🔹 While the prompting techniques affect performance, better models exhibit less sensitivity to variations in prompts. We control for the variance introduced by the prompting methods in our experiments. 🌐 Detailed analyses, visualizations: ⌨️ code: 🧵 1/n

Amir Zamir

73,074 görüntüleme • 11 ay önce

🚨Crowded scenes are notoriously difficult for #ComputerVision models. In our new work lead by Zhou Mu & Lucas Stoffl we developed a novel approach called BUCTD that is state-of-the-art on crowded 🐒🐠&🕺human pose estimation benchmarks!🔥 #CVPR2023 workshop

🚨Crowded scenes are notoriously difficult for #ComputerVision models. In our new work lead by Zhou Mu & Lucas Stoffl we developed a novel approach called BUCTD that is state-of-the-art on crowded 🐒🐠&🕺human pose estimation benchmarks!🔥 #CVPR2023 workshop

Mackenzie Weygandt Mathis, PhD

67,207 görüntüleme • 3 yıl önce

Introducing SAM 3D, the newest addition to the SAM collection, bringing common sense 3D understanding of everyday images. SAM 3D includes two models: 🛋️ SAM 3D Objects for object and scene reconstruction 🧑‍🤝‍🧑 SAM 3D Body for human pose and shape estimation Both models achieve state-of-the-art performance transforming static 2D images into vivid, accurate reconstructions. 🔗 Learn more:

Introducing SAM 3D, the newest addition to the SAM collection, bringing common sense 3D understanding of everyday images. SAM 3D includes two models: 🛋️ SAM 3D Objects for object and scene reconstruction 🧑‍🤝‍🧑 SAM 3D Body for human pose and shape estimation Both models achieve state-of-the-art performance transforming static 2D images into vivid, accurate reconstructions. 🔗 Learn more:

AI at Meta

858,138 görüntüleme • 7 ay önce

Google has launched Med-Gemini, an advanced AI fine-tuned for medical Tasks. It significantly outperforms earlier models, including GPT-4, on most medical benchmarks. Achieves top scores, particularly on the MedQA-USMLE benchmark with a groundbreaking 91.1% accuracy.🚀 Demonstrates superior performance over GPT-4 by 44.5% on average across seven multimodal benchmarks. Excels in tasks such as medical summarization, generating doctor referrals, and simplifying medical documents. It is a preferred method over human expert analyses for complex text-based medical tasks. This marks a significant advancement in AI for healthcare, suggesting potential improvements in medical diagnostics and patient care. By choosing to harmonize with technology, we become more human, we become IRREPLACEABLE. Join the IRREPLACEABLE Academy and read the Book: #techforgood #medicine #ai

Google has launched Med-Gemini, an advanced AI fine-tuned for medical Tasks. It significantly outperforms earlier models, including GPT-4, on most medical benchmarks. Achieves top scores, particularly on the MedQA-USMLE benchmark with a groundbreaking 91.1% accuracy.🚀 Demonstrates superior performance over GPT-4 by 44.5% on average across seven multimodal benchmarks. Excels in tasks such as medical summarization, generating doctor referrals, and simplifying medical documents. It is a preferred method over human expert analyses for complex text-based medical tasks. This marks a significant advancement in AI for healthcare, suggesting potential improvements in medical diagnostics and patient care. By choosing to harmonize with technology, we become more human, we become IRREPLACEABLE. Join the IRREPLACEABLE Academy and read the Book: #techforgood #medicine #ai

Pascal Bornet

14,393 görüntüleme • 1 yıl önce

Yay, finally! Introducing Vision Banana🍌 from Google DeepMind, our unified model that outperforms SoTA specialist models on various vision tasks! By treating 2D/3D vision tasks as image generation, we unlock a new foundation for CV. Project page: (1/5)

Yay, finally! Introducing Vision Banana🍌 from Google DeepMind, our unified model that outperforms SoTA specialist models on various vision tasks! By treating 2D/3D vision tasks as image generation, we unlock a new foundation for CV. Project page: (1/5)

Songyou Peng

284,291 görüntüleme • 2 ay önce

TikTok presents Depth Anything Unleashing the Power of Large-Scale Unlabeled Data paper page: demo: Depth Anything is trained on 1.5M labeled images and 62M+ unlabeled images jointly, providing the most capable Monocular Depth Estimation (MDE) foundation models with the following features: zero-shot relative depth estimation, better than MiDaS v3.1 (BEiTL-512) zero-shot metric depth estimation, better than ZoeDepth optimal in-domain fine-tuning and evaluation on NYUv2 and KITTI

TikTok presents Depth Anything Unleashing the Power of Large-Scale Unlabeled Data paper page: demo: Depth Anything is trained on 1.5M labeled images and 62M+ unlabeled images jointly, providing the most capable Monocular Depth Estimation (MDE) foundation models with the following features: zero-shot relative depth estimation, better than MiDaS v3.1 (BEiTL-512) zero-shot metric depth estimation, better than ZoeDepth optimal in-domain fine-tuning and evaluation on NYUv2 and KITTI

AK

600,018 görüntüleme • 2 yıl önce

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,572 görüntüleme • 2 yıl önce

Meta's Joe Spisak explains how AI models can train themselves by generating images, asking itself questions about them, and choosing the best answers, in order to move beyond human data and human fine-tuning, and teach itself from synthetic data

Meta's Joe Spisak explains how AI models can train themselves by generating images, asking itself questions about them, and choosing the best answers, in order to move beyond human data and human fine-tuning, and teach itself from synthetic data

Tsarathustra

42,390 görüntüleme • 1 yıl önce

We developed an RL method for fine-tuning our models for precise tasks in just a few hours or even minutes. Instead of training the whole model, we add an “RL token” output to π-0.6, our latest model, which is used by a tiny actor and critic to learn quickly with RL.

We developed an RL method for fine-tuning our models for precise tasks in just a few hours or even minutes. Instead of training the whole model, we add an “RL token” output to π-0.6, our latest model, which is used by a tiny actor and critic to learn quickly with RL.

Physical Intelligence

430,406 görüntüleme • 3 ay önce

New short course on Pretraining LLMs! Developed with Upstage and taught by their CEO Sung Kim and CSO Lucy Park. While prompting or fine-tuning existing models works well for many general language tasks, pretraining is valuable for specialized domains or languages with limited representation in current models. This course walks you through the LLM pretraining pipeline: 1. Data preparation: Learn to source, clean, and prepare training data using HuggingFace. 2. Model architecture: Configure transformer networks, including modifying existing models. 3. Training: Set up and run training using open-source libraries. 4. Evaluation: Benchmark performance using popular evaluation strategies. As an example use case, you'll also compare the output of a base model with its fine-tuned and further pretrained variants, to see the impact of pretraining on a model's ability to write Python. The course also explores an innovative technique called depth up-scaling, which Upstage used to train their Solar model family, reducing pretraining compute costs by up to 70%. This technique works by first duplicating layers of a smaller pretrained model to form a larger model, and then further pretraining the result. Sign up here!

New short course on Pretraining LLMs! Developed with Upstage and taught by their CEO Sung Kim and CSO Lucy Park. While prompting or fine-tuning existing models works well for many general language tasks, pretraining is valuable for specialized domains or languages with limited representation in current models. This course walks you through the LLM pretraining pipeline: 1. Data preparation: Learn to source, clean, and prepare training data using HuggingFace. 2. Model architecture: Configure transformer networks, including modifying existing models. 3. Training: Set up and run training using open-source libraries. 4. Evaluation: Benchmark performance using popular evaluation strategies. As an example use case, you'll also compare the output of a base model with its fine-tuned and further pretrained variants, to see the impact of pretraining on a model's ability to write Python. The course also explores an innovative technique called depth up-scaling, which Upstage used to train their Solar model family, reducing pretraining compute costs by up to 70%. This technique works by first duplicating layers of a smaller pretrained model to form a larger model, and then further pretraining the result. Sign up here!

Andrew Ng

85,654 görüntüleme • 1 yıl önce

Humans learn and improve from failures. Similarly, foundation models adapt based on human feedback. Can we leverage this failure understanding to enhance robotics systems that use foundation models? Introducing AHA—a vision-language model for detecting and reasoning over failures in robotic manipulation. Project page: 🧵Thread👇 Aha!

Humans learn and improve from failures. Similarly, foundation models adapt based on human feedback. Can we leverage this failure understanding to enhance robotics systems that use foundation models? Introducing AHA—a vision-language model for detecting and reasoning over failures in robotic manipulation. Project page: 🧵Thread👇 Aha!

Jiafei Duan

48,739 görüntüleme • 1 yıl önce

ViPE: Video Pose Engine for 3D Geometric Perception Contributions: • A robust and efficient framework, ViPE, for estimating camera parameters and dense depth from diverse, in-the-wild videos. • A system design that integrates the strengths of classical SLAM (efficiency, scalability) and learned models (robustness), with key improvements in efficiency, dynamic object handling, and depth quality over prior work. • A large-scale dataset of annotated videos, created using ViPE, to facilitate future research in 3D computer vision.

ViPE: Video Pose Engine for 3D Geometric Perception Contributions: • A robust and efficient framework, ViPE, for estimating camera parameters and dense depth from diverse, in-the-wild videos. • A system design that integrates the strengths of classical SLAM (efficiency, scalability) and learned models (robustness), with key improvements in efficiency, dynamic object handling, and depth quality over prior work. • A large-scale dataset of annotated videos, created using ViPE, to facilitate future research in 3D computer vision.

MrNeRF

42,534 görüntüleme • 10 ay önce

Human Archive builds datasets to model sensorimotor intelligence for robotics and world models. HA-Multi is the largest dataset of manual labor tasks aligning egocentric RGB, stereo depth (active IR), tactile gloves, body IMUs, and wrist cameras. Congrats on the launch, Raj Patel, Shloke Patel, Samay Maini, and Rushil Agarwal!

Human Archive builds datasets to model sensorimotor intelligence for robotics and world models. HA-Multi is the largest dataset of manual labor tasks aligning egocentric RGB, stereo depth (active IR), tactile gloves, body IMUs, and wrist cameras. Congrats on the launch, Raj Patel, Shloke Patel, Samay Maini, and Rushil Agarwal!

Y Combinator

109,696 görüntüleme • 3 ay önce

For the first time in human history, we are teaching a Foundation Model to master the diverse tasks of medicinal chemists, biologists, and computational scientists all in one place. In our latest collaboration with Liquid AI, we are moving away from fragmented, specialized tools toward a single, super-intelligent model. What surprised me most? This model isn't just performing at reasonable levels—it has started outperforming specialist models across physics-based tasks, imaging, and longitudinal data. Why this changes everything: -Synergy over Specialization: Fine-tuning on specific tasks has unlocked unexpected capabilities in synergetic areas, opening a new frontier in multimodal AI research. -Zero-Shot Potential: We are building a model that can perform out-of-scope tasks, moving us closer to an "AI deity" for drug discovery. -Quality First: The goal isn't just to bypass regulations to save time; it’s about using these synergies to develop better, more effective drugs. We are no longer just looking at linear regression or simple text; we are looking at the future of how humanity fights disease. #LiquidAI #InsilicoMedicine #GenerativeAI #DrugDiscovery #DeepTech #BiotechInnovation

For the first time in human history, we are teaching a Foundation Model to master the diverse tasks of medicinal chemists, biologists, and computational scientists all in one place. In our latest collaboration with Liquid AI, we are moving away from fragmented, specialized tools toward a single, super-intelligent model. What surprised me most? This model isn't just performing at reasonable levels—it has started outperforming specialist models across physics-based tasks, imaging, and longitudinal data. Why this changes everything: -Synergy over Specialization: Fine-tuning on specific tasks has unlocked unexpected capabilities in synergetic areas, opening a new frontier in multimodal AI research. -Zero-Shot Potential: We are building a model that can perform out-of-scope tasks, moving us closer to an "AI deity" for drug discovery. -Quality First: The goal isn't just to bypass regulations to save time; it’s about using these synergies to develop better, more effective drugs. We are no longer just looking at linear regression or simple text; we are looking at the future of how humanity fights disease. #LiquidAI #InsilicoMedicine #GenerativeAI #DrugDiscovery #DeepTech #BiotechInnovation

Alex Zhavoronkov, PhD (aka Aleksandrs Zavoronkovs)

10,544 görüntüleme • 3 ay önce

Today we’re releasing V-JEPA, a method for teaching machines to understand and model the physical world by watching videos. This work is another important step towards Yann LeCun’s outlined vision of AI models that use a learned understanding of the world to plan, reason and accomplish complex tasks. Details ➡️ We're releasing a collection of V-JEPA vision models trained with a feature prediction objective using self-supervised learning. The models are able to understand and predict what is going on in a video, even with limited information. It learns by predicting missing or obscured parts of a video in its internal feature space. Unlike generative approaches that fill in missing pixels, this flexible approach enables up to 6x improvements in training and sample efficiency. The models were pre-trained on entirely unlabeled data, and a small amount of labeled data can be used to train a task-specific prediction head on top after pre-training. Our results show that, using a frozen backbone, our top V-JEPA models achieve 82.0% on Kinetics-400, 72.2% on Something-Something-v2 and 77.9% on ImageNet1K — competitive with or exceeding previous leading video models. We believe that this work is an important milestone on the path to advancing machine intelligence.

Today we’re releasing V-JEPA, a method for teaching machines to understand and model the physical world by watching videos. This work is another important step towards Yann LeCun’s outlined vision of AI models that use a learned understanding of the world to plan, reason and accomplish complex tasks. Details ➡️ We're releasing a collection of V-JEPA vision models trained with a feature prediction objective using self-supervised learning. The models are able to understand and predict what is going on in a video, even with limited information. It learns by predicting missing or obscured parts of a video in its internal feature space. Unlike generative approaches that fill in missing pixels, this flexible approach enables up to 6x improvements in training and sample efficiency. The models were pre-trained on entirely unlabeled data, and a small amount of labeled data can be used to train a task-specific prediction head on top after pre-training. Our results show that, using a frozen backbone, our top V-JEPA models achieve 82.0% on Kinetics-400, 72.2% on Something-Something-v2 and 77.9% on ImageNet1K — competitive with or exceeding previous leading video models. We believe that this work is an important milestone on the path to advancing machine intelligence.

AI at Meta

703,647 görüntüleme • 2 yıl önce

NVIDIA just announced EgoScale 🤖🧠 NVIDIA Research has uncovered a log-linear scaling law for robot dexterity by pretraining VLA models on over 20,000 hours of egocentric human video This massive dataset is 20 times larger than previous efforts and proves that robot intelligence follows a predictable path: the more human data, the lower the loss The secret is a simple recipe combining large-scale human pretraining with a small amount of aligned human-robot mid-training to bridge the gap In testing, this method boosted the average success rate by 54% on a 22-DoF robotic hand compared to policies built without pretraining EgoScale also enables one-shot task adaptation and works across different hardware, suggesting that human motion is a universal motor prior for robots Website: Paper: Source: NVIDIA Research #Robot #Humanoid #Robotics #AI #EmbodiedAI #PhysicalAI #NVIDIA #EgoScale #GR00T

NVIDIA just announced EgoScale 🤖🧠 NVIDIA Research has uncovered a log-linear scaling law for robot dexterity by pretraining VLA models on over 20,000 hours of egocentric human video This massive dataset is 20 times larger than previous efforts and proves that robot intelligence follows a predictable path: the more human data, the lower the loss The secret is a simple recipe combining large-scale human pretraining with a small amount of aligned human-robot mid-training to bridge the gap In testing, this method boosted the average success rate by 54% on a 22-DoF robotic hand compared to policies built without pretraining EgoScale also enables one-shot task adaptation and works across different hardware, suggesting that human motion is a universal motor prior for robots Website: Paper: Source: NVIDIA Research #Robot #Humanoid #Robotics #AI #EmbodiedAI #PhysicalAI #NVIDIA #EgoScale #GR00T

RoboHub🤖

43,752 görüntüleme • 4 ay önce