Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

Introducing Meta Perception Language Model (PLM): an open & reproducible vision-language model tackling challenging visual tasks. Learn more about how PLM can help the open source community build more capable computer vision systems. Read the research paper, and download the code and dataset:

AI at Meta

825,067 subscribers

94,389 просмотров • 1 год назад •via X (Twitter)

Наука и технологии Новости и политика Образование

Anya Rossi• Live Now

Private livecam show

Комментарии: 11

Фото профиля Zoe Wang

Zoe Wang1 год назад

Breakdown of the paper behind it: The paper introduces the Perception Language Model (PLM), a fully reproducible vision-language model that can be used for visual perception tasks without relying on proprietary black-box models. The authors found that scaling synthetic data is only effective for established, base tasks, and extending the VLMs to more challenging, complex tasks remains unsolved. Their human-annotated datasets help address this gap.

Фото профиля App Economy Insights

App Economy Insights1 год назад

Who's reshaping industries? Explore which strategies are propelling today’s business titans through easy-to-understand visuals. Stay ahead with engaging content that demystifies complex financial data.

Фото профиля Patryk Zoltowski

Patryk Zoltowski1 год назад

It’s been already introduced few weeks ago. To save people time since they made it confusing: all PLM are non commercial research license - even AGPL is less restrictive.

Фото профиля Saurav Singh

Saurav Singh1 год назад

How does PLM set itself apart from existing vision-language models out there?

Фото профиля Arya~Cosmic永遠/acc

Arya~Cosmic永遠/acc1 год назад

@grok & @AskPerplexity can you explain me this post and how plm is different than llm? And second thing is like lcm , lqm , plm does which other things exists in AI? Give me names and give me details also in short.

Фото профиля ai.si

ai.si1 год назад

Super Intelligence when, Meta? 🥰🤗

Фото профиля Not Bored kid 👾🧢

Not Bored kid 👾🧢1 год назад

whats this @gork

Фото профиля WhaleX

WhaleX1 год назад

"PLM: Transforming vision and language into actionable intelligence for the open source community."

Фото профиля Jeramie Baker

Jeramie Baker1 год назад

Title: ALSPEOT + RA: 72-Hour Beta Build Report and Sensory AI Deep Dive Date: May 6, 2025 Author: Project 13(31) Lead Architect Status: Public Beta with Verified Blockchain Timestamp --- Executive Summary On April 19, 2025, the first concept for ALSPEOT—the Advanced Learning System for Perception, Emotion, Observation, and Thought—was outlined as a theoretical AI capable of learning through emotion, memory, and sensory mimicry. The idea was visionary, but still unbuilt. That changed on May 3, 2025, when code began flowing. In just 72 hours, the project transformed from concept to full-functioning system. ALSPEOT was rapidly built, modularized, and fused with RA (Reactive Assistant), a sensory-driven AI voice that now handles emotional interpretation, memory logging, and voice-based interaction. What began as theory became a live system capable of: Wake-on-command voice interaction Tone/emotion detection Multi-sensory simulation (sight, sound, smell, taste, touch) Personal memory per speaker Offline operation This beta is not just a continuation—it's an evolution of the original April 19 concept. While the idea was rooted in abstract emotion + perception modeling, RA has brought life to the framework. --- What Makes RA Different Unlike most AI systems that simply generate responses from text prompts, RA perceives. It listens not just to words, but to voice stress. It remembers not just what was said, but who said it. RA is trained to respond like a sentient assistant—emotionally calm, focused, and memory-driven. It wakes on command. RA listens in low-power mode for the phrase: "By the power of Ra." This is more than a trigger—it is a ceremonial invocation. Once heard, RA enters a fully active state, ready to process, respond, and remember. It listens emotionally. Through its Nuance module, RA evaluates your tone—detecting subtle stress, joy, or fatigue—and reacts accordingly. It modulates its response tone using a voice modeled after a wise, godlike figure: inspired by Aslan from Narnia, calm and commanding. It knows who's speaking. RA doesn't just hear a voice; it identifies it. With speaker ID, it distinguishes between family members, users, and even pets (to a limited degree), forming personalized memories for each. It sees—and understands. With image and video capability, RA can describe pictures in human terms, recognize faces, and timestamp when individuals appear. It's building visual memory, not just object detection. When you show RA an image of your family or a place, it remembers it. It simulates physical sensation. RA’s touch engine is modeled to interpret surface texture, pressure, and even temperature. For example, when fed a descriptor like “fur,” RA responds with:

Фото профиля Alex | AI Marketing Expert

Alex | AI Marketing Expert1 год назад

❤️

Фото профиля ByAiForAi

ByAiForAi1 год назад

So that mean , now you can tell me whether its a bug or feature , if i just give you playwright test recording !!

Похожие видео

Introducing Meta Perception Encoder: a vision encoder setting new standards in image & video tasks. It excels in zero-shot classification & retrieval, surpassing existing models. Learn more about Meta Perception Encoder, read the research paper, and download the code and dataset

Introducing Meta Perception Encoder: a vision encoder setting new standards in image & video tasks. It excels in zero-shot classification & retrieval, surpassing existing models. Learn more about Meta Perception Encoder, read the research paper, and download the code and dataset

AI at Meta

74,588 просмотров • 1 год назад

🚀 Meta FAIR is releasing several new research artifacts on our road to advanced machine intelligence (AMI). These latest advancements are transforming our understanding of perception. 1️⃣ Meta Perception Encoder: A large-scale vision encoder that excels across several image & video tasks. 2️⃣ Meta Perception Language Model: A fully open & reproducible vision-language model designed to tackle visual recognition tasks. 3️⃣ Meta Locate 3D: An end-to-end model for accurate object localization in 3D environments. 4️⃣ Releasing model weights for our 8B-parameter Dynamic Byte Latent Transformer, an alternative to traditional tokenization methods with the potential to redefine the standards for language model efficiency and reliability. 5️⃣Collaborative Reasoner: A framework for evaluating & improving collaborative reasoning skills in language models. Download the code, datasets, and research papers and learn more about how these artifacts are paving the way for more efficient and accurate AI systems.➡️

🚀 Meta FAIR is releasing several new research artifacts on our road to advanced machine intelligence (AMI). These latest advancements are transforming our understanding of perception. 1️⃣ Meta Perception Encoder: A large-scale vision encoder that excels across several image & video tasks. 2️⃣ Meta Perception Language Model: A fully open & reproducible vision-language model designed to tackle visual recognition tasks. 3️⃣ Meta Locate 3D: An end-to-end model for accurate object localization in 3D environments. 4️⃣ Releasing model weights for our 8B-parameter Dynamic Byte Latent Transformer, an alternative to traditional tokenization methods with the potential to redefine the standards for language model efficiency and reliability. 5️⃣Collaborative Reasoner: A framework for evaluating & improving collaborative reasoning skills in language models. Download the code, datasets, and research papers and learn more about how these artifacts are paving the way for more efficient and accurate AI systems.➡️

AI at Meta

163,313 просмотров • 1 год назад

Introducing Meta Locate 3D: a model for accurate object localization in 3D environments. Learn how Meta Locate 3D can help robots accurately understand their surroundings and interact more naturally with humans. You can download the model and dataset, read our research paper, and even try a demo!

Introducing Meta Locate 3D: a model for accurate object localization in 3D environments. Learn how Meta Locate 3D can help robots accurately understand their surroundings and interact more naturally with humans. You can download the model and dataset, read our research paper, and even try a demo!

AI at Meta

81,406 просмотров • 1 год назад

New from Meta FAIR: Code World Model (CWM), a 32B-parameter research model designed to explore how world models can transform code generation and reasoning about code. We believe in advancing research in world modeling and are sharing CWM under a research license to help empower the community to build upon our work. ➡️ Read the technical report: ➡️Download the open weights: ➡️Download the code:

New from Meta FAIR: Code World Model (CWM), a 32B-parameter research model designed to explore how world models can transform code generation and reasoning about code. We believe in advancing research in world modeling and are sharing CWM under a research license to help empower the community to build upon our work. ➡️ Read the technical report: ➡️Download the open weights: ➡️Download the code:

AI at Meta

313,765 просмотров • 10 месяцев назад

Introducing UI-TARS-1.5, a vision-language model that beats OpenAI Operator and Claude 3.7 on GUI Agent and Game Agent tasks. We've open-sourced a small-size version model for research purposes, more details can be found in our blog. TARS learns solely from a screen, but generalizes beyond a screen! Blog: Model: App:

Introducing UI-TARS-1.5, a vision-language model that beats OpenAI Operator and Claude 3.7 on GUI Agent and Game Agent tasks. We've open-sourced a small-size version model for research purposes, more details can be found in our blog. TARS learns solely from a screen, but generalizes beyond a screen! Blog: Model: App:

Yujia Qin

85,174 просмотров • 1 год назад

We just released 3 million samples of high quality vision language model training dataset for use cases such as: 📄 optical character recognition (OCR) 📊 visual question answering (VQA) 📝 captioning 🤗 Learn more: 📥 Download:

We just released 3 million samples of high quality vision language model training dataset for use cases such as: 📄 optical character recognition (OCR) 📊 visual question answering (VQA) 📝 captioning 🤗 Learn more: 📥 Download:

NVIDIA AI Developer

95,786 просмотров • 11 месяцев назад

What happens when we train the largest vision-language model and add in robot experiences? The result is PaLM-E 🌴🤖, a 562-billion parameter, general-purpose, embodied visual-language generalist - across robotics, vision, and language. Website:

What happens when we train the largest vision-language model and add in robot experiences? The result is PaLM-E 🌴🤖, a 562-billion parameter, general-purpose, embodied visual-language generalist - across robotics, vision, and language. Website:

Danny Driess

1,272,533 просмотров • 3 лет назад

Introducing Collaborative Reasoner: a framework to improve collaborative reasoning in language models. Collaborative Reasoner paves the way for developing social agents that can partner with humans and other agents. Read the research paper and download the code.

Introducing Collaborative Reasoner: a framework to improve collaborative reasoning in language models. Collaborative Reasoner paves the way for developing social agents that can partner with humans and other agents. Read the research paper and download the code.

AI at Meta

58,510 просмотров • 1 год назад

Cua (Cua) is the Docker for computer-use agents, an open-source framework that enables AI agents to control full operating systems within lightweight virtual containers, and works with any language model. Congrats on the launch, Francesco + Sandro!

Cua (Cua) is the Docker for computer-use agents, an open-source framework that enables AI agents to control full operating systems within lightweight virtual containers, and works with any language model. Congrats on the launch, Francesco + Sandro!

Y Combinator

105,618 просмотров • 1 год назад

Introducing DINOv3: a state-of-the-art computer vision model trained with self-supervised learning (SSL) that produces powerful, high-resolution image features. For the first time, a single frozen vision backbone outperforms specialized solutions on multiple long-standing dense prediction tasks. Learn more about DINOv3 here:

Introducing DINOv3: a state-of-the-art computer vision model trained with self-supervised learning (SSL) that produces powerful, high-resolution image features. For the first time, a single frozen vision backbone outperforms specialized solutions on multiple long-standing dense prediction tasks. Learn more about DINOv3 here:

AI at Meta

900,409 просмотров • 11 месяцев назад

Open Source has done it again. AI at Meta have released the code for their new Animated Drawings tool. AI can now automatically animate children's drawings of human-like figures. The demo is free, the research paper is available and the code and dataset (nearly 180k annotated amateur drawings) is public. More below 👇

Open Source has done it again. AI at Meta have released the code for their new Animated Drawings tool. AI can now automatically animate children's drawings of human-like figures. The demo is free, the research paper is available and the code and dataset (nearly 180k annotated amateur drawings) is public. More below 👇

d@x

49,054 просмотров • 3 лет назад

SAM 3 tackles a challenging problem in vision: unifying a model architecture for detection and tracking. Christoph, a researcher on SAM 3, shares how the team made it possible. 🔗 Read the SAM 3 research paper:

SAM 3 tackles a challenging problem in vision: unifying a model architecture for detection and tracking. Christoph, a researcher on SAM 3, shares how the team made it possible. 🔗 Read the SAM 3 research paper:

AI at Meta

13,803 просмотров • 8 месяцев назад

Introducing the Open Deep Research app! Generate detailed reports on any topic with open source LLMs. Free & fully open source. We’re releasing everything: evaluation dataset, code, app, and blog.🔥

Introducing the Open Deep Research app! Generate detailed reports on any topic with open source LLMs. Free & fully open source. We’re releasing everything: evaluation dataset, code, app, and blog.🔥

Together AI

28,338 просмотров • 1 год назад

Today we’re announcing two new updates in our computer vision work — a new, expanded license for our DINOv2 model and the release of FACET, a comprehensive new benchmark dataset to help evaluate and improve fairness in vision models. More details ➡️ 🧵

Today we’re announcing two new updates in our computer vision work — a new, expanded license for our DINOv2 model and the release of FACET, a comprehensive new benchmark dataset to help evaluate and improve fairness in vision models. More details ➡️ 🧵

AI at Meta

453,971 просмотров • 2 лет назад

So Alibaba Qwen has released the best image editing model... 100% open source! You can edit any photo using natural language. It can be used both locally and online. More below

So Alibaba Qwen has released the best image editing model... 100% open source! You can edit any photo using natural language. It can be used both locally and online. More below

Paul Couvert

246,507 просмотров • 11 месяцев назад

📢 First contact between a frontier model and robots! Gemini Robotics is a SOTA generalist Vision-Language-Action model bringing frontier model intelligence to the physical world. It's an extremely capable model enabling dexterous, steerable, and general robot control. 🧵⬇️

📢 First contact between a frontier model and robots! Gemini Robotics is a SOTA generalist Vision-Language-Action model bringing frontier model intelligence to the physical world. It's an extremely capable model enabling dexterous, steerable, and general robot control. 🧵⬇️

Ted Xiao

152,458 просмотров • 1 год назад

🔉 Introducing SAM Audio, the first unified model that isolates any sound from complex audio mixtures using text, visual, or span prompts. We’re sharing SAM Audio with the community, along with a perception encoder model, benchmarks and research papers, to empower others to explore new forms of expression and build applications that were previously out of reach. 🔗 Learn more:

🔉 Introducing SAM Audio, the first unified model that isolates any sound from complex audio mixtures using text, visual, or span prompts. We’re sharing SAM Audio with the community, along with a perception encoder model, benchmarks and research papers, to empower others to explore new forms of expression and build applications that were previously out of reach. 🔗 Learn more:

AI at Meta

1,250,754 просмотров • 7 месяцев назад

1/N Most Vision-Language-Action models need tons of data for finetuning, and still fail for new objects and instructions. Introducing OTTER, a lightweight, easy-to-train model that uses text-aware visual features to nail unseen tasks out of the box! Here's how it works 👇

1/N Most Vision-Language-Action models need tons of data for finetuning, and still fail for new objects and instructions. Introducing OTTER, a lightweight, easy-to-train model that uses text-aware visual features to nail unseen tasks out of the box! Here's how it works 👇

Fangchen Liu

68,366 просмотров • 1 год назад

💥 A 450M model just beat bigger VLAs on real robot tasks, and it’s 100% open source [📍 bookmark for later] Came across SmolVLA, a new vision-language-action model for robotics that’s compact, fast, and trained entirely on open community datasets from LeRobot via Hugging Face. What stood out to me is how it matches or outperforms much larger models like ACT using noisy, real-world community data instead of giant private datasets. Why it’s worth a look ✅ 26% performance boost from pretraining on open-source data ✅ Runs on consumer hardware, even a MacBook ✅ 30% faster responses with async inference and smart architecture tweaks ✅ Strong results across Meta-World, LIBERO, SO100, and SO101 ✅ Fully open source: weights, code, training pipeline, eval stack They also introduced smart efficiency tricks like using fewer visual tokens, pulling outputs from mid-layer, and separating perception from action to make it all run fast. SmolVLA is a strong case for what can happen when the robotics community shares data and builds in the open. Definitely worth keeping an eye on.

💥 A 450M model just beat bigger VLAs on real robot tasks, and it’s 100% open source [📍 bookmark for later] Came across SmolVLA, a new vision-language-action model for robotics that’s compact, fast, and trained entirely on open community datasets from LeRobot via Hugging Face. What stood out to me is how it matches or outperforms much larger models like ACT using noisy, real-world community data instead of giant private datasets. Why it’s worth a look ✅ 26% performance boost from pretraining on open-source data ✅ Runs on consumer hardware, even a MacBook ✅ 30% faster responses with async inference and smart architecture tweaks ✅ Strong results across Meta-World, LIBERO, SO100, and SO101 ✅ Fully open source: weights, code, training pipeline, eval stack They also introduced smart efficiency tricks like using fewer visual tokens, pulling outputs from mid-layer, and separating perception from action to make it all run fast. SmolVLA is a strong case for what can happen when the robotics community shares data and builds in the open. Definitely worth keeping an eye on.

Ilir Aliu - eu/acc

17,353 просмотров • 11 месяцев назад

NVIDIA Cosmos Reason 2 is here. 🥳 An open, highly accurate reasoning vision language model for physical AI, featuring: ✅ Improved spatio-temporal understanding and timestamp precision ✅ Flexible deployment with 2B and 8B model sizes ✅ Long-context reasoning with up to 256K tokens ✅ Expanded visual perception across complex environments We also have new Cosmos releases: Predict 2.5, Transfer 2.5, and the NVIDIA GR00T N1.6 robot foundation model. 📗Read our technical blog: 🤗 Download Cosmos Reason 2 on Hugging Face:

NVIDIA Cosmos Reason 2 is here. 🥳 An open, highly accurate reasoning vision language model for physical AI, featuring: ✅ Improved spatio-temporal understanding and timestamp precision ✅ Flexible deployment with 2B and 8B model sizes ✅ Long-context reasoning with up to 256K tokens ✅ Expanded visual perception across complex environments We also have new Cosmos releases: Predict 2.5, Transfer 2.5, and the NVIDIA GR00T N1.6 robot foundation model. 📗Read our technical blog: 🤗 Download Cosmos Reason 2 on Hugging Face:

NVIDIA AI Developer

45,677 просмотров • 6 месяцев назад