Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

Introducing HRM-Text. An ultra-lean 1B-parameter reasoning language model designed to deliver strong general performance with a fraction of the data, compute, and infrastructure. Trained on just 40B structured tokens, HRM-Text achieves competitive performance while using ~1/1000 of the training data of comparable models. The kicker? The full model trains... show more

Sapient Intelligence

5,396 subscribers

513,908 Aufrufe • vor 2 Monaten •via X (Twitter)

Bildung Nachrichten & Politik Wissenschaft & Technologie

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

FRIEDBERG: VIDEO DATA WILL POWER THE NEXT GENERATION OF AI Friedberg broke down the scale shift coming to artificial intelligence, arguing that text based models like GPT are just the beginning, and that the real revolution will come from video-trained systems: “The internet and all these LLMs are language models trained on text from the internet, around 50 billion words total, maybe one to five terabytes of data in their training sets. But if you look at the video data out there, there are hundreds of billions of hours, much of it on YouTube. By some estimates, there’s a thousand exabytes of video data on the internet, about a billion times more than text data. I think we just saw that play out with the new video model that launched yesterday. Google has all this YouTube data, whether or not they’re using it to train, I don’t know. I’ve heard from insiders they’re not allowed to yet and would have to redo the terms of service.” Source: AIFinInsights david friedberg

FRIEDBERG: VIDEO DATA WILL POWER THE NEXT GENERATION OF AI Friedberg broke down the scale shift coming to artificial intelligence, arguing that text based models like GPT are just the beginning, and that the real revolution will come from video-trained systems: “The internet and all these LLMs are language models trained on text from the internet, around 50 billion words total, maybe one to five terabytes of data in their training sets. But if you look at the video data out there, there are hundreds of billions of hours, much of it on YouTube. By some estimates, there’s a thousand exabytes of video data on the internet, about a billion times more than text data. I think we just saw that play out with the new video model that launched yesterday. Google has all this YouTube data, whether or not they’re using it to train, I don’t know. I’ve heard from insiders they’re not allowed to yet and would have to redo the terms of service.” Source: AIFinInsights david friedberg

Mario Nawfal

15,855 Aufrufe • vor 7 Monaten

Introducing SDXL Turbo: A real-time text-to-image generation model. SDXL Turbo achieves state-of-the-art performance with a new distillation technology, enabling single-step image generation with unprecedented quality, reducing the required step count from 50 to just one. The code, research paper, and weights for non-commercial use are now available on our website. You can test SDXL Turbo on Stability AI’s image editing platform Clipdrop, with a beta demonstration of the real-time text-to-image generation capabilities. Learn more:

Introducing SDXL Turbo: A real-time text-to-image generation model. SDXL Turbo achieves state-of-the-art performance with a new distillation technology, enabling single-step image generation with unprecedented quality, reducing the required step count from 50 to just one. The code, research paper, and weights for non-commercial use are now available on our website. You can test SDXL Turbo on Stability AI’s image editing platform Clipdrop, with a beta demonstration of the real-time text-to-image generation capabilities. Learn more:

Stability AI

976,344 Aufrufe • vor 2 Jahren

Today marks the beginning of a new era. Introducing: Cowboy Space Corporation. We are building orbital infrastructure for the AI era: a fully integrated system of rockets and satellites designed to deliver high-performance compute and optical data transmission directly from Low Earth Orbit.

Today marks the beginning of a new era. Introducing: Cowboy Space Corporation. We are building orbital infrastructure for the AI era: a fully integrated system of rockets and satellites designed to deliver high-performance compute and optical data transmission directly from Low Earth Orbit.

Cowboy Space Corp.

795,218 Aufrufe • vor 2 Monaten

Meta announces Movie Gen A Cast of Media Foundation Models We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user’s image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generation model is a 30B parameter transformer trained with a maximum context length of 73K video tokens, corresponding to a generated video of 16 seconds at 16 frames-per-second. We show multiple technical innovations and simplifications on the architecture, latent spaces, training objectives and recipes, data curation, evaluation protocols, parallelization techniques, and inference optimizations that allow us to reap the benefits of scaling pre-training data, model size, and training compute for training large scale media generation models. We hope this paper helps the research community to accelerate progress and innovation in media generation models

Meta announces Movie Gen A Cast of Media Foundation Models We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user’s image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generation model is a 30B parameter transformer trained with a maximum context length of 73K video tokens, corresponding to a generated video of 16 seconds at 16 frames-per-second. We show multiple technical innovations and simplifications on the architecture, latent spaces, training objectives and recipes, data curation, evaluation protocols, parallelization techniques, and inference optimizations that allow us to reap the benefits of scaling pre-training data, model size, and training compute for training large scale media generation models. We hope this paper helps the research community to accelerate progress and innovation in media generation models

AK

62,719 Aufrufe • vor 1 Jahr

Google presents Still-Moving Customized Video Generation without Customized Video Data Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth or StyleDrop). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight Spatial Adapters that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on "frozen videos" (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel Motion Adapter module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.

Google presents Still-Moving Customized Video Generation without Customized Video Data Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth or StyleDrop). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight Spatial Adapters that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on "frozen videos" (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel Motion Adapter module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.

AK

40,474 Aufrufe • vor 2 Jahren

General Intuition CEO Pim de Witte, who's building foundation models trained on video game controller input data ("action-labeled gameplay clips"), says general intelligence won't "taste like an LLM": "We have a scale of data that's going to allow us to jump to the frontier in one capability — which is any system that can be controlled with a game controller (which is most robots) — and then, you can use that to create a sufficiently general intelligence." "As humans, the decision to talk or type is a very, very small subset of the actions that we can actually take." "So in order to create a sufficiently general intelligence to play 10,000+ video games, the model has to be able to predict across the entire action space of human cognition when they're interacting with these environments. Which are 2D and 3D environments, interfaces, long-horizon tasks, short-horizon tasks, [etc.]." "It has to be a sufficiently general intelligence in order to predict actions. Therefore, the type of model you get out is not going to taste like an LLM. This model is going to be incredibly good at navigating unforeseen environments. It's going to be incredibly good at zero-shotting any task that can be done with a game controller."

General Intuition CEO Pim de Witte, who's building foundation models trained on video game controller input data ("action-labeled gameplay clips"), says general intelligence won't "taste like an LLM": "We have a scale of data that's going to allow us to jump to the frontier in one capability — which is any system that can be controlled with a game controller (which is most robots) — and then, you can use that to create a sufficiently general intelligence." "As humans, the decision to talk or type is a very, very small subset of the actions that we can actually take." "So in order to create a sufficiently general intelligence to play 10,000+ video games, the model has to be able to predict across the entire action space of human cognition when they're interacting with these environments. Which are 2D and 3D environments, interfaces, long-horizon tasks, short-horizon tasks, [etc.]." "It has to be a sufficiently general intelligence in order to predict actions. Therefore, the type of model you get out is not going to taste like an LLM. This model is going to be incredibly good at navigating unforeseen environments. It's going to be incredibly good at zero-shotting any task that can be done with a game controller."

TBPN

80,143 Aufrufe • vor 27 Tagen

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

AK

125,319 Aufrufe • vor 3 Jahren

Perplexity CEO Aravind Srinivas on the biggest threat to the data center industry: It's not competition. It's not regulation. It's decentralisation. "The biggest threat to a data center is if the intelligence can be packed locally on a chip that's running on the device and then there's no need to inference all of it on like one centralized data center." He outlines how this could work in practice. Personalisation doesn't necessarily require on-device model training. Retrieval augmented generation, tool calls, and local data can already tailor AI to individual users. But the real unlock? Test time training. Aravind Srinivas describes a future where AI lives on your device, watches how you work and gradually automates your repetitive tasks. "Imagine we crack test time training where the AI watches tasks you repeatedly do on your local system, adapts to you over time and starts automating a lot of the things you do." The key insight: in this model, the intelligence belongs to you. It's your data, your device, your personalised AI brain. And if that future arrives, the economics of centralised infrastructure start to collapse. "That really disrupts the whole data center industry. It doesn't make sense to spend all this money, 500 billion, 5 trillion, whatever on building all the centralized data centers across the world that do a lot of the intelligence workloads for people." The companies spending trillions on centralised infrastructure may want to rethink where intelligence actually needs to live.

Perplexity CEO Aravind Srinivas on the biggest threat to the data center industry: It's not competition. It's not regulation. It's decentralisation. "The biggest threat to a data center is if the intelligence can be packed locally on a chip that's running on the device and then there's no need to inference all of it on like one centralized data center." He outlines how this could work in practice. Personalisation doesn't necessarily require on-device model training. Retrieval augmented generation, tool calls, and local data can already tailor AI to individual users. But the real unlock? Test time training. Aravind Srinivas describes a future where AI lives on your device, watches how you work and gradually automates your repetitive tasks. "Imagine we crack test time training where the AI watches tasks you repeatedly do on your local system, adapts to you over time and starts automating a lot of the things you do." The key insight: in this model, the intelligence belongs to you. It's your data, your device, your personalised AI brain. And if that future arrives, the economics of centralised infrastructure start to collapse. "That really disrupts the whole data center industry. It doesn't make sense to spend all this money, 500 billion, 5 trillion, whatever on building all the centralized data centers across the world that do a lot of the intelligence workloads for people." The companies spending trillions on centralised infrastructure may want to rethink where intelligence actually needs to live.

Big Brain AI

90,102 Aufrufe • vor 5 Monaten

$A 7-million parameter model outperforming models a thousand times its size on tasks like ARC Prize. That's what recursive reasoning unlocks. In this episode of Decoded, YC's Ankit Gupta and Francois Chaubard break down two recent papers on recursive AI models, HRMs and TRMs, that are achieving state-of-the-art results with a fraction of the parameters of today's largest models. They explain why standard LLMs hit a fundamental ceiling on certain reasoning tasks, how recursion at inference time gives small models the compute depth to break through it, and what happens when you combine these ideas with the power of large-scale foundation models. 00:35 - Model Foundations 01:15 - RNN Limits and LLM Contrast 02:36 - Reasoning Limits and Sorting Analogy 04:22 - HRM Paper Introduction 05:25 - HRM Architecture and Intuition 07:36 - HRM Results and Outer Loop 09:46 - TRM Paper Overview 11:20 - TRM Training and Fixed Point 13:30 - Detailed HRM Summary 20:46 - Comparing HRM and TRM 34:45 - Future Outlook$

A 7-million parameter model outperforming models a thousand times its size on tasks like ARC Prize. That's what recursive reasoning unlocks. In this episode of Decoded, YC's Ankit Gupta and Francois Chaubard break down two recent papers on recursive AI models, HRMs and TRMs, that are achieving state-of-the-art results with a fraction of the parameters of today's largest models. They explain why standard LLMs hit a fundamental ceiling on certain reasoning tasks, how recursion at inference time gives small models the compute depth to break through it, and what happens when you combine these ideas with the power of large-scale foundation models. 00:35 - Model Foundations 01:15 - RNN Limits and LLM Contrast 02:36 - Reasoning Limits and Sorting Analogy 04:22 - HRM Paper Introduction 05:25 - HRM Architecture and Intuition 07:36 - HRM Results and Outer Loop 09:46 - TRM Paper Overview 11:20 - TRM Training and Fixed Point 13:30 - Detailed HRM Summary 20:46 - Comparing HRM and TRM 34:45 - Future Outlook

Y Combinator

127,645 Aufrufe • vor 2 Monaten

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

Tencent Hy

89,257 Aufrufe • vor 10 Monaten

Last week we released Meta Chameleon: a new mixed-modal research model from Meta FAIR. Get the models ➡️ The 7B & 34B safety tuned models we’ve released can take any combination of text and images as input and produce text outputs using a new early fusion approach. While some LLMs have separate image and text encoders or decoders, Chameleon is one of the first publicly released approaches using a single unified architecture. We’re releasing Chameleon models under a research license to help democratize access to foundational mixed-modal models & further research on early fusion. Approach & training details in the paper ➡️

Last week we released Meta Chameleon: a new mixed-modal research model from Meta FAIR. Get the models ➡️ The 7B & 34B safety tuned models we’ve released can take any combination of text and images as input and produce text outputs using a new early fusion approach. While some LLMs have separate image and text encoders or decoders, Chameleon is one of the first publicly released approaches using a single unified architecture. We’re releasing Chameleon models under a research license to help democratize access to foundational mixed-modal models & further research on early fusion. Approach & training details in the paper ➡️

AI at Meta

54,428 Aufrufe • vor 2 Jahren

Autonomous driving with Chain of Thought - autopilot thinking out loud in text! LINGO-1 is the most interesting work I've read in autodriving for a while. Before: perception -> driving action After: perception -> textual reasoning -> action LINGO-1 trains a video-language model that comments on the ongoing scene. You can ask it to explain its decisions ("why are you stopped?") and planning ("what are you gonna do next?"). The explicit reasoning step comes with key benefits: - Explainability: driving models are no longer a mysterious blackbox that you pray for safety. - Counterfactuals: it's able to imagine scenarios that are not in the training data, and reason through how to handle them correctly. - Long-tail programming: there are soooo many edge cases in driving. It's impossible to have good data coverage on everything. Instead of collecting 1000s of examples to "neural program" a case, you can now have a human teacher write prompts to explain a handful of examples. LINGO-1 is closely related to a few works in game AI: - MineDojo (my team's work at NVIDIA, learns a reward model that aligns Minecraft gameplay videos with their transcripts. The model, called "MineCLIP", is able to ground commentary text in the video pixels. - Thought Cloning (Jeff Clune): pixel -> language -> action loop in gridworlds.

Autonomous driving with Chain of Thought - autopilot thinking out loud in text! LINGO-1 is the most interesting work I've read in autodriving for a while. Before: perception -> driving action After: perception -> textual reasoning -> action LINGO-1 trains a video-language model that comments on the ongoing scene. You can ask it to explain its decisions ("why are you stopped?") and planning ("what are you gonna do next?"). The explicit reasoning step comes with key benefits: - Explainability: driving models are no longer a mysterious blackbox that you pray for safety. - Counterfactuals: it's able to imagine scenarios that are not in the training data, and reason through how to handle them correctly. - Long-tail programming: there are soooo many edge cases in driving. It's impossible to have good data coverage on everything. Instead of collecting 1000s of examples to "neural program" a case, you can now have a human teacher write prompts to explain a handful of examples. LINGO-1 is closely related to a few works in game AI: - MineDojo (my team's work at NVIDIA, learns a reward model that aligns Minecraft gameplay videos with their transcripts. The model, called "MineCLIP", is able to ground commentary text in the video pixels. - Thought Cloning (Jeff Clune): pixel -> language -> action loop in gridworlds.

Jim Fan

552,760 Aufrufe • vor 2 Jahren

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

AK

83,657 Aufrufe • vor 3 Jahren

A breakthrough in real-time video generation. As a research preview developed with NVIDIA and shared at NVIDIAGTC this week, we trained a new real-time video model running on Vera Rubin. HD videos generate instantly, with time-to-first-frame under 100ms. Unlocking an entirely new creative paradigm and bolstering the foundations of our General World Model, GWM-1. Real-time generation opens a fundamentally different design space for video models and world simulation. We're investing in co-designing our models alongside advances in hardware to keep pushing this frontier.

A breakthrough in real-time video generation. As a research preview developed with NVIDIA and shared at NVIDIAGTC this week, we trained a new real-time video model running on Vera Rubin. HD videos generate instantly, with time-to-first-frame under 100ms. Unlocking an entirely new creative paradigm and bolstering the foundations of our General World Model, GWM-1. Real-time generation opens a fundamentally different design space for video models and world simulation. We're investing in co-designing our models alongside advances in hardware to keep pushing this frontier.

Runway

1,162,805 Aufrufe • vor 4 Monaten

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

AK

290,517 Aufrufe • vor 3 Jahren

We’re no longer just scaling computing power. We’re using compute to scale intelligence itself. That’s what makes this moment historically significant. For sixty years, progress in computing followed Moore’s Law—transistor density doubling roughly every two years. But AI is advancing on a far steeper curve. Today, frontier model capabilities are improving on a cadence closer to every six months—an order of magnitude faster than classical hardware scaling. The underlying principle is both simple and radical: when you increase data, compute, and model complexity, intelligence emerges. Scaling laws show that larger models—given sufficient compute and high-quality data—become predictably more capable. In just over a decade, we’ve gone from neural nets that could identify cats to systems that can draft legal briefs, write production-grade code, generate scientific hypotheses, and outperform top human competitors in mathematics, strategy, and reasoning tasks. This is no longer “software” in the traditional sense. It is a new form of intelligence—synthetic, scalable, rapidly compounding, and increasingly able to take meaningful action in the real world. The geopolitical, economic, and societal implications of this shift are only beginning to unfold—and they will redefine global power in the decades ahead.

We’re no longer just scaling computing power. We’re using compute to scale intelligence itself. That’s what makes this moment historically significant. For sixty years, progress in computing followed Moore’s Law—transistor density doubling roughly every two years. But AI is advancing on a far steeper curve. Today, frontier model capabilities are improving on a cadence closer to every six months—an order of magnitude faster than classical hardware scaling. The underlying principle is both simple and radical: when you increase data, compute, and model complexity, intelligence emerges. Scaling laws show that larger models—given sufficient compute and high-quality data—become predictably more capable. In just over a decade, we’ve gone from neural nets that could identify cats to systems that can draft legal briefs, write production-grade code, generate scientific hypotheses, and outperform top human competitors in mathematics, strategy, and reasoning tasks. This is no longer “software” in the traditional sense. It is a new form of intelligence—synthetic, scalable, rapidly compounding, and increasingly able to take meaningful action in the real world. The geopolitical, economic, and societal implications of this shift are only beginning to unfold—and they will redefine global power in the decades ahead.

Nina Schick

17,728 Aufrufe • vor 6 Monaten

This is not a joke! 🐬 Excited to share DolphinGemma the first audio-to-audio for dolphin communication! Yes, a model that predicts tokens on how dolphin speech! > DolphinGemma is the first LLM trained specifically to understand dolphin language patterns. > Leverages 40 years of data from Dr. Denise Herzing's unique collection > Works like text prediction, trying to "complete" dolphin whistles and sounds > Use wearable hardware (Google Pixel 9) to capture and analyze sounds in the field. > Dolphin Gemma is designed to be fine-tuned with new data > Weights coming soon! Research like this is why I love AI even more! ♥️

This is not a joke! 🐬 Excited to share DolphinGemma the first audio-to-audio for dolphin communication! Yes, a model that predicts tokens on how dolphin speech! > DolphinGemma is the first LLM trained specifically to understand dolphin language patterns. > Leverages 40 years of data from Dr. Denise Herzing's unique collection > Works like text prediction, trying to "complete" dolphin whistles and sounds > Use wearable hardware (Google Pixel 9) to capture and analyze sounds in the field. > Dolphin Gemma is designed to be fine-tuned with new data > Weights coming soon! Research like this is why I love AI even more! ♥️

Philipp Schmid

112,244 Aufrufe • vor 1 Jahr

Today is a good day for open science. As part of our continued commitment to the growth and development of an open ecosystem, today at Meta FAIR we’re announcing four new publicly available AI models and additional research artifacts to inspire innovation in the community and help advance AI in a responsible way. More in the video from Joelle Pineau. What we’re releasing: 🦎 Meta Chameleon 7B & 34B language models that support mixed-modal input and text-only outputs. 🪙 Meta Multi-Token Prediction Pretrained Language Models for code completion using Multi-Token Prediction. 🎼 Meta JASCO Generative text-to-music models capable of accepting various conditioning inputs for greater controllability. Paper available today with a pretrained model coming soon. 🗣️ Meta AudioSeal An audio watermarking model that we believe is the first designed specifically for the localized detection of AI-generated speech, available under a commercial license. 📝 Additional RAI artifacts Including research, data and code to measure and improve the representation of geographical and cultural preferences and diversity in AI systems. We believe that access to state-of-the-art AI creates opportunities for everyone – not just a small handful of Big Tech companies. We’re excited to share this work and to see how the community learns, iterates and builds using this technology. Details and access to everything released by FAIR today ➡️

Today is a good day for open science. As part of our continued commitment to the growth and development of an open ecosystem, today at Meta FAIR we’re announcing four new publicly available AI models and additional research artifacts to inspire innovation in the community and help advance AI in a responsible way. More in the video from Joelle Pineau. What we’re releasing: 🦎 Meta Chameleon 7B & 34B language models that support mixed-modal input and text-only outputs. 🪙 Meta Multi-Token Prediction Pretrained Language Models for code completion using Multi-Token Prediction. 🎼 Meta JASCO Generative text-to-music models capable of accepting various conditioning inputs for greater controllability. Paper available today with a pretrained model coming soon. 🗣️ Meta AudioSeal An audio watermarking model that we believe is the first designed specifically for the localized detection of AI-generated speech, available under a commercial license. 📝 Additional RAI artifacts Including research, data and code to measure and improve the representation of geographical and cultural preferences and diversity in AI systems. We believe that access to state-of-the-art AI creates opportunities for everyone – not just a small handful of Big Tech companies. We’re excited to share this work and to see how the community learns, iterates and builds using this technology. Details and access to everything released by FAIR today ➡️

AI at Meta

380,751 Aufrufe • vor 2 Jahren

1/N Most Vision-Language-Action models need tons of data for finetuning, and still fail for new objects and instructions. Introducing OTTER, a lightweight, easy-to-train model that uses text-aware visual features to nail unseen tasks out of the box! Here's how it works 👇

1/N Most Vision-Language-Action models need tons of data for finetuning, and still fail for new objects and instructions. Introducing OTTER, a lightweight, easy-to-train model that uses text-aware visual features to nail unseen tasks out of the box! Here's how it works 👇

Fangchen Liu

68,366 Aufrufe • vor 1 Jahr