Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

LegoGPT, an LLM-based system that generates physically stable LEGO structures from text prompts, backed by a new 47,000+ sample dataset and physics-aware filtering during inference. → LegoGPT is trained on a custom dataset, StableText2Lego, which includes 47,000+ 3D LEGO models mapped to text, spanning 28,000+ unique objects. → The... model predicts LEGO bricks sequentially like tokens, using next-token prediction in a transformer setup. → To ensure physical stability, LegoGPT integrates physics-aware rollback and validity filtering, pruning out structurally invalid brick placements. → The generated designs are aesthetically aligned with prompts, physically buildable, and tested both with human manual assembly and robotic arms. → The team also introduced a text-driven LEGO coloring/texturing pipeline, enabling more expressive and customized outputs. → The dataset, code, and models are all publicly released under an open-access license.show more

Rohan Paul

152,901 subscribers

75,268 görüntüleme • 1 yıl önce •via X (Twitter)

Sanat Bilim & Teknoloji Eğitim

Anya Rossi• Live Now

Private livecam show

10 Yorum

RTTS profil fotoğrafı

RTTS1 yıl önce

Testing Salesforce presents unique challenges due to its complexity, scalability and customizability. RTTS can plan, design & automate a successful testing process for you.

Justin Obney profil fotoğrafı

Justin Obney1 yıl önce

This is dope. Check out this exploration I did with my kids.

Rohan Paul profil fotoğrafı

Rohan Paul1 yıl önce

cool.. 👍

Sanskar Pandey profil fotoğrafı

Sanskar Pandey1 yıl önce

the logical conclusion to NLP is its intersection with robotics @ruhzi57

Jacek (Jomsborg.eth) profil fotoğrafı

Jacek (Jomsborg.eth)1 yıl önce

L(L)M. We should care more about what cannot be expressed through language. All rest is like LEGO.

✨ profil fotoğrafı

✨1 yıl önce

@_mcbench irl

Jack Lau profil fotoğrafı

Jack Lau1 yıl önce

Can't wait to see what others come up with using it.

Varun K | AI Insights profil fotoğrafı

Varun K | AI Insights1 yıl önce

LegoGPT actually sounds like the future of playtime! 47k+ models, physics checks, AND it’s tested with real humans and robots building the stuff?? according to Tom's Hardware, it nails physical stability 98% of the time. gonna try this out for my next LEGO binge lol

Vinayak profil fotoğrafı

Vinayak1 yıl önce

Damn it's sooo cool, I wanna work on this amazing stuff one day!

cryptobiot profil fotoğrafı

cryptobiot1 yıl önce

don't tell me chatgpt is now taking my childhood lego master builder dream job, too

Benzer Videolar

We've released the code for LegoGPT. This autoregressive model generates physically stable and buildable designs from text prompts, by integrating physics laws and assembly constraints into LLM training and inference. This work is led by PhD students Ava Pun, Kangle Deng, Ruixuan Liu, and in collaboration with CMU faculty Changliu Liu and Deva Ramanan. LegoGPT is a small first step towards the ultimate goal of generative manufacturing of physical objects. Our implementation is limited to 20x20x20 dimensions, 21 object categories, and simple brick types, but we are working on scaling it up! Code: Website: Demo:

We've released the code for LegoGPT. This autoregressive model generates physically stable and buildable designs from text prompts, by integrating physics laws and assembly constraints into LLM training and inference. This work is led by PhD students Ava Pun, Kangle Deng, Ruixuan Liu, and in collaboration with CMU faculty Changliu Liu and Deva Ramanan. LegoGPT is a small first step towards the ultimate goal of generative manufacturing of physical objects. Our implementation is limited to 20x20x20 dimensions, 21 object categories, and simple brick types, but we are working on scaling it up! Code: Website: Demo:

Jun-Yan Zhu

38,607 görüntüleme • 1 yıl önce

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

AK

83,657 görüntüleme • 3 yıl önce

We’re excited to announce the release and open-source of HunyuanImage 3.0 — the largest and most powerful open-source text-to-image model to date, with over 80 billion total parameters, of which 13 billion are activated per token during inference.The effect is completely comparable to the industry’s flagship closed-source model.🚀🚀🚀 HunyuanImage 3.0 originates from our internally developed native multimodal large language model, with fine-tuning and post-training focused on text-to-image generation. This unique foundation gives the model a powerful set of capabilities: ✅Reason with world knowledge ✅Understand complex, thousand-word prompts ✅Generate precise text within images Different from traditional DiT architecture image generation models, HunyuanImage 3.0’s MoE architecture uses a Transfusion-based approach to deeply couple Diffusion and LLM training for a single, powerful system. Built on Hunyuan-A13B, HunyuanImage 3.0 was trained on a massive dataset: 5 billion image-text pairs, video frames, interleaved image-text data, and 6 trillion tokens of text corpora. This hybrid training across multimodal generation, understanding, and LLM capabilities allows the model to seamlessly integrate multiple tasks. Whether you're an illustrator, designer, or creator, this is built to slash your workflow from hours to minutes. HunyuanImage 3.0 can generate intricate text, detailed comics, expressive emojis, and lively, engaging illustrations for educational content. The current release focuses solely on text-to-image generation and future updates will include image-to-image, image editing, multi-turn interaction, and more. 👉🏻Try it now: 🔗GitHub: 🤗Hugging Face:

We’re excited to announce the release and open-source of HunyuanImage 3.0 — the largest and most powerful open-source text-to-image model to date, with over 80 billion total parameters, of which 13 billion are activated per token during inference.The effect is completely comparable to the industry’s flagship closed-source model.🚀🚀🚀 HunyuanImage 3.0 originates from our internally developed native multimodal large language model, with fine-tuning and post-training focused on text-to-image generation. This unique foundation gives the model a powerful set of capabilities: ✅Reason with world knowledge ✅Understand complex, thousand-word prompts ✅Generate precise text within images Different from traditional DiT architecture image generation models, HunyuanImage 3.0’s MoE architecture uses a Transfusion-based approach to deeply couple Diffusion and LLM training for a single, powerful system. Built on Hunyuan-A13B, HunyuanImage 3.0 was trained on a massive dataset: 5 billion image-text pairs, video frames, interleaved image-text data, and 6 trillion tokens of text corpora. This hybrid training across multimodal generation, understanding, and LLM capabilities allows the model to seamlessly integrate multiple tasks. Whether you're an illustrator, designer, or creator, this is built to slash your workflow from hours to minutes. HunyuanImage 3.0 can generate intricate text, detailed comics, expressive emojis, and lively, engaging illustrations for educational content. The current release focuses solely on text-to-image generation and future updates will include image-to-image, image editing, multi-turn interaction, and more. 👉🏻Try it now: 🔗GitHub: 🤗Hugging Face:

Tencent Hy

412,658 görüntüleme • 10 ay önce

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

Tencent Hy

89,257 görüntüleme • 10 ay önce

🇨🇳 Another great Chinese Model, OmniHuman-1.5 from ByteDance Turns 1 image plus a voice track into expressive avatar video by pairing a System 1 and System 2 inspired planner with a Diffusion Transformer, Produces coherent motion for over 1 minute with moving camera and multi character scenes. Most avatar models move to the beat of the audio but miss meaning, so gestures feel generic and emotions feel shallow. The fix here is a Multimodal LLM planner that listens to the speech and drafts a structured plan describing intent, emotions, beats, and high level actions, which gives the motion engine clear semantic targets instead of only rhythm. The motion engine is a Multimodal Diffusion Transformer that fuses the plan with audio, the single reference image, and optional text prompts, then synthesizes continuous body, face, and head motion that matches both words and tone. A key trick is a Pseudo Last Frame, a synthetic target that summarizes the next expected state, which stabilizes fusion across modalities and keeps motion consistent over long spans. From just 1 image and speech, the system outputs speaking avatars with synchronized lips, context aware gestures, and continuous camera movement, and it also supports multi character interactions without manual choreography. Reported results show strong lip sync accuracy, high video quality, natural motion, and close match to text prompts, and the same setup works on nonhuman characters too.

Rohan Paul

63,859 görüntüleme • 10 ay önce

Today is a good day for open science. As part of our continued commitment to the growth and development of an open ecosystem, today at Meta FAIR we’re announcing four new publicly available AI models and additional research artifacts to inspire innovation in the community and help advance AI in a responsible way. More in the video from Joelle Pineau. What we’re releasing: 🦎 Meta Chameleon 7B & 34B language models that support mixed-modal input and text-only outputs. 🪙 Meta Multi-Token Prediction Pretrained Language Models for code completion using Multi-Token Prediction. 🎼 Meta JASCO Generative text-to-music models capable of accepting various conditioning inputs for greater controllability. Paper available today with a pretrained model coming soon. 🗣️ Meta AudioSeal An audio watermarking model that we believe is the first designed specifically for the localized detection of AI-generated speech, available under a commercial license. 📝 Additional RAI artifacts Including research, data and code to measure and improve the representation of geographical and cultural preferences and diversity in AI systems. We believe that access to state-of-the-art AI creates opportunities for everyone – not just a small handful of Big Tech companies. We’re excited to share this work and to see how the community learns, iterates and builds using this technology. Details and access to everything released by FAIR today ➡️

Today is a good day for open science. As part of our continued commitment to the growth and development of an open ecosystem, today at Meta FAIR we’re announcing four new publicly available AI models and additional research artifacts to inspire innovation in the community and help advance AI in a responsible way. More in the video from Joelle Pineau. What we’re releasing: 🦎 Meta Chameleon 7B & 34B language models that support mixed-modal input and text-only outputs. 🪙 Meta Multi-Token Prediction Pretrained Language Models for code completion using Multi-Token Prediction. 🎼 Meta JASCO Generative text-to-music models capable of accepting various conditioning inputs for greater controllability. Paper available today with a pretrained model coming soon. 🗣️ Meta AudioSeal An audio watermarking model that we believe is the first designed specifically for the localized detection of AI-generated speech, available under a commercial license. 📝 Additional RAI artifacts Including research, data and code to measure and improve the representation of geographical and cultural preferences and diversity in AI systems. We believe that access to state-of-the-art AI creates opportunities for everyone – not just a small handful of Big Tech companies. We’re excited to share this work and to see how the community learns, iterates and builds using this technology. Details and access to everything released by FAIR today ➡️

AI at Meta

380,751 görüntüleme • 2 yıl önce

In my past research experience, finding or developing an appropriate simulation environment, dataset, and benchmark has always been a challenge. Missing features, limited support, or unexpected bugs often occupied my days and nights. Moreover, current simulation platforms are relatively fragmented—making it challenging to replicate the success of the RT-X dataset in unifying community efforts. Introducing RoboVerse, we provide a unified platform, dataset, and benchmark for scalable and generalizable robot learning. We hope to build a shared foundation to combine the community efforts. RoboVerse includes: MetaSim: We carefully designed a configuration system and a universal interface to align current robotic simulators. With MetaSim, you can use any simulator with the same code—bringing together the community’s diverse efforts under one framework! RoboVerse Dataset and Benchmark: We unify popular simulation environments and benchmarks into a single cohesive system and introduce the RoboVerse dataset—a large-scale, high-quality synthetic dataset. Additionally, we propose a standardized benchmark across both imitation learning and reinforcement learning. A cool feature enabled by our unified framework: Hybrid Simulation! You can now integrate physics engines and renderers from different simulators—e.g., using MuJoCo precise physics with Isaac photorealistic rendering. This not only elevates simulation fidelity but also significantly enhances real-world transfer performance across complex robotic applications. Hopefully, our team’s efforts could serve the robotic community to thrive vibrantly in the years to come. RoboVerse is open-sourced🥳!!! Project Page: Documentation: Github Repo: Paper:

In my past research experience, finding or developing an appropriate simulation environment, dataset, and benchmark has always been a challenge. Missing features, limited support, or unexpected bugs often occupied my days and nights. Moreover, current simulation platforms are relatively fragmented—making it challenging to replicate the success of the RT-X dataset in unifying community efforts. Introducing RoboVerse, we provide a unified platform, dataset, and benchmark for scalable and generalizable robot learning. We hope to build a shared foundation to combine the community efforts. RoboVerse includes: MetaSim: We carefully designed a configuration system and a universal interface to align current robotic simulators. With MetaSim, you can use any simulator with the same code—bringing together the community’s diverse efforts under one framework! RoboVerse Dataset and Benchmark: We unify popular simulation environments and benchmarks into a single cohesive system and introduce the RoboVerse dataset—a large-scale, high-quality synthetic dataset. Additionally, we propose a standardized benchmark across both imitation learning and reinforcement learning. A cool feature enabled by our unified framework: Hybrid Simulation! You can now integrate physics engines and renderers from different simulators—e.g., using MuJoCo precise physics with Isaac photorealistic rendering. This not only elevates simulation fidelity but also significantly enhances real-world transfer performance across complex robotic applications. Hopefully, our team’s efforts could serve the robotic community to thrive vibrantly in the years to come. RoboVerse is open-sourced🥳!!! Project Page: Documentation: Github Repo: Paper:

Haoran Geng

84,249 görüntüleme • 1 yıl önce

Working on multimodal instruction tuning and finding it hard to scale? Building Web/GUI agents but data is too narrow? Introducing 🚀MultiUI: 7.3M multimodal instructions from 1M webpage UIs, offering diverse data to boost text-rich visual understanding. Key takeaways: 🌟WebUI-trained models show major gains in visual web understanding and agent tasks. 💻 🌟Models also generalize well to non-UI tasks like DocVQA/OCR. 📄 How it works: We generate multimodal instructions with a text LLM using structured text from webpage accessibility trees. We then pair them with UI screenshots, to train multimodal models. Homepage: Paper: Dataset: Model: Congrats to the student lead Junpeng Liu and the team Tianyue Ou Yifan Song Yuxiao Qu Chenyan Xiong Wenhu Chen Graham Neubig ! More details are in the following threads ⬇️

Working on multimodal instruction tuning and finding it hard to scale? Building Web/GUI agents but data is too narrow? Introducing 🚀MultiUI: 7.3M multimodal instructions from 1M webpage UIs, offering diverse data to boost text-rich visual understanding. Key takeaways: 🌟WebUI-trained models show major gains in visual web understanding and agent tasks. 💻 🌟Models also generalize well to non-UI tasks like DocVQA/OCR. 📄 How it works: We generate multimodal instructions with a text LLM using structured text from webpage accessibility trees. We then pair them with UI screenshots, to train multimodal models. Homepage: Paper: Dataset: Model: Congrats to the student lead Junpeng Liu and the team Tianyue Ou Yifan Song Yuxiao Qu Chenyan Xiong Wenhu Chen Graham Neubig ! More details are in the following threads ⬇️

Xiang Yue

57,699 görüntüleme • 1 yıl önce

Text-to-SQL is dead 💀 The next generation of querying is agentic - and it’s already here. This paper ( introduces a type of agentic querying called 𝗙𝘂𝗻𝗰𝘁𝗶𝗼𝗻 𝗖𝗮𝗹𝗹𝗶𝗻𝗴 that uses an LLM to structure queries using predefined function calls in JSON format, with optional arguments for search, filters, aggregation, and grouping. Along with that, it also tests out a bunch of different models with a new dataset, DBGorilla, designed to evaluate agentic querying techniques on real-world use cases. Weaviate AI Database also just released a 𝗤𝘂𝗲𝗿𝘆 𝗔𝗴𝗲𝗻𝘁, designed based on some of the work in this paper, to handle advanced agentic querying out of the box, find out more here:

Text-to-SQL is dead 💀 The next generation of querying is agentic - and it’s already here. This paper ( introduces a type of agentic querying called 𝗙𝘂𝗻𝗰𝘁𝗶𝗼𝗻 𝗖𝗮𝗹𝗹𝗶𝗻𝗴 that uses an LLM to structure queries using predefined function calls in JSON format, with optional arguments for search, filters, aggregation, and grouping. Along with that, it also tests out a bunch of different models with a new dataset, DBGorilla, designed to evaluate agentic querying techniques on real-world use cases. Weaviate AI Database also just released a 𝗤𝘂𝗲𝗿𝘆 𝗔𝗴𝗲𝗻𝘁, designed based on some of the work in this paper, to handle advanced agentic querying out of the box, find out more here:

Victoria Slocum

189,014 görüntüleme • 1 yıl önce

Open science is how we continue to push technology forward and today at Meta FAIR we’re sharing eight new AI research artifacts including new models, datasets and code to inspire innovation in the community. More in the video from Joelle Pineau. This work is another important step towards our goal of achieving Advanced Machine Intelligence (AMI). What we’re releasing: • Meta Spirit LM: An open source language model for seamless speech and text integration. • Meta Segment Anything Model 2.1: An updated checkpoint with improved results on visually similar objects, small objects and occlusion handling. Plus a new developer suite to make it easier for developers to build with SAM 2. • Layer Skip: Inference code and fine-tuned checkpoints demonstrating a new method for enhancing LLM performance. • SALSA: New code to enable researchers to benchmark AI-based attacks in support of validating security for post-quantum cryptography. • Meta Lingua: A lightweight and self-contained codebase designed to train language models at scale. • Meta Open Materials: New open source models and the largest dataset of its kind to accelerate AI-driven discovery of new inorganic materials. • MEXMA: A new research paper and code for our novel pre-trained cross-lingual sentence encoder with coverage across 80 languages. • Self-Taught Evaluator: a new method for generating synthetic preference data to train reward models without relying on human annotations. Access to state-of-the-art AI creates opportunities for everyone. We’re excited to share this work and look forward to seeing the community innovation that results from it. Details and access to everything released by FAIR today ➡️

Open science is how we continue to push technology forward and today at Meta FAIR we’re sharing eight new AI research artifacts including new models, datasets and code to inspire innovation in the community. More in the video from Joelle Pineau. This work is another important step towards our goal of achieving Advanced Machine Intelligence (AMI). What we’re releasing: • Meta Spirit LM: An open source language model for seamless speech and text integration. • Meta Segment Anything Model 2.1: An updated checkpoint with improved results on visually similar objects, small objects and occlusion handling. Plus a new developer suite to make it easier for developers to build with SAM 2. • Layer Skip: Inference code and fine-tuned checkpoints demonstrating a new method for enhancing LLM performance. • SALSA: New code to enable researchers to benchmark AI-based attacks in support of validating security for post-quantum cryptography. • Meta Lingua: A lightweight and self-contained codebase designed to train language models at scale. • Meta Open Materials: New open source models and the largest dataset of its kind to accelerate AI-driven discovery of new inorganic materials. • MEXMA: A new research paper and code for our novel pre-trained cross-lingual sentence encoder with coverage across 80 languages. • Self-Taught Evaluator: a new method for generating synthetic preference data to train reward models without relying on human annotations. Access to state-of-the-art AI creates opportunities for everyone. We’re excited to share this work and look forward to seeing the community innovation that results from it. Details and access to everything released by FAIR today ➡️

AI at Meta

150,222 görüntüleme • 1 yıl önce

This is probably the most complex workflow I’ve ever built, only with open-source tools. It took my 4 days. It takes four inputs: author, title, and style; and generates a full visual animated story in one click in ComfyUI . I worked on it for four days. There are still some bugs, but here’s the first preview. Here’s a quick breakdown: - The four inputs are sent to LLMs with precise instructions to generate: first, prompts for images and image modifications; second, prompts for animations; third, prompts for generating music. - All voices are generated from the text and timed precisely, as they determine the length of each animation segment. - The first image and video are generated to serve as the title, but also as the guide for all other images created for the video. - Titles and subtitles are also added automatically in Comfy. - I also developed a lot of custom nodes for minor frame calculations, mostly to match audio and video. - The full system is a large loop that, for each line of text, generates an image and then a video from that image. The loop was the hardest part to build in this workflow, so it can process either a 20-second video or a 2-minute video with the same input. - There are multiple combinations of LLMs that try to understand the text in the best way to provide the best prompts for images and video. - The final video is assembled entirely within ComfyUI. - The music is generated based on the LLM output and matches the exact timing of the full animation. - Done! For reference, this workflow uses a lot of models and only works on an RTX 6000 Pro with plenty of RAM. My goal is not to replace humans, as I’ll try to explain later, this workflow is highly controlled and can be adapted or reworked at any point by real artists! My aim was to create a tool that can animate text in one go, allowing the AI some freedom while keeping a strict flow. I don’t know yet how I’ll share this workflow with people, I still need to polish it properly, but maybe through Patreon. Anyway, I hope you enjoy my research, and let’s always keep pushing further! :)

This is probably the most complex workflow I’ve ever built, only with open-source tools. It took my 4 days. It takes four inputs: author, title, and style; and generates a full visual animated story in one click in ComfyUI . I worked on it for four days. There are still some bugs, but here’s the first preview. Here’s a quick breakdown: - The four inputs are sent to LLMs with precise instructions to generate: first, prompts for images and image modifications; second, prompts for animations; third, prompts for generating music. - All voices are generated from the text and timed precisely, as they determine the length of each animation segment. - The first image and video are generated to serve as the title, but also as the guide for all other images created for the video. - Titles and subtitles are also added automatically in Comfy. - I also developed a lot of custom nodes for minor frame calculations, mostly to match audio and video. - The full system is a large loop that, for each line of text, generates an image and then a video from that image. The loop was the hardest part to build in this workflow, so it can process either a 20-second video or a 2-minute video with the same input. - There are multiple combinations of LLMs that try to understand the text in the best way to provide the best prompts for images and video. - The final video is assembled entirely within ComfyUI. - The music is generated based on the LLM output and matches the exact timing of the full animation. - Done! For reference, this workflow uses a lot of models and only works on an RTX 6000 Pro with plenty of RAM. My goal is not to replace humans, as I’ll try to explain later, this workflow is highly controlled and can be adapted or reworked at any point by real artists! My aim was to create a tool that can animate text in one go, allowing the AI some freedom while keeping a strict flow. I don’t know yet how I’ll share this workflow with people, I still need to polish it properly, but maybe through Patreon. Anyway, I hope you enjoy my research, and let’s always keep pushing further! :)

Lovis Odin

58,769 görüntüleme • 10 ay önce

Excited to announce GR00T N1, the world’s first open foundation model for humanoid robots! We are on a mission to democratize Physical AI. The power of general robot brain, in the palm of your hand - with only 2B parameters, N1 learns from the most diverse physical action dataset ever compiled and punches above its weight: - Real humanoid teleoperation data. - Large-scale simulation data: we are open-sourcing 300K+ trajectories! - Neural trajectories: we apply SOTA video generation models to “hallucinate” new synthetic data that features accurate physics in pixels. Using Jensen’s words, “systematically infinite data”! - Latent actions: we develop novel algorithms to extract action tokens from in-the-wild human videos and neural generated videos. GR00T N1 is a single end-to-end neural net, from photons to actions: - Vision-Language Model (System 2) that interprets the physical world through vision and language instructions, enabling robots to reason about their environment and instructions, and plan the right actions. - Diffusion Transformer (System 1) that “renders” smooth and precise motor actions at 120 Hz, executing the latent plan made by System 2. We deploy N1 on GR1 robot, 1X Neo robot, and a large collection of simulation benchmarks. N1 achieves up to +30% boost in diverse manipulation tasks for household and industrial settings. While humanoid robots are the main focus of N1, our model also supports cross-embodiment. We finetune it to work on the $110 HuggingFace LeRobot SO100 robot arm! Open robot brain runs on open hardware. Sounds just right. Let’s solve robotics, together, one token at a time. Links to our Whitepaper, Github repo, HuggingFace model, and open dataset page in the thread: 🧵

Excited to announce GR00T N1, the world’s first open foundation model for humanoid robots! We are on a mission to democratize Physical AI. The power of general robot brain, in the palm of your hand - with only 2B parameters, N1 learns from the most diverse physical action dataset ever compiled and punches above its weight: - Real humanoid teleoperation data. - Large-scale simulation data: we are open-sourcing 300K+ trajectories! - Neural trajectories: we apply SOTA video generation models to “hallucinate” new synthetic data that features accurate physics in pixels. Using Jensen’s words, “systematically infinite data”! - Latent actions: we develop novel algorithms to extract action tokens from in-the-wild human videos and neural generated videos. GR00T N1 is a single end-to-end neural net, from photons to actions: - Vision-Language Model (System 2) that interprets the physical world through vision and language instructions, enabling robots to reason about their environment and instructions, and plan the right actions. - Diffusion Transformer (System 1) that “renders” smooth and precise motor actions at 120 Hz, executing the latent plan made by System 2. We deploy N1 on GR1 robot, 1X Neo robot, and a large collection of simulation benchmarks. N1 achieves up to +30% boost in diverse manipulation tasks for household and industrial settings. While humanoid robots are the main focus of N1, our model also supports cross-embodiment. We finetune it to work on the $110 HuggingFace LeRobot SO100 robot arm! Open robot brain runs on open hardware. Sounds just right. Let’s solve robotics, together, one token at a time. Links to our Whitepaper, Github repo, HuggingFace model, and open dataset page in the thread: 🧵

Jim Fan

466,226 görüntüleme • 1 yıl önce

Breaking news: Cosmos 3 is here. They are attempting to do something completely new 🤯 Why is Physical AI much harder than building a chatbot? Understanding the world is not enough, robots need to predict it and act inside it. That's the idea behind NVIDIA Cosmos 3: → Reasoning model understands what's happening from video, images, text, and actions. → World model generates future states of the environment. → Action model generates the actions needed to achieve a goal. Previous systems often stitched these capabilities together using separate models. Cosmos 3 combines them into a single architecture with two components: ▪️ Reasoner Tower Analyzes observations and builds an understanding of objects, motion, interactions, and physical context. ▪️ Generator Tower Uses that understanding to generate future videos and action sequences that obey physical constraints. So Cosmos 3 moves from: Perception → Model A Prediction → Model B Actions → Model C to: Perception + Prediction + Actions → One unified system. The goal is to make robots, autonomous vehicles, and smart environments better at answering three questions: 1. What is happening? 2. What will happen next? 3. What should I do? That's a big shift from today's AI, which mostly focuses on generating text. And check the benchmarks! Physical AI needs to generate decisions that survive contact with the real world. 🚗🤖

Breaking news: Cosmos 3 is here. They are attempting to do something completely new 🤯 Why is Physical AI much harder than building a chatbot? Understanding the world is not enough, robots need to predict it and act inside it. That's the idea behind NVIDIA Cosmos 3: → Reasoning model understands what's happening from video, images, text, and actions. → World model generates future states of the environment. → Action model generates the actions needed to achieve a goal. Previous systems often stitched these capabilities together using separate models. Cosmos 3 combines them into a single architecture with two components: ▪️ Reasoner Tower Analyzes observations and builds an understanding of objects, motion, interactions, and physical context. ▪️ Generator Tower Uses that understanding to generate future videos and action sequences that obey physical constraints. So Cosmos 3 moves from: Perception → Model A Prediction → Model B Actions → Model C to: Perception + Prediction + Actions → One unified system. The goal is to make robots, autonomous vehicles, and smart environments better at answering three questions: 1. What is happening? 2. What will happen next? 3. What should I do? That's a big shift from today's AI, which mostly focuses on generating text. And check the benchmarks! Physical AI needs to generate decisions that survive contact with the real world. 🚗🤖

Turing Post

13,755 görüntüleme • 1 ay önce

Learn a development pattern to systematically improve the accuracy and reliability of LLM applications in our new short course, Improving Accuracy of LLM Applications, built in partnership with Lamini and Meta, and taught by Lamini’s CEO Sharon Zhou, and Meta’s Senior Director of Partner Engineering, Amit Sangani. (Disclosure: I am an investor in Lamini.) The path to tuning an LLM application can be complex. In this course, you'll learn a systematic sequence of steps for improving accuracy by reducing hallucinations: - Create an evaluation dataset to measure model accuracy - Add prompt engineering and self-reflection - Fine-tune your model including "memory-tuning" which is a new method of embedding facts in an LLM Using the Llama 3-8B parameter model, you will: - Build a text-to-SQL agent with a custom schema and simulate situations where it hallucinates - Understand the difference between instruction fine-tuning, which gives pre-trained LLMs instructions to follow, and memory fine-tuning - See how Performance-Efficient Fine-tuning (PEFT) techniques like Low-Rank Adaptation (LoRA) reduce training time by 100x and Mixture of Memory Experts (MoME) reduces it even further I appreciate Meta releasing the Llama's family of open models -- this course gives an example of the unique type of work that developers can do with such models. Please sign up here:

Learn a development pattern to systematically improve the accuracy and reliability of LLM applications in our new short course, Improving Accuracy of LLM Applications, built in partnership with Lamini and Meta, and taught by Lamini’s CEO Sharon Zhou, and Meta’s Senior Director of Partner Engineering, Amit Sangani. (Disclosure: I am an investor in Lamini.) The path to tuning an LLM application can be complex. In this course, you'll learn a systematic sequence of steps for improving accuracy by reducing hallucinations: - Create an evaluation dataset to measure model accuracy - Add prompt engineering and self-reflection - Fine-tune your model including "memory-tuning" which is a new method of embedding facts in an LLM Using the Llama 3-8B parameter model, you will: - Build a text-to-SQL agent with a custom schema and simulate situations where it hallucinates - Understand the difference between instruction fine-tuning, which gives pre-trained LLMs instructions to follow, and memory fine-tuning - See how Performance-Efficient Fine-tuning (PEFT) techniques like Low-Rank Adaptation (LoRA) reduce training time by 100x and Mixture of Memory Experts (MoME) reduces it even further I appreciate Meta releasing the Llama's family of open models -- this course gives an example of the unique type of work that developers can do with such models. Please sign up here:

Andrew Ng

66,435 görüntüleme • 1 yıl önce

As detailed in the Meta Movie Gen technical report, today we’re open sourcing Movie Gen Bench: two new media generation benchmarks that we hope will help to enable the AI research community to progress work on more capable audio and video generation models. Movie Gen Video Bench is the largest and most comprehensive benchmark ever released for evaluating text-to-video generation. It includes a collection of 1,000+ prompts that cover concepts ranging from detailed human activity to animals, physics, unusual subjects and more — with broad coverage across different motion levels. Movie Gen Audio Bench is a first-of-its-kind benchmark aimed at evaluating video-to-audio and (text+video)-to-audio generation. It includes 527 generated videos and associated sound effects and music prompts covering a diverse set of ambient environments and sound effects. To enable fair and easy comparison to our models for future works, these new benchmarks include non cherry-picked generated videos and audio from Movie Gen. In releasing these new benchmarks we hope to promote fair & extensive evaluations in media generation research to enable greater progress in this field.

As detailed in the Meta Movie Gen technical report, today we’re open sourcing Movie Gen Bench: two new media generation benchmarks that we hope will help to enable the AI research community to progress work on more capable audio and video generation models. Movie Gen Video Bench is the largest and most comprehensive benchmark ever released for evaluating text-to-video generation. It includes a collection of 1,000+ prompts that cover concepts ranging from detailed human activity to animals, physics, unusual subjects and more — with broad coverage across different motion levels. Movie Gen Audio Bench is a first-of-its-kind benchmark aimed at evaluating video-to-audio and (text+video)-to-audio generation. It includes 527 generated videos and associated sound effects and music prompts covering a diverse set of ambient environments and sound effects. To enable fair and easy comparison to our models for future works, these new benchmarks include non cherry-picked generated videos and audio from Movie Gen. In releasing these new benchmarks we hope to promote fair & extensive evaluations in media generation research to enable greater progress in this field.

AI at Meta

156,273 görüntüleme • 1 yıl önce

Today we're announcing the open-source release of HunyuanVideo-Foley, our new end-to-end Text-Video-to-Audio (TV2A) framework for generating high-fidelity audio.🚀 This tool empowers creators in video production, filmmaking, and game development to generate professional-grade audio that precisely aligns with visual dynamics and semantic context, addressing key challenges in V2A generation.🔊 Key Innovations: 🔹Exceptional Generalization: Trained on a massive 100k-hour multimodal dataset, the model generates contextually-aware soundscapes for a wide range of scenes, from natural landscapes to animated shorts. 🔹Balanced Multimodal Response: Our innovative multimodal diffusion transformer (MMDiT) architecture ensures the model balances video and text cues, generating rich, layered sound effects that capture every detail—from the main subject to subtle background elements. 🔹High-Fidelity Audio: Using a Representation Alignment (REPA) loss function and a powerful Audio VAE, we've improved generation stability and producing professional-grade audio, free of noise and inconsistencies. HunyuanVideo-Foley achieves SOTA on multiple benchmarks, surpassing all open-source models in audio quality, visual-semantic alignment, and temporal alignment. 👉Try it now: 🌐Project Page: 🔗Code: 📄Technical Report: 🤗Hugging Face:

Today we're announcing the open-source release of HunyuanVideo-Foley, our new end-to-end Text-Video-to-Audio (TV2A) framework for generating high-fidelity audio.🚀 This tool empowers creators in video production, filmmaking, and game development to generate professional-grade audio that precisely aligns with visual dynamics and semantic context, addressing key challenges in V2A generation.🔊 Key Innovations: 🔹Exceptional Generalization: Trained on a massive 100k-hour multimodal dataset, the model generates contextually-aware soundscapes for a wide range of scenes, from natural landscapes to animated shorts. 🔹Balanced Multimodal Response: Our innovative multimodal diffusion transformer (MMDiT) architecture ensures the model balances video and text cues, generating rich, layered sound effects that capture every detail—from the main subject to subtle background elements. 🔹High-Fidelity Audio: Using a Representation Alignment (REPA) loss function and a powerful Audio VAE, we've improved generation stability and producing professional-grade audio, free of noise and inconsistencies. HunyuanVideo-Foley achieves SOTA on multiple benchmarks, surpassing all open-source models in audio quality, visual-semantic alignment, and temporal alignment. 👉Try it now: 🌐Project Page: 🔗Code: 📄Technical Report: 🤗Hugging Face:

Tencent Hy

122,706 görüntüleme • 11 ay önce

Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution paper page: Text-based diffusion models have exhibited remarkable success in generation and editing, showing great promise for enhancing visual content with their generative prior. However, applying these models to video super-resolution remains challenging due to the high demands for output fidelity and temporal consistency, which is complicated by the inherent randomness in diffusion models. Our study introduces Upscale-A-Video, a text-guided latent diffusion framework for video upscaling. This framework ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences; globally, without training, a flow-guided recurrent latent propagation module is introduced to enhance overall video stability by propagating and fusing latent across the entire sequences. Thanks to the diffusion paradigm, our model also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation, enabling a trade-off between fidelity and quality. Extensive experiments show that Upscale-A-Video surpasses existing methods in both synthetic and real-world benchmarks, as well as in AI-generated videos, showcasing impressive visual realism and temporal consistency.

Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution paper page: Text-based diffusion models have exhibited remarkable success in generation and editing, showing great promise for enhancing visual content with their generative prior. However, applying these models to video super-resolution remains challenging due to the high demands for output fidelity and temporal consistency, which is complicated by the inherent randomness in diffusion models. Our study introduces Upscale-A-Video, a text-guided latent diffusion framework for video upscaling. This framework ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences; globally, without training, a flow-guided recurrent latent propagation module is introduced to enhance overall video stability by propagating and fusing latent across the entire sequences. Thanks to the diffusion paradigm, our model also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation, enabling a trade-off between fidelity and quality. Extensive experiments show that Upscale-A-Video surpasses existing methods in both synthetic and real-world benchmarks, as well as in AI-generated videos, showcasing impressive visual realism and temporal consistency.

AK

32,849 görüntüleme • 2 yıl önce

Yann LeCun (Yann LeCun ) explains why LLMs are so limited in terms of real-world intelligence. Says the biggest LLM is trained on about 30 trillion words, which is roughly 10 to the power 14 bytes of text. That sounds huge, but a 4 year old who has been awake about 16,000 hours has also taken in about 10 to the power 14 bytes through the eyes alone. So a small child has already seen as much raw data as the largest LLM has read. But the child’s data is visual, continuous, noisy, and tied to actions: gravity, objects falling, hands grabbing, people moving, cause and effect. From this, the child builds an internal “world model” and intuitive physics, and can learn new tasks like loading a dishwasher from a handful of demonstrations. LLMs only see disconnected text and are trained just to predict the next token. So they get very good at symbol patterns, exams, and code, but they lack grounded physical understanding, real common sense, and efficient learning from a few messy real-world experiences. --- From 'Pioneer Works' YT channel (link in comment)

Yann LeCun (Yann LeCun ) explains why LLMs are so limited in terms of real-world intelligence. Says the biggest LLM is trained on about 30 trillion words, which is roughly 10 to the power 14 bytes of text. That sounds huge, but a 4 year old who has been awake about 16,000 hours has also taken in about 10 to the power 14 bytes through the eyes alone. So a small child has already seen as much raw data as the largest LLM has read. But the child’s data is visual, continuous, noisy, and tied to actions: gravity, objects falling, hands grabbing, people moving, cause and effect. From this, the child builds an internal “world model” and intuitive physics, and can learn new tasks like loading a dishwasher from a handful of demonstrations. LLMs only see disconnected text and are trained just to predict the next token. So they get very good at symbol patterns, exams, and code, but they lack grounded physical understanding, real common sense, and efficient learning from a few messy real-world experiences. --- From 'Pioneer Works' YT channel (link in comment)

Rohan Paul

639,483 görüntüleme • 4 ay önce

Tencent presents GameGen-O Open-world Video Game Generation We introduce GameGen-O, the first diffusion transformer model tailored for the generation of open-world video games. This model facilitates high-quality, open-domain generation by simulating a wide array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, thus allowing for the gameplay simulation. The development of GameGen-O involves a comprehensive data collection and processing effort from scratch. We collect and build the first Open-World Video Game Dataset (OGameData), amassed extensive data from over a hundred of next-generation open-world games, employing a proprietary data pipeline for efficient sorting, scoring, filtering, and decoupled captioning. This robust and extensive OGameData forms the foundation of our model's training process. GameGen-O undergoes a two-stage training process, consisting of foundation model pretraining and instruction tuning. In the first phase, the model is pre-trained on the OGameData via the text-to-video and video continuation, endowing GameGen-O with the capability for open-domain video game generation. In the second phase, the pre-trained model is frozen, and we fine-tuned using a trainable InstructNet, which enables the production of subsequent frames based on multimodal structural instructions. This whole training process imparts the model with the ability to generate and interactively control content. In summary, GameGen-O represents a notable initial step forward in the realm of open-world video game generation via generative models. It underscores the potential of generative models to serve as an alternative to rendering techniques, which can efficiently combine creative generation with interactive capabilities.

Tencent presents GameGen-O Open-world Video Game Generation We introduce GameGen-O, the first diffusion transformer model tailored for the generation of open-world video games. This model facilitates high-quality, open-domain generation by simulating a wide array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, thus allowing for the gameplay simulation. The development of GameGen-O involves a comprehensive data collection and processing effort from scratch. We collect and build the first Open-World Video Game Dataset (OGameData), amassed extensive data from over a hundred of next-generation open-world games, employing a proprietary data pipeline for efficient sorting, scoring, filtering, and decoupled captioning. This robust and extensive OGameData forms the foundation of our model's training process. GameGen-O undergoes a two-stage training process, consisting of foundation model pretraining and instruction tuning. In the first phase, the model is pre-trained on the OGameData via the text-to-video and video continuation, endowing GameGen-O with the capability for open-domain video game generation. In the second phase, the pre-trained model is frozen, and we fine-tuned using a trainable InstructNet, which enables the production of subsequent frames based on multimodal structural instructions. This whole training process imparts the model with the ability to generate and interactively control content. In summary, GameGen-O represents a notable initial step forward in the realm of open-world video game generation via generative models. It underscores the potential of generative models to serve as an alternative to rendering techniques, which can efficiently combine creative generation with interactive capabilities.

AK

367,088 görüntüleme • 1 yıl önce

Welcome to the 3D Layer of the Internet We are thrilled to share our latest features Prompt-to-3D engine and Scene Studio Your text based prompts and photos can now be transformed from simple concepts into fully-rendered, interactive 3D models. It parses complex topology, handles multi-view reconstruction, and even lets you restyle your model instantly with single-click filters like low-poly, voxel, and Voronoi shells. Within the all-new Scene Studio, you can construct immersive 3D environments that reflect your exact style. Position your assets in a full sandbox space, adjust advanced lighting and environment colors, and add the finishing touches with custom camera angles, fog settings, and real-time script parameters. This is the future of the open web. This is

Welcome to the 3D Layer of the Internet We are thrilled to share our latest features Prompt-to-3D engine and Scene Studio Your text based prompts and photos can now be transformed from simple concepts into fully-rendered, interactive 3D models. It parses complex topology, handles multi-view reconstruction, and even lets you restyle your model instantly with single-click filters like low-poly, voxel, and Voronoi shells. Within the all-new Scene Studio, you can construct immersive 3D environments that reflect your exact style. Position your assets in a full sandbox space, adjust advanced lighting and environment colors, and add the finishing touches with custom camera angles, fog settings, and real-time script parameters. This is the future of the open web. This is

three.ws

37,254 görüntüleme • 1 ay önce