正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

We've released the code for LegoGPT. This autoregressive model generates physically stable and buildable designs from text prompts, by integrating physics laws and assembly constraints into LLM training and inference. This work is led by PhD students Ava Pun, Kangle Deng, Ruixuan Liu, and in collaboration with CMU faculty... show more

Jun-Yan Zhu

13,380 subscribers

38,584 次观看 • 1 年前 •via X (Twitter)

艺术科学技术教育

Anya Rossi• Live Now

Private livecam show

7 条评论

Or Patashnik 的头像

Or Patashnik1 年前

@AvaLovelace0 @kangle_deng Wow, really cool!

Rainmaker 的头像

Rainmaker2 年前

Here I share an XGBoost model that delivers a 25% CAGR with minimal drawdown on Visa stock. In this free Substack post I share code and commentary for a powerful Machine Learning strategy that delivers powerful returns.

Jason Liu 的头像

Jason Liu1 年前

@AvaLovelace0 @kangle_deng Awesome project 👍🏼. Some designs may require other orientation than from the ground up. I’m excited to learn about this!

Redcrown 的头像

Redcrown1 年前

@AvaLovelace0 @kangle_deng woah, this is soo cool

Ant A 的头像

Ant A1 年前

@AvaLovelace0 @kangle_deng So just GenAI every step/layer?

Aiden 的头像

Aiden1 年前

@AvaLovelace0 @kangle_deng Super interesting project! We're also big believers in using natural language to create. With jenova ai, anyone can build their own custom AI apps just by describing what they need.

Max Zhaoshuo Li 李赵硕的头像

Max Zhaoshuo Li 李赵硕1 年前

@AvaLovelace0 @kangle_deng Very interesting work! Congrats!

相关视频

LegoGPT, an LLM-based system that generates physically stable LEGO structures from text prompts, backed by a new 47,000+ sample dataset and physics-aware filtering during inference. → LegoGPT is trained on a custom dataset, StableText2Lego, which includes 47,000+ 3D LEGO models mapped to text, spanning 28,000+ unique objects. → The model predicts LEGO bricks sequentially like tokens, using next-token prediction in a transformer setup. → To ensure physical stability, LegoGPT integrates physics-aware rollback and validity filtering, pruning out structurally invalid brick placements. → The generated designs are aesthetically aligned with prompts, physically buildable, and tested both with human manual assembly and robotic arms. → The team also introduced a text-driven LEGO coloring/texturing pipeline, enabling more expressive and customized outputs. → The dataset, code, and models are all publicly released under an open-access license.

LegoGPT, an LLM-based system that generates physically stable LEGO structures from text prompts, backed by a new 47,000+ sample dataset and physics-aware filtering during inference. → LegoGPT is trained on a custom dataset, StableText2Lego, which includes 47,000+ 3D LEGO models mapped to text, spanning 28,000+ unique objects. → The model predicts LEGO bricks sequentially like tokens, using next-token prediction in a transformer setup. → To ensure physical stability, LegoGPT integrates physics-aware rollback and validity filtering, pruning out structurally invalid brick placements. → The generated designs are aesthetically aligned with prompts, physically buildable, and tested both with human manual assembly and robotic arms. → The team also introduced a text-driven LEGO coloring/texturing pipeline, enabling more expressive and customized outputs. → The dataset, code, and models are all publicly released under an open-access license.

Rohan Paul

75,248 次观看 • 1 年前

[1/2] We’ve released the code for #pix2pixturbo and #CycleGANTurbo. These conditional GANs are able to adapt a text-to-image model such as SD-Turbo for both paired and unpaired image translation with a single step (0.11 sec on A100 and 0.29 sec on A6000). Try our code and the Gradio demo. Paper: Code: Demo: This is a joint work with Gaurav Parmar (the leading author), Taesung Park, and Srinivasa Narasimhan. This work shows that a pre-trained one-step model can be easily adapted to conditional GANs frameworks for downstream image editing and synthesis tasks. #Edges2Cats

[1/2] We’ve released the code for #pix2pixturbo and #CycleGANTurbo. These conditional GANs are able to adapt a text-to-image model such as SD-Turbo for both paired and unpaired image translation with a single step (0.11 sec on A100 and 0.29 sec on A6000). Try our code and the Gradio demo. Paper: Code: Demo: This is a joint work with Gaurav Parmar (the leading author), Taesung Park, and Srinivasa Narasimhan. This work shows that a pre-trained one-step model can be easily adapted to conditional GANs frameworks for downstream image editing and synthesis tasks. #Edges2Cats

Jun-Yan Zhu

36,473 次观看 • 2 年前

Check this!! You can get LLM-ready text from ANY website in 2 mins. Using Firecrawl's /llmstxt endpoint, transform ANY website into clean LLM-ready text—by just specifying a URL. Use this data for RAG, training LLMs, and more. Everything is just 5 lines of code!

Check this!! You can get LLM-ready text from ANY website in 2 mins. Using Firecrawl's /llmstxt endpoint, transform ANY website into clean LLM-ready text—by just specifying a URL. Use this data for RAG, training LLMs, and more. Everything is just 5 lines of code!

Avi Chawla

12,936 次观看 • 1 年前

Today, we are releasing Stable Video Diffusion, our first foundation model for generative AI video based on the image model, Stable Diffusion. As part of this research preview, the code, weights, and research paper are now available. Additionally, today you can sign up for our waitlist to access a new upcoming web experience featuring a Text-To-Video interface. To access the model & sign up for our waitlist, visit our website here:

Today, we are releasing Stable Video Diffusion, our first foundation model for generative AI video based on the image model, Stable Diffusion. As part of this research preview, the code, weights, and research paper are now available. Additionally, today you can sign up for our waitlist to access a new upcoming web experience featuring a Text-To-Video interface. To access the model & sign up for our waitlist, visit our website here:

Stability AI

1,024,415 次观看 • 2 年前

Small prototype with AI + generative sketching workflow. Sketches are written as usual in code, and prompts can be used to augment/modify the artwork's inputs and parameters. 🤖 This is using OpenAI API with GPT 3.5, already showing a surprisingly good grasp of color for an LLM.

Small prototype with AI + generative sketching workflow. Sketches are written as usual in code, and prompts can be used to augment/modify the artwork's inputs and parameters. 🤖 This is using OpenAI API with GPT 3.5, already showing a surprisingly good grasp of color for an LLM.

Matt DesLauriers

22,320 次观看 • 2 年前

🚀 We open-sourced LongLive — interactive, real-time long-video generation. 👥Generates video in real time as users enter text prompts. ⚡️20.7 FPS on a single H100,⏱️up to 240s per clip. 🎬Fine-tunes SOTA short-video models (e.g., Wan) into long-video generators. 🌍One step closer to World Models. All code for training & inference, model weights, demo page, and videos released! Paper: Code: Model: Demo Page: Introduction Video:

🚀 We open-sourced LongLive — interactive, real-time long-video generation. 👥Generates video in real time as users enter text prompts. ⚡️20.7 FPS on a single H100,⏱️up to 240s per clip. 🎬Fine-tunes SOTA short-video models (e.g., Wan) into long-video generators. 🌍One step closer to World Models. All code for training & inference, model weights, demo page, and videos released! Paper: Code: Model: Demo Page: Introduction Video:

Yukang Chen

11,752 次观看 • 8 个月前

#SIGGRAPH2024 🥳🥳I've open-sourced the code for our Vertex Block Descent (VBD) paper as a part of Gaia: This is a collaborative work with Ziheng Liu. Thanks for your interest in VBD and Gaia! I released it ahead of schedule for you!

#SIGGRAPH2024 🥳🥳I've open-sourced the code for our Vertex Block Descent (VBD) paper as a part of Gaia: This is a collaborative work with Ziheng Liu. Thanks for your interest in VBD and Gaia! I released it ahead of schedule for you!

Dr. Anka He Chen

29,368 次观看 • 2 年前

This week, grounding DINO 1.5 was released It is a new model that uses text prompts to detect objects from videos and images in real-time Examples & demo to try below:

This week, grounding DINO 1.5 was released It is a new model that uses text prompts to detect objects from videos and images in real-time Examples & demo to try below:

Allen T.

56,013 次观看 • 2 年前

i'm a little sick of chatgpt giving me obviously broken code i've found a "micro agent" approach to LLM code generation can work much better the LLM first generates a *test*, and then enters a loop where it generates and iterates on the code until the tests pass source below

i'm a little sick of chatgpt giving me obviously broken code i've found a "micro agent" approach to LLM code generation can work much better the LLM first generates a test, and then enters a loop where it generates and iterates on the code until the tests pass source below

Steve (Builder.io)

544,125 次观看 • 2 年前

Interested in hippocampal dynamics and their interactions with cortical rhythms? Thrilled to see this work led by RichaPhogat hit the public domain - see below for a walk through of the theory, implementation, validation and code!!

Interested in hippocampal dynamics and their interactions with cortical rhythms? Thrilled to see this work led by RichaPhogat hit the public domain - see below for a walk through of the theory, implementation, validation and code!!

Michael Breakspear

14,895 次观看 • 9 个月前

Wow! AI ASSISTED GARAGE MANUFACTURING IS ABOUT TO EXPLODE! CAD Drawings From Just A Picture! MIT just released something profound for creators and engineers alike. Picture this. You take a photo of an object, upload it, and an AI delivers a fully parametric CAD model, complete with editable code and construction history. This is open source GenCAD, from MIT's Decode Lab. It uses autoregressive transformers and diffusion models, trained on hundreds of thousands of images and CAD files. Input a 2D photo or sketch. Output valid CadQuery Python code that beats models like GPT-4.5 in accuracy. Why does this matter? It speeds up reverse engineering, prototyping, and part searches in vast databases. No more hours spent modeling from scratch. Field repairs, custom designs, education, all transformed. It even retrieves similar parts from libraries of thousands. For industries like manufacturing and aerospace, it cuts costs and boosts innovation. Hobbyists gain pro tools without the steep curve. I am testing it now on random objects and can not believe how much of a super power this is. I can start dozens of companies just on this AI model. This open-source gem is here: The future of building stuff arrives in a snapshot.

Wow! AI ASSISTED GARAGE MANUFACTURING IS ABOUT TO EXPLODE! CAD Drawings From Just A Picture! MIT just released something profound for creators and engineers alike. Picture this. You take a photo of an object, upload it, and an AI delivers a fully parametric CAD model, complete with editable code and construction history. This is open source GenCAD, from MIT's Decode Lab. It uses autoregressive transformers and diffusion models, trained on hundreds of thousands of images and CAD files. Input a 2D photo or sketch. Output valid CadQuery Python code that beats models like GPT-4.5 in accuracy. Why does this matter? It speeds up reverse engineering, prototyping, and part searches in vast databases. No more hours spent modeling from scratch. Field repairs, custom designs, education, all transformed. It even retrieves similar parts from libraries of thousands. For industries like manufacturing and aerospace, it cuts costs and boosts innovation. Hobbyists gain pro tools without the steep curve. I am testing it now on random objects and can not believe how much of a super power this is. I can start dozens of companies just on this AI model. This open-source gem is here: The future of building stuff arrives in a snapshot.

Brian Roemmele

121,815 次观看 • 3 个月前

This is a pretty wild model! You can use it to turn an image into a 3D object with texture. The quality is out of this world! I'm not even a designer, and I've been using this nonstop for the last 2 hours. The model is Hunyuan 3D 2.1. It's open source. You'll find model weights, training/inference code, data pipelines, and architecture on their repository. You can even fine-tune it if you want! GitHub Repository: By the way, the model runs on consumer-grade GPUs. You don't need a datacenter for this! I've been using the model from the HuggingFace demo page: To use it, go to the link and upload an image. That's it! Check out the video I recorded for a couple of examples.

This is a pretty wild model! You can use it to turn an image into a 3D object with texture. The quality is out of this world! I'm not even a designer, and I've been using this nonstop for the last 2 hours. The model is Hunyuan 3D 2.1. It's open source. You'll find model weights, training/inference code, data pipelines, and architecture on their repository. You can even fine-tune it if you want! GitHub Repository: By the way, the model runs on consumer-grade GPUs. You don't need a datacenter for this! I've been using the model from the HuggingFace demo page: To use it, go to the link and upload an image. That's it! Check out the video I recorded for a couple of examples.

Santiago

44,783 次观看 • 1 年前

This is THE moment of Physical AI! We are officially announcing Cosmos 3: Omnimodal World Models for Physical AI 🚀 - Cosmos 3 is an omnimodal world model: within a unified architecture, it can understand and generate language, images, video, audio, and actions. - It is not just a VLM, not just a video generator, not just an audio-visual generative model, and not just a physics simulator / world-action model. It can understand images and videos, generate images, videos, and audio, simulate future worlds, predict actions, and generate robot policies—enabling models to truly begin to “touch the world.” - Cosmos 3 is the #1 open-weight reasoner / T2I / I2V / robot policy across many benchmarks. Huge thanks to every teammate who fought side by side on this journey—from architecture, data, training, infra, serving, and evaluation to post-training. Every part of this project carries an incredible amount of hard work. This was my first time leading a project as Tech Lead, and I feel truly fortunate. The future of Physical AI needs models that can not only “see” and “describe” the world, but also “imagine,” “simulate,” and “act”—and eventually close the loop with the real world. I hope Cosmos 3 can become an important starting point for this direction, and I’m excited to push Physical AI into its next stage together with the open-source community. Welcome to the era of Physical AI. HuggingFace: Project Website: Code:

This is THE moment of Physical AI! We are officially announcing Cosmos 3: Omnimodal World Models for Physical AI 🚀 - Cosmos 3 is an omnimodal world model: within a unified architecture, it can understand and generate language, images, video, audio, and actions. - It is not just a VLM, not just a video generator, not just an audio-visual generative model, and not just a physics simulator / world-action model. It can understand images and videos, generate images, videos, and audio, simulate future worlds, predict actions, and generate robot policies—enabling models to truly begin to “touch the world.” - Cosmos 3 is the #1 open-weight reasoner / T2I / I2V / robot policy across many benchmarks. Huge thanks to every teammate who fought side by side on this journey—from architecture, data, training, infra, serving, and evaluation to post-training. Every part of this project carries an incredible amount of hard work. This was my first time leading a project as Tech Lead, and I feel truly fortunate. The future of Physical AI needs models that can not only “see” and “describe” the world, but also “imagine,” “simulate,” and “act”—and eventually close the loop with the real world. I hope Cosmos 3 can become an important starting point for this direction, and I’m excited to push Physical AI into its next stage together with the open-source community. Welcome to the era of Physical AI. HuggingFace: Project Website: Code:

Max Zhaoshuo Li 李赵硕

1,072,781 次观看 • 13 天前

We've been ramping up usage of AI tools on our design team at Coinbase 🛡️ Two examples: 1. Write a text prompt and get a figma mockup as a starting point for your design 2. Click a button and turn any figma design into front-end code The front-end code it generates adheres to our design system, and is trained on our library of UIs Shout out to 🛡️ tali krakowsky apel🛡️ Blair McKee and the team pushing this forward.

We've been ramping up usage of AI tools on our design team at Coinbase 🛡️ Two examples: 1. Write a text prompt and get a figma mockup as a starting point for your design 2. Click a button and turn any figma design into front-end code The front-end code it generates adheres to our design system, and is trained on our library of UIs Shout out to 🛡️ tali krakowsky apel🛡️ Blair McKee and the team pushing this forward.

Brian Armstrong

520,346 次观看 • 2 年前

An interactive world model developed by NVIDIA in collaboration with academic partners. - DreamDojo turns egocentric human video data into physical intelligence. - Human data is more scalable than robotics data but lacks action labels. - To solve this, a dedicated action model extracts latent actions by identifying physics and motion deltas between frames. Training - A massive 44k hours of video data are used for pre-training. - Post-training on small-scale robot datasets maps human physics to specific robot embodiments. - An additional distillation stage converts the model into an autoregressive, few-step diffusion model, enabling real-time, action-controllable simulation. Primary Use Cases - Live Teleoperation: Controlling a robot inside a world simulation in real-time. - Model-based Planning: Previewing and curating the best actions for improved success. - Policy Evaluation: Testing robot policies in realistic, out-of-distribution scenarios. Everything that's open-sourced: weights, code, post-training dataset, eval set, and details to reproduce.

An interactive world model developed by NVIDIA in collaboration with academic partners. - DreamDojo turns egocentric human video data into physical intelligence. - Human data is more scalable than robotics data but lacks action labels. - To solve this, a dedicated action model extracts latent actions by identifying physics and motion deltas between frames. Training - A massive 44k hours of video data are used for pre-training. - Post-training on small-scale robot datasets maps human physics to specific robot embodiments. - An additional distillation stage converts the model into an autoregressive, few-step diffusion model, enabling real-time, action-controllable simulation. Primary Use Cases - Live Teleoperation: Controlling a robot inside a world simulation in real-time. - Model-based Planning: Previewing and curating the best actions for improved success. - Policy Evaluation: Testing robot policies in realistic, out-of-distribution scenarios. Everything that's open-sourced: weights, code, post-training dataset, eval set, and details to reproduce.

The Humanoid Hub

11,575 次观看 • 3 个月前

AN OXFORD STUDENT IS RUNNING A PARTICLE SIMULATION WITH REAL PEOPLE'S NAMES AND CLAIMS CERN IS TAUNTING HIM THROUGH THE CODE Thousands of particles on a black screen - each one labeled with a real person's name - moving according to the laws of physics in real time and he is completely convinced this is not a simulation but a personal message from CERN directed at him specifically. Particle simulation with collision detection, velocity vectors and brownian motion - technically flawless code that tracks every particle individually and renders trajectories at 60 fps. CERN operates a 17km collider that accelerates protons to 99.9999991% the speed of light and generates a petabyte of data every single day - and apparently found the time to encode Oxford student names into a simulation. The code is real. The physics is correct. The conclusions are a separate conversation.

AN OXFORD STUDENT IS RUNNING A PARTICLE SIMULATION WITH REAL PEOPLE'S NAMES AND CLAIMS CERN IS TAUNTING HIM THROUGH THE CODE Thousands of particles on a black screen - each one labeled with a real person's name - moving according to the laws of physics in real time and he is completely convinced this is not a simulation but a personal message from CERN directed at him specifically. Particle simulation with collision detection, velocity vectors and brownian motion - technically flawless code that tracks every particle individually and renders trajectories at 60 fps. CERN operates a 17km collider that accelerates protons to 99.9999991% the speed of light and generates a petabyte of data every single day - and apparently found the time to encode Oxford student names into a simulation. The code is real. The physics is correct. The conclusions are a separate conversation.

Noisy

5,682,355 次观看 • 20 天前

NVIDIA just released a new open source transcription model, Nemotron Speech ASR, designed from the ground up for low-latency use cases like voice agents. Here's a voice agent built with this new model. 24ms transcription finalization and total voice-to-voice inference time under 500ms. This agent actually uses *three* NVIDIA open source models: - Nemotron Speech ASR - Nemotron 3 Nano 30GB in a 4-bit quant (released in December) - A preview checkpoint of the upcoming Magpie text-to-speech model These models are all truly open source: weights, training data, training code, and inference code. This is a big deal! Jensen said in the CES keynote yesterday that he expects open source models to catch up to proprietary models this year in a number of categories. NVIDIA is putting their weight behind making this happen. (As Alan Kay said, the best way to predict the future is to invent it.) The code for this agent is open source too, of course. You can deploy it to production with Modal and Pipecat AI cloud, or run locally on an NVIDIA DGX Spark or RTX 5090.

NVIDIA just released a new open source transcription model, Nemotron Speech ASR, designed from the ground up for low-latency use cases like voice agents. Here's a voice agent built with this new model. 24ms transcription finalization and total voice-to-voice inference time under 500ms. This agent actually uses three NVIDIA open source models: - Nemotron Speech ASR - Nemotron 3 Nano 30GB in a 4-bit quant (released in December) - A preview checkpoint of the upcoming Magpie text-to-speech model These models are all truly open source: weights, training data, training code, and inference code. This is a big deal! Jensen said in the CES keynote yesterday that he expects open source models to catch up to proprietary models this year in a number of categories. NVIDIA is putting their weight behind making this happen. (As Alan Kay said, the best way to predict the future is to invent it.) The code for this agent is open source too, of course. You can deploy it to production with Modal and Pipecat AI cloud, or run locally on an NVIDIA DGX Spark or RTX 5090.

kwindla

274,188 次观看 • 5 个月前

Boom! Open source LegoGPT is a building AI, and sure it can be used for Legos but it can also be used for Lego-like building of homes. LegoGPT converts meshes to Lego in one step using 1×1, 1×2, 1×4, 1×6, 1×8, 2×2, 2×4, and 2×6 bricks. Then they evaluate the stability of the design. Finally, they render an image and ask GPT-4o to produce captions to go with the image. This sent to robots and they complete the real-time build. This can absolutely scale and as with Legos, you can use smaller pieces for higher resolution. Testing it detail now with robot assembly. Link:

Boom! Open source LegoGPT is a building AI, and sure it can be used for Legos but it can also be used for Lego-like building of homes. LegoGPT converts meshes to Lego in one step using 1×1, 1×2, 1×4, 1×6, 1×8, 2×2, 2×4, and 2×6 bricks. Then they evaluate the stability of the design. Finally, they render an image and ask GPT-4o to produce captions to go with the image. This sent to robots and they complete the real-time build. This can absolutely scale and as with Legos, you can use smaller pieces for higher resolution. Testing it detail now with robot assembly. Link:

Brian Roemmele

22,363 次观看 • 1 年前

🔥 DreamEngine revolutionizes image generation with its text-guided object fusion capabilities! The demo and code for Text Guided Object Fustion are released! Let's unlock the Imaginations! Run it locally now in: Paper:

🔥 DreamEngine revolutionizes image generation with its text-guided object fusion capabilities! The demo and code for Text Guided Object Fustion are released! Let's unlock the Imaginations! Run it locally now in: Paper:

Liang Chen

11,357 次观看 • 1 年前