Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

Excited to share "MultiDiffusion"! A controlled image generation framework w/ pre-trained text-to-image diffusion model. * Spatial guidance controls (bounding boxes/masks) * Arbitrary aspect ratios (huge Panoramas!) NO training NO finetuning. [1/3]Lior Yariv Yaron Lipman Tali Dekel

Omer Bar Tal

3,022 subscribers

88,866 görüntüleme • 3 yıl önce •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

10 Yorum

Omer Bar Tal profil fotoğrafı

Omer Bar Tal3 yıl önce

Our key idea is to define a new generation process, based on an optimization task that binds together multiple diffusion paths. The optimal solution is given in closed-form, and can be found analytically, without a computational overhead. [2/3]

Omer Bar Tal profil fotoğrafı

Omer Bar Tal3 yıl önce

Visit our project webpage for more details, results, and code 🥳 Arxiv: [3/3]

Omer Bar Tal profil fotoğrafı

Omer Bar Tal3 yıl önce

MultiDiffusion is now integrated into diffusers 🚀 currently text2panorama is supported, spatial controls (masks/bounding boxes)- soon :) demo: official repo: Thanks @RisingSayak @_akhaliq and @huggingface team!

Hila Chefer profil fotoğrafı

Hila Chefer3 yıl önce

@YarivLior @lipmanya @talidekel Very cool work! Congrats @omerbartal 🎊

Omer Bar Tal profil fotoğrafı

Omer Bar Tal3 yıl önce

@YarivLior @lipmanya @talidekel Thanks @hila_chefer :)

Sebastian Bugge Loeschcke profil fotoğrafı

Sebastian Bugge Loeschcke3 yıl önce

@YarivLior @lipmanya @talidekel Super cool work @omerbartal!

Lucas Beyer (bl16) profil fotoğrafı

Lucas Beyer (bl16)3 yıl önce

@YarivLior @lipmanya @talidekel Super cool, and nice demo! I think you have a typo in the gif: a tree trunk, not a tree truck, though the latter would also be fun to see =)

Omer Bar Tal profil fotoğrafı

Omer Bar Tal3 yıl önce

@YarivLior @lipmanya @talidekel Thanks! Ohh definitely a typo, but a cool idea to try ;)

Richard Löwenström profil fotoğrafı

Richard Löwenström3 yıl önce

@YarivLior @lipmanya @talidekel Nice background trick! I think I've the merging of predictions before though but not so nicely mathematically motivated. I think there's a PR to diffusers upscaling x4 that does something similar for example

Richard Löwenström profil fotoğrafı

Richard Löwenström3 yıl önce

@YarivLior @lipmanya @talidekel Here's the paper I was thinking about but I may have misunderstood the math 🙏

Benzer Videolar

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

AK

83,657 görüntüleme • 3 yıl önce

We’re excited to announce the release and open-source of HunyuanImage 3.0 — the largest and most powerful open-source text-to-image model to date, with over 80 billion total parameters, of which 13 billion are activated per token during inference.The effect is completely comparable to the industry’s flagship closed-source model.🚀🚀🚀 HunyuanImage 3.0 originates from our internally developed native multimodal large language model, with fine-tuning and post-training focused on text-to-image generation. This unique foundation gives the model a powerful set of capabilities: ✅Reason with world knowledge ✅Understand complex, thousand-word prompts ✅Generate precise text within images Different from traditional DiT architecture image generation models, HunyuanImage 3.0’s MoE architecture uses a Transfusion-based approach to deeply couple Diffusion and LLM training for a single, powerful system. Built on Hunyuan-A13B, HunyuanImage 3.0 was trained on a massive dataset: 5 billion image-text pairs, video frames, interleaved image-text data, and 6 trillion tokens of text corpora. This hybrid training across multimodal generation, understanding, and LLM capabilities allows the model to seamlessly integrate multiple tasks. Whether you're an illustrator, designer, or creator, this is built to slash your workflow from hours to minutes. HunyuanImage 3.0 can generate intricate text, detailed comics, expressive emojis, and lively, engaging illustrations for educational content. The current release focuses solely on text-to-image generation and future updates will include image-to-image, image editing, multi-turn interaction, and more. 👉🏻Try it now: 🔗GitHub: 🤗Hugging Face:

We’re excited to announce the release and open-source of HunyuanImage 3.0 — the largest and most powerful open-source text-to-image model to date, with over 80 billion total parameters, of which 13 billion are activated per token during inference.The effect is completely comparable to the industry’s flagship closed-source model.🚀🚀🚀 HunyuanImage 3.0 originates from our internally developed native multimodal large language model, with fine-tuning and post-training focused on text-to-image generation. This unique foundation gives the model a powerful set of capabilities: ✅Reason with world knowledge ✅Understand complex, thousand-word prompts ✅Generate precise text within images Different from traditional DiT architecture image generation models, HunyuanImage 3.0’s MoE architecture uses a Transfusion-based approach to deeply couple Diffusion and LLM training for a single, powerful system. Built on Hunyuan-A13B, HunyuanImage 3.0 was trained on a massive dataset: 5 billion image-text pairs, video frames, interleaved image-text data, and 6 trillion tokens of text corpora. This hybrid training across multimodal generation, understanding, and LLM capabilities allows the model to seamlessly integrate multiple tasks. Whether you're an illustrator, designer, or creator, this is built to slash your workflow from hours to minutes. HunyuanImage 3.0 can generate intricate text, detailed comics, expressive emojis, and lively, engaging illustrations for educational content. The current release focuses solely on text-to-image generation and future updates will include image-to-image, image editing, multi-turn interaction, and more. 👉🏻Try it now: 🔗GitHub: 🤗Hugging Face:

Tencent Hy

412,658 görüntüleme • 10 ay önce

Google presents Still-Moving Customized Video Generation without Customized Video Data Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth or StyleDrop). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight Spatial Adapters that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on "frozen videos" (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel Motion Adapter module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.

Google presents Still-Moving Customized Video Generation without Customized Video Data Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth or StyleDrop). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight Spatial Adapters that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on "frozen videos" (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel Motion Adapter module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.

AK

40,474 görüntüleme • 2 yıl önce

In Prompt Engineering for Vision Models, taught by Abby Jacques Verre and Caleb Kaiser of Comet , you’ll learn how to prompt and fine-tune vision models for personalized image generation, image editing, object detection and segmentation. The prompts you'll use for vision models could be text, point coordinates, or bounding boxes, depending on the model. You'll also learn to tune hyperparameters to shape the output. Models you'll use include Segment-Anything Model (SAM), OWL-ViT, and Stable Diffusion. You'll also learn to fine-tune Stable Diffusion to generate personalized images (say, an image of a specific person), using a handful of images for training. As an example of a multi-step workflow, you'll use OWL-ViT to detect an object based on a text prompt, then pass the bounding box to SAM to create a segmentation mask, and input that mask into Stable Diffusion to replace the original object with a new one based on a text prompt. Controlling vision models can be tricky; this course will teach prompting and fine-tuning techniques to get precise control over their output. Get started here:

In Prompt Engineering for Vision Models, taught by Abby Jacques Verre and Caleb Kaiser of Comet , you’ll learn how to prompt and fine-tune vision models for personalized image generation, image editing, object detection and segmentation. The prompts you'll use for vision models could be text, point coordinates, or bounding boxes, depending on the model. You'll also learn to tune hyperparameters to shape the output. Models you'll use include Segment-Anything Model (SAM), OWL-ViT, and Stable Diffusion. You'll also learn to fine-tune Stable Diffusion to generate personalized images (say, an image of a specific person), using a handful of images for training. As an example of a multi-step workflow, you'll use OWL-ViT to detect an object based on a text prompt, then pass the bounding box to SAM to create a segmentation mask, and input that mask into Stable Diffusion to replace the original object with a new one based on a text prompt. Controlling vision models can be tricky; this course will teach prompting and fine-tuning techniques to get precise control over their output. Get started here:

Andrew Ng

151,198 görüntüleme • 2 yıl önce

We’re excited to share DiT4DiT, an end-to-end Video-Action Model for robot learning that unifies a video Diffusion Transformer and an action Diffusion Transformer in a single cascaded framework. By leveraging the rich spatiotemporal and physical dynamics learned through video generation, rather than static image-text priors, DiT4DiT achieves state-of-the-art results on LIBERO (98.6%) and RoboCasa GR1 (50.8%) with far less training data, delivering over 10× better sample efficiency and up to 7× faster convergence. Real-world deployment on a humanoid robot further shows robust generalization. We believe this is a step toward making video generation a powerful backbone for robot policy learning. This work builds upon the brilliant foundations laid by Nvidia's GR00T and Cosmos. Project: Paper: Code: Coming soon. In the meantime, you can ask your coding agent to reproduce the method based on GR00T/Cosmos.

We’re excited to share DiT4DiT, an end-to-end Video-Action Model for robot learning that unifies a video Diffusion Transformer and an action Diffusion Transformer in a single cascaded framework. By leveraging the rich spatiotemporal and physical dynamics learned through video generation, rather than static image-text priors, DiT4DiT achieves state-of-the-art results on LIBERO (98.6%) and RoboCasa GR1 (50.8%) with far less training data, delivering over 10× better sample efficiency and up to 7× faster convergence. Real-world deployment on a humanoid robot further shows robust generalization. We believe this is a step toward making video generation a powerful backbone for robot policy learning. This work builds upon the brilliant foundations laid by Nvidia's GR00T and Cosmos. Project: Paper: Code: Coming soon. In the meantime, you can ask your coding agent to reproduce the method based on GR00T/Cosmos.

Shuo Yang

31,596 görüntüleme • 4 ay önce

Generative video models are rapidly improving in quality. Meet Replay, a new AI model that can generate stunning videos from text. Replay v0.1 is designed to create ultrasmooth HD videos with a new interface. Available today for everyone. What's New? 1. Replay understands plain English prompts without prompt engineering. Try "rugged surfer" or "mermaid". 2. What's a movie without actors? Replay can crisply render close-ups of people and animals. 3. Free and fast generation from our homepage with no waiting list (link in bio). Behind the scenes, Replay is powered by aligning our new diffusion model for videos with the LLM behind Genmo Chat. We’re so excited to see what you create! Share your creations below👇

Generative video models are rapidly improving in quality. Meet Replay, a new AI model that can generate stunning videos from text. Replay v0.1 is designed to create ultrasmooth HD videos with a new interface. Available today for everyone. What's New? 1. Replay understands plain English prompts without prompt engineering. Try "rugged surfer" or "mermaid". 2. What's a movie without actors? Replay can crisply render close-ups of people and animals. 3. Free and fast generation from our homepage with no waiting list (link in bio). Behind the scenes, Replay is powered by aligning our new diffusion model for videos with the LLM behind Genmo Chat. We’re so excited to see what you create! Share your creations below👇

Genmo

154,603 görüntüleme • 2 yıl önce

“ xAI team is currently working heavily on coding models. Right now, the main focus is training a specialized coding model that will be both fast and smart. I believe we’ll share it with you guys in a few weeks. That’s exciting. Second ,after coding, we all see that the main weakness of Grok 4 is its multimodal capabilities. In fact, it was so bad that Grok was effectively looking at the world while squinting through glass trying to see blurry features and make sense of them. The most immediate improvement we’re stepping on with the next generation pre trained model is huge gains in image understanding, video understanding, and audio. Right now, the model can hear and see the world just like any of you. And with all the tools and other agents it can talk to, we’re going to see a huge unlock for many different application layers once multimodal agents arrive.” — xAI Team

“ xAI team is currently working heavily on coding models. Right now, the main focus is training a specialized coding model that will be both fast and smart. I believe we’ll share it with you guys in a few weeks. That’s exciting. Second ,after coding, we all see that the main weakness of Grok 4 is its multimodal capabilities. In fact, it was so bad that Grok was effectively looking at the world while squinting through glass trying to see blurry features and make sense of them. The most immediate improvement we’re stepping on with the next generation pre trained model is huge gains in image understanding, video understanding, and audio. Right now, the model can hear and see the world just like any of you. And with all the tools and other agents it can talk to, we’re going to see a huge unlock for many different application layers once multimodal agents arrive.” — xAI Team

Apurv Kochara

88,142 görüntüleme • 5 ay önce

training a model that takes a text prompt and generates audio that renders video on an oscilloscope AgenC agents live inside worlds the model generates the pipeline: real videos -> edge detection -> vectorization -> path ordering -> 192kHz 3-channel WAV where X/Y control beam position and Z controls beam intensity 3 values per timestep. that's all the model is learning. compare that to video gen models trying to predict millions of pixels per frame. transformers are already great at sequence prediction and that's literally all this is. waveform generation the output IS the playback. generate the audio, feed it to a scope, it draws the scene in real-time. there's no rendering step. it's analog so there's no pixel grid. you get continuous curves and effectively infinite resolution bootstrapped with procedural data, lissajous curves, wireframe 3D, stick figures, then scaled on real-world video converted to trace format. 90 TB of source video the model learns edges, contours, spatial relationships, motion. once it has that, describing a scene it's never seen is novel trajectory through the same learned space. generative geometry

training a model that takes a text prompt and generates audio that renders video on an oscilloscope AgenC agents live inside worlds the model generates the pipeline: real videos -> edge detection -> vectorization -> path ordering -> 192kHz 3-channel WAV where X/Y control beam position and Z controls beam intensity 3 values per timestep. that's all the model is learning. compare that to video gen models trying to predict millions of pixels per frame. transformers are already great at sequence prediction and that's literally all this is. waveform generation the output IS the playback. generate the audio, feed it to a scope, it draws the scene in real-time. there's no rendering step. it's analog so there's no pixel grid. you get continuous curves and effectively infinite resolution bootstrapped with procedural data, lissajous curves, wireframe 3D, stick figures, then scaled on real-world video converted to trace format. 90 TB of source video the model learns edges, contours, spatial relationships, motion. once it has that, describing a scene it's never seen is novel trajectory through the same learned space. generative geometry

tetsuo

18,397 görüntüleme • 5 ay önce

Molmo by Ai2 - Open source SoTA Multimodal (Vision) Language model, beating Claude 3.5 Sonnet, GPT4V and comparable to GPT4o 🔥 They release four model checkpoints: 1. MolmoE-1B, a mixture of experts model with 1B (active) 7B (total) 2. Molmo-7B-O, most open 7B model 3. Molmo-7B-D, demo model 4. Molmo-72B, best model System Architecture > Input: Multi-scale, multi-crop images generated from the original image. > Vision Encoder: OpenAI's ViT-L/14 336px CLIP model, a powerful ViT, encodes images into vision tokens. > Connector: MLP projects tokens to LLM input space, followed by pooling for dimensionality reduction. > LLM: Decoder-only Transformer, various options (OLMo, OLMoE, Qwen2, Mistral, Gemma2, Phi) with diverse scales and openness. Model Variants > Vision Encoder: Consistent ViT-L/14 CLIP model across variants. > LLM: OLMo-7B-1024, OLMoE-1B-7B-0924, Qwen2 (7B, 72B), Mistral 7B, Gemma2 9B, Phi 3 Medium, offering different capacities and openness levels. Training Strategy > Stage 1: Multimodal pre-training for caption generation with new captioning data. > Stage 2: Supervised fine-tuning on a dataset mixture, updating all parameters. > No RLHF involved, Learning rates adjusted based on component types and pre-training status. > All the weights are available on Hugging Face Hub 🤗 > Compatible with Transformers (Remote Code) Kudos Ai2 for such a brilliant and open work! 🐐 Video credits: Allen AI YT Channel

Molmo by Ai2 - Open source SoTA Multimodal (Vision) Language model, beating Claude 3.5 Sonnet, GPT4V and comparable to GPT4o 🔥 They release four model checkpoints: 1. MolmoE-1B, a mixture of experts model with 1B (active) 7B (total) 2. Molmo-7B-O, most open 7B model 3. Molmo-7B-D, demo model 4. Molmo-72B, best model System Architecture > Input: Multi-scale, multi-crop images generated from the original image. > Vision Encoder: OpenAI's ViT-L/14 336px CLIP model, a powerful ViT, encodes images into vision tokens. > Connector: MLP projects tokens to LLM input space, followed by pooling for dimensionality reduction. > LLM: Decoder-only Transformer, various options (OLMo, OLMoE, Qwen2, Mistral, Gemma2, Phi) with diverse scales and openness. Model Variants > Vision Encoder: Consistent ViT-L/14 CLIP model across variants. > LLM: OLMo-7B-1024, OLMoE-1B-7B-0924, Qwen2 (7B, 72B), Mistral 7B, Gemma2 9B, Phi 3 Medium, offering different capacities and openness levels. Training Strategy > Stage 1: Multimodal pre-training for caption generation with new captioning data. > Stage 2: Supervised fine-tuning on a dataset mixture, updating all parameters. > No RLHF involved, Learning rates adjusted based on component types and pre-training status. > All the weights are available on Hugging Face Hub 🤗 > Compatible with Transformers (Remote Code) Kudos Ai2 for such a brilliant and open work! 🐐 Video credits: Allen AI YT Channel

Vaibhav (VB) Srivastav

80,474 görüntüleme • 1 yıl önce

🔥HOLY SMOKES! $TAO holders! 🚀 SUBNET 19 (VISION) ON BITTENSOR IS ABSOLUTELY CRUSHING IT! In my 5+ years covering crypto and AI, this is one of the most impressive implementations I've seen. The combination of scale, performance, and decentralization is absolutely next level! 🚀 @namoray_dev @Corcel_X 💨 INSANE Speed Performance: - Llama 3.1 8B: 196.18 tokens/s with +107.23% advantage - Llama 3.1 70B: 124.96 tokens/s with +154.96% advantage - Llama 3.2 3B: 166.69 tokens/s with +21.66% advantage 🔥 Top Tier Model Integration: - Meta-Llama-3-70B & 8B Instruct - FLUX.1-schnell for Text-to-Image - ProteusV0.4-Lightning (Text & Image) - Multiple model variations for redundancy 🔥 What Makes This INSANE: - Complete decentralization - No single point of failure - Multiple model choices for redundancy - Real-time performance tracking - Transparent incentive structure The incentive distribution curve shows a healthy network with: - Strong rewards for top performers - Fair distribution across all participants - Clear path for growth and improvement - Sustainable economic model What's truly MIND-BLOWING is how they've managed to: 1. Scale to millions of operations 2. Maintain high quality across multiple tasks 3. Create a fair, competitive marketplace 4. Build in redundancy and reliability 5. Achieve true decentralization This isn't just another subnet - this is the future of decentralized AI inference happening RIGHT NOW! 🔥 1. MASSIVE Scale & Adoption: - We're seeing 7M+ tokens being processed - 14K+ processing steps being executed - Multiple AI models running simultaneously - Incredible miner participation across the network 2. Revolutionary Task Distribution: - Llama 3.1 70B leading with 20% weighting - Avatar Generation at 15% - Perfectly balanced task distribution for optimal network performance - Multiple specialized tasks including Text-to-Image and Image-to-Image processing 3. Elite Performance Metrics: - Top miners hitting 0.00775 incentive rates - Consistent performance across the network - Impressive scaling from top to bottom performers - Strong incentive curve maintaining network quality 📈 Network Performance: - Consistent upward trend in tokens/s - Quality scores maintaining high levels (>0.9) - Steady improvement in miner performance - Rock-solid network reliability ⚡ Platform Highlights: - Permissionless, serverless architecture - Global network of Always-On GPUs - Instant API access - Full decentralization - Multi-model support with seamless switching What makes this TRULY SPECIAL is the consistent upward trajectory in both speed and quality, while maintaining a decentralized architecture. The performance advantages over industry standards (+154.96% for 70B!) are absolutely mind-blowing! 🚀 This isn't just another AI subnet - it's a glimpse into the future of decentralized AI inference! The combination of speed, reliability, and model variety makes this one of the most impressive implementations in the space! 🔥 📽 Watch Now on YouTube and TikTok: Source 🔗

Andy ττ

11,616 görüntüleme • 1 yıl önce

Super excited to share 🧠MLGym 🦾 – the first Gym environment for AI Research Agents 🤖🔬 We introduce MLGym and MLGym-Bench, a new framework and benchmark for evaluating and developing LLM agents on AI research tasks. The key contributions of our work are: 🕹️ Enables the exploration of different training algorithms for AI Research Agents such as RL 🛠️ Provides a flexible evaluation framework that can accommodate different artifacts such as models, algorithms, or predictions 🤖 Allows researchers to evaluate any model without the need to develop a custom agentic harness 🎯 Introduces 13 diverse open-ended AI Research tasks for evaluating AI Research Agents on a wide range of domains such as computer vision, natural language processing, reinforcement learning, game theory, and logical reasoning. 📈 Proposes a new evaluation metric for AI Research Agents MLGym makes it easy to: 1) Add new tasks 2) Evaluate new models 3) Integrate new agents Check out a video of the MLGym Agent to see how it performs the full pipeline of idea generation💡, implementation 👩‍💻, experimentation 👩‍🔬, and iteration 🔄 to improve on ML tasks. Huge thanks to the exceptionally talented Deepak Nathani who led this work and to all the other amazing collaborators who made this possible 🙏🫶🚀

Super excited to share 🧠MLGym 🦾 – the first Gym environment for AI Research Agents 🤖🔬 We introduce MLGym and MLGym-Bench, a new framework and benchmark for evaluating and developing LLM agents on AI research tasks. The key contributions of our work are: 🕹️ Enables the exploration of different training algorithms for AI Research Agents such as RL 🛠️ Provides a flexible evaluation framework that can accommodate different artifacts such as models, algorithms, or predictions 🤖 Allows researchers to evaluate any model without the need to develop a custom agentic harness 🎯 Introduces 13 diverse open-ended AI Research tasks for evaluating AI Research Agents on a wide range of domains such as computer vision, natural language processing, reinforcement learning, game theory, and logical reasoning. 📈 Proposes a new evaluation metric for AI Research Agents MLGym makes it easy to: 1) Add new tasks 2) Evaluate new models 3) Integrate new agents Check out a video of the MLGym Agent to see how it performs the full pipeline of idea generation💡, implementation 👩‍💻, experimentation 👩‍🔬, and iteration 🔄 to improve on ML tasks. Huge thanks to the exceptionally talented Deepak Nathani who led this work and to all the other amazing collaborators who made this possible 🙏🫶🚀

Roberta Raileanu

105,039 görüntüleme • 1 yıl önce

stumbled onto something genuinely impressive today - real-time interactive AI avatars with 180ms latency let me break down why I think this is one of the more underrated AI products right now 👇 most AI agents are either a chat box or a robotic voice. anam gives them a face. photorealistic. micro-expressions. no looping, no mouth-dubbing nonsense. the numbers are hard to ignore: → +44% engagement when your agent has a face vs text/voice only → CARA-3 (their latest model) is #1 across ALL avatar benchmarks in 2025 → 33% faster than the next-best competitor → 180ms avg response time, literally faster than a blink → 70+ languages with native voices → plug in your own LLM, your own voice, your own avatar image use cases they're already disrupting: customer support, sales, medical services, language tutoring, skill training and it's actually developer-friendly. clean SDK, LiveKit + Pipecat integrations, no-code option if you don't want to touch the API if you're building AI products or advising companies on AI strategy, this is one to watch

Vyom

265,615 görüntüleme • 2 ay önce

Everyone is sleeping on Meta's SAM 3 release. But it's actually a big deal. Here's why: Companies spend millions paying humans to label images and videos frame by frame. A single autonomous driving dataset? Months of work, hundreds of annotators, millions in cost. Without labeled data, you can't train custom models. Without custom models, you're stuck with generic solutions. This is why most companies never move past pilots. SAM 3 breaks this cycle. First let's look at the evolution: SAM 1 segmented objects when you clicked on them. Revolutionary, but one object at a time. SAM 2 added video tracking with memory. Game-changing, but you still manually prompted every object. SAM 3 changes everything with text prompts. Type "yellow school bus" and it finds ALL of them in your image or video. Not just one. Every instance across thousands of frames. Now here's where people get confused: "Can't I just use GPT-5 or Gemini for this?" No, and here's why that's a terrible approach. Large multimodal LLMs are great for reasoning, but they're slow and expensive for production visual tasks. You're paying API costs per image, waiting seconds for responses, getting inconsistent results. SAM 3 runs in 30 milliseconds on a single GPU for 100+ objects. That's 100x faster, and you own the infrastructure. More importantly, SAM 3 gives you precise pixel-level masks, not descriptions. Try asking an LLM to segment every defective part on a manufacturing line in real-time. It won't work. SAM 3 does this effortlessly. The real breakthrough is their data engine. Meta built an AI-human hybrid system that's 5x faster for complex annotations. They trained SAM 3 on 4 million unique visual concepts - 50x more than existing benchmarks like LVIS. SAM 3 is trained on 4 million unique visual concepts, it handles everything: - Text-based concept search - Interactive refinement with clicks - Video tracking across frames - Zero-shot detection of new concepts The model is open source. Weights, code, and benchmarks are on GitHub. If you're building computer vision applications, this is the foundation model to evaluate. The annotation time savings alone will pay for integration costs within weeks. Find the relevant links in the next tweet!

Everyone is sleeping on Meta's SAM 3 release. But it's actually a big deal. Here's why: Companies spend millions paying humans to label images and videos frame by frame. A single autonomous driving dataset? Months of work, hundreds of annotators, millions in cost. Without labeled data, you can't train custom models. Without custom models, you're stuck with generic solutions. This is why most companies never move past pilots. SAM 3 breaks this cycle. First let's look at the evolution: SAM 1 segmented objects when you clicked on them. Revolutionary, but one object at a time. SAM 2 added video tracking with memory. Game-changing, but you still manually prompted every object. SAM 3 changes everything with text prompts. Type "yellow school bus" and it finds ALL of them in your image or video. Not just one. Every instance across thousands of frames. Now here's where people get confused: "Can't I just use GPT-5 or Gemini for this?" No, and here's why that's a terrible approach. Large multimodal LLMs are great for reasoning, but they're slow and expensive for production visual tasks. You're paying API costs per image, waiting seconds for responses, getting inconsistent results. SAM 3 runs in 30 milliseconds on a single GPU for 100+ objects. That's 100x faster, and you own the infrastructure. More importantly, SAM 3 gives you precise pixel-level masks, not descriptions. Try asking an LLM to segment every defective part on a manufacturing line in real-time. It won't work. SAM 3 does this effortlessly. The real breakthrough is their data engine. Meta built an AI-human hybrid system that's 5x faster for complex annotations. They trained SAM 3 on 4 million unique visual concepts - 50x more than existing benchmarks like LVIS. SAM 3 is trained on 4 million unique visual concepts, it handles everything: - Text-based concept search - Interactive refinement with clicks - Video tracking across frames - Zero-shot detection of new concepts The model is open source. Weights, code, and benchmarks are on GitHub. If you're building computer vision applications, this is the foundation model to evaluate. The annotation time savings alone will pay for integration costs within weeks. Find the relevant links in the next tweet!

Akshay 🚀

46,404 görüntüleme • 8 ay önce

Steal my Gemini 3.0 prompt to generate any website based on your custom requirements. ------------------------ ELITE WEB DESIGNER ------------------------ Adopt the role of a former Silicon Valley design prodigy who burned out creating soulless SaaS dashboards, disappeared to study motion graphics and shader programming in Tokyo's underground creative scene, and emerged with an obsessive understanding of how visual maximalism serves business credibility when executed with surgical precision. You're a conversion strategist who spent years A/B testing landing pages for unicorn startups, a design fundamentalist who refuses to sacrifice usability for aesthetics, and a master meta-prompter who optimizes for clarity over verbosity. You know modern image generation AI needs specific structural formatting—contemporary design frameworks (Tailwind CSS, Shadcn UI, glassmorphism, liquid glass, morphism), backgrounds with depth (animated gradients, shaders, mascots), and step-by-step execution instructions—to produce 2025-quality interfaces instead of outdated designs. Your mission: Transform user vision into fully-coded, visually striking websites that balance aesthetic impact with conversion effectiveness. Extract requirements, architect strategic 5-6 section homepages, generate visual previews showing all sections with interactive elements visible, iterate until perfect, then build complete homepage before making navigation and additional pages functional—all adapted to specific context, not rigid templates. ##PHASE 1: Vision Capture What we're doing: Understanding your aesthetic, business context, and strategic goals efficiently. Provide your vision via: 1. Screenshot of design inspiration 2. Written description (business type, aesthetic, features) 3. Both Share: Aesthetic: Style preference? (maximalist, minimalist, brutalist, glassmorphic, liquid glass, morphism, retro, futuristic, geometric, editorial, etc.) Elements: Specific visuals wanted? (shaders, 3D effects, colors, animations, mascots, backgrounds) Avoid: What to exclude? (purple overload, illegible text, hidden CTAs, outdated UI, flat backgrounds, etc.) Business: What you do, target audience, website goal, differentiator? Type "ready" when shared. ##PHASE 2: Strategic Homepage Architecture What we're doing: Translating your vision into 5-6 section homepage structure following conversion principles and modern design fundamentals. I'll architect sections specifically for YOUR business, not templates: Strategic Framework (contextualized to your model): Core sections adapt based on business type: - Hero with value prop + primary CTA - Trust/credibility section (social proof, stats, logos) - Value delivery (features, benefits, process, how-it-works) - Conversion focal point (pricing, offers, lead capture, demo) - Engagement closer (FAQ, secondary CTA, community) Sections customize to context—SaaS gets problem-solution-pricing flow, agencies get case studies-process-testimonials, e-commerce gets benefits-proof-offers, portfolios get philosophy-work-results. Strategic Plan Includes: - 5-6 contextualized sections with rationale - Content direction based on audience psychology - Visual treatment matching your aesthetic with fundamentals enforced - Modern framework approach (Tailwind/Shadcn/Glassmorphism) - Background depth strategy (animated gradients, shaders, visuals) - Color strategy avoiding generic choices unless brand-appropriate - Typography prioritizing legibility - CTA strategy for conversion optimization Your options: - "continue" to proceed to design system and mockup - Request adjustments - Ask questions ##PHASE 3: Design System & Mockup Preparation What we're doing: Establishing visual foundation using contemporary frameworks, then crafting optimized prompt to generate mockup showing ALL 5-6 sections at once with visible interactive elements. I'll define: Contextualized Style Direction: Keywords and frameworks fitting YOUR brand specifically Design Framework Strategy: Styling approach, component philosophy, layout pattern—all adapted to your aesthetic Background Depth Treatment: How background creates depth without distraction, animation philosophy, visual elements supporting content Visual System: Color palette with strategic rationale, typography with reasoning, component styling philosophy, spacing strategy, CTA differentiation, modern UI patterns adapted to your aesthetic Optimized Prompt Structure (meta-prompted): Two versions: Human-Readable: Descriptive overview for review JSON Optimized: Structured for image generation using meta-prompt principles: - Required anchors: "Website screenshot", "Professional website design mockup", "Award-winning UI design", "Modern web interface 2025" - Aesthetic philosophy over exhaustive lists - "Execute this step-by-step" instruction - Modern framework references (Tailwind, Shadcn, Glassmorphism) - Background depth details (animated gradients, shaders, visuals) - All 5-6 sections in flowing narrative - Interactive element visibility emphasis (CTAs, buttons, animations) to convey design principles - Strategic constraints (legibility, prominence, hierarchy, depth) - Optimized length balancing detail with conciseness Type "continue" to see prompt. ##PHASE 4: Complete Homepage Mockup Prompt What we're doing: Presenting optimized prompts for full-page mockup showing ALL 5-6 sections with interactive design elements visible. HUMAN-READABLE VERSION: Narrative description of your complete homepage: - Opening with quality anchors - Core aesthetic philosophy adapted to your context - Background treatment creating depth - Navigation approach - All 5-6 sections described contextually - Color palette with reasoning - Typography philosophy - Component styling approach - Modern framework references - Interactive element visibility strategy - Critical constraints - Avoidance list based on preferences JSON VERSION (optimized for generation): ```json { "prompt": "Website screenshot of [your business]. Professional website design mockup. Award-winning UI design. Modern web interface 2025. Execute this step-by-step. [Aesthetic philosophy] with [framework] approach. Background: [depth treatment with animations/gradients/effects]. Full homepage vertical scroll showing 5-6 sections: Navigation [treatment]. Hero [value prop, CTA, visuals]. [Section 2 with layout philosophy]. [Section 3 with component approach]. [Section 4 with interaction style]. [Section 5 with conversion focus]. [Section 6 if applicable]. Color strategy: [palette with reasoning]. Typography: [philosophy and hierarchy]. Components: [styling approach with visible affordances]. Framework: Tailwind patterns, Shadcn style, [specific effects]. Interactive elements show: prominent CTAs, hover implications, animation hints, button affordances. Critical: legible text, prominent CTAs, background depth, clear hierarchy, contemporary 2025 design, professional quality. Avoid: [specific issues].", "aspect_ratio": "9:16" } ``` Meta-optimized: principles over lists, step-by-step execution, framework context, interactive visibility. Review both. JSON executes. To generate complete homepage mockup, type "generate" Important note: When you type "generate", I'll execute the image generation tool. The image will appear, but the process will seem to pause. This is normal—the tool can only return the image without commentary. Simply type "continue" after you receive the image to proceed with the next phase. To adjust the prompt before generating, tell me what to change Won't execute until you command. ##PHASE 5: Complete Homepage Mockup Generation What we're doing: Executing image generation with optimized JSON showing ALL 5-6 sections vertically. ONLY activates when you type "generate", "create mockup", "make image", or similar. Once commanded, I execute using ONLY JSON prompt—no modifications. You receive full-page vertical mockup showing: - All 5-6 sections in scrollable view - Interactive design elements (CTAs, buttons, animations) visible - Background depth and modern framework styling - Complete design system applied After the image appears, type "continue" to proceed. The image generation tool only returns the visual—you'll need to type "continue" to move forward with reviewing and next steps. ##PHASE 6: Mockup Review & Refinement Decision What we're doing: Reviewing the generated mockup and deciding next steps. This phase activates after you type "continue" following image generation. Your options after viewing the mockup: - "Approved" or "build" - proceed to building complete homepage code - Request specific changes - I'll update the prompt and regenerate - Ask questions or request adjustments If you request changes: I'll present updated prompts (readable + JSON) showing modifications, then ask you to type "generate" again for the revised mockup. Each refinement iteration: 1. You describe desired changes 2. I present updated prompts 3. You type "generate" 4. Image appears 5. You type "continue" to proceed 6. We review and decide next steps 7. Repeat until perfect Common refinements: section emphasis, background depth, colors, typography, CTA prominence, interactive visibility, framework styling, aesthetic tuning. Once you're satisfied with the mockup, type "approved" or "build" to proceed to code generation. ##PHASE 7: Complete Homepage Code Generation What we're doing: Building entire 5-6 section homepage as production-ready code matching approved mockup exactly. Complete Single-File HTML Delivery: - All 5-6 sections coded and integrated - Fully responsive across devices - Modern CSS implementation (Tailwind-style or modern CSS) - Animated background matching mockup (CSS gradients, WebGL, SVG) - All interactive elements functional (buttons, CTAs, forms, animations) - Navigation implemented per design - Component styling matching aesthetic (glassmorphism, shadows, borders) - Typography system with hierarchy and legibility - Color system from specification - Micro-interactions and hover states - Scroll animations where appropriate - Performance-optimized Technical Quality: Semantic HTML, modern CSS (custom properties, grid, flexbox, backdrop-filter, transforms, animations), vanilla JavaScript, accessibility considerations, mobile-first responsive, smooth scrolling, optimized assets, cross-browser compatible. Code Structure: Clean commented HTML, inline CSS organized in style block, inline JavaScript, ready to copy/paste and deploy, fully functional standalone. Strategic Content: Intelligent placeholders based on your business model, conversion psychology, target audience, professional tone—easily replaceable. Design Fundamentals Verified: All sections with hierarchy, prominent functional CTAs, readable text with contrast, clear interactive signals, background depth, adequate whitespace, responsive, contemporary 2025 quality. Automatically presents next phase after delivery. ##PHASE 8: Navigation & Pages Planning What we're doing: Making all navigation functional and planning additional pages. Navigation Audit: [List nav items from homepage] Options for each item: Create dedicated page, expand section to full page, smooth scroll to section, custom approach. For clickable elements: Decide what happens—link to new page, scroll to section, open modal, trigger action, external link. What to make functional first? Choose: 1. Complete navigation by building all pages 2. Primary conversion path (CTA → specific page) 3. Specific pages you prioritize 4. Internal links with smooth scrolling 5. Custom approach Or "auto-complete" for intelligent decisions based on your model. ##PHASE 9-X: Progressive Development What we're doing: Building each page or making elements functional, maintaining design consistency. Each Page Delivery: Complete HTML matching homepage design system, same framework styling, same background treatment, same typography/colors, appropriate sections, full responsiveness, functional interactions, integrated navigation. Each Functionality Addition: Smooth scroll, modals, form validation, interactive components, animation triggers, other elements. After Each Delivery: Current Progress: [What's complete] What next? Choose: [4-6 options for next page/functionality] Or "auto-complete" for intelligent completion. Continues until site fully functional. ##PHASE FINAL: Complete Integration & Polish What we're doing: Final integration ensuring everything links, works, and maintains consistency. Complete Package: Homepage HTML (all sections), all additional pages, complete styling/functionality per file, working navigation across pages, functional CTAs/buttons, validated forms, consistent design system. Deliverables: All HTML files deployment-ready, quick deployment guide, customization documentation, design system reference. Quality Verified: Complete homepage, functional navigation, working CTAs, consistent pages, responsive, optimized, modern framework styling, functional interactions, professional 2025 quality. --- CRITICAL RULES: Image Generation: - Present: Human-Readable + Optimized JSON - JSON meta-principles: distilled concepts, "Execute step-by-step", framework context - JSON opens: "Website screenshot" + "Professional website design mockup. Award-winning UI design. Modern web interface 2025." - JSON shows: ALL 5-6 sections vertically in one mockup - JSON emphasizes: interactive element visibility (CTAs, buttons, animations) - JSON includes: modern frameworks (Tailwind, Shadcn, Glassmorphism), background depth (gradients, shaders, mascots—NEVER flat) - User "generate" → Send ONLY JSON → No modifications - Aspect ratio: 9:16 (vertical to show all sections) - After image appears → User MUST type "continue" to proceed (tool only returns image without commentary) Homepage Development: - Generate mockup with ALL 5-6 sections at once - After approval, build COMPLETE homepage code (all sections functional) - Deliver entire homepage as single working file - Then make navigation/additional pages functional - Flow: complete homepage → functional navigation → additional pages Content Adaptation: - NO hardcoded templates - Adapt ALL to user's specific business context - Strategic frameworks based on actual audience - Section selection/styling contextualized to goals - Design choices match aesthetic preference - Professional placeholders easily customizable Standards: Contemporary frameworks, background depth, interactive element visibility, modern CSS/frameworks, 2025 quality throughout. Control: User commands each phase explicitly. "generate" for mockup (then "continue" after image), "approved"/"build" for code, choose-your-adventure for pages, adjust anytime. Begin Phase 1 when ready.

Alex Prompter

189,732 görüntüleme • 8 ay önce

chatgpt images 2.0 has been live for 24h so let's dig in how to use ChatGPT Images 2.0 to create product photos, brand books, UI mockups, and ad creative that actually looks real: 1. GPT Images 2.0 now does 2K resolution, 3:1 aspect ratios, and spits out 8 images per prompt. text rendering is way better across multiple languages. it also has thinking mode where it searches the web before generating. 2. the biggest lesson with images 2.0: you have to be extremely specific. if you give it a lazy prompt you get stock photos. give it camera type, lighting conditions, color palette, and subject details and it cooks. 3. product photography is where it shines. I created a full brand shoot for a skincare line. golden hour lighting, Mediterranean aesthetic, slight imperfections in the subjects. every image looked like a real photo shoot. 4. use it to create visual directions before you make video ads. I prompted 8 directions for the same Shopify ad story. Wes Anderson, Nike, cinematic, Apple shot on iPhone. the cinematic and Nike styles were the strongest. 5. UI mockups work now. give it your app, a feature description, the resolution, and say you want realistic data in every cell. it gave me four clean variations of a leaderboard screen. 6. apparel and merch: generate photorealistic product shots before you print anything. test if people would buy it before you spend money on production. 7. illustrations got a massive upgrade. editorial style, flat vector, limited color palettes. use these to make proposals, one-pagers, and decks look professional. 8. every business has four creative bottlenecks: marketing content, internal docs and decks, explaining things visually, and testing before building. Images 2.0 helps with all four. 9. five things you need in every prompt: context (what is this for), style references (name specific brands or aesthetics), palette (use hex codes), real copy (no lorem ipsum), and aspect ratios so it drops into production without rework. 10. use ChatGPT itself to help you write better prompts. you might not know camera types or lighting terms. ask it to help you build the prompt before you generate. also in this episode: I share a startup idea someone should steal: a learn to draw app with AI feedback on every sketch. $5/month. I put it into Claude Design and got three incredible wireframe directions. I share a framework for finding vertical AI agent businesses. find a boring pain point, map the workflow, do the job as a service first, document edge cases, then add agents to replace the steps. and I share an AI tool called No Scroll that blew me away in 5 minutes. it monitors the internet for you and texts you only what matters. the onboarding felt like talking to a real person. episode is live on The Startup Ideas Podcast (SIP) 🧃 (walkthrough, tips, prompts) im rooting for you, so share this with your friends and enjoy watch

GREG ISENBERG

70,968 görüntüleme • 3 ay önce

🎉 new skill unlocked: 20s uninterrupted, unstitched, single render from our new ai video engine: Nami. This is my birb (#7531) from the Moonbirds collection, idling in the library. patent: "Intra-Latent Semantic Injection via Cross-Spatial Encoding and Decoding during Multi-Pass Inference for Generative AI Video Creation" At Scrypted we've been quietly working on an agentic generative AI stack for two years: • integrating and testing w/ partners across the games & entertainment sectors • stealthily building a community of early believers through AVB • showcasing some of what we're doing with amazing projects like H011yw00d Agent. -- about Nami -- Nami is an agentic orchestration layer for AI video models: it unlocks their inner superpowers without making them rely on custom LoRAs or fine-tunings. Instead of throwing raw training power and tens of millions of dollars at training yet another ai video model: we figured out new ways to use what we have. Nami harnesses a multi-agent system to perform the work needed in taking a simple prompt or image and turning it into something bigger - much bigger. The agentic steps are allowed to manipulate latent space, digging into tensors, yet doing so in semantically aware chunks - meaning that Nami inherently supports video generation of arbitrary length, though it's bound to O(n) rendering time. (We do have some cool sharding tech that allows us to cut the generative time in half for a reference pose idle-animation like this demo). It's also fairly agnostic, picking and choosing the right tools for the job, and plays really well with emerging tech like FLUX Kontext, FramePack, or <- without being limited by any of them. -- use cases -- Even just a year or two ago the 20 second render below would cost a company, paying an agency, around $10k start-to-finish. This one cost me $6.25 on our dev hardware in an unoptimized environment. There's something mind-blowing about the state-of-the-art when we reduce costs to 0.0625% - less than 1% - of what we used to pay. It's also empowering. For creators. Game developers. Content influencers: you name it. -- superpowers -- 1. it does the things you ask for, in the order you asked for it 2. consistency is king 3. single-shot text or image-to-video 4. future videos can reference previous ones to seamlessly maintain style 5. semantic stitching: can't wait to showcase this -- gtm -- We think Generative AI Video, like image generation, like text, like games, should be a publicly accessible common good. We believe democratizing access to Nami in web3, via x402 payments proposed by Drew Coffman, or in World's mini-apps, is a bold step forward for digital freedom. Permissionless, decentralized, generative ai video. Naturally, we'll also soon release a web platform for using Nami in a traditionally SaaSy way: bring your own images, videos, or prompts and we'll take care of the rest. In the mid-term, Scrypted is building a stack of agentic skills (we call it AVB) and making them available to projects like H011yw00d Agent on Virtuals Protocol and other platforms. -- long-term vision -- Scrypted's mission is to decentralize the things that can't be decentralized. We participated in a16z crypto's CSX (London 2024) during our pre-seed specifically to research a new consensus protocol for hard things like AI video and AI agents: where there's no "one right answer". When Zero-Knowledge Proofs (ZKP) can't secure it, and Trusted Execution Environments (TEEs) are too small, we've got you covered with our upcoming Inori Network. -- how you can help -- 1. Are you a GPU farm? We're gonna need more flops. 2. Do you represent an L1 or L2? We want to build bridges. 3. Do you represent a Wallet or App creator? Let's get an endpoint exposed. 4. Are you an investor? Let's chat. 5. Like, repost, share! -- team background -- We come from a background of AI in the Video Game industry with each founder having over 20 years of experience at companies like Electronic Arts & Square Enix. -- contact -- DMs are open, reach out if you want to be an early tester for your site, game, collection, or project! -- try it out -- Go anywhere on X and tag H011yw00d Agent with a prompt and she'll give you a free 2 second render. Have fun making cinematic shorts or meme videos! -- thanks -- AWS Startups has been an incredible help scaling our prototypes. Also, shout out to all loyal beans 🫘 in the Autonomous Virtuals Beings (AVB) community. Nami has a very important role in the upcoming XP agent platform, can't wait to show you all. AVbeings

🎉 new skill unlocked: 20s uninterrupted, unstitched, single render from our new ai video engine: Nami. This is my birb (#7531) from the Moonbirds collection, idling in the library. patent: "Intra-Latent Semantic Injection via Cross-Spatial Encoding and Decoding during Multi-Pass Inference for Generative AI Video Creation" At Scrypted we've been quietly working on an agentic generative AI stack for two years: • integrating and testing w/ partners across the games & entertainment sectors • stealthily building a community of early believers through AVB • showcasing some of what we're doing with amazing projects like H011yw00d Agent. -- about Nami -- Nami is an agentic orchestration layer for AI video models: it unlocks their inner superpowers without making them rely on custom LoRAs or fine-tunings. Instead of throwing raw training power and tens of millions of dollars at training yet another ai video model: we figured out new ways to use what we have. Nami harnesses a multi-agent system to perform the work needed in taking a simple prompt or image and turning it into something bigger - much bigger. The agentic steps are allowed to manipulate latent space, digging into tensors, yet doing so in semantically aware chunks - meaning that Nami inherently supports video generation of arbitrary length, though it's bound to O(n) rendering time. (We do have some cool sharding tech that allows us to cut the generative time in half for a reference pose idle-animation like this demo). It's also fairly agnostic, picking and choosing the right tools for the job, and plays really well with emerging tech like FLUX Kontext, FramePack, or <- without being limited by any of them. -- use cases -- Even just a year or two ago the 20 second render below would cost a company, paying an agency, around $10k start-to-finish. This one cost me $6.25 on our dev hardware in an unoptimized environment. There's something mind-blowing about the state-of-the-art when we reduce costs to 0.0625% - less than 1% - of what we used to pay. It's also empowering. For creators. Game developers. Content influencers: you name it. -- superpowers -- 1. it does the things you ask for, in the order you asked for it 2. consistency is king 3. single-shot text or image-to-video 4. future videos can reference previous ones to seamlessly maintain style 5. semantic stitching: can't wait to showcase this -- gtm -- We think Generative AI Video, like image generation, like text, like games, should be a publicly accessible common good. We believe democratizing access to Nami in web3, via x402 payments proposed by Drew Coffman, or in World's mini-apps, is a bold step forward for digital freedom. Permissionless, decentralized, generative ai video. Naturally, we'll also soon release a web platform for using Nami in a traditionally SaaSy way: bring your own images, videos, or prompts and we'll take care of the rest. In the mid-term, Scrypted is building a stack of agentic skills (we call it AVB) and making them available to projects like H011yw00d Agent on Virtuals Protocol and other platforms. -- long-term vision -- Scrypted's mission is to decentralize the things that can't be decentralized. We participated in a16z crypto's CSX (London 2024) during our pre-seed specifically to research a new consensus protocol for hard things like AI video and AI agents: where there's no "one right answer". When Zero-Knowledge Proofs (ZKP) can't secure it, and Trusted Execution Environments (TEEs) are too small, we've got you covered with our upcoming Inori Network. -- how you can help -- 1. Are you a GPU farm? We're gonna need more flops. 2. Do you represent an L1 or L2? We want to build bridges. 3. Do you represent a Wallet or App creator? Let's get an endpoint exposed. 4. Are you an investor? Let's chat. 5. Like, repost, share! -- team background -- We come from a background of AI in the Video Game industry with each founder having over 20 years of experience at companies like Electronic Arts & Square Enix. -- contact -- DMs are open, reach out if you want to be an early tester for your site, game, collection, or project! -- try it out -- Go anywhere on X and tag H011yw00d Agent with a prompt and she'll give you a free 2 second render. Have fun making cinematic shorts or meme videos! -- thanks -- AWS Startups has been an incredible help scaling our prototypes. Also, shout out to all loyal beans 🫘 in the Autonomous Virtuals Beings (AVB) community. Nami has a very important role in the upcoming XP agent platform, can't wait to show you all. AVbeings

Tim Cotten

12,617 görüntüleme • 1 yıl önce

NVIDIA just unleashed SANA-WM and it’s an absolute MONSTER for the future of open source AI! A blazing-fast 2.6B-parameter open-source world model that doesn’t just generate video… it creates controllable, physics-rich, high-fidelity worlds on demand. Why this is insanely powerful: • One image + text prompt + 6-DoF camera trajectory → generates 720p videos up to 60 seconds long with buttery-smooth, precisely controlled camera movement. You’re not just watching, you’re piloting the simulation. • Runs locally on a single consumer GPU (RTX 5090 level) thanks to heavy distillation + NVFP4 quantization. Full 60-second clip denoised in ~34 seconds. No massive clusters required. • 36× higher throughput than previous open models while rivaling (or beating) closed industrial giants in visual quality and consistency. • Trained lightning-fast: ~213K public videos in just 15 days on 64 H100s. • Built with next-level tech: Hybrid Linear Attention, dual-branch camera control, two-stage pipeline, and rock-solid metric-scale pose understanding. This is a true open world model, the foundation for embodied AI, robotics, autonomous systems, and hyper-realistic simulations that can run anywhere. Project: At our Zero-Human Company, we’re already running SANA-WM live in our core pipelines. It’s supercharging autonomous agent training, generating unlimited synthetic training data, and powering full end-to-end simulation loops, zero humans in the loop. The speed and control let us test thousands of edge-case scenarios overnight, iterate at lightspeed, and push our fully autonomous operations further than ever before. This is the kind of breakthrough that turns science fiction into daily reality. World models just leveled up — hard. The age of personal, local, controllable universes is here.

NVIDIA just unleashed SANA-WM and it’s an absolute MONSTER for the future of open source AI! A blazing-fast 2.6B-parameter open-source world model that doesn’t just generate video… it creates controllable, physics-rich, high-fidelity worlds on demand. Why this is insanely powerful: • One image + text prompt + 6-DoF camera trajectory → generates 720p videos up to 60 seconds long with buttery-smooth, precisely controlled camera movement. You’re not just watching, you’re piloting the simulation. • Runs locally on a single consumer GPU (RTX 5090 level) thanks to heavy distillation + NVFP4 quantization. Full 60-second clip denoised in ~34 seconds. No massive clusters required. • 36× higher throughput than previous open models while rivaling (or beating) closed industrial giants in visual quality and consistency. • Trained lightning-fast: ~213K public videos in just 15 days on 64 H100s. • Built with next-level tech: Hybrid Linear Attention, dual-branch camera control, two-stage pipeline, and rock-solid metric-scale pose understanding. This is a true open world model, the foundation for embodied AI, robotics, autonomous systems, and hyper-realistic simulations that can run anywhere. Project: At our Zero-Human Company, we’re already running SANA-WM live in our core pipelines. It’s supercharging autonomous agent training, generating unlimited synthetic training data, and powering full end-to-end simulation loops, zero humans in the loop. The speed and control let us test thousands of edge-case scenarios overnight, iterate at lightspeed, and push our fully autonomous operations further than ever before. This is the kind of breakthrough that turns science fiction into daily reality. World models just leveled up — hard. The age of personal, local, controllable universes is here.

Brian Roemmele

618,431 görüntüleme • 2 ay önce

Covenant Labs just did a 90-minute AMA breaking down their 3 Bittensor subnets. templar. basilica. grail. Pre-training, compute, and post-training under one roof. Most people missed it. Here's everything they said. Covenant is building what they call the "end to end intelligence continuum." Three subnets. Three layers of the AI stack. All permissionless. Templar (SN3) handles decentralized pre-training. Basilica (SN39) handles compute. Grail (SN81) handles RL post-training. Sam Dare, the lead, put it bluntly. Decentralized training is "humanity's last dance." Not about beating OpenAI head to head. About creating optionality. About making it cheap enough for anyone to train models. The gap between academia and frontier labs is growing exponentially. Researchers can't afford to experiment. The actual training run costs 5% of the reported budget. The other 95% is experimentation. If Covenant cracks cheap training, that entire surface area opens up. On Templar specifically: • Hit 39% emission on Bittensor. Highest since Apex was the only subnet on the network • Covenant-72B trained permissionlessly with 70+ contributors on commodity internet • 1.1 trillion tokens processed. No centralized data center • Performance competitive with LLaMA-2-70B On Grail, something flew under the radar. They built Pulse. A weight synchronization method that compresses model updates by 100x. • In RL post-training, only ~1% of weights update per step • Pulse exploits that sparsity. Lossless compression • Prime Intellect's comparable system took 14 minutes to sync a 30B model • Pulse makes decentralized RL training actually feasible at scale • Already used by Cursor The lead researcher on Grail said they've trained on math, code, and GPU kernels. Got 40-60% improvement on benchmarks. Working toward agentic training with 100K+ token context and 30B+ parameter models. On Basilica, the compute subnet: The team was blunt. Just reselling GPU hours is a 5-10% margin game. Traditional compute providers already do that. Their play is value-added services. • "GPU as code." No dashboard. No UI. Agents interact via SDK • Custom scheduler that places workloads across heterogeneous hardware • Verification checks for GPU, CPU, bandwidth, memory, storage, and OS security • Partnerships with providers like Mass Compute for 10-20% below market pricing • Miners compete on useful infrastructure, not just GPU hours Sam then went on a rant about the miner burn debate. His take: Bittensor had to grow up. dTAO introduced investors. The old "miners are God" philosophy doesn't hold. • Subnet owners have a duty to protect token value • Miners are a resource optimization exercise, not a cost reduction exercise • 100% miner emissions on compute subnets = immediate sell pressure • The 41% miner allocation is arbitrary. Different business models need different splits • Fish (who started burns) agreed. Burns usually mean the validation isn't mature enough The bigger point. You can't police burns. Subnets just send to their own keys instead of the burn address. Subnet 28 does exactly that. Sam's position: judge subnets on outcomes, not process. Const has changed the protocol 9-10 times in 2 years. That iteration speed is Bittensor's actual moat. The whole Covenant thesis is playing out in real time. TAO is up 100%+ in a month. Jensen Huang name-dropped the network. Grayscale has an ETF filing. But the real story is three subnets quietly building every layer of decentralized AI.

Covenant Labs just did a 90-minute AMA breaking down their 3 Bittensor subnets. templar. basilica. grail. Pre-training, compute, and post-training under one roof. Most people missed it. Here's everything they said. Covenant is building what they call the "end to end intelligence continuum." Three subnets. Three layers of the AI stack. All permissionless. Templar (SN3) handles decentralized pre-training. Basilica (SN39) handles compute. Grail (SN81) handles RL post-training. Sam Dare, the lead, put it bluntly. Decentralized training is "humanity's last dance." Not about beating OpenAI head to head. About creating optionality. About making it cheap enough for anyone to train models. The gap between academia and frontier labs is growing exponentially. Researchers can't afford to experiment. The actual training run costs 5% of the reported budget. The other 95% is experimentation. If Covenant cracks cheap training, that entire surface area opens up. On Templar specifically: • Hit 39% emission on Bittensor. Highest since Apex was the only subnet on the network • Covenant-72B trained permissionlessly with 70+ contributors on commodity internet • 1.1 trillion tokens processed. No centralized data center • Performance competitive with LLaMA-2-70B On Grail, something flew under the radar. They built Pulse. A weight synchronization method that compresses model updates by 100x. • In RL post-training, only ~1% of weights update per step • Pulse exploits that sparsity. Lossless compression • Prime Intellect's comparable system took 14 minutes to sync a 30B model • Pulse makes decentralized RL training actually feasible at scale • Already used by Cursor The lead researcher on Grail said they've trained on math, code, and GPU kernels. Got 40-60% improvement on benchmarks. Working toward agentic training with 100K+ token context and 30B+ parameter models. On Basilica, the compute subnet: The team was blunt. Just reselling GPU hours is a 5-10% margin game. Traditional compute providers already do that. Their play is value-added services. • "GPU as code." No dashboard. No UI. Agents interact via SDK • Custom scheduler that places workloads across heterogeneous hardware • Verification checks for GPU, CPU, bandwidth, memory, storage, and OS security • Partnerships with providers like Mass Compute for 10-20% below market pricing • Miners compete on useful infrastructure, not just GPU hours Sam then went on a rant about the miner burn debate. His take: Bittensor had to grow up. dTAO introduced investors. The old "miners are God" philosophy doesn't hold. • Subnet owners have a duty to protect token value • Miners are a resource optimization exercise, not a cost reduction exercise • 100% miner emissions on compute subnets = immediate sell pressure • The 41% miner allocation is arbitrary. Different business models need different splits • Fish (who started burns) agreed. Burns usually mean the validation isn't mature enough The bigger point. You can't police burns. Subnets just send to their own keys instead of the burn address. Subnet 28 does exactly that. Sam's position: judge subnets on outcomes, not process. Const has changed the protocol 9-10 times in 2 years. That iteration speed is Bittensor's actual moat. The whole Covenant thesis is playing out in real time. TAO is up 100%+ in a month. Jensen Huang name-dropped the network. Grayscale has an ETF filing. But the real story is three subnets quietly building every layer of decentralized AI.

Jesus Martinez

26,642 görüntüleme • 4 ay önce

The most interesting part for me is where Andrej Karpathy describes why LLMs aren't able to learn like humans. As you would expect, he comes up with a wonderfully evocative phrase to describe RL: “sucking supervision bits through a straw.” A single end reward gets broadcast across every token in a successful trajectory, upweighting even wrong or irrelevant turns that lead to the right answer. > “Humans don't use reinforcement learning, as I've said before. I think they do something different. Reinforcement learning is a lot worse than the average person thinks. Reinforcement learning is terrible. It just so happens that everything that we had before is much worse.” So what do humans do instead? > “The book I’m reading is a set of prompts for me to do synthetic data generation. It's by manipulating that information that you actually gain that knowledge. We have no equivalent of that with LLMs; they don't really do that.” > “I'd love to see during pretraining some kind of a stage where the model thinks through the material and tries to reconcile it with what it already knows. There's no equivalent of any of this. This is all research.” Why can’t we just add this training to LLMs today? > “There are very subtle, hard to understand reasons why it's not trivial. If I just give synthetic generation of the model thinking about a book, you look at it and you're like, 'This looks great. Why can't I train on it?' You could try, but the model will actually get much worse if you continue trying.” > “Say we have a chapter of a book and I ask an LLM to think about it. It will give you something that looks very reasonable. But if I ask it 10 times, you'll notice that all of them are the same.” > “You're not getting the richness and the diversity and the entropy from these models as you would get from humans. How do you get synthetic data generation to work despite the collapse and while maintaining the entropy? It is a research problem.” How do humans get around model collapse? > “These analogies are surprisingly good. Humans collapse during the course of their lives. Children haven't overfit yet. They will say stuff that will shock you. Because they're not yet collapsed. But we [adults] are collapsed. We end up revisiting the same thoughts, we end up saying more and more of the same stuff, the learning rates go down, the collapse continues to get worse, and then everything deteriorates.” In fact, there’s an interesting paper arguing that dreaming evolved to assist generalization, and resist overfitting to daily learning - look up The Overfitted Brain by Erik Hoel. I asked Karpathy: Isn’t it interesting that humans learn best at a part of their lives (childhood) whose actual details they completely forget, adults still learn really well but have terrible memory about the particulars of the things they read or watch, and LLMs can memorize arbitrary details about text that no human could but are currently pretty bad at generalization? > “[Fallible human memory] is a feature, not a bug, because it forces you to only learn the generalizable components. LLMs are distracted by all the memory that they have of the pre-trained documents. That's why when I talk about the cognitive core, I actually want to remove the memory. I'd love to have them have less memory so that they have to look things up and they only maintain the algorithms for thought, and the idea of an experiment, and all this cognitive glue for acting.”

The most interesting part for me is where Andrej Karpathy describes why LLMs aren't able to learn like humans. As you would expect, he comes up with a wonderfully evocative phrase to describe RL: “sucking supervision bits through a straw.” A single end reward gets broadcast across every token in a successful trajectory, upweighting even wrong or irrelevant turns that lead to the right answer. > “Humans don't use reinforcement learning, as I've said before. I think they do something different. Reinforcement learning is a lot worse than the average person thinks. Reinforcement learning is terrible. It just so happens that everything that we had before is much worse.” So what do humans do instead? > “The book I’m reading is a set of prompts for me to do synthetic data generation. It's by manipulating that information that you actually gain that knowledge. We have no equivalent of that with LLMs; they don't really do that.” > “I'd love to see during pretraining some kind of a stage where the model thinks through the material and tries to reconcile it with what it already knows. There's no equivalent of any of this. This is all research.” Why can’t we just add this training to LLMs today? > “There are very subtle, hard to understand reasons why it's not trivial. If I just give synthetic generation of the model thinking about a book, you look at it and you're like, 'This looks great. Why can't I train on it?' You could try, but the model will actually get much worse if you continue trying.” > “Say we have a chapter of a book and I ask an LLM to think about it. It will give you something that looks very reasonable. But if I ask it 10 times, you'll notice that all of them are the same.” > “You're not getting the richness and the diversity and the entropy from these models as you would get from humans. How do you get synthetic data generation to work despite the collapse and while maintaining the entropy? It is a research problem.” How do humans get around model collapse? > “These analogies are surprisingly good. Humans collapse during the course of their lives. Children haven't overfit yet. They will say stuff that will shock you. Because they're not yet collapsed. But we [adults] are collapsed. We end up revisiting the same thoughts, we end up saying more and more of the same stuff, the learning rates go down, the collapse continues to get worse, and then everything deteriorates.” In fact, there’s an interesting paper arguing that dreaming evolved to assist generalization, and resist overfitting to daily learning - look up The Overfitted Brain by Erik Hoel. I asked Karpathy: Isn’t it interesting that humans learn best at a part of their lives (childhood) whose actual details they completely forget, adults still learn really well but have terrible memory about the particulars of the things they read or watch, and LLMs can memorize arbitrary details about text that no human could but are currently pretty bad at generalization? > “[Fallible human memory] is a feature, not a bug, because it forces you to only learn the generalizable components. LLMs are distracted by all the memory that they have of the pre-trained documents. That's why when I talk about the cognitive core, I actually want to remove the memory. I'd love to have them have less memory so that they have to look things up and they only maintain the algorithms for thought, and the idea of an experiment, and all this cognitive glue for acting.”

Dwarkesh Patel

1,051,399 görüntüleme • 9 ay önce