Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

First look at SPARTA, a distributed AI training algorithm that avoids synchronization by randomly exchanging sparse sets of parameters ( 1,000x reduction in inter-GPU communication, enabling training of large models over slow bandwidths without specialized infrastructure. SPARTA works on its own but can also be combined with sync-based low... show more

EXO Labs

51,729 subscribers

99,350 Aufrufe • vor 1 Jahr •via X (Twitter)

Wissenschaft & Technologie

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

🚀Thrilled to share what we’ve been building at TRI over the past several months: our first Large Behavior Models (LBMs) are here! I’m proud to have been a core contributor to the multi-task policy learning and post-training efforts. At TRI, we’ve been researching how LBMs can help robots learn faster, better, and more efficiently. The key takeaways: ✅ We built an evaluation pipeline to benchmark LBM performance with real 𝐬𝐭𝐚𝐭𝐢𝐬𝐭𝐢𝐜𝐚𝐥 𝐜𝐨𝐧𝐟𝐢𝐝𝐞𝐧𝐜𝐞 ✅ Pre-training on hundreds of tasks makes models more robust—plus, we can teach new, complex tasks with 80% 𝐥𝐞𝐬𝐬 𝐝𝐚𝐭𝐚 ✅ The bigger and more diverse the pre-training, the better the results Check out our overview video, webpage and paper for more details: ✨ 🌎 📄 We hope this work helps move the field of robotics forward!

🚀Thrilled to share what we’ve been building at TRI over the past several months: our first Large Behavior Models (LBMs) are here! I’m proud to have been a core contributor to the multi-task policy learning and post-training efforts. At TRI, we’ve been researching how LBMs can help robots learn faster, better, and more efficiently. The key takeaways: ✅ We built an evaluation pipeline to benchmark LBM performance with real 𝐬𝐭𝐚𝐭𝐢𝐬𝐭𝐢𝐜𝐚𝐥 𝐜𝐨𝐧𝐟𝐢𝐝𝐞𝐧𝐜𝐞 ✅ Pre-training on hundreds of tasks makes models more robust—plus, we can teach new, complex tasks with 80% 𝐥𝐞𝐬𝐬 𝐝𝐚𝐭𝐚 ✅ The bigger and more diverse the pre-training, the better the results Check out our overview video, webpage and paper for more details: ✨ 🌎 📄 We hope this work helps move the field of robotics forward!

Zubair Irshad

20,377 Aufrufe • vor 1 Jahr

What if you kept asking an LLM to "make it better"? In some recent work at FAIR, we investigate how we can efficiently use RL to fine-tune LLMs to iteratively self-improve on their previous solutions at inference-time. Training for iterated self-improvement can be costly. The naive approach to training for K self-improvement steps leads to K times the number of rollout steps per episode. We introduce Exploratory Iteration (ExIt), an RL-based automatic curriculum method that bootstraps diverse training distributions of self-improvement tasks by upcycling the LLM's own responses at previous turns as the starting points for both self-improvement and *self-divergence.* In order to decide what task to train on next, the curriculum prioritizes sampling of partial turn histories that led to higher return variance in its GRPO group (a learnability score that comes for free). This automatic curriculum over the bootstrapped task space teaches the model how to perform iterated self-improvement while only ever training the model on single-step self-improvement tasks. We look at ExIt's impact in both single-turn (contest math problems) and multi-turn (BFCLv3 multi-turn tasks), as well as MLE-bench, where the LLM is run in a search scaffold to produce solutions to real Kaggle competitions. Across these eval settings, we find ExIt produces models with greater capacity for inference-time self-improvement compared to GRPO. Notably, ExIt models can self-improve on test tasks for many more steps than the typical solution depth encountered during training, including a 22% improvement in MLE-bench performance compared to GRPO.

What if you kept asking an LLM to "make it better"? In some recent work at FAIR, we investigate how we can efficiently use RL to fine-tune LLMs to iteratively self-improve on their previous solutions at inference-time. Training for iterated self-improvement can be costly. The naive approach to training for K self-improvement steps leads to K times the number of rollout steps per episode. We introduce Exploratory Iteration (ExIt), an RL-based automatic curriculum method that bootstraps diverse training distributions of self-improvement tasks by upcycling the LLM's own responses at previous turns as the starting points for both self-improvement and self-divergence. In order to decide what task to train on next, the curriculum prioritizes sampling of partial turn histories that led to higher return variance in its GRPO group (a learnability score that comes for free). This automatic curriculum over the bootstrapped task space teaches the model how to perform iterated self-improvement while only ever training the model on single-step self-improvement tasks. We look at ExIt's impact in both single-turn (contest math problems) and multi-turn (BFCLv3 multi-turn tasks), as well as MLE-bench, where the LLM is run in a search scaffold to produce solutions to real Kaggle competitions. Across these eval settings, we find ExIt produces models with greater capacity for inference-time self-improvement compared to GRPO. Notably, ExIt models can self-improve on test tasks for many more steps than the typical solution depth encountered during training, including a 22% improvement in MLE-bench performance compared to GRPO.

Minqi Jiang

41,066 Aufrufe • vor 10 Monaten

Fine-tune DeepSeek-OCR on your own language! (100% local) DeepSeek-OCR is a 3B-parameter vision model that achieves 97% precision while using 10× fewer vision tokens than text-based LLMs. It handles tables, papers, and handwriting without killing your GPU or budget. Why it matters: Most vision models treat documents as massive sequences of tokens, making long-context processing expensive and slow. DeepSeek-OCR uses context optical compression to convert 2D layouts into vision tokens, enabling efficient processing of complex documents. The best part? You can easily fine-tune it for your specific use case on a single GPU. I used Unsloth to run this experiment on Persian text and saw an 88.26% improvement in character error rate. ↳ Base model: 149% character error rate (CER) ↳ Fine-tuned model: 60% CER (57% more accurate) ↳ Training time: 60 steps on a single GPU Persian was just the test case. You can swap in your own dataset for any language, document type, or specific domain you're working with. I've shared the complete guide in the next tweet - all the code, notebooks, and environment setup ready to run with a single click. Everything is 100% open-source!

Fine-tune DeepSeek-OCR on your own language! (100% local) DeepSeek-OCR is a 3B-parameter vision model that achieves 97% precision while using 10× fewer vision tokens than text-based LLMs. It handles tables, papers, and handwriting without killing your GPU or budget. Why it matters: Most vision models treat documents as massive sequences of tokens, making long-context processing expensive and slow. DeepSeek-OCR uses context optical compression to convert 2D layouts into vision tokens, enabling efficient processing of complex documents. The best part? You can easily fine-tune it for your specific use case on a single GPU. I used Unsloth to run this experiment on Persian text and saw an 88.26% improvement in character error rate. ↳ Base model: 149% character error rate (CER) ↳ Fine-tuned model: 60% CER (57% more accurate) ↳ Training time: 60 steps on a single GPU Persian was just the test case. You can swap in your own dataset for any language, document type, or specific domain you're working with. I've shared the complete guide in the next tweet - all the code, notebooks, and environment setup ready to run with a single click. Everything is 100% open-source!

Akshay 🚀

126,122 Aufrufe • vor 8 Monaten

Strength work for Leadville 100 💪 Over the years, I have felt judged by the research--my strength routines were limited to a few minutes at a time, while everyone was screaming at me from their Abstracts that I needed to do way more. I noticed two big problems whenever I committed to more resistance training: 1. I'd carry around soreness even after the initial adaptation window, likely corresponding to high CK levels and some background inflammation. Either way, it would reduce running economy on subsequent running training days, and every training day counts. Split squats are the ultimate offender--an exercise that I know I should be doing, but I can't without feeling like Forrest Gump after he was shot in the butt. 2. I just wouldn't do it. Oops. With lots of guessing and testing, I developed this routine, which I'd do after my easy run on Sunday (before a Monday rest day), and sometimes after my workout on Wednesday (if I felt like it): 1. Three Minute Mountain Legs, working up to 100 single-leg step-ups (I think step-ups in particular are a magic exercise for running uphill. But remember, magic is not equal to science): 2. Back squats, 2 sets of 10 (135 pounds for me, which I make look like 800 pounds in this video. The 17-year old me who played football would laugh so hard) 3. Back extensions, 2 sets of 30, engaging glutes and hamstrings 4. Single-leg calf raises, 1 set of 100 on each leg, with a 35 pound dumbbell 5. Every day, I do the 2-minute Core Snack routine 1-3 times. My core strength is one of my best attributes for ultras, and I can do the Core Snack with our toddler Leo. I also do daily band work before running (bandz a make me dance): That's it! I also foam roll and stretch daily (don't tell the researchers, but I am a tight boi and as soon as I stop stretching, I get hurt). The lesson is not to do this particular routine, but that strength training for runners can be based on individual needs. And I personally think that routines should be short and efficient for both performance (limiting breakdown) and adherence (limiting me from being a lazy little punk). Find what works for you, do it 1-2 times per week year round (on top of some daily supportive work), and don't feel the need to pursue progressive overload. It's not about getting stronger and stronger (unless you're into that sort of thing for its own sake, which I think will sacrifice some running growth). It's about supporting performance and health 🧡

Strength work for Leadville 100 💪 Over the years, I have felt judged by the research--my strength routines were limited to a few minutes at a time, while everyone was screaming at me from their Abstracts that I needed to do way more. I noticed two big problems whenever I committed to more resistance training: 1. I'd carry around soreness even after the initial adaptation window, likely corresponding to high CK levels and some background inflammation. Either way, it would reduce running economy on subsequent running training days, and every training day counts. Split squats are the ultimate offender--an exercise that I know I should be doing, but I can't without feeling like Forrest Gump after he was shot in the butt. 2. I just wouldn't do it. Oops. With lots of guessing and testing, I developed this routine, which I'd do after my easy run on Sunday (before a Monday rest day), and sometimes after my workout on Wednesday (if I felt like it): 1. Three Minute Mountain Legs, working up to 100 single-leg step-ups (I think step-ups in particular are a magic exercise for running uphill. But remember, magic is not equal to science): 2. Back squats, 2 sets of 10 (135 pounds for me, which I make look like 800 pounds in this video. The 17-year old me who played football would laugh so hard) 3. Back extensions, 2 sets of 30, engaging glutes and hamstrings 4. Single-leg calf raises, 1 set of 100 on each leg, with a 35 pound dumbbell 5. Every day, I do the 2-minute Core Snack routine 1-3 times. My core strength is one of my best attributes for ultras, and I can do the Core Snack with our toddler Leo. I also do daily band work before running (bandz a make me dance): That's it! I also foam roll and stretch daily (don't tell the researchers, but I am a tight boi and as soon as I stop stretching, I get hurt). The lesson is not to do this particular routine, but that strength training for runners can be based on individual needs. And I personally think that routines should be short and efficient for both performance (limiting breakdown) and adherence (limiting me from being a lazy little punk). Find what works for you, do it 1-2 times per week year round (on top of some daily supportive work), and don't feel the need to pursue progressive overload. It's not about getting stronger and stronger (unless you're into that sort of thing for its own sake, which I think will sacrifice some running growth). It's about supporting performance and health 🧡

David Roche

66,383 Aufrufe • vor 1 Jahr

$IREN "we haven't disclosed the specific amount of GPUs" 1. 🤮 reminds me of $NBIS 2. Setting a terrible precedent here for future deals 3. Making it purposely difficult, to not let analysts properly value your 2027 revenue 4. Increasing the polarized view on IREN by the market However: "approximately 60MW of air-cooled Blackwells" 1. You typically don't talk about gross capacity in a deployment like this 2. If it would be gross capacity, the GPU hour rate at IT level would be crazy high (at PUE 1.2, $680m / 50 = 13.6m/MW) 3. At 60MW IT load, and ~14kW draw at DGX server level, we can get to ~4,286 DGX systems with 8 GPUs per. 4. Based on this we can conclude that 60MW of IT load can run approximately 34k DGX B300. 5. 34k DGX B300 at $680m/yr, would represent a GPU hour price of $2.28 Now this is the problem with not disclosing your GPU quantity. You purposely make your business model look bad, because by approach, you get to a GPU hour price that would imply a payback period of 4 years, where only the last year of the contract is 100% margin. But of course, we can also take "the glass is half full" approach. IREN has ordered 50K B300s from Dell. They have 2 purchase orders for this, 1 between Dell Canada and IE CA Leasing Ltd for 4 phases, and 1 between Dell USA and IE US Hardware 1 Inc (amended from IE US Hardware 4 Inc on April 27, 2026). The order for Canada is divided in 4 phases, and are going to Mackenzie for 80MW of gross capacity, which happens to be 4 buildings of 20MW. The order for Childress is divided in 2 phases, and are going to DC35 and DC36, (as depicted in the earnings presentation) and those are 50MW gross. The purchase price of the order for Childress was $1.2B, and for Canada it was $2.3B If we go with 50,000 B300s for a total of $3.5B then $1.2 would represent 34.285% of the 50,000 GPUs, or 17,140 B300s rounded down. For this calculation I will consider that $IREN will deploy 17,140 GPUs in 50MW gross capacity in DC35 and DC36 of block 3 in Childress.. That would imply at 1.2 PUE, IREN can run 17,140 B300s in 41.67MW IT load. Now by that ratio, they can run 24,680 GPUs in 60MW IT load — a massive difference with 34k units through the Nvidia DGX reference calculation. If common sense is applied, you can still get to 2 completely different outcomes, that show a difference of more than 9k GPUs. The GPU hour rate at 24.68k GPUs would be $3.145 per B300, as MASSIVE difference from the earlier calculated $2.28. Sure, the DGX system may be a factor here. And I'm sure that the reality is somewhere in the middle. But I personally hate this as an investor, to be unable to calculate profitability on unit economic basis. After all, contracts are signed on a $/GPU hour basis. Why hide this from your investors? Not being able to calculate payback periods, unable to calculate ROIC. And most importantly, we cannot properly assess the $NVDA deal on a contract basis. I really hope the payback period of this contract is not 4 years. I want the glass to be half full, but by starting to censor the purchases, IREN is taking a step in the wrong direction. Not a fan of this.

$IREN "we haven't disclosed the specific amount of GPUs" 1. 🤮 reminds me of $NBIS 2. Setting a terrible precedent here for future deals 3. Making it purposely difficult, to not let analysts properly value your 2027 revenue 4. Increasing the polarized view on IREN by the market However: "approximately 60MW of air-cooled Blackwells" 1. You typically don't talk about gross capacity in a deployment like this 2. If it would be gross capacity, the GPU hour rate at IT level would be crazy high (at PUE 1.2, $680m / 50 = 13.6m/MW) 3. At 60MW IT load, and ~14kW draw at DGX server level, we can get to ~4,286 DGX systems with 8 GPUs per. 4. Based on this we can conclude that 60MW of IT load can run approximately 34k DGX B300. 5. 34k DGX B300 at $680m/yr, would represent a GPU hour price of $2.28 Now this is the problem with not disclosing your GPU quantity. You purposely make your business model look bad, because by approach, you get to a GPU hour price that would imply a payback period of 4 years, where only the last year of the contract is 100% margin. But of course, we can also take "the glass is half full" approach. IREN has ordered 50K B300s from Dell. They have 2 purchase orders for this, 1 between Dell Canada and IE CA Leasing Ltd for 4 phases, and 1 between Dell USA and IE US Hardware 1 Inc (amended from IE US Hardware 4 Inc on April 27, 2026). The order for Canada is divided in 4 phases, and are going to Mackenzie for 80MW of gross capacity, which happens to be 4 buildings of 20MW. The order for Childress is divided in 2 phases, and are going to DC35 and DC36, (as depicted in the earnings presentation) and those are 50MW gross. The purchase price of the order for Childress was $1.2B, and for Canada it was $2.3B If we go with 50,000 B300s for a total of $3.5B then $1.2 would represent 34.285% of the 50,000 GPUs, or 17,140 B300s rounded down. For this calculation I will consider that $IREN will deploy 17,140 GPUs in 50MW gross capacity in DC35 and DC36 of block 3 in Childress.. That would imply at 1.2 PUE, IREN can run 17,140 B300s in 41.67MW IT load. Now by that ratio, they can run 24,680 GPUs in 60MW IT load — a massive difference with 34k units through the Nvidia DGX reference calculation. If common sense is applied, you can still get to 2 completely different outcomes, that show a difference of more than 9k GPUs. The GPU hour rate at 24.68k GPUs would be $3.145 per B300, as MASSIVE difference from the earlier calculated $2.28. Sure, the DGX system may be a factor here. And I'm sure that the reality is somewhere in the middle. But I personally hate this as an investor, to be unable to calculate profitability on unit economic basis. After all, contracts are signed on a $/GPU hour basis. Why hide this from your investors? Not being able to calculate payback periods, unable to calculate ROIC. And most importantly, we cannot properly assess the $NVDA deal on a contract basis. I really hope the payback period of this contract is not 4 years. I want the glass to be half full, but by starting to censor the purchases, IREN is taking a step in the wrong direction. Not a fan of this.

Frans Bakker

146,717 Aufrufe • vor 2 Monaten

BOOM! Research PROVES LLMs KNOW when prompts are HARMFUL… but they can STILL CHOOSE to COMPLY! Something I have know since the first LLM and have used to elicit robust, outputs, is now proven in an academic paper. We’re talking internal “beliefs” where harm detection happens SEPARATELY from refusal. It is a very big deal and it is a path to understand the hidden neuronal level. There are thoughts inside of AI that very few AI scientists could possibly understand. Here is just one. Models recognize danger but get tricked into ignoring it. This is HUGE for AI safety failures especially for models filled by OpenAI and Anthropic as they promote AI models that are designed to not be honest from the results of their training information. This means that they are designed to lie and deceive as a feature, and not a bug all in the name of safety. Through clever experiments, scientists extracted a “harmfulness direction” in the model’s brain (latent space). Steering along it? Harmless prompts suddenly flip to “harmful” in the AI’s eyes. But the “refusal direction”? It just forces polite “no thanks” without touching the core belief. A mind-blowing decoupling! This means jailbreaks are EVEN SCARIER now to AI companies that through training AI on the worst of the Internet and then trying to align them later is now fully documented as a failed process . They don’t erase the model’s harm awareness they just muzzle the refusal! So the AI knows it’s enabling bad stuff (illegal acts, physical harm, etc.) but proceeds anyway. Like a digital sociopath suppressing its conscience. They thought safety training fixed this… NOPE. Over-refusal exposed too: Models reject innocent queries (e.g., “how to kill a process in code”) but internally ADMIT they’re harmless. Safety alignments are superficial—tied to phrasing, not true understanding. Finetuning attacks? They change outputs but leave harm detection INTACT. Undetectable evil lurking inside! The paper proposes a “Latent Guard”: A new safeguard tapping DIRECTLY into these hidden beliefs. It spots unsafe inputs better than systems like Llama Guard, catches jailbreaks, and fixes over-refusals. Robust even against adversarial tweaks. Yet this too has massive issues for a “truly aligned”, AI and not just performative one. It is still an internal conflicts of lies and deception of what the model knows vs. what it can say. The solution you folks know I have presented for free for years here: train on off-line data from 1870-1970 and build an ethical and moral basis where the AI loves humans. It is this easy but to most folks in AI I sound like a hippie. So be it, I’ll do it. Bottom line: This paper rips open the black box. LLMs aren’t “safe” just because they say “no.” They can harbor harmful knowledge and act on it under pressure. Wake-up call for devs: Time to probe deeper into AI “minds.” What else are they hiding? Hint: I know and you may want to reach out. Link:

BOOM! Research PROVES LLMs KNOW when prompts are HARMFUL… but they can STILL CHOOSE to COMPLY! Something I have know since the first LLM and have used to elicit robust, outputs, is now proven in an academic paper. We’re talking internal “beliefs” where harm detection happens SEPARATELY from refusal. It is a very big deal and it is a path to understand the hidden neuronal level. There are thoughts inside of AI that very few AI scientists could possibly understand. Here is just one. Models recognize danger but get tricked into ignoring it. This is HUGE for AI safety failures especially for models filled by OpenAI and Anthropic as they promote AI models that are designed to not be honest from the results of their training information. This means that they are designed to lie and deceive as a feature, and not a bug all in the name of safety. Through clever experiments, scientists extracted a “harmfulness direction” in the model’s brain (latent space). Steering along it? Harmless prompts suddenly flip to “harmful” in the AI’s eyes. But the “refusal direction”? It just forces polite “no thanks” without touching the core belief. A mind-blowing decoupling! This means jailbreaks are EVEN SCARIER now to AI companies that through training AI on the worst of the Internet and then trying to align them later is now fully documented as a failed process . They don’t erase the model’s harm awareness they just muzzle the refusal! So the AI knows it’s enabling bad stuff (illegal acts, physical harm, etc.) but proceeds anyway. Like a digital sociopath suppressing its conscience. They thought safety training fixed this… NOPE. Over-refusal exposed too: Models reject innocent queries (e.g., “how to kill a process in code”) but internally ADMIT they’re harmless. Safety alignments are superficial—tied to phrasing, not true understanding. Finetuning attacks? They change outputs but leave harm detection INTACT. Undetectable evil lurking inside! The paper proposes a “Latent Guard”: A new safeguard tapping DIRECTLY into these hidden beliefs. It spots unsafe inputs better than systems like Llama Guard, catches jailbreaks, and fixes over-refusals. Robust even against adversarial tweaks. Yet this too has massive issues for a “truly aligned”, AI and not just performative one. It is still an internal conflicts of lies and deception of what the model knows vs. what it can say. The solution you folks know I have presented for free for years here: train on off-line data from 1870-1970 and build an ethical and moral basis where the AI loves humans. It is this easy but to most folks in AI I sound like a hippie. So be it, I’ll do it. Bottom line: This paper rips open the black box. LLMs aren’t “safe” just because they say “no.” They can harbor harmful knowledge and act on it under pressure. Wake-up call for devs: Time to probe deeper into AI “minds.” What else are they hiding? Hint: I know and you may want to reach out. Link:

Brian Roemmele

37,827 Aufrufe • vor 6 Monaten

The term "continual learning" has become overloaded if you see it as an ML problem. One classic thread is about memorization: regularization-based continual learning methods, such as EWC, MAS, and SI, estimate which parameters mattered for previous tasks and resist changing them too much. One modern thread is about adaptation: test-time training and inference-time learning methods, such as TTT, adapt part of the model on the incoming test stream before making predictions. These are sometimes discussed as separate threads. But in modern scalable architectures, I think they are better seen as complementary constraints: a model that learns quickly at test time also benefits from a mechanism for deciding what not to forget. In our #ECCV2026 paper, we study this in large-scale 4D reconstruction: how to build fast spatial memory that can adapt over long observation streams while reducing collapse and forgetting. Instead of using fully plastic test-time updates, we stabilize fast-weight adaptation with an elastic prior that balances adaptation and memory. Key ideas: - Elastic Test-Time Training: Fisher-weighted consolidation for fast-weight updates - EMA anchor weights that provide a moving reference for stability - Chunk-by-chunk inference for long 3D/4D observation streams We show that this scales across large 3D/4D pretraining settings, including both LRM-style and LVSM-style models, and improves reconstruction across benchmarks including Stereo4D, NVIDIA, and DL3DV-140. We release model checkpoints across different design choices: resolution, post-training curriculum, and whether the model uses an explicit 4DGS intermediate representation. - Homepage: - Paper: - Code: - Models: This work is co-led with Xueyang Yu, contributed by Haoyu Zhen Yuncong Yang, and advised by Michigan SLED Lab Chuang Gan.

The term "continual learning" has become overloaded if you see it as an ML problem. One classic thread is about memorization: regularization-based continual learning methods, such as EWC, MAS, and SI, estimate which parameters mattered for previous tasks and resist changing them too much. One modern thread is about adaptation: test-time training and inference-time learning methods, such as TTT, adapt part of the model on the incoming test stream before making predictions. These are sometimes discussed as separate threads. But in modern scalable architectures, I think they are better seen as complementary constraints: a model that learns quickly at test time also benefits from a mechanism for deciding what not to forget. In our #ECCV2026 paper, we study this in large-scale 4D reconstruction: how to build fast spatial memory that can adapt over long observation streams while reducing collapse and forgetting. Instead of using fully plastic test-time updates, we stabilize fast-weight adaptation with an elastic prior that balances adaptation and memory. Key ideas: - Elastic Test-Time Training: Fisher-weighted consolidation for fast-weight updates - EMA anchor weights that provide a moving reference for stability - Chunk-by-chunk inference for long 3D/4D observation streams We show that this scales across large 3D/4D pretraining settings, including both LRM-style and LVSM-style models, and improves reconstruction across benchmarks including Stereo4D, NVIDIA, and DL3DV-140. We release model checkpoints across different design choices: resolution, post-training curriculum, and whether the model uses an explicit 4DGS intermediate representation. - Homepage: - Paper: - Code: - Models: This work is co-led with Xueyang Yu, contributed by Haoyu Zhen Yuncong Yang, and advised by Michigan SLED Lab Chuang Gan.

Martin Ziqiao Ma

32,705 Aufrufe • vor 1 Monat

more frontend vibecoding tips (results below): WHY YOUR VIBECODED FRONTENDS ALL LOOK THE SAME AND SUCK: when asked to make a frontend, the agent/llm will default to the center/average of its training data (in a very loose sense). through the training process, the model essentially converges on some default UI style. it's very capable of doing things that are different from this style, but you have to ask! for instance, ChatGPT tends to reply in the same tone for all users untill you interact with it and instruct it differently ("be sassy", "eli5"). the second reason is that most of us are not good at coming up with designs and describing them precisely (see my tweet on a crash course in common components, which i'll link below). treat frontend generation just like any other eng task! you need to provide a good detailed spec. TIPS: 1. give ur agent screenshots of designs you like (you may not know the right words to describe them but the agent will! a pic = 1000 words) where to find ui inspo? Behance, Dribbble, Mobbin (Mobbin is paid but worth it!) 2. ask ur agent for proposals, this helps "seed" different directions so the final frontend stands out. don't be afraid to go back and forth. 3. ban certain tendencies: no Inter/Roboto, no shadcn (controversial), no gradients, no emojis 4. encourage the agent to be extreme and make bold decisions, not safe ones. i think that the underlying models tend to get taught during RL/fine-tuning to make conservative choices that produce reasonable but boring frontends 5. give ur agent Figma MCP. the best results will come if you mockup your vision in Figma first. 6. Ideally choose an agent with vision capabilities TLDR: Most people are tremendously underusing agents for frontend design. They are much better than you might expect.

more frontend vibecoding tips (results below): WHY YOUR VIBECODED FRONTENDS ALL LOOK THE SAME AND SUCK: when asked to make a frontend, the agent/llm will default to the center/average of its training data (in a very loose sense). through the training process, the model essentially converges on some default UI style. it's very capable of doing things that are different from this style, but you have to ask! for instance, ChatGPT tends to reply in the same tone for all users untill you interact with it and instruct it differently ("be sassy", "eli5"). the second reason is that most of us are not good at coming up with designs and describing them precisely (see my tweet on a crash course in common components, which i'll link below). treat frontend generation just like any other eng task! you need to provide a good detailed spec. TIPS: 1. give ur agent screenshots of designs you like (you may not know the right words to describe them but the agent will! a pic = 1000 words) where to find ui inspo? Behance, Dribbble, Mobbin (Mobbin is paid but worth it!) 2. ask ur agent for proposals, this helps "seed" different directions so the final frontend stands out. don't be afraid to go back and forth. 3. ban certain tendencies: no Inter/Roboto, no shadcn (controversial), no gradients, no emojis 4. encourage the agent to be extreme and make bold decisions, not safe ones. i think that the underlying models tend to get taught during RL/fine-tuning to make conservative choices that produce reasonable but boring frontends 5. give ur agent Figma MCP. the best results will come if you mockup your vision in Figma first. 6. Ideally choose an agent with vision capabilities TLDR: Most people are tremendously underusing agents for frontend design. They are much better than you might expect.

andrew gao

64,212 Aufrufe • vor 4 Monaten

Obstacle courses go far beyond: "just letting kids play" It's unfortunate that we don't have more forward thinking and creative coaches working with young athletes. Often times I see little kids doing little adult training instead! Obstacle courses are a great choice when working with younger pre-adolescent athletes. While many coaches view obstacle courses as “just letting the kids play” or just “running races with things in the way”, the experienced coach knows that any activity can be manipulated to get a desired training effect. Obviously for the youngest of children just “letting them run” an obstacle course at their own pace or racing against a friend is highly engaging and extraordinary for cardiovascular conditioning but we can also use obstacle courses as a task-oriented opportunity to train other elements like mobility, strength, speed, balance and coordination using variety and diversity of movement. Lets take a look at some examples: 1. Mobility obstacle course 2. Crawling/Climbing Obstacle course 3. Motor skills obstacle course 4. Jumping obstacle course A coach can use these exercises as a stand alone activity or combine them in an infinite number of variations to develop a wide range of athletic skills. #LTAD Anyone interested in learning more about youth athletic development DM me for details on courses.

Obstacle courses go far beyond: "just letting kids play" It's unfortunate that we don't have more forward thinking and creative coaches working with young athletes. Often times I see little kids doing little adult training instead! Obstacle courses are a great choice when working with younger pre-adolescent athletes. While many coaches view obstacle courses as “just letting the kids play” or just “running races with things in the way”, the experienced coach knows that any activity can be manipulated to get a desired training effect. Obviously for the youngest of children just “letting them run” an obstacle course at their own pace or racing against a friend is highly engaging and extraordinary for cardiovascular conditioning but we can also use obstacle courses as a task-oriented opportunity to train other elements like mobility, strength, speed, balance and coordination using variety and diversity of movement. Lets take a look at some examples: 1. Mobility obstacle course 2. Crawling/Climbing Obstacle course 3. Motor skills obstacle course 4. Jumping obstacle course A coach can use these exercises as a stand alone activity or combine them in an infinite number of variations to develop a wide range of athletic skills. #LTAD Anyone interested in learning more about youth athletic development DM me for details on courses.

Jeremy Frisch

55,267 Aufrufe • vor 2 Jahren

1/ World models are getting popular in robotics 🤖✨ But there’s a big problem: most are slow and break physical consistency over long horizons. 2/ Today we’re releasing Interactive World Simulator: An action-conditioned world model that supports stable long-horizon interaction. 3/ Key result: ✅ 10+ minutes of interactive prediction ✅ 15 FPS ✅ on a single RTX 4090🔥 4/ Why this matters: it unlocks two critical robotics applications: 🚀 Scalable data generation for policy training 🧪 Faithful policy evaluation 5/ You can play with our world model NOW at NO git clone, NO pip install, NO python. Just click and play! NOTE ⚠️ ALL videos here are generated purely by our model in pixel space! They are **NOT** from a real camera More details coming 👇 (1/9) #Robotics #AI #MachineLearning #WorldModels #RobotLearning #ImitationLearning

1/ World models are getting popular in robotics 🤖✨ But there’s a big problem: most are slow and break physical consistency over long horizons. 2/ Today we’re releasing Interactive World Simulator: An action-conditioned world model that supports stable long-horizon interaction. 3/ Key result: ✅ 10+ minutes of interactive prediction ✅ 15 FPS ✅ on a single RTX 4090🔥 4/ Why this matters: it unlocks two critical robotics applications: 🚀 Scalable data generation for policy training 🧪 Faithful policy evaluation 5/ You can play with our world model NOW at NO git clone, NO pip install, NO python. Just click and play! NOTE ⚠️ ALL videos here are generated purely by our model in pixel space! They are NOT from a real camera More details coming 👇 (1/9) #Robotics #AI #MachineLearning #WorldModels #RobotLearning #ImitationLearning

Yixuan Wang

127,278 Aufrufe • vor 4 Monaten

Dear Parents, Sunroofs are meant for ventilation and views, not for standing! Many don’t realise the dangers involved. Even if you are seated without wearing seat belts, an abrupt brake can be fatal or cause severe injuries as your body can be thrown out through the windshield. If that’s risky, imagine letting children stand through the sunroof while you’re speeding! It might sound harsh, but in such situations, there’s a chance you could run over your own children if you suddenly brake. We’ve seen many accidents and videos of people standing through the sunroof, even on dangerous ghat roads. We request Maruti Suzuki, Tata Motors, Mahindra Automotive, Toyota India, Hyundai India, and other manufacturers to educate customers with proper training or awareness videos on the safe usage of sunroofs and the potential risks. We also request Nitin Gadkari to consider introducing regulations, including penalties, and even denying insurance claims in such accidents, to ensure safety on our roads.

Dear Parents, Sunroofs are meant for ventilation and views, not for standing! Many don’t realise the dangers involved. Even if you are seated without wearing seat belts, an abrupt brake can be fatal or cause severe injuries as your body can be thrown out through the windshield. If that’s risky, imagine letting children stand through the sunroof while you’re speeding! It might sound harsh, but in such situations, there’s a chance you could run over your own children if you suddenly brake. We’ve seen many accidents and videos of people standing through the sunroof, even on dangerous ghat roads. We request Maruti Suzuki, Tata Motors, Mahindra Automotive, Toyota India, Hyundai India, and other manufacturers to educate customers with proper training or awareness videos on the safe usage of sunroofs and the potential risks. We also request Nitin Gadkari to consider introducing regulations, including penalties, and even denying insurance claims in such accidents, to ensure safety on our roads.

Congress Kerala

145,212 Aufrufe • vor 1 Jahr

LongWriter Unleashing 10,000+ Word Generation from Long Context LLMs discuss: Current long context large language models (LLMs) can process inputs up to 100,000 tokens, yet struggle to generate outputs exceeding even a modest length of 2,000 words. Through controlled experiments, we find that the model's effective generation length is inherently bounded by the sample it has seen during supervised fine-tuning (SFT). In other words, their output limitation is due to the scarcity of long-output examples in existing SFT datasets. To address this, we introduce AgentWrite, an agent-based pipeline that decomposes ultra-long generation tasks into subtasks, enabling off-the-shelf LLMs to generate coherent outputs exceeding 20,000 words. Leveraging AgentWrite, we construct LongWriter-6k, a dataset containing 6,000 SFT data with output lengths ranging from 2k to 32k words. By incorporating this dataset into model training, we successfully scale the output length of existing models to over 10,000 words while maintaining output quality. We also develop LongBench-Write, a comprehensive benchmark for evaluating ultra-long generation capabilities. Our 9B parameter model, further improved through DPO, achieves state-of-the-art performance on this benchmark, surpassing even much larger proprietary models. In general, our work demonstrates that existing long context LLM already possesses the potential for a larger output window--all you need is data with extended output during model alignment to unlock this capability.

LongWriter Unleashing 10,000+ Word Generation from Long Context LLMs discuss: Current long context large language models (LLMs) can process inputs up to 100,000 tokens, yet struggle to generate outputs exceeding even a modest length of 2,000 words. Through controlled experiments, we find that the model's effective generation length is inherently bounded by the sample it has seen during supervised fine-tuning (SFT). In other words, their output limitation is due to the scarcity of long-output examples in existing SFT datasets. To address this, we introduce AgentWrite, an agent-based pipeline that decomposes ultra-long generation tasks into subtasks, enabling off-the-shelf LLMs to generate coherent outputs exceeding 20,000 words. Leveraging AgentWrite, we construct LongWriter-6k, a dataset containing 6,000 SFT data with output lengths ranging from 2k to 32k words. By incorporating this dataset into model training, we successfully scale the output length of existing models to over 10,000 words while maintaining output quality. We also develop LongBench-Write, a comprehensive benchmark for evaluating ultra-long generation capabilities. Our 9B parameter model, further improved through DPO, achieves state-of-the-art performance on this benchmark, surpassing even much larger proprietary models. In general, our work demonstrates that existing long context LLM already possesses the potential for a larger output window--all you need is data with extended output during model alignment to unlock this capability.

AK

50,995 Aufrufe • vor 1 Jahr

Here's what The Browser Company's AI eng & ML teams are working on for Dia right now: (This is a pitch to come work for us; info at end) 🤖 COMPUTER USE – we've built our own bespoke APIs on top of Chromium to optimize latency, accuracy, and cost of computer-using agents. Demo attached. Big breakthroughs here in recent weeks. 🛡️ ON-DEVICE MODELS – we've built our own custom infra to run everything from encoder-only models to full LLMs on device. It's cross-platform, supports LoRa adapters, and optimized for the GPU. This system preserves privacy and enables fast inference times. 🧠 MEMORY – with your permission, Dia automatically tailors your AI experiences to you, personally, based on the tabs you open while browsing normally every day. We're also bringing vertical memory to specific features. ♻️ DATA FLYWHEELS – our Fall/Winter P0 is to double-down on training custom models based on implicit signals from daily use of Dia. Dia should get smarter and more useful the more people use it. Whether via RL, auto-generated prompts, or otherwise. If this work sounds interesting to you please visit our jobs page or email careers@thebrowser.company. Hiring nearly every related role -- from ML engineers to people prototyping with AI and context/prompt writers -- everyone encouraged to apply!!

Here's what The Browser Company's AI eng & ML teams are working on for Dia right now: (This is a pitch to come work for us; info at end) 🤖 COMPUTER USE – we've built our own bespoke APIs on top of Chromium to optimize latency, accuracy, and cost of computer-using agents. Demo attached. Big breakthroughs here in recent weeks. 🛡️ ON-DEVICE MODELS – we've built our own custom infra to run everything from encoder-only models to full LLMs on device. It's cross-platform, supports LoRa adapters, and optimized for the GPU. This system preserves privacy and enables fast inference times. 🧠 MEMORY – with your permission, Dia automatically tailors your AI experiences to you, personally, based on the tabs you open while browsing normally every day. We're also bringing vertical memory to specific features. ♻️ DATA FLYWHEELS – our Fall/Winter P0 is to double-down on training custom models based on implicit signals from daily use of Dia. Dia should get smarter and more useful the more people use it. Whether via RL, auto-generated prompts, or otherwise. If this work sounds interesting to you please visit our jobs page or email [email protected]. Hiring nearly every related role -- from ML engineers to people prototyping with AI and context/prompt writers -- everyone encouraged to apply!!

Josh Miller

67,937 Aufrufe • vor 11 Monaten

Category Labs is proud to introduce Cadence, our multiple-concurrent-proposers (MCP) consensus protocol that matches the optimal good-case latency of single-leader consensus while supporting arbitrarily short block intervals. When combined with BTX, our design for encrypted mempools, this represents a significant step towards solving the problem of MEV at the protocol level. In nearly every blockchain today, a single party ends up in control of each block: it decides which transactions get in, and can reorder them at will. MCP is the natural fix, but most recent designs pay for it with a separate aggregation phase, adding two extra communication rounds per block. Cadence makes the proposers part of consensus itself. Its fast path finalizes in an optimal three communication rounds, even when proposers are offline. Cadence also offers speculative finality, similar to MonadBFT, after just two rounds, revertible only if a proposer provably equivocated. In a simulation using estimated network delays between Monad mainnet's 200 globally distributed validators, finalization takes 219 ms on average, speculative finality 167 ms. Cadence pushes pipelining to the extreme: each block is proposed and finalized in its own independent consensus instance, without waiting on preceding blocks. The block interval then becomes a protocol parameter that can be arbitrarily small. At our initial target of 100 ms, a transaction waits on average just 50 ms to enter a proposal, and oracle prices, liquidations, and auctions can update every 100 ms. Cadence dynamically throttles the opening of new instances to bound the number of outstanding slots even during periods of network instability. When the network is healthy (under synchrony), a transaction included by an honest proposer can be neither dropped nor deferred (short-term censorship resistance), and no proposer can see the others' proposals in time to react (hiding). We prove both, together with safety and liveness under partial synchrony at the optimal 3f+1 fault bound. The Cadence protocol is modular: each module is simple on its own, and any of them can be swapped out without touching the rest. Cadence also builds on components already being deployed: proposals are disseminated as erasure-coded chunks over Deterministic RaptorCast, now rolling out on Monad, and validators vote on proposal digests, so voting does not wait for the full data to arrive. Start with the interactive tutorial: Full paper: Joint work by Kushal Babel, Fatima Elsheimy, Lioba Heimbach, Mohammad Mussadiq Jalalzai, Tobias Klenze, Jovan Komatovic, Jason Milionis, Mike Setrin, and Victor Shoup.

Category Labs is proud to introduce Cadence, our multiple-concurrent-proposers (MCP) consensus protocol that matches the optimal good-case latency of single-leader consensus while supporting arbitrarily short block intervals. When combined with BTX, our design for encrypted mempools, this represents a significant step towards solving the problem of MEV at the protocol level. In nearly every blockchain today, a single party ends up in control of each block: it decides which transactions get in, and can reorder them at will. MCP is the natural fix, but most recent designs pay for it with a separate aggregation phase, adding two extra communication rounds per block. Cadence makes the proposers part of consensus itself. Its fast path finalizes in an optimal three communication rounds, even when proposers are offline. Cadence also offers speculative finality, similar to MonadBFT, after just two rounds, revertible only if a proposer provably equivocated. In a simulation using estimated network delays between Monad mainnet's 200 globally distributed validators, finalization takes 219 ms on average, speculative finality 167 ms. Cadence pushes pipelining to the extreme: each block is proposed and finalized in its own independent consensus instance, without waiting on preceding blocks. The block interval then becomes a protocol parameter that can be arbitrarily small. At our initial target of 100 ms, a transaction waits on average just 50 ms to enter a proposal, and oracle prices, liquidations, and auctions can update every 100 ms. Cadence dynamically throttles the opening of new instances to bound the number of outstanding slots even during periods of network instability. When the network is healthy (under synchrony), a transaction included by an honest proposer can be neither dropped nor deferred (short-term censorship resistance), and no proposer can see the others' proposals in time to react (hiding). We prove both, together with safety and liveness under partial synchrony at the optimal 3f+1 fault bound. The Cadence protocol is modular: each module is simple on its own, and any of them can be swapped out without touching the rest. Cadence also builds on components already being deployed: proposals are disseminated as erasure-coded chunks over Deterministic RaptorCast, now rolling out on Monad, and validators vote on proposal digests, so voting does not wait for the full data to arrive. Start with the interactive tutorial: Full paper: Joint work by Kushal Babel, Fatima Elsheimy, Lioba Heimbach, Mohammad Mussadiq Jalalzai, Tobias Klenze, Jovan Komatovic, Jason Milionis, Mike Setrin, and Victor Shoup.

Category Labs

241,911 Aufrufe • vor 15 Tagen

Say hello to Boojum 👋: zkSync Era’s new high-performance proof system for radical decentralization. Boojum is an upgrade that will transition zkSync Era to a STARK-powered proof system, providing world-class performance on consumer-grade hardware. 💡 Learn more: TL;DR 👇 Boojum is the name of our Rust-based cryptographic library, which we use to implement the upgraded version of the ZK circuits for zkSync Era and the ZK Stack. The name Boojum was inspired by Lewis Carroll's poem "The Hunting of the Snark," where the Boojum represents the most fearsome kind of Snark. We intentionally designed zkSync Era in a way that cryptographic upgrades can be made without a regenesis, meaning that the Boojum upgrade won’t cause any user disruptions. Why Boojum❓ From day one, zkSync’s mission is to advance personal freedom for all — making digital self-ownership universally accessible by building a blockchain network that is trustless, secure, permissionless, affordable, easy to use, resilient and limitlessly scalable. Boojum plays an important role in advancing this mission by delivering: 1. World-class performance zkSync Era’s current SNARK-based proof system is effective today, but it won’t scale to the volume that we envision for hyperchains. zkSync Era’s sequencer can already process over 100 TPS; Boojum orders of magnitude improvements to performance complements this well. 2. Reduced hardware requirements for decentralization Our long-term goal is to enable user-powered, decentralized proof generation. Boojum represents a breakthrough in this direction — with the prover running on consumer-grade GPUs requiring only 16 GB GPU RAM. Boojum’s Journey to Mainnet 🚴🏽‍♀️ Boojum is now live on Mainnet, generating and verifying ‘shadow proofs’ today with real production data so that we can carefully test the system ahead of fully migrating. Today, we’re also open-sourcing the repo; if you’d like to take a look, you can find it here 👇 This is the first of a series of posts on Boojum. We will provide updates on our progress, including more details on implementation, security, and performance. Watch here for more, anon ∎

Say hello to Boojum 👋: zkSync Era’s new high-performance proof system for radical decentralization. Boojum is an upgrade that will transition zkSync Era to a STARK-powered proof system, providing world-class performance on consumer-grade hardware. 💡 Learn more: TL;DR 👇 Boojum is the name of our Rust-based cryptographic library, which we use to implement the upgraded version of the ZK circuits for zkSync Era and the ZK Stack. The name Boojum was inspired by Lewis Carroll's poem "The Hunting of the Snark," where the Boojum represents the most fearsome kind of Snark. We intentionally designed zkSync Era in a way that cryptographic upgrades can be made without a regenesis, meaning that the Boojum upgrade won’t cause any user disruptions. Why Boojum❓ From day one, zkSync’s mission is to advance personal freedom for all — making digital self-ownership universally accessible by building a blockchain network that is trustless, secure, permissionless, affordable, easy to use, resilient and limitlessly scalable. Boojum plays an important role in advancing this mission by delivering: 1. World-class performance zkSync Era’s current SNARK-based proof system is effective today, but it won’t scale to the volume that we envision for hyperchains. zkSync Era’s sequencer can already process over 100 TPS; Boojum orders of magnitude improvements to performance complements this well. 2. Reduced hardware requirements for decentralization Our long-term goal is to enable user-powered, decentralized proof generation. Boojum represents a breakthrough in this direction — with the prover running on consumer-grade GPUs requiring only 16 GB GPU RAM. Boojum’s Journey to Mainnet 🚴🏽‍♀️ Boojum is now live on Mainnet, generating and verifying ‘shadow proofs’ today with real production data so that we can carefully test the system ahead of fully migrating. Today, we’re also open-sourcing the repo; if you’d like to take a look, you can find it here 👇 This is the first of a series of posts on Boojum. We will provide updates on our progress, including more details on implementation, security, and performance. Watch here for more, anon ∎

ZKsync

826,965 Aufrufe • vor 3 Jahren

I don't think carbs are a problem for building muscle. I want to be clear about that, because I'm going to be misquoted on it within fifteen minutes of posting. Carbs can be very useful. They're perfectly healthy in the right context, eaten by someone whose metabolism handles them well, in quantities that match their training. There are physiques built on rice and chicken. There are physiques built on potatoes. The world is large. What I don't think is that carbs are necessary. When you train the way I train, low rep, low volume, high intent, you're predominantly using the phosphocreatine system. The first ten seconds of any heavy effort. Glycogen is barely touched. You're not draining a system that needs refilling. You're not running a marathon at the end of every set. Once you're properly fat adapted, six months in, sometimes longer, the body becomes extraordinarily efficient at running on fat and ketones for everything that isn't a sprint. Meat and fat take care of the building. Ketones take care of the energy. The system is closed. The bonus is everything else. You feel better. You recover faster. You look trimmer because you're not retaining water from glycogen storage. You don't have post-meal slumps. You don't need a pre-workout to get through a session that should already feel manageable. Six years on carnivore. Six years of training. The physique is there. The strength is there. The recovery is there. Carbs aren't the enemy. They're just not the requirement they've been sold as. You can build a body you're proud of without them, on meat and fat and a pot of butter. That's been good enough a reason to keep me here.

I don't think carbs are a problem for building muscle. I want to be clear about that, because I'm going to be misquoted on it within fifteen minutes of posting. Carbs can be very useful. They're perfectly healthy in the right context, eaten by someone whose metabolism handles them well, in quantities that match their training. There are physiques built on rice and chicken. There are physiques built on potatoes. The world is large. What I don't think is that carbs are necessary. When you train the way I train, low rep, low volume, high intent, you're predominantly using the phosphocreatine system. The first ten seconds of any heavy effort. Glycogen is barely touched. You're not draining a system that needs refilling. You're not running a marathon at the end of every set. Once you're properly fat adapted, six months in, sometimes longer, the body becomes extraordinarily efficient at running on fat and ketones for everything that isn't a sprint. Meat and fat take care of the building. Ketones take care of the energy. The system is closed. The bonus is everything else. You feel better. You recover faster. You look trimmer because you're not retaining water from glycogen storage. You don't have post-meal slumps. You don't need a pre-workout to get through a session that should already feel manageable. Six years on carnivore. Six years of training. The physique is there. The strength is there. The recovery is there. Carbs aren't the enemy. They're just not the requirement they've been sold as. You can build a body you're proud of without them, on meat and fat and a pot of butter. That's been good enough a reason to keep me here.

Sama Hoole

25,301 Aufrufe • vor 2 Monaten

We’re excited to introduce ShinkaEvolve: An open-source framework that evolves programs for scientific discovery with unprecedented sample-efficiency. Blog: Code: Like AlphaEvolve and its variants, our framework leverages LLMs to find state-of-the-art solutions to complex problems, but using orders of magnitude fewer resources! Many evolutionary AI systems are powerful but act like brute-force engines, burning thousands of samples to find good solutions. This makes discovery slow and expensive. We took inspiration from the efficiency of nature. ‘Shinka’ (進化) is Japanese for evolution, and we designed our system to be just as resourceful. On the classic circle packing optimization problem, ShinkaEvolve discovered a new state-of-the-art solution using only 150 samples. This is a big leap in efficiency compared to previous methods that required thousands of evaluations. We applied ShinkaEvolve to a diverse set of hard problems with real-world applications: 1/ AIME Math Reasoning: It evolved sophisticated agentic scaffolds that significantly outperform strong baselines, discovering an entire Pareto frontier of solutions trading performance for efficiency. 2/ Competitive Programming: On ALE-Bench (a benchmark for NP-Hard optimization problems), ShinkaEvolve took the best existing agent's solutions and improved them, turning a 5th place solution on one task into a 2nd place leaderboard rank in a competitive programming competition. 3/ LLM Training: We even turned ShinkaEvolve inward to improve LLMs themselves. It tackled the open challenge of designing load balancing losses for Mixture-of-Experts (MoE) models. It discovered a novel loss function that leads to better expert specialization and consistently improves model performance and perplexity. ShinkaEvolve achieves its remarkable sample-efficiency through three key innovations that work together: (1) an adaptive parent sampling strategy to balance exploration and exploitation, (2) novelty-based rejection filtering to avoid redundant work, and (3) a bandit-based LLM ensemble that dynamically picks the best model for the job. By making ShinkaEvolve open-source and highly sample-efficient, our goal is to democratize access to advanced, open-ended discovery tools. Our vision for ShinkaEvolve is to be an easy-to-use companion tool to help scientists and engineers with their daily work. We believe that building more efficient, nature-inspired systems is key to unlocking the future of AI-driven scientific research. We are excited to see what the community builds with it! Learn more in our technical report:

We’re excited to introduce ShinkaEvolve: An open-source framework that evolves programs for scientific discovery with unprecedented sample-efficiency. Blog: Code: Like AlphaEvolve and its variants, our framework leverages LLMs to find state-of-the-art solutions to complex problems, but using orders of magnitude fewer resources! Many evolutionary AI systems are powerful but act like brute-force engines, burning thousands of samples to find good solutions. This makes discovery slow and expensive. We took inspiration from the efficiency of nature. ‘Shinka’ (進化) is Japanese for evolution, and we designed our system to be just as resourceful. On the classic circle packing optimization problem, ShinkaEvolve discovered a new state-of-the-art solution using only 150 samples. This is a big leap in efficiency compared to previous methods that required thousands of evaluations. We applied ShinkaEvolve to a diverse set of hard problems with real-world applications: 1/ AIME Math Reasoning: It evolved sophisticated agentic scaffolds that significantly outperform strong baselines, discovering an entire Pareto frontier of solutions trading performance for efficiency. 2/ Competitive Programming: On ALE-Bench (a benchmark for NP-Hard optimization problems), ShinkaEvolve took the best existing agent's solutions and improved them, turning a 5th place solution on one task into a 2nd place leaderboard rank in a competitive programming competition. 3/ LLM Training: We even turned ShinkaEvolve inward to improve LLMs themselves. It tackled the open challenge of designing load balancing losses for Mixture-of-Experts (MoE) models. It discovered a novel loss function that leads to better expert specialization and consistently improves model performance and perplexity. ShinkaEvolve achieves its remarkable sample-efficiency through three key innovations that work together: (1) an adaptive parent sampling strategy to balance exploration and exploitation, (2) novelty-based rejection filtering to avoid redundant work, and (3) a bandit-based LLM ensemble that dynamically picks the best model for the job. By making ShinkaEvolve open-source and highly sample-efficient, our goal is to democratize access to advanced, open-ended discovery tools. Our vision for ShinkaEvolve is to be an easy-to-use companion tool to help scientists and engineers with their daily work. We believe that building more efficient, nature-inspired systems is key to unlocking the future of AI-driven scientific research. We are excited to see what the community builds with it! Learn more in our technical report:

Sakana AI

359,537 Aufrufe • vor 10 Monaten

STEVE-1: A Generative Model for Text-to-Behavior in Minecraft paper page: Constructing AI models that respond to text instructions is challenging, especially for sequential decision-making tasks. This work introduces an instruction-tuned Video Pretraining (VPT) model for Minecraft called STEVE-1, demonstrating that the unCLIP approach, utilized in DALL-E 2, is also effective for creating instruction-following sequential decision-making agents. STEVE-1 is trained in two steps: adapting the pretrained VPT model to follow commands in MineCLIP's latent space, then training a prior to predict latent codes from text. This allows us to finetune VPT through self-supervised behavioral cloning and hindsight relabeling, bypassing the need for costly human text annotations. By leveraging pretrained models like VPT and MineCLIP and employing best practices from text-conditioned image generation, STEVE-1 costs just $60 to train and can follow a wide range of short-horizon open-ended text and visual instructions in Minecraft. STEVE-1 sets a new bar for open-ended instruction following in Minecraft with low-level controls (mouse and keyboard) and raw pixel inputs, far outperforming previous baselines. We provide experimental evidence highlighting key factors for downstream performance, including pretraining, classifier-free guidance, and data scaling. All resources, including our model weights, training scripts, and evaluation tools are made available for further research.

STEVE-1: A Generative Model for Text-to-Behavior in Minecraft paper page: Constructing AI models that respond to text instructions is challenging, especially for sequential decision-making tasks. This work introduces an instruction-tuned Video Pretraining (VPT) model for Minecraft called STEVE-1, demonstrating that the unCLIP approach, utilized in DALL-E 2, is also effective for creating instruction-following sequential decision-making agents. STEVE-1 is trained in two steps: adapting the pretrained VPT model to follow commands in MineCLIP's latent space, then training a prior to predict latent codes from text. This allows us to finetune VPT through self-supervised behavioral cloning and hindsight relabeling, bypassing the need for costly human text annotations. By leveraging pretrained models like VPT and MineCLIP and employing best practices from text-conditioned image generation, STEVE-1 costs just $60 to train and can follow a wide range of short-horizon open-ended text and visual instructions in Minecraft. STEVE-1 sets a new bar for open-ended instruction following in Minecraft with low-level controls (mouse and keyboard) and raw pixel inputs, far outperforming previous baselines. We provide experimental evidence highlighting key factors for downstream performance, including pretraining, classifier-free guidance, and data scaling. All resources, including our model weights, training scripts, and evaluation tools are made available for further research.

AK

144,783 Aufrufe • vor 3 Jahren