Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

the encoder

Kat ⊷ the Poet Engineer

84,664 subscribers

103,750 Aufrufe • vor 7 Monaten •via X (Twitter)

Kunst Wissenschaft & Technologie

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

Crazy how transformers work! It’s the encoder part

Crazy how transformers work! It’s the encoder part

Dev Khant

166,618 Aufrufe • vor 1 Jahr

The Vision Encoder: Our Secret Weapon 🔥 An ultra-accurate vision encoder plate, combined with 5μm-resolution optical feedback, calibrates the H2D’s motion system to achieve consistent 50μm accuracy across the entire build plate. #BambuH2D

The Vision Encoder: Our Secret Weapon 🔥 An ultra-accurate vision encoder plate, combined with 5μm-resolution optical feedback, calibrates the H2D’s motion system to achieve consistent 50μm accuracy across the entire build plate. #BambuH2D

Bambu Lab

21,379 Aufrufe • vor 1 Jahr

manual control of the monster mwir lens achieved. gonna hook into the magnetic end stop + encoder to replace the proprietary Telic Optics board.

manual control of the monster mwir lens achieved. gonna hook into the magnetic end stop + encoder to replace the proprietary Telic Optics board.

MacCallister Higgins

21,252 Aufrufe • vor 1 Jahr

Training Wan 2.1-1.3B to use Qwen3-VL-2B text encoder. Doing 33% text only, 33% VL only, and 33% both. Just pretraining the two text input linear layers for now. It is amazing how quickly a model can adapt to a different text encoder. This is 25,750 steps, BS of 10.

Training Wan 2.1-1.3B to use Qwen3-VL-2B text encoder. Doing 33% text only, 33% VL only, and 33% both. Just pretraining the two text input linear layers for now. It is amazing how quickly a model can adapt to a different text encoder. This is 25,750 steps, BS of 10.

Ostris

25,826 Aufrufe • vor 22 Tagen

i just ran Google's brand new Unsloth Gemma4 12B dense GGUF on my RTX 4060 using llama.cpp + CUDA 13.2 21 tokens per second. on a budget consumer GPU. locally. no API. no cloud. no subscription. and the benchmarks are absolutely cooked # first let's talk architecture because this is genuinely different every multimodal model you've used has a frozen vision encoder + frozen audio encoder + LLM backbone glued together Gemma 4 12B is different it's a single decoder only transformer. that's it. vision? raw 48×48 pixel patches → one matmul → projected directly into the LLM audio? raw 16kHz signal sliced into 40ms frames → linear projection → same LLM input space no encoder tax. no latency penalty. no fragmented memory to put the encoder savings in perspective: old Gemma 4 26B approach: - 550M param vision encoder (frozen) - 300M param audio encoder (frozen) - LLM backbone Gemma 4 12B: - 35M param vision embedder (a single matmul) - no audio encoder at all - LLM backbone handles EVERYTHING 550M → 35M for vision alone. that's a 15x reduction this is why the gemma-4-12b-it-Q4_K_M.gguf is just 6.6 GBs!!! and it has 256K native context context # Benchmarks: AIME 2026 (math olympiad): 77.5% GPQA Diamond (expert science): 78.8% LiveCodeBench v6 (real code): 72% Codeforces ELO: 1659 MMLU Pro: 77.2% MATH-Vision: 79.7% BigBench Extra Hard: 53% inference → llama.cpp, LM Studio, vLLM, SGLang llamacpp flags: -m "gemma-4-12b-it-Q4_K_M.gguf" -ngl 99 -c 8000 -v --port 8080 Available on huggingface now! Link below

Alok

279,768 Aufrufe • vor 1 Monat

Modern tech tends to focus on “more”. More data. More bandwidth. More compute. The best software engineering happens in the “less” sphere; codec engineering. Listen to this stream with 90% packet loss. The Opus Encoder team is cracked:

Modern tech tends to focus on “more”. More data. More bandwidth. More compute. The best software engineering happens in the “less” sphere; codec engineering. Listen to this stream with 90% packet loss. The Opus Encoder team is cracked:

LaurieWired

244,761 Aufrufe • vor 1 Jahr

Autoencoder by hand✍️Excel~ I designed this exercise to show how an Encoder-Decoder network convert input to code and reconstruct input from code. It is annotated with equations, PyTorch, and graphs. 👇Join the 'AI Math' community. Download xlsx.

Autoencoder by hand✍️Excel~ I designed this exercise to show how an Encoder-Decoder network convert input to code and reconstruct input from code. It is annotated with equations, PyTorch, and graphs. 👇Join the 'AI Math' community. Download xlsx.

Tom Yeh

101,555 Aufrufe • vor 1 Jahr

Can a VLM see without a vision encoder? We trained one for $100, inspired by Gemma 4 12B. Latency on an M3 Pro MacBook: 112 ms -> 1.1 ms for the image path 30% lower end-to-end image+LLM The architecture is just: patchify the image -> linear projection with pos embeddings -> LLM Writeup:

Can a VLM see without a vision encoder? We trained one for $100, inspired by Gemma 4 12B. Latency on an M3 Pro MacBook: 112 ms -> 1.1 ms for the image path 30% lower end-to-end image+LLM The architecture is just: patchify the image -> linear projection with pos embeddings -> LLM Writeup:

Andi Marafioti

60,278 Aufrufe • vor 1 Monat

Autoencoder by hand✍️Excel~ I designed this exercise to show how an Encoder-Decoder network convert input to code and reconstruct input from code. It is annotated with equations, PyTorch, and graphs. I also made a medium version.👇Join the 'AI Math' community. Download xlsx.

Autoencoder by hand✍️Excel~ I designed this exercise to show how an Encoder-Decoder network convert input to code and reconstruct input from code. It is annotated with equations, PyTorch, and graphs. I also made a medium version.👇Join the 'AI Math' community. Download xlsx.

Tom Yeh

54,482 Aufrufe • vor 1 Jahr

THE BEST visual explainer of how information propagates through a transformer. If you want to have more than intuition about how the Transformer architecture is ruling the LLM world - open-source project explains everything about LLM Transformer Models! - A great resource for anyone looking to gain a deeper understanding of how Transformer-based AI models like GPT work, including: - Self-attention mechanisms - Encoder-decoder architecture - Positional encoding - Multi-head attention

THE BEST visual explainer of how information propagates through a transformer. If you want to have more than intuition about how the Transformer architecture is ruling the LLM world - open-source project explains everything about LLM Transformer Models! - A great resource for anyone looking to gain a deeper understanding of how Transformer-based AI models like GPT work, including: - Self-attention mechanisms - Encoder-decoder architecture - Positional encoding - Multi-head attention

Rohan Paul

106,897 Aufrufe • vor 1 Jahr

Protein language models just got an upgrade. Meet Profluent-E1: a free, flexible, frontier protein sequence encoder. E1 is built with retrieval augmentation to learn from multiple sequences. Models trained over 4T tokens with only 150M-600M params, E1 is SOTA for zero-shot functional and unsupervised structural tasks. It raises the bar for protein representation learning and is freely available today.

Protein language models just got an upgrade. Meet Profluent-E1: a free, flexible, frontier protein sequence encoder. E1 is built with retrieval augmentation to learn from multiple sequences. Models trained over 4T tokens with only 150M-600M params, E1 is SOTA for zero-shot functional and unsupervised structural tasks. It raises the bar for protein representation learning and is freely available today.

Profluent

222,106 Aufrufe • vor 8 Monaten

TESLA FSD V14.2 HITS HARD – V14.3 IS ABOUT TO HIT HARDER V14.2 shows up like a “small update” and then drives circles around half the country. Classic Tesla. Iteration Punch: •⁠ ⁠V14.2’s upgraded vision encoder acts like it’s on caffeine •⁠ ⁠Human gestures? It reads them better than most people do •⁠ ⁠Seventh drop in the same version – Tesla iterates like it’s sprinting •⁠ ⁠V14.3? That’s the one insiders are quietly bracing for If V14.2 hits this hard, V14.3 is coming in with a swing! Source: Elon Musk

TESLA FSD V14.2 HITS HARD – V14.3 IS ABOUT TO HIT HARDER V14.2 shows up like a “small update” and then drives circles around half the country. Classic Tesla. Iteration Punch: •⁠ ⁠V14.2’s upgraded vision encoder acts like it’s on caffeine •⁠ ⁠Human gestures? It reads them better than most people do •⁠ ⁠Seventh drop in the same version – Tesla iterates like it’s sprinting •⁠ ⁠V14.3? That’s the one insiders are quietly bracing for If V14.2 hits this hard, V14.3 is coming in with a swing! Source: Elon Musk

Mario Nawfal

232,465 Aufrufe • vor 8 Monaten

🏆 We're thrilled to announce that Meta FAIR’s Brain & AI team won 1st place at the prestigious Algonauts 2025 brain modeling competition. Their 1B parameter model, TRIBE (Trimodal Brain Encoder), is the first deep neural network trained to predict brain responses to stimuli across multiple modalities, cortical areas, and individuals. The approach combines pretrained representations of several foundational models from Meta – text (Llama 3.2), audio (Wav2Vec2-BERT from Seamless) and video (V-JEPA 2) – to predict a very large amount (80 hours per subject) of spatio-temporal fMRI brain responses to movies acquired by the Courtois NeuroMod project Download the code: Read the paper: Learn about the challenge: Download the data:

🏆 We're thrilled to announce that Meta FAIR’s Brain & AI team won 1st place at the prestigious Algonauts 2025 brain modeling competition. Their 1B parameter model, TRIBE (Trimodal Brain Encoder), is the first deep neural network trained to predict brain responses to stimuli across multiple modalities, cortical areas, and individuals. The approach combines pretrained representations of several foundational models from Meta – text (Llama 3.2), audio (Wav2Vec2-BERT from Seamless) and video (V-JEPA 2) – to predict a very large amount (80 hours per subject) of spatio-temporal fMRI brain responses to movies acquired by the Courtois NeuroMod project Download the code: Read the paper: Learn about the challenge: Download the data:

AI at Meta

1,093,417 Aufrufe • vor 11 Monaten

Vector Database by Hand ✍️ Vector databases are revolutionizing how we search and analyze complex data. They have become the backbone of Retrieval Augmented Generation (#RAG). How do vector databases work? [1] Given ↳ A dataset of three sentences, each has 3 words (or tokens) ↳ In practice, a dataset may contain millions or billions of sentences. The max number of tokens may be tens of thousands (e.g., 32,768 mistral-7b). Process "how are you" [2] 🟨 Word Embeddings ↳ For each word, look up corresponding word embedding vector from a table of 22 vectors, where 22 is the vocabulary size. ↳ In practice, the vocabulary size can be tens of thousands. The word embedding dimensions are in the thousands (e.g., 1024, 4096) [3] 🟩 Encoding ↳ Feed the sequence of word embeddings to an encoder to obtain a sequence of feature vectors, one per word. ↳ Here, the encoder is a simple one layer perceptron (linear layer + ReLU) ↳ In practice, the encoder is a transformer or one of its many variants. [4] 🟩 Mean Pooling ↳ Merge the sequence of feature vectors into a single vector using "mean pooling" which is to average across the columns. ↳ The result is a single vector. We often call it "text embeddings" or "sentence embeddings." ↳ Other pooling techniques are possible, such as CLS. But mean pooling is the most common. [5] 🟦 Indexing ↳ Reduce the dimensions of the text embedding vector by a projection matrix. The reduction rate is 50% (4->2). ↳ In practice, the values in this projection matrix is much more random. ↳ The purpose is similar to that of hashing, which is to obtain a short representation to allow faster comparison and retrieval. ↳ The resulting dimension-reduced index vector is saved in the vector storage. [6] Process "who are you" ↳ Repeat [2]-[5] [7] Process "who am I" ↳ Repeat [2]-[5] Now we have indexed our dataset in the vector database. [8] 🟥 Query: "am I you" ↳ Repeat [2]-[5] ↳ The result is a 2-d query vector. [9] 🟥 Dot Products ↳ Take dot product between the query vector and database vectors. They are all 2-d. ↳ The purpose is to use dot product to estimate similarity. ↳ By transposing the query vector, this step becomes a matrix multiplication. [10] 🟥 Nearest Neighbor ↳ Find the largest dot product by linear scan. ↳ The sentence with the highest dot product is "who am I" ↳ In practice, because scanning billions of vectors is slow, we use an Approximate Nearest Neighbor (ANN) algorithm like the Hierarchical Navigable Small Worlds (HNSW).

Vector Database by Hand ✍️ Vector databases are revolutionizing how we search and analyze complex data. They have become the backbone of Retrieval Augmented Generation (#RAG). How do vector databases work? [1] Given ↳ A dataset of three sentences, each has 3 words (or tokens) ↳ In practice, a dataset may contain millions or billions of sentences. The max number of tokens may be tens of thousands (e.g., 32,768 mistral-7b). Process "how are you" [2] 🟨 Word Embeddings ↳ For each word, look up corresponding word embedding vector from a table of 22 vectors, where 22 is the vocabulary size. ↳ In practice, the vocabulary size can be tens of thousands. The word embedding dimensions are in the thousands (e.g., 1024, 4096) [3] 🟩 Encoding ↳ Feed the sequence of word embeddings to an encoder to obtain a sequence of feature vectors, one per word. ↳ Here, the encoder is a simple one layer perceptron (linear layer + ReLU) ↳ In practice, the encoder is a transformer or one of its many variants. [4] 🟩 Mean Pooling ↳ Merge the sequence of feature vectors into a single vector using "mean pooling" which is to average across the columns. ↳ The result is a single vector. We often call it "text embeddings" or "sentence embeddings." ↳ Other pooling techniques are possible, such as CLS. But mean pooling is the most common. [5] 🟦 Indexing ↳ Reduce the dimensions of the text embedding vector by a projection matrix. The reduction rate is 50% (4->2). ↳ In practice, the values in this projection matrix is much more random. ↳ The purpose is similar to that of hashing, which is to obtain a short representation to allow faster comparison and retrieval. ↳ The resulting dimension-reduced index vector is saved in the vector storage. [6] Process "who are you" ↳ Repeat [2]-[5] [7] Process "who am I" ↳ Repeat [2]-[5] Now we have indexed our dataset in the vector database. [8] 🟥 Query: "am I you" ↳ Repeat [2]-[5] ↳ The result is a 2-d query vector. [9] 🟥 Dot Products ↳ Take dot product between the query vector and database vectors. They are all 2-d. ↳ The purpose is to use dot product to estimate similarity. ↳ By transposing the query vector, this step becomes a matrix multiplication. [10] 🟥 Nearest Neighbor ↳ Find the largest dot product by linear scan. ↳ The sentence with the highest dot product is "who am I" ↳ In practice, because scanning billions of vectors is slow, we use an Approximate Nearest Neighbor (ANN) algorithm like the Hierarchical Navigable Small Worlds (HNSW).

Tom Yeh

191,994 Aufrufe • vor 2 Jahren

Most imitation learning policies break when the camera moves or the robot changes. NOT THIS ONE 👇 [📍 Bookmark for later ] A new 3D scene representation encoder, tackles this by enabling zero-shot generalization to unseen embodiments and viewpoints… And it works with any IL algorithm. The trick? •Use a 2D foundation model to extract semantic features •Lift them into 3D space for localization (not semantics) •Condition the IL policy on this spatially grounded vector Across 93 simulated and 6 real tasks, Adapt3R: ✅ Maintains IL performance on LIBERO & MimicGen benchmarks ✅ Outperforms DP3 and 3D Diffuser Actor in most settings ✅ Holds >80% success on LIBERO even with large camera rotations Thanks for sharing this, Animesh Garg & Albert Wilcox! 📍Paper: Website: Code:

Most imitation learning policies break when the camera moves or the robot changes. NOT THIS ONE 👇 [📍 Bookmark for later ] A new 3D scene representation encoder, tackles this by enabling zero-shot generalization to unseen embodiments and viewpoints… And it works with any IL algorithm. The trick? •Use a 2D foundation model to extract semantic features •Lift them into 3D space for localization (not semantics) •Condition the IL policy on this spatially grounded vector Across 93 simulated and 6 real tasks, Adapt3R: ✅ Maintains IL performance on LIBERO & MimicGen benchmarks ✅ Outperforms DP3 and 3D Diffuser Actor in most settings ✅ Holds >80% success on LIBERO even with large camera rotations Thanks for sharing this, Animesh Garg & Albert Wilcox! 📍Paper: Website: Code:

Ilir Aliu

12,178 Aufrufe • vor 11 Monaten

Disappointed with your ICLR paper being rejected? Ten years ago today, Sergey and I finished training some of the first end-to-end neutral nets for robot control 🤖 We submitted the paper to RSS on January 23, 2015. It was rejected for being "incremental" and "unlikely to have much impact" Our resubmission to NeurIPS was also rejected It now has >4,000 citations (and more importantly, end-to-end training is widely accepted!) It's also cool to think about what's changed and what's the same -- - The network was 92k parameters and trained on ~15 minutes of data - The code was a combination of matlab, caffe, ROS, a custom CUDA kernel for speed, and a low-level 20 Hz controller in C++, all talking to each other. ROS+matlab was as bad as it sounds. - We pre-trained the encoder and did inference off-board on a workstation with a larger GPU. - We were paranoid about varying lighting messing up the network, so we did all the experiments after sunset (so long nights running experiments on the robot past 3 am) Now, we have manipulation policies that are far more dextrous, far more generalizable, and maybe on the cusp of breaking into the real world. :) (the paper:

Disappointed with your ICLR paper being rejected? Ten years ago today, Sergey and I finished training some of the first end-to-end neutral nets for robot control 🤖 We submitted the paper to RSS on January 23, 2015. It was rejected for being "incremental" and "unlikely to have much impact" Our resubmission to NeurIPS was also rejected It now has >4,000 citations (and more importantly, end-to-end training is widely accepted!) It's also cool to think about what's changed and what's the same -- - The network was 92k parameters and trained on ~15 minutes of data - The code was a combination of matlab, caffe, ROS, a custom CUDA kernel for speed, and a low-level 20 Hz controller in C++, all talking to each other. ROS+matlab was as bad as it sounds. - We pre-trained the encoder and did inference off-board on a workstation with a larger GPU. - We were paranoid about varying lighting messing up the network, so we did all the experiments after sunset (so long nights running experiments on the robot past 3 am) Now, we have manipulation policies that are far more dextrous, far more generalizable, and maybe on the cusp of breaking into the real world. :) (the paper:

Chelsea Finn

169,101 Aufrufe • vor 1 Jahr

Testing the new Gemma 4 12B (QAT) vision and OCR capabilities locally with LM Studio. # The setup: - GPU: NVIDIA RTX 4060 (8GB VRAM) - CPU: Intel i7 - Runner: LM Studio - Config: 32k context, 38 layers offloaded, Flash Attention enabled - Speed: ~14 tokens/sec decode throughput # The test: I gave it a screenshot of Google AI Studio. Prompt: "clone this. give me a single html file" # The result: A solid one shot replication. It successfully mapped out the layout, recognized the UI text, and structured the divs correctly, with only minor differences from the original. Results available at the end of the video. Quite capable for a 12B model running on budget consumer hardware. A gpu that costs only $300. # Why the architecture under the hood is notable: Unlike traditional models that rely on heavy, separate vision and audio encoders, Gemma 4 12B uses a unified, encoder free architecture. It bypasses separate multi stage encoders. Uses a 35M parameter vision embedder to project raw 48x48 pixel patches directly to the LLM hidden dimension. Local multimodal development is becoming highly accessible on standard hardware. If you've spun up Gemma 4 12B locally, what setup are you using and what kind of throughput are you seeing?

Testing the new Gemma 4 12B (QAT) vision and OCR capabilities locally with LM Studio. # The setup: - GPU: NVIDIA RTX 4060 (8GB VRAM) - CPU: Intel i7 - Runner: LM Studio - Config: 32k context, 38 layers offloaded, Flash Attention enabled - Speed: ~14 tokens/sec decode throughput # The test: I gave it a screenshot of Google AI Studio. Prompt: "clone this. give me a single html file" # The result: A solid one shot replication. It successfully mapped out the layout, recognized the UI text, and structured the divs correctly, with only minor differences from the original. Results available at the end of the video. Quite capable for a 12B model running on budget consumer hardware. A gpu that costs only $300. # Why the architecture under the hood is notable: Unlike traditional models that rely on heavy, separate vision and audio encoders, Gemma 4 12B uses a unified, encoder free architecture. It bypasses separate multi stage encoders. Uses a 35M parameter vision embedder to project raw 48x48 pixel patches directly to the LLM hidden dimension. Local multimodal development is becoming highly accessible on standard hardware. If you've spun up Gemma 4 12B locally, what setup are you using and what kind of throughput are you seeing?

Alok

25,717 Aufrufe • vor 1 Monat

Here's what The Browser Company's AI eng & ML teams are working on for Dia right now: (This is a pitch to come work for us; info at end) 🤖 COMPUTER USE – we've built our own bespoke APIs on top of Chromium to optimize latency, accuracy, and cost of computer-using agents. Demo attached. Big breakthroughs here in recent weeks. 🛡️ ON-DEVICE MODELS – we've built our own custom infra to run everything from encoder-only models to full LLMs on device. It's cross-platform, supports LoRa adapters, and optimized for the GPU. This system preserves privacy and enables fast inference times. 🧠 MEMORY – with your permission, Dia automatically tailors your AI experiences to you, personally, based on the tabs you open while browsing normally every day. We're also bringing vertical memory to specific features. ♻️ DATA FLYWHEELS – our Fall/Winter P0 is to double-down on training custom models based on implicit signals from daily use of Dia. Dia should get smarter and more useful the more people use it. Whether via RL, auto-generated prompts, or otherwise. If this work sounds interesting to you please visit our jobs page or email careers@thebrowser.company. Hiring nearly every related role -- from ML engineers to people prototyping with AI and context/prompt writers -- everyone encouraged to apply!!

Here's what The Browser Company's AI eng & ML teams are working on for Dia right now: (This is a pitch to come work for us; info at end) 🤖 COMPUTER USE – we've built our own bespoke APIs on top of Chromium to optimize latency, accuracy, and cost of computer-using agents. Demo attached. Big breakthroughs here in recent weeks. 🛡️ ON-DEVICE MODELS – we've built our own custom infra to run everything from encoder-only models to full LLMs on device. It's cross-platform, supports LoRa adapters, and optimized for the GPU. This system preserves privacy and enables fast inference times. 🧠 MEMORY – with your permission, Dia automatically tailors your AI experiences to you, personally, based on the tabs you open while browsing normally every day. We're also bringing vertical memory to specific features. ♻️ DATA FLYWHEELS – our Fall/Winter P0 is to double-down on training custom models based on implicit signals from daily use of Dia. Dia should get smarter and more useful the more people use it. Whether via RL, auto-generated prompts, or otherwise. If this work sounds interesting to you please visit our jobs page or email [email protected]. Hiring nearly every related role -- from ML engineers to people prototyping with AI and context/prompt writers -- everyone encouraged to apply!!

Josh Miller

67,937 Aufrufe • vor 11 Monaten

This is my "feel the AGI" moment: I used GPT-5.6 Sol to train my own autocorrect model that outperforms GPT-5.6 Sol (wtf??) I have no ML background. I have no idea what I'm doing. I just kept pushing Sol until it spat out a SOTA model. And I spent $0. The motivation: Years of talking to AI have made me terrible at typing. Rather than fix my skill issue, I decided to throw more AI at it. My idea was: instead of autocorrect that interrupts my flow, I want to type fast with mistakes and have AI clean it up after. I wanted the smallest local model possible, for speed, for battery life, for science! So I decided to train my own. Inspired by Andrej Karpathy’s autoresearch, I ran Codex /goal with this setup: pick an experiment, try it, record the results to a doc, throw it out if it fails, and plan the next experiment without repeating failures. I gave a few examples that had to pass, tight latency targets, and let it run. Sol did some amazing things. First, it scanned benchmarks and shortlisted base models: Qwen 3.5, Gemma 4, Liquid LFM 2.5. It found a dataset on HuggingFace for typed text. Then it built a simulator for fingers striking a Mac keyboard, modeling the physical layout with a Gaussian distribution around each key. It simulated striking the wrong key, wrong order, fat-fingering, etc. With the models + data + simulator, it fine-tuned using MLX right on my MacBook. It had a working prototype within an hour! But accuracy was pretty poor. — Problem 1: Tokenization Sol read papers, ran tests, and identified that the tokenizer was the bottleneck. Tokenization makes typos hard for the model to see, so it memorizes mappings instead of using its language priors. Sol tried ByT5, Google’s tokenizer-free byte-level LLM. This made a big improvement, but the model is old and lacked the knowledge needed to reach Sol performance. Sol dug deeper and realized a tokenizer-free model isn’t needed; instead, it used T5Gemma, an encoder-decoder model. This can understand the input deeply before producing output, and furthermore, Sol could post-train the encoder to improve performance. This gave a much higher ceiling. — Problem 2: Loss function Now the model was correcting some typos perfectly, but ignoring most. Sol realized that standard cross-entropy loss was teaching the model to avoid edits, because the vast majority of characters in the training data were left unmodified. The fix was wild: Sol wrote a custom loss function that byte-aligns the source and target strings, uses a dynamic programming algorithm to compute the minimum edits between the two, then weights correct edits much higher than copies. After a lot of tuning, this dramatically improved accuracy. — Problem 3: Autoregression One failure mode remained: if the model made a mistake, it couldn’t backtrack. It could only predict the next token. Teaching it to “think” like a reasoning model would solve this, but would be far too slow. Sol found a beautiful solution: instead of greedily predicting the next token, beam search over all possibilities. This parallelizes the exploration instead of one linear chain-of-thought. At the end, choose the path with highest cumulative log probability. This worked great, but made the experience worse, since the user wouldn’t see progress until the whole search was done. To fix this, Sol made a clever observation: after each search step, the longest common prefix among surviving branches is guaranteed to appear in the final result, so it can be displayed immediately. As the search progresses, weaker paths are dropped and the prefix grows, so the user sees continuous progress. Sol built all this as a custom MLX pipeline that does the parallel decoding on the MacBook GPU, with just ~40ms TTFT. It’s crazy fast and entirely local. — Final eval (error reduction rate, higher is better): - Apple autocorrect: 49.66% - GPT-5.6 Luna: 82.47% - GPT-5.6 Terra: 87.64% - GPT-5.6 Sol: 90.56% - Our model (1.7B): 91.02% Final cost: - 1 quota reset (thanks Tibo) - $0 (And yes, I verified there's no cheating. In fact, we test words scrubbed from the training data to prove the model isn’t memorizing) There were a ton more details and tangents I could write about: contrastive learning, GRPO, DPO, dynamic masking, and more. Sol is a fascinating and creative model. It blew my mind so many times. Don’t let a lack of experience stop you: Sol makes AI experiments accessible to anyone!

This is my "feel the AGI" moment: I used GPT-5.6 Sol to train my own autocorrect model that outperforms GPT-5.6 Sol (wtf??) I have no ML background. I have no idea what I'm doing. I just kept pushing Sol until it spat out a SOTA model. And I spent $0. The motivation: Years of talking to AI have made me terrible at typing. Rather than fix my skill issue, I decided to throw more AI at it. My idea was: instead of autocorrect that interrupts my flow, I want to type fast with mistakes and have AI clean it up after. I wanted the smallest local model possible, for speed, for battery life, for science! So I decided to train my own. Inspired by Andrej Karpathy’s autoresearch, I ran Codex /goal with this setup: pick an experiment, try it, record the results to a doc, throw it out if it fails, and plan the next experiment without repeating failures. I gave a few examples that had to pass, tight latency targets, and let it run. Sol did some amazing things. First, it scanned benchmarks and shortlisted base models: Qwen 3.5, Gemma 4, Liquid LFM 2.5. It found a dataset on HuggingFace for typed text. Then it built a simulator for fingers striking a Mac keyboard, modeling the physical layout with a Gaussian distribution around each key. It simulated striking the wrong key, wrong order, fat-fingering, etc. With the models + data + simulator, it fine-tuned using MLX right on my MacBook. It had a working prototype within an hour! But accuracy was pretty poor. — Problem 1: Tokenization Sol read papers, ran tests, and identified that the tokenizer was the bottleneck. Tokenization makes typos hard for the model to see, so it memorizes mappings instead of using its language priors. Sol tried ByT5, Google’s tokenizer-free byte-level LLM. This made a big improvement, but the model is old and lacked the knowledge needed to reach Sol performance. Sol dug deeper and realized a tokenizer-free model isn’t needed; instead, it used T5Gemma, an encoder-decoder model. This can understand the input deeply before producing output, and furthermore, Sol could post-train the encoder to improve performance. This gave a much higher ceiling. — Problem 2: Loss function Now the model was correcting some typos perfectly, but ignoring most. Sol realized that standard cross-entropy loss was teaching the model to avoid edits, because the vast majority of characters in the training data were left unmodified. The fix was wild: Sol wrote a custom loss function that byte-aligns the source and target strings, uses a dynamic programming algorithm to compute the minimum edits between the two, then weights correct edits much higher than copies. After a lot of tuning, this dramatically improved accuracy. — Problem 3: Autoregression One failure mode remained: if the model made a mistake, it couldn’t backtrack. It could only predict the next token. Teaching it to “think” like a reasoning model would solve this, but would be far too slow. Sol found a beautiful solution: instead of greedily predicting the next token, beam search over all possibilities. This parallelizes the exploration instead of one linear chain-of-thought. At the end, choose the path with highest cumulative log probability. This worked great, but made the experience worse, since the user wouldn’t see progress until the whole search was done. To fix this, Sol made a clever observation: after each search step, the longest common prefix among surviving branches is guaranteed to appear in the final result, so it can be displayed immediately. As the search progresses, weaker paths are dropped and the prefix grows, so the user sees continuous progress. Sol built all this as a custom MLX pipeline that does the parallel decoding on the MacBook GPU, with just ~40ms TTFT. It’s crazy fast and entirely local. — Final eval (error reduction rate, higher is better): - Apple autocorrect: 49.66% - GPT-5.6 Luna: 82.47% - GPT-5.6 Terra: 87.64% - GPT-5.6 Sol: 90.56% - Our model (1.7B): 91.02% Final cost: - 1 quota reset (thanks Tibo) - $0 (And yes, I verified there's no cheating. In fact, we test words scrubbed from the training data to prove the model isn’t memorizing) There were a ton more details and tangents I could write about: contrastive learning, GRPO, DPO, dynamic masking, and more. Sol is a fascinating and creative model. It blew my mind so many times. Don’t let a lack of experience stop you: Sol makes AI experiments accessible to anyone!

Anshu

177,702 Aufrufe • vor 13 Tagen

Before the week ends, let's acknowledge one of the most INSANE week ever for open AI, with 25+ notable open-weight drops across every modality: 🧠 LLMs → NVIDIA Nemotron 3 Ultra: 550B hybrid Mamba-MoE, only 55B active, 1M context, MMLU 89.1. NVFP4 variant claims ~5x throughput on Blackwell. First openly-weighted 550B hybrid Mamba-Transformer, closing the gap with frontier closed models. → Google Gemma 4 12B: fully open dense any-to-any (text/image/audio/video), 256k context, encoder-free, 140+ languages, AIME 2026 at 77.5. Shipped with a 23-checkpoint QAT wave (mobile ONNX + MLX). Most deployable model of the week. → StepFun Step-3.7-Flash: 198B sparse MoE VLM, ~11B active, SWE-Bench PRO 56.3. Apache 2.0. → Liquid AI LFM2.5-8B-A1B: edge MoE, just 1.5B active, 128k ctx, MATH500 88.8, MLX-ready. Best on-device option this week. → JetBrains Mellum2-12B-A2.5B-Thinking: their first open MoE, near-Qwen3-14B coding at 2.5B active. Apache 2.0. 🎨 Image gen (the surprise of the week) → Ideogram 4: their FIRST-EVER open weights. 9.3B flow-matching DiT trained from scratch. #2 overall behind GPT Image 2, top open-weight model on Design Arena + LMArena. Strongest open checkpoint for text-rich images, full stop. It has taste. Still can't believe this is open weights. 🔊 Audio & Speech (a breakout week for open TTS, 4 labs shipped) → Boson Higgs Audio v3 4B: 102 languages, 21 emotions, singing/whispering/shouting, sub-second TTFA. → RedNote dots.tts: the only fully continuous (no codec) open TTS pipeline, Apache 2.0. → Google Magenta RealTime 2: real-time music gen, <200ms latency, text+audio+MIDI. multimodalart ported it to PyTorch within hours with live ZeroGPU demos. → NVIDIA Nemotron-3.5 ASR: 600M streaming, 17x more concurrent streams vs Parakeet RNNT 1.1B. 👁️ Vision & VLMs → PaddleOCR-VL-1.6: SOTA document parsing at 1B params, Apache 2.0. → Baidu NAVA: 6.3B joint audio-video gen, best-in-class A/V sync, Apache 2.0. 🎬 Video, 3D & World Models → NVIDIA Cosmos3-Super: 64B omnimodal world model coupling action trajectories with video+audio gen, for Physical AI. → JD JoyAI-Echo: up to 5-min multi-shot text-to-video on LTX-2.3. → ByteDance Bernini-R + VAST TripoSplat (single-image-to-3D Gaussian splats, MIT).

Before the week ends, let's acknowledge one of the most INSANE week ever for open AI, with 25+ notable open-weight drops across every modality: 🧠 LLMs → NVIDIA Nemotron 3 Ultra: 550B hybrid Mamba-MoE, only 55B active, 1M context, MMLU 89.1. NVFP4 variant claims ~5x throughput on Blackwell. First openly-weighted 550B hybrid Mamba-Transformer, closing the gap with frontier closed models. → Google Gemma 4 12B: fully open dense any-to-any (text/image/audio/video), 256k context, encoder-free, 140+ languages, AIME 2026 at 77.5. Shipped with a 23-checkpoint QAT wave (mobile ONNX + MLX). Most deployable model of the week. → StepFun Step-3.7-Flash: 198B sparse MoE VLM, ~11B active, SWE-Bench PRO 56.3. Apache 2.0. → Liquid AI LFM2.5-8B-A1B: edge MoE, just 1.5B active, 128k ctx, MATH500 88.8, MLX-ready. Best on-device option this week. → JetBrains Mellum2-12B-A2.5B-Thinking: their first open MoE, near-Qwen3-14B coding at 2.5B active. Apache 2.0. 🎨 Image gen (the surprise of the week) → Ideogram 4: their FIRST-EVER open weights. 9.3B flow-matching DiT trained from scratch. #2 overall behind GPT Image 2, top open-weight model on Design Arena + LMArena. Strongest open checkpoint for text-rich images, full stop. It has taste. Still can't believe this is open weights. 🔊 Audio & Speech (a breakout week for open TTS, 4 labs shipped) → Boson Higgs Audio v3 4B: 102 languages, 21 emotions, singing/whispering/shouting, sub-second TTFA. → RedNote dots.tts: the only fully continuous (no codec) open TTS pipeline, Apache 2.0. → Google Magenta RealTime 2: real-time music gen, <200ms latency, text+audio+MIDI. multimodalart ported it to PyTorch within hours with live ZeroGPU demos. → NVIDIA Nemotron-3.5 ASR: 600M streaming, 17x more concurrent streams vs Parakeet RNNT 1.1B. 👁️ Vision & VLMs → PaddleOCR-VL-1.6: SOTA document parsing at 1B params, Apache 2.0. → Baidu NAVA: 6.3B joint audio-video gen, best-in-class A/V sync, Apache 2.0. 🎬 Video, 3D & World Models → NVIDIA Cosmos3-Super: 64B omnimodal world model coupling action trajectories with video+audio gen, for Physical AI. → JD JoyAI-Echo: up to 5-min multi-shot text-to-video on LTX-2.3. → ByteDance Bernini-R + VAST TripoSplat (single-image-to-3D Gaussian splats, MIT).

Victor M

539,883 Aufrufe • vor 1 Monat