Han Xiao ✈️ ICML 2026's banner

Han Xiao ✈️ ICML 2026

@hxiao • 19,310 subscribers

VP, AI @Elastic prev: founder & ceo @JinaAI_

Shorts

I love Clawdbot, but most parts can be just Claude Code --dangerously-skip-permissions + pipe via Telegram. Made a simple version using cloudflare tunnel + tmux + StopHook.

I love Clawdbot, but most parts can be just Claude Code --dangerously-skip-permissions + pipe via Telegram. Made a simple version using cloudflare tunnel + tmux + StopHook.

317,139 Aufrufe

Saw a poster at ICLR showing contrastive learning (InfoNCE included!) ≈ a closed-form spectral decomposition in RKHS. Got curious whether a map could adapt embedding spaces across model families, while preserving NDCG retrieval perf. Did some experiments on my flights back to sf. Video shows diff maps on simple swiss roll data.

Saw a poster at ICLR showing contrastive learning (InfoNCE included!) ≈ a closed-form spectral decomposition in RKHS. Got curious whether a map could adapt embedding spaces across model families, while preserving NDCG retrieval perf. Did some experiments on my flights back to sf. Video shows diff maps on simple swiss roll data.

14,292 Aufrufe

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

Not a fan of Knowledge Graphs, but recently I started using them more often for a surprising reason: to build non-trivial private verifiers for agentic search. For those who don't know, building a private eval set for a scaffolded LLM in 2026 is really challenging, like seriously hard. It takes a lot of effort to find a question that's non-trivial to a scaffolded LLM yet still answerable. To find those question-answer pairs, I built a knowledge graph extractor where you can throw a corpus at it, and it extracts the entity relations using qwen3.6-35b-a3b-MTP on an L4 at 70 tps (which is really good for such a low-budget GPU). Then I mark out the longest path in the graph and use it to generate challenging question-answer pairs. The idea is to find those genuinely multi-hop fact chains that are verifiable from the corpus, to stress-test the agentic search system.

Not a fan of Knowledge Graphs, but recently I started using them more often for a surprising reason: to build non-trivial private verifiers for agentic search. For those who don't know, building a private eval set for a scaffolded LLM in 2026 is really challenging, like seriously hard. It takes a lot of effort to find a question that's non-trivial to a scaffolded LLM yet still answerable. To find those question-answer pairs, I built a knowledge graph extractor where you can throw a corpus at it, and it extracts the entity relations using qwen3.6-35b-a3b-MTP on an L4 at 70 tps (which is really good for such a low-budget GPU). Then I mark out the longest path in the graph and use it to generate challenging question-answer pairs. The idea is to find those genuinely multi-hop fact chains that are verifiable from the corpus, to stress-test the agentic search system.

Han Xiao ✈️ ICML 2026

57,850 Aufrufe • vor 1 Monat

If you only have 60s of attention for Kimi's Attention Residuals paper, watch this.

If you only have 60s of attention for Kimi's Attention Residuals paper, watch this.

Han Xiao ✈️ ICML 2026

84,647 Aufrufe • vor 4 Monaten

low quant weights make the embedding model lose all discriminative power. I plotted the cosine correlation matrix of jina-v5, and one can see that low quant makes the model really blind. The off-diagonal similarities are pretty high on Q1/2/3, meaning everything looks similar in the semantic space. Q4 is a sweet spot where model quality becomes acceptable.

low quant weights make the embedding model lose all discriminative power. I plotted the cosine correlation matrix of jina-v5, and one can see that low quant makes the model really blind. The off-diagonal similarities are pretty high on Q1/2/3, meaning everything looks similar in the semantic space. Q4 is a sweet spot where model quality becomes acceptable.

Han Xiao ✈️ ICML 2026

63,042 Aufrufe • vor 3 Monaten

Gave my autoresearch x TTC x retrieval talk at AIE yesterday, ran it again at home today. things moving so fast that after ICML next week I'll probably be onto something new, and these ideas will already be disposable. so before that happens, I want to drop a summary of the last 2 months of experiments across several projects, mostly so future me remembers they existed.

Gave my autoresearch x TTC x retrieval talk at AIE yesterday, ran it again at home today. things moving so fast that after ICML next week I'll probably be onto something new, and these ideas will already be disposable. so before that happens, I want to drop a summary of the last 2 months of experiments across several projects, mostly so future me remembers they existed.

Han Xiao ✈️ ICML 2026

17,877 Aufrufe • vor 29 Tagen

after turboquant and qwen3.5-35b-a3b, i got curious: how realistic is it to use kv cache as a document store today? to have vectorless, RAG-less search. so i prefilled 258K out of 262K context window on L4 (a budget GPU popular in prod). ~99% of the slot is pre-computed and stored, users load it on the fly in ~1s. system prompt + query append to the end, generation takes ~3K tokens, enough for search. at 99% fill rate, decoding runs ~20 tps on L4. i prepared some ego datasets (jina papers, which i know best), plus popular novels in chinese and english. the results are actually pretty good. some hallucination, but most answers are solid and well-grounded. what's more interesting is the cost: ~$0.26/h on L4 spot. single LLM. no vector database, no embedding model, no workflow/pipeline engineering. using kv cache as document store is nothing new, like the old CAG paper. but with quantized kv cache and modern attention (hybrid SSM-attention, GQA, MQA, MLA), the economics are changing fast. if we solve cold-prefill speed and decoding speed, and budget GPU costs keep dropping, the future of search could be vectorless. radical, but possible.

after turboquant and qwen3.5-35b-a3b, i got curious: how realistic is it to use kv cache as a document store today? to have vectorless, RAG-less search. so i prefilled 258K out of 262K context window on L4 (a budget GPU popular in prod). ~99% of the slot is pre-computed and stored, users load it on the fly in ~1s. system prompt + query append to the end, generation takes ~3K tokens, enough for search. at 99% fill rate, decoding runs ~20 tps on L4. i prepared some ego datasets (jina papers, which i know best), plus popular novels in chinese and english. the results are actually pretty good. some hallucination, but most answers are solid and well-grounded. what's more interesting is the cost: ~$0.26/h on L4 spot. single LLM. no vector database, no embedding model, no workflow/pipeline engineering. using kv cache as document store is nothing new, like the old CAG paper. but with quantized kv cache and modern attention (hybrid SSM-attention, GQA, MQA, MLA), the economics are changing fast. if we solve cold-prefill speed and decoding speed, and budget GPU costs keep dropping, the future of search could be vectorless. radical, but possible.

42,424 Aufrufe • vor 4 Monaten

Rationale是我司推出的一款专为管理者和决策者打造的分析成效工具，它集成了最新的GPT3.x和上下文学习(in-context learning)技术，能够快速生成Pros & Cons和SWOT分析报告，帮助管理者和个人做出明智的决策。 Michael Anti

Rationale是我司推出的一款专为管理者和决策者打造的分析成效工具，它集成了最新的GPT3.x和上下文学习(in-context learning)技术，能够快速生成Pros & Cons和SWOT分析报告，帮助管理者和个人做出明智的决策。 Michael Anti

149,353 Aufrufe • vor 3 Jahren

Given an embedding vector, you can tell which model produced it. I trained a 0.8M transformer that fingerprints embedding models by reading raw float digits (vocab size: 15). Full end-to-end, zero feature engineering.

Given an embedding vector, you can tell which model produced it. I trained a 0.8M transformer that fingerprints embedding models by reading raw float digits (vocab size: 15). Full end-to-end, zero feature engineering.

19,695 Aufrufe • vor 4 Monaten

Keine weiteren Inhalte verfügbar