
Han Xiao
@hxiao • 17,889 subscribers
VP, AI @Elastic prev: founder & ceo @JinaAI_
Shorts
Videos

Not a fan of Knowledge Graphs, but recently I started using them more often for a surprising reason: to build non-trivial private verifiers for agentic search. For those who don't know, building a private eval set for a scaffolded LLM in 2026 is really challenging, like seriously hard. It takes a lot of effort to find a question that's non-trivial to a scaffolded LLM yet still answerable. To find those question-answer pairs, I built a knowledge graph extractor where you can throw a corpus at it, and it extracts the entity relations using qwen3.6-35b-a3b-MTP on an L4 at 70 tps (which is really good for such a low-budget GPU). Then I mark out the longest path in the graph and use it to generate challenging question-answer pairs. The idea is to find those genuinely multi-hop fact chains that are verifiable from the corpus, to stress-test the agentic search system.
Han Xiao56,052 Aufrufe • vor 4 Tagen

low quant weights make the embedding model lose all discriminative power. I plotted the cosine correlation matrix of jina-v5, and one can see that low quant makes the model really blind. The off-diagonal similarities are pretty high on Q1/2/3, meaning everything looks similar in the semantic space. Q4 is a sweet spot where model quality becomes acceptable.
Han Xiao62,912 Aufrufe • vor 2 Monaten

If you only have 60s of attention for Kimi's Attention Residuals paper, watch this.
Han Xiao84,514 Aufrufe • vor 3 Monaten

after turboquant and qwen3.5-35b-a3b, i got curious: how realistic is it to use kv cache as a document store today? to have vectorless, RAG-less search. so i prefilled 258K out of 262K context window on L4 (a budget GPU popular in prod). ~99% of the slot is pre-computed and stored, users load it on the fly in ~1s. system prompt + query append to the end, generation takes ~3K tokens, enough for search. at 99% fill rate, decoding runs ~20 tps on L4. i prepared some ego datasets (jina papers, which i know best), plus popular novels in chinese and english. the results are actually pretty good. some hallucination, but most answers are solid and well-grounded. what's more interesting is the cost: ~$0.26/h on L4 spot. single LLM. no vector database, no embedding model, no workflow/pipeline engineering. using kv cache as document store is nothing new, like the old CAG paper. but with quantized kv cache and modern attention (hybrid SSM-attention, GQA, MQA, MLA), the economics are changing fast. if we solve cold-prefill speed and decoding speed, and budget GPU costs keep dropping, the future of search could be vectorless. radical, but possible.
Han Xiao42,304 Aufrufe • vor 2 Monaten
Keine weiteren Inhalte verfügbar