Video yükleniyor...

Video Yüklenemedi

Ana Sayfaya Dön

Submodular optimization for token/sentence selection from long contexts. Here's an interesting exp: first used jina-embeddings-v4's multi-vector feature to extract token-level embeddings from a passage, then applied submodular optimization to cherry-pick the tokens that provide the best coverage, finally call tokenizer and convert selections back to the strings at their...

13,161 görüntüleme • 11 ay önce •via X (Twitter)

3 Yorum

Jina AI profil fotoğrafı
Jina AI11 ay önce

Try it on Google Colab's L4 GPU for free: This could be an interesting approach for extracting information from long documents, saving tokens for LLMs, etc. Check out our recent blog posts and learn more about submodular optimization.

Richard Collins, The Internet Foundation profil fotoğrafı
Richard Collins, The Internet Foundation11 ay önce

Can you scale to replace Google? Put your AI on it and put in the numbers. Back of the envelope or "in a spreadsheet" is better than "in your head somewhere as an idea only". It might be easier than you think now. If the whole Internet is coded as it goes in, not scraped and indexed and tokenized later - completely separated from the authors, without their permission or help. Check my writing on "global open tokens" where all tokens are linked to the real things in the world - not arbitrary strings of characters in one language. Using universal (global) tokens means "the sun", "the earth", "water" and those are independent of human language so ties things together. Yes, choose the things that matter, keep it lean and sufficient and sustainable, not shotgun or brute force, only for people with big computers. For all humans, not just a few. Richard Collins, The Internet Foundation

Franck Lebeau profil fotoğrafı
Franck Lebeau11 ay önce

interesting how "Late chucking" is condensed into "lateing" (tokens "late" + "##ing"). As I understand it, it means that the semantic of "chuncking" (tokens "chunck"+"#ing") is mainly supported by the contextualized embedding of the "#ing".

Benzer Videolar