Victoria Slocum's banner

Victoria Slocum

@victorialslocum • 9,878 subscribers

learning cool stuff, machine learning engineer at @weaviate_io 💙

Shorts

There’s been two papers released in the past couple months, one by Google and one by NVIDIA, that argue that ordering the documents retrieved by RAG systems can enhance performance. However, they both give two different strategies on HOW these documents should be ordered 🤔 Both papers agree on two main points: 1️⃣ There’s a fundamental issue in RAG - as more documents are retrieved, more irrelevant context (e.g., hard negatives) are introduced, which leads to confusion for the LLM and eventually degrades the quality of the generated output. This is called an inverted-U performance curve. 2️⃣ Ordering the retrieved documents is a key lever for optimizing RAG performance. Google Cloud researchers proposed ordering results based on relevance scores: The authors in this paper argue for relevance-based reordering, or ordering the retrieved chunks based on their similarity scores, so the most relevant documents are at the beginning and the end of the inputs to counter the “lost in the middle” effect. NVIDIA researchers proposed ordering results based on the original sequence of document chunks: The authors of this paper argue for Order-Preserving Reordering, or Order-Preserve RAG (OP-RAG), to maintain the logically coherent content flow of the document. So they preserved the original order of retrieved document chunks in the source text, instead of ranking them by relevance scores. So which one is right? It probably depends on the specific use case and dataset - relevance-based reordering could perform better in tasks where you need fast access to the most critical information (e.g., fact retrieval, QA systems), while order-preserving RAG might be better where you need to understand the sequential structure of information (e.g., narrative or legal documents). There are still so many uncertainties in AI - we don’t actually know what we’re doing, and it takes awhile to figure out the best strategies for most things! Excited to see more research about this.

There’s been two papers released in the past couple months, one by Google and one by NVIDIA, that argue that ordering the documents retrieved by RAG systems can enhance performance. However, they both give two different strategies on HOW these documents should be ordered 🤔 Both papers agree on two main points: 1️⃣ There’s a fundamental issue in RAG - as more documents are retrieved, more irrelevant context (e.g., hard negatives) are introduced, which leads to confusion for the LLM and eventually degrades the quality of the generated output. This is called an inverted-U performance curve. 2️⃣ Ordering the retrieved documents is a key lever for optimizing RAG performance. Google Cloud researchers proposed ordering results based on relevance scores: The authors in this paper argue for relevance-based reordering, or ordering the retrieved chunks based on their similarity scores, so the most relevant documents are at the beginning and the end of the inputs to counter the “lost in the middle” effect. NVIDIA researchers proposed ordering results based on the original sequence of document chunks: The authors of this paper argue for Order-Preserving Reordering, or Order-Preserve RAG (OP-RAG), to maintain the logically coherent content flow of the document. So they preserved the original order of retrieved document chunks in the source text, instead of ranking them by relevance scores. So which one is right? It probably depends on the specific use case and dataset - relevance-based reordering could perform better in tasks where you need fast access to the most critical information (e.g., fact retrieval, QA systems), while order-preserving RAG might be better where you need to understand the sequential structure of information (e.g., narrative or legal documents). There are still so many uncertainties in AI - we don’t actually know what we’re doing, and it takes awhile to figure out the best strategies for most things! Excited to see more research about this.

15,213 次观看

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

Text-to-SQL is dead 💀 The next generation of querying is agentic - and it’s already here. This paper ( introduces a type of agentic querying called 𝗙𝘂𝗻𝗰𝘁𝗶𝗼𝗻 𝗖𝗮𝗹𝗹𝗶𝗻𝗴 that uses an LLM to structure queries using predefined function calls in JSON format, with optional arguments for search, filters, aggregation, and grouping. Along with that, it also tests out a bunch of different models with a new dataset, DBGorilla, designed to evaluate agentic querying techniques on real-world use cases. Weaviate AI Database also just released a 𝗤𝘂𝗲𝗿𝘆 𝗔𝗴𝗲𝗻𝘁, designed based on some of the work in this paper, to handle advanced agentic querying out of the box, find out more here:

Text-to-SQL is dead 💀 The next generation of querying is agentic - and it’s already here. This paper ( introduces a type of agentic querying called 𝗙𝘂𝗻𝗰𝘁𝗶𝗼𝗻 𝗖𝗮𝗹𝗹𝗶𝗻𝗴 that uses an LLM to structure queries using predefined function calls in JSON format, with optional arguments for search, filters, aggregation, and grouping. Along with that, it also tests out a bunch of different models with a new dataset, DBGorilla, designed to evaluate agentic querying techniques on real-world use cases. Weaviate AI Database also just released a 𝗤𝘂𝗲𝗿𝘆 𝗔𝗴𝗲𝗻𝘁, designed based on some of the work in this paper, to handle advanced agentic querying out of the box, find out more here:

Victoria Slocum

188,395 次观看 • 1 年前

If you're building a PDF RAG pipeline: Should you be using OCR and 𝘁𝗲𝘅𝘁-𝗯𝗮𝘀𝗲𝗱 𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 methods, or just 𝗲𝗺𝗯𝗲𝗱 𝗶𝗺𝗮𝗴𝗲𝘀 𝗱𝗶𝗿𝗲𝗰𝘁𝗹𝘆 using late interaction models? This paper says the answer might actually be 𝘣𝘰𝘵𝘩. My colleagues at Weaviate released IRPAPERS, a benchmark comparing 𝗶𝗺𝗮𝗴𝗲-𝗯𝗮𝘀𝗲𝗱 and 𝘁𝗲𝘅𝘁-𝗯𝗮𝘀𝗲𝗱 retrieval over 3,230 pages from 166 scientific papers. The setup: Take the same PDFs and process them two ways. For text, run OCR with GPT-4.1 and embed with Arctic 2.0 + BM25 hybrid search. For images, embed raw page images with ColModernVBERT multi-vector embeddings. Test both on 180 needle-in-the-haystack questions. 𝗧𝗵𝗲 𝗿𝗲𝘀𝘂𝗹𝘁𝘀: Text edges out images at the top rank: 46% vs 43% Recall@1 But images match or exceed text at deeper recall: 93% vs 91% Recall@20 But text and image based methods actually fail on 𝘥𝘪𝘧𝘧𝘦𝘳𝘦𝘯𝘁 𝘲𝘶𝘦𝘳𝘪𝘦𝘴. At Recall@1: • 22 queries succeed with text but fail with images • 18 queries succeed with images but fail with text This complementarity is what makes 𝗠𝘂𝗹𝘁𝗶𝗺𝗼𝗱𝗮𝗹 𝗛𝘆𝗯𝗿𝗶𝗱 𝗦𝗲𝗮𝗿𝗰𝗵 work. By fusing scores from both text and image retrieval, they achieved: • 49% Recall@1 (beating either modality alone) • 81% Recall@5 • 95% Recall@20 More in the video below 🔽 Dataset: Paper: Code:

If you're building a PDF RAG pipeline: Should you be using OCR and 𝘁𝗲𝘅𝘁-𝗯𝗮𝘀𝗲𝗱 𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 methods, or just 𝗲𝗺𝗯𝗲𝗱 𝗶𝗺𝗮𝗴𝗲𝘀 𝗱𝗶𝗿𝗲𝗰𝘁𝗹𝘆 using late interaction models? This paper says the answer might actually be 𝘣𝘰𝘵𝘩. My colleagues at Weaviate released IRPAPERS, a benchmark comparing 𝗶𝗺𝗮𝗴𝗲-𝗯𝗮𝘀𝗲𝗱 and 𝘁𝗲𝘅𝘁-𝗯𝗮𝘀𝗲𝗱 retrieval over 3,230 pages from 166 scientific papers. The setup: Take the same PDFs and process them two ways. For text, run OCR with GPT-4.1 and embed with Arctic 2.0 + BM25 hybrid search. For images, embed raw page images with ColModernVBERT multi-vector embeddings. Test both on 180 needle-in-the-haystack questions. 𝗧𝗵𝗲 𝗿𝗲𝘀𝘂𝗹𝘁𝘀: Text edges out images at the top rank: 46% vs 43% Recall@1 But images match or exceed text at deeper recall: 93% vs 91% Recall@20 But text and image based methods actually fail on 𝘥𝘪𝘧𝘧𝘦𝘳𝘦𝘯𝘁 𝘲𝘶𝘦𝘳𝘪𝘦𝘴. At Recall@1: • 22 queries succeed with text but fail with images • 18 queries succeed with images but fail with text This complementarity is what makes 𝗠𝘂𝗹𝘁𝗶𝗺𝗼𝗱𝗮𝗹 𝗛𝘆𝗯𝗿𝗶𝗱 𝗦𝗲𝗮𝗿𝗰𝗵 work. By fusing scores from both text and image retrieval, they achieved: • 49% Recall@1 (beating either modality alone) • 81% Recall@5 • 95% Recall@20 More in the video below 🔽 Dataset: Paper: Code:

Victoria Slocum

43,996 次观看 • 2 个月前

Matryoshka Representation Learning (MRL) is a super exciting approach to improving the quality and efficiency of embedding models and strategies ✨ MRL allows models to store more information in the earlier dimensions of their data vectors. This method not only boosts performance in tasks like classification and retrieval, but is also a super cool compression technique! Paper: For compression: It’s been so much fun learning and making this video with @DanielW966, thanks for all your help!

Matryoshka Representation Learning (MRL) is a super exciting approach to improving the quality and efficiency of embedding models and strategies ✨ MRL allows models to store more information in the earlier dimensions of their data vectors. This method not only boosts performance in tasks like classification and retrieval, but is also a super cool compression technique! Paper: For compression: It’s been so much fun learning and making this video with @DanielW966, thanks for all your help!

Victoria Slocum

120,436 次观看 • 1 年前

‘pip install elysia’ and ‘elysia start’ That’s literally all it takes to get the most advanced open source agentic RAG app running on your data. We just released 𝗘𝗹𝘆𝘀𝗶𝗮, our open source, agentic RAG framework and an app so cool needed a cool video to go with it. Watch the full video: In the video, we go through these components of Elysia: 1️⃣ 𝗗𝗲𝗰𝗶𝘀𝗶𝗼𝗻 𝗧𝗿𝗲𝗲 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲: Instead of giving agents access to all tools at once, Elysia uses a pre-defined web of nodes with corresponding actions. Each decision agent has global context awareness. 2️⃣ 𝗗𝘆𝗻𝗮𝗺𝗶𝗰 𝗗𝗮𝘁𝗮 𝗗𝗶𝘀𝗽𝗹𝗮𝘆𝘀: Seven different data display formats including tables, e-commerce product cards, GitHub tickets, and charts. The system automatically choses the best display format. 3️⃣ 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗰 𝗗𝗮𝘁𝗮 𝗘𝘅𝗽𝗲𝗿𝘁𝗶𝘀𝗲: Unlike naive RAG systems that perform blind vector searches, Elysia analyzes your collections to understand data structure and meaning before performing queries. 𝗢𝘁𝗵𝗲𝗿 𝗖𝗼𝗼𝗹 𝗙𝗲𝗮𝘁𝘂𝗿𝗲𝘀: • 𝗙𝗲𝗲𝗱𝗯𝗮𝗰𝗸 𝗦𝘆𝘀𝘁𝗲𝗺: Uses positive examples as few-shot demonstrations for smaller, faster models • 𝗖𝗵𝘂𝗻𝗸-𝗢𝗻-𝗗𝗲𝗺𝗮𝗻𝗱: Dynamically chunks documents at query time instead of pre-chunking • 𝗠𝘂𝗹𝘁𝗶-𝗠𝗼𝗱𝗲𝗹 𝗦𝘁𝗿𝗮𝘁𝗲𝗴𝘆: Routes different tasks to appropriate model sizes based on complexity …And also how to get started with your own data! The entire project is open source and designed with customization in mind. You can use it as-is for effective data searching, or install the Python package to create custom tools for whatever agentic AI purposes you need. Big kudos to Edward for the vision, filming, and editing this masterpiece

‘pip install elysia’ and ‘elysia start’ That’s literally all it takes to get the most advanced open source agentic RAG app running on your data. We just released 𝗘𝗹𝘆𝘀𝗶𝗮, our open source, agentic RAG framework and an app so cool needed a cool video to go with it. Watch the full video: In the video, we go through these components of Elysia: 1️⃣ 𝗗𝗲𝗰𝗶𝘀𝗶𝗼𝗻 𝗧𝗿𝗲𝗲 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲: Instead of giving agents access to all tools at once, Elysia uses a pre-defined web of nodes with corresponding actions. Each decision agent has global context awareness. 2️⃣ 𝗗𝘆𝗻𝗮𝗺𝗶𝗰 𝗗𝗮𝘁𝗮 𝗗𝗶𝘀𝗽𝗹𝗮𝘆𝘀: Seven different data display formats including tables, e-commerce product cards, GitHub tickets, and charts. The system automatically choses the best display format. 3️⃣ 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗰 𝗗𝗮𝘁𝗮 𝗘𝘅𝗽𝗲𝗿𝘁𝗶𝘀𝗲: Unlike naive RAG systems that perform blind vector searches, Elysia analyzes your collections to understand data structure and meaning before performing queries. 𝗢𝘁𝗵𝗲𝗿 𝗖𝗼𝗼𝗹 𝗙𝗲𝗮𝘁𝘂𝗿𝗲𝘀: • 𝗙𝗲𝗲𝗱𝗯𝗮𝗰𝗸 𝗦𝘆𝘀𝘁𝗲𝗺: Uses positive examples as few-shot demonstrations for smaller, faster models • 𝗖𝗵𝘂𝗻𝗸-𝗢𝗻-𝗗𝗲𝗺𝗮𝗻𝗱: Dynamically chunks documents at query time instead of pre-chunking • 𝗠𝘂𝗹𝘁𝗶-𝗠𝗼𝗱𝗲𝗹 𝗦𝘁𝗿𝗮𝘁𝗲𝗴𝘆: Routes different tasks to appropriate model sizes based on complexity …And also how to get started with your own data! The entire project is open source and designed with customization in mind. You can use it as-is for effective data searching, or install the Python package to create custom tools for whatever agentic AI purposes you need. Big kudos to Edward for the vision, filming, and editing this masterpiece

Victoria Slocum

45,497 次观看 • 9 个月前

ColPali is changing the game for PDF retrieval by eliminating the need for OCR and chunking methods 🚀 Inspired by ColBERT’s success with text, ColPali splits an image of a document into patches, which are then processed through a vision LLM called PaliGemma. The embeddings for each patch retain contextual information, similarly to text embeddings in methods like ColBERT. During retrieval, user queries are embedded in the same space, and then compared to document patches using the MaxSim operator. ColPali recipe POC Weaviate AI Database: ColPali paper: As always, shoutout to the awesome Daniel Williams for the help! 💙

ColPali is changing the game for PDF retrieval by eliminating the need for OCR and chunking methods 🚀 Inspired by ColBERT’s success with text, ColPali splits an image of a document into patches, which are then processed through a vision LLM called PaliGemma. The embeddings for each patch retain contextual information, similarly to text embeddings in methods like ColBERT. During retrieval, user queries are embedded in the same space, and then compared to document patches using the MaxSim operator. ColPali recipe POC Weaviate AI Database: ColPali paper: As always, shoutout to the awesome Daniel Williams for the help! 💙

Victoria Slocum

67,802 次观看 • 1 年前

I just watched AI agents map Nazi escape routes across two continents. This is the coolest fork of our agentic RAG framework that I've seen so far 🔽 𝗩𝗲𝗿𝗼 𝗗𝗮𝗹𝗹'𝗔𝗴𝗹𝗶𝗼 built a complete 𝗢𝗦𝗜𝗡𝗧 𝗶𝗻𝘁𝗲𝗹𝗹𝗶𝗴𝗲𝗻𝗰𝗲 𝗽𝗹𝗮𝘁𝗳𝗼𝗿𝗺 on top of Elysia, and it's fully open source. 𝗜𝗻𝘁𝗲𝗹𝗹𝘆𝗪𝗲𝗮𝘃𝗲 takes Elysia's decision tree architecture and extends it for intelligence analysis. Upload documents, ask questions in natural language, and get comprehensive intelligence assessments with entity extraction, geospatial mapping, and network analysis. Main features: 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗰 𝗘𝗻𝘁𝗶𝘁𝘆 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻: Uses GLiNER for zero-shot recognition of 7 entity types (persons, organizations, locations, dates, events, laws, cryptonyms). No training required. 𝗧𝘄𝗼-𝗔𝗴𝗲𝗻𝘁 𝗔𝗿𝗰𝗵𝗶𝘃𝗲 𝗥𝗲𝘀𝗲𝗮𝗿𝗰𝗵: The "Quartermaster" agent maps the information landscape (discovers archives, classifies access levels), while the "Case Officer" conducts hypothesis-driven investigations with confidence scoring and evidence citations. 𝗚𝗲𝗼𝘀𝗽𝗮𝘁𝗶𝗮𝗹 + 𝗡𝗲𝘁𝘄𝗼𝗿𝗸 𝗩𝗶𝘀𝘂𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻: Interactive 3D maps (Mapbox) and force-directed network graphs (vis-network) to reveal hidden connections between entities using Elysia's build in customizable display types. 𝟲-𝗣𝗵𝗮𝘀𝗲 𝗜𝗻𝘁𝗲𝗹𝗹𝗶𝗴𝗲𝗻𝗰𝗲 𝗢𝗿𝗰𝗵𝗲𝘀𝘁𝗿𝗮𝘁𝗼𝗿: Automated pipeline that goes from extraction → relationship mapping → geospatial analysis → network analysis → pattern detection → synthesis, with automatic task generation for follow-up investigation. They even built a demo analyzing 17 historical documents about Nazi escape networks to South America (1945-1962). The system automatically extracted entities, mapped three distinct escape routes, and generated hypotheses with confidence scores. This is exactly what we hoped people would build with Elysia 💚 so so cool to see. Elysia's decision tree architecture makes it straightforward to add domain-specific tools (like GLiNER entity extraction or archive discovery or custom map displays) while keeping all the core functionality (error handling, streaming, self-healing, transparency) that comes built-in. Check out the repo: Interactive demo: Elysia blog post: Huge shoutout to the contributors for building this and sharing it with the community! 🫶

I just watched AI agents map Nazi escape routes across two continents. This is the coolest fork of our agentic RAG framework that I've seen so far 🔽 𝗩𝗲𝗿𝗼 𝗗𝗮𝗹𝗹'𝗔𝗴𝗹𝗶𝗼 built a complete 𝗢𝗦𝗜𝗡𝗧 𝗶𝗻𝘁𝗲𝗹𝗹𝗶𝗴𝗲𝗻𝗰𝗲 𝗽𝗹𝗮𝘁𝗳𝗼𝗿𝗺 on top of Elysia, and it's fully open source. 𝗜𝗻𝘁𝗲𝗹𝗹𝘆𝗪𝗲𝗮𝘃𝗲 takes Elysia's decision tree architecture and extends it for intelligence analysis. Upload documents, ask questions in natural language, and get comprehensive intelligence assessments with entity extraction, geospatial mapping, and network analysis. Main features: 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗰 𝗘𝗻𝘁𝗶𝘁𝘆 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻: Uses GLiNER for zero-shot recognition of 7 entity types (persons, organizations, locations, dates, events, laws, cryptonyms). No training required. 𝗧𝘄𝗼-𝗔𝗴𝗲𝗻𝘁 𝗔𝗿𝗰𝗵𝗶𝘃𝗲 𝗥𝗲𝘀𝗲𝗮𝗿𝗰𝗵: The "Quartermaster" agent maps the information landscape (discovers archives, classifies access levels), while the "Case Officer" conducts hypothesis-driven investigations with confidence scoring and evidence citations. 𝗚𝗲𝗼𝘀𝗽𝗮𝘁𝗶𝗮𝗹 + 𝗡𝗲𝘁𝘄𝗼𝗿𝗸 𝗩𝗶𝘀𝘂𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻: Interactive 3D maps (Mapbox) and force-directed network graphs (vis-network) to reveal hidden connections between entities using Elysia's build in customizable display types. 𝟲-𝗣𝗵𝗮𝘀𝗲 𝗜𝗻𝘁𝗲𝗹𝗹𝗶𝗴𝗲𝗻𝗰𝗲 𝗢𝗿𝗰𝗵𝗲𝘀𝘁𝗿𝗮𝘁𝗼𝗿: Automated pipeline that goes from extraction → relationship mapping → geospatial analysis → network analysis → pattern detection → synthesis, with automatic task generation for follow-up investigation. They even built a demo analyzing 17 historical documents about Nazi escape networks to South America (1945-1962). The system automatically extracted entities, mapped three distinct escape routes, and generated hypotheses with confidence scores. This is exactly what we hoped people would build with Elysia 💚 so so cool to see. Elysia's decision tree architecture makes it straightforward to add domain-specific tools (like GLiNER entity extraction or archive discovery or custom map displays) while keeping all the core functionality (error handling, streaming, self-healing, transparency) that comes built-in. Check out the repo: Interactive demo: Elysia blog post: Huge shoutout to the contributors for building this and sharing it with the community! 🫶

Victoria Slocum

19,538 次观看 • 5 个月前

Google just proved that bigger isn't always better. Their 308M parameter model is outperforming models 2x its size. Google just released 𝗘𝗺𝗯𝗲𝗱𝗱𝗶𝗻𝗴𝗚𝗲𝗺𝗺𝗮, and it's proving that lightweight embedding models can punch way above their weight class. At just 308M parameters (578MB), it's the new state-of-the-art for models under 500M parameters across MTEB multilingual, English, and code benchmarks. But the really impressive part is that it ranks 8th overall on MTEB(Multilingual, v2) - that's 𝟭𝟳 𝗽𝗹𝗮𝗰𝗲𝘀 above the second-best sub-500M model, and it's delivering performance 𝗰𝗼𝗺𝗽𝗮𝗿𝗮𝗯𝗹𝗲 𝘁𝗼 𝗺𝗼𝗱𝗲𝗹𝘀 𝗻𝗲𝗮𝗿𝗹𝘆 𝗱𝗼𝘂𝗯𝗹𝗲 𝗶𝘁𝘀 𝘀𝗶𝘇𝗲. There are three key parts of their training recipe that sets it apart: 𝟭. 𝗘𝗻𝗰𝗼𝗱𝗲𝗿-𝗗𝗲𝗰𝗼𝗱𝗲𝗿 𝗜𝗻𝗶𝘁𝗶𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 Instead of starting from a decoder-only Gemma 3 model, they first adapted it to encoder-decoder, then used just the encoder. By basing EmbeddingGemma off an LLM that already has world and language understanding, it gives it a stronger starting point. 𝟮. 𝗧𝗵𝗿𝗲𝗲-𝗟𝗼𝘀𝘀 𝗧𝗿𝗮𝗶𝗻𝗶𝗻𝗴 They combine three different loss functions, instead of just having one: • Contrastive loss (NCE) with in-batch negatives and hardness weighting • Spread-out regularization to ensure embeddings utilize the full space (for quantization and ANN retrieval) • Embedding matching distillation from Gemini Embedding - not just learning from relevance scores, but directly aligning the embedding space with the teacher model 𝟯. 𝗠𝗼𝗱𝗲𝗹 𝗦𝗼𝘂𝗽𝗶𝗻𝗴 Rather than just averaging checkpoints from the same training run, they use optimization techniques to find multiple specialized training mixtures. Each mixture creates an "expert" model in different domains, and averaging all their parameters creates a final model that's actually better than individual models. Extras: • Matryoshka embeddings supporting 768, 512, 256, and 128 dimensions • Quantization-aware training - maintains quality even at int4 precision • 100+ languages from Gemma 3 pretraining • Exceptional performance on low-resource languages (check their XTREME-UP results) Is it the absolute best embedding model? No - Gemini Embedding still leads overall. But that's not really the point. EmbeddingGemma proves you can achieve state-of-the-art performance in a small package that's actually deployable on-device, in low-latency applications, and in resource-constrained environments. This makes good embeddings accessible for use cases that I'm seeing more and more: offline applications, privacy-sensitive deployments, and high-throughput scenarios where inference cost actually matters. Full paper: Shoutout to the EmbeddingGemma team at Google DeepMind for this awesome open source work 💙 and to Daniel Williams for helping me with this video! 🫶

Google just proved that bigger isn't always better. Their 308M parameter model is outperforming models 2x its size. Google just released 𝗘𝗺𝗯𝗲𝗱𝗱𝗶𝗻𝗴𝗚𝗲𝗺𝗺𝗮, and it's proving that lightweight embedding models can punch way above their weight class. At just 308M parameters (578MB), it's the new state-of-the-art for models under 500M parameters across MTEB multilingual, English, and code benchmarks. But the really impressive part is that it ranks 8th overall on MTEB(Multilingual, v2) - that's 𝟭𝟳 𝗽𝗹𝗮𝗰𝗲𝘀 above the second-best sub-500M model, and it's delivering performance 𝗰𝗼𝗺𝗽𝗮𝗿𝗮𝗯𝗹𝗲 𝘁𝗼 𝗺𝗼𝗱𝗲𝗹𝘀 𝗻𝗲𝗮𝗿𝗹𝘆 𝗱𝗼𝘂𝗯𝗹𝗲 𝗶𝘁𝘀 𝘀𝗶𝘇𝗲. There are three key parts of their training recipe that sets it apart: 𝟭. 𝗘𝗻𝗰𝗼𝗱𝗲𝗿-𝗗𝗲𝗰𝗼𝗱𝗲𝗿 𝗜𝗻𝗶𝘁𝗶𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 Instead of starting from a decoder-only Gemma 3 model, they first adapted it to encoder-decoder, then used just the encoder. By basing EmbeddingGemma off an LLM that already has world and language understanding, it gives it a stronger starting point. 𝟮. 𝗧𝗵𝗿𝗲𝗲-𝗟𝗼𝘀𝘀 𝗧𝗿𝗮𝗶𝗻𝗶𝗻𝗴 They combine three different loss functions, instead of just having one: • Contrastive loss (NCE) with in-batch negatives and hardness weighting • Spread-out regularization to ensure embeddings utilize the full space (for quantization and ANN retrieval) • Embedding matching distillation from Gemini Embedding - not just learning from relevance scores, but directly aligning the embedding space with the teacher model 𝟯. 𝗠𝗼𝗱𝗲𝗹 𝗦𝗼𝘂𝗽𝗶𝗻𝗴 Rather than just averaging checkpoints from the same training run, they use optimization techniques to find multiple specialized training mixtures. Each mixture creates an "expert" model in different domains, and averaging all their parameters creates a final model that's actually better than individual models. Extras: • Matryoshka embeddings supporting 768, 512, 256, and 128 dimensions • Quantization-aware training - maintains quality even at int4 precision • 100+ languages from Gemma 3 pretraining • Exceptional performance on low-resource languages (check their XTREME-UP results) Is it the absolute best embedding model? No - Gemini Embedding still leads overall. But that's not really the point. EmbeddingGemma proves you can achieve state-of-the-art performance in a small package that's actually deployable on-device, in low-latency applications, and in resource-constrained environments. This makes good embeddings accessible for use cases that I'm seeing more and more: offline applications, privacy-sensitive deployments, and high-throughput scenarios where inference cost actually matters. Full paper: Shoutout to the EmbeddingGemma team at Google DeepMind for this awesome open source work 💙 and to Daniel Williams for helping me with this video! 🫶

Victoria Slocum

21,211 次观看 • 7 个月前

There’s been two papers released in the past couple months, one by Google and one by NVIDIA, that argue that ordering the documents retrieved by RAG systems can enhance performance. However, they both give two different strategies on HOW these documents should be ordered 🤔 Both papers agree on two main points: 1️⃣ There’s a fundamental issue in RAG - as more documents are retrieved, more irrelevant context (e.g., hard negatives) are introduced, which leads to confusion for the LLM and eventually degrades the quality of the generated output. This is called an inverted-U performance curve. 2️⃣ Ordering the retrieved documents is a key lever for optimizing RAG performance. Google Cloud researchers proposed ordering results based on relevance scores: The authors in this paper argue for relevance-based reordering, or ordering the retrieved chunks based on their similarity scores, so the most relevant documents are at the beginning and the end of the inputs to counter the “lost in the middle” effect. NVIDIA researchers proposed ordering results based on the original sequence of document chunks: The authors of this paper argue for Order-Preserving Reordering, or Order-Preserve RAG (OP-RAG), to maintain the logically coherent content flow of the document. So they preserved the original order of retrieved document chunks in the source text, instead of ranking them by relevance scores. So which one is right? It probably depends on the specific use case and dataset - relevance-based reordering could perform better in tasks where you need fast access to the most critical information (e.g., fact retrieval, QA systems), while order-preserving RAG might be better where you need to understand the sequential structure of information (e.g., narrative or legal documents). There are still so many uncertainties in AI - we don’t actually know what we’re doing, and it takes awhile to figure out the best strategies for most things! Excited to see more research about this.

There’s been two papers released in the past couple months, one by Google and one by NVIDIA, that argue that ordering the documents retrieved by RAG systems can enhance performance. However, they both give two different strategies on HOW these documents should be ordered 🤔 Both papers agree on two main points: 1️⃣ There’s a fundamental issue in RAG - as more documents are retrieved, more irrelevant context (e.g., hard negatives) are introduced, which leads to confusion for the LLM and eventually degrades the quality of the generated output. This is called an inverted-U performance curve. 2️⃣ Ordering the retrieved documents is a key lever for optimizing RAG performance. Google Cloud researchers proposed ordering results based on relevance scores: The authors in this paper argue for relevance-based reordering, or ordering the retrieved chunks based on their similarity scores, so the most relevant documents are at the beginning and the end of the inputs to counter the “lost in the middle” effect. NVIDIA researchers proposed ordering results based on the original sequence of document chunks: The authors of this paper argue for Order-Preserving Reordering, or Order-Preserve RAG (OP-RAG), to maintain the logically coherent content flow of the document. So they preserved the original order of retrieved document chunks in the source text, instead of ranking them by relevance scores. So which one is right? It probably depends on the specific use case and dataset - relevance-based reordering could perform better in tasks where you need fast access to the most critical information (e.g., fact retrieval, QA systems), while order-preserving RAG might be better where you need to understand the sequential structure of information (e.g., narrative or legal documents). There are still so many uncertainties in AI - we don’t actually know what we’re doing, and it takes awhile to figure out the best strategies for most things! Excited to see more research about this.

Victoria Slocum

15,213 次观看 • 1 年前

没有更多内容可加载