Загрузка видео...

Не удалось загрузить видео

На главную

Researchers built a new RAG approach that: - does not need a vector DB. - does not embed data. - involves no chunking. - performs no similarity search. And it hit 98.7% accuracy on a financial benchmark (SOTA). Here's the core problem with RAG that this new approach solves:...

971,454 просмотров • 4 месяцев назад •via X (Twitter)

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

Traditional data pipelines don't work for RAG applications. There are 3 issues with them: ​ 1. Traditional data engineering solutions are optimized to handle structured data. RAG applications rely primarily on unstructured data. ​ 2. The connector ecosystem to load data from unstructured data sources is very immature. ​ 3. Traditional solutions do not offer any way to transform unstructured data into an optimized vector search index. ​ The goal of a RAG Pipeline is to solve these problems. ​ The number one objective is to create a reliable vector search index using factual knowledge and relevant context. This sounds easy, but it's one of the biggest challenges we face when building RAG applications. ​ At a high level, there are four different stages in the architecture of a RAG pipeline: ​ 1. Ingestion: Here is where the pipeline loads the information from the data source. ​ 2. Extraction: Where the pipeline processes the input data and decides how to retrieve the text contained inside them. ​ 3. Transform: Where the pipeline chunks the data and generates document embeddings. ​ 4. Load: Where the pipeline creates a search index in a vector database and loads the document embeddings. ​ There are different rabbit holes at each one of these stages. Here are three of them: ​ 1. Ingesting data once is simple. The hard part is refreshing the vector database whenever the original data source changes. ​ 2. Extracting the content of a plain text document is simple. The hard part is to extract content from complex documents containing tables, images, or cross-references. ​ 3. A simple continual chunking strategy with an overlap is simple. The hard part is to find the optimal strategy for your specific knowledge base and the way you are planning to query it. ​ In the attached video, I'll show you how you can build an enterprise-grade RAG Pipeline that solves every one of the above problems. ​ I'll use Vectorize. They partnered with me on this post. You can use them to build RAG pipelines optimized for accurate context retrieval. ​ ​ If you have a few documents lying around, set up a free account and give it a try.

Santiago

40,441 просмотров • 1 год назад

Announcing a new Coursera course: Retrieval Augmented Generation (RAG) You'll learn to build high performance, production-ready RAG systems in this hands-on, in-depth course created by and taught by Zain, experienced AI and ML engineer, researcher, and educator. RAG is a critical component today of many LLM-based applications in customer support, internal company Q&A systems, even many of the leading chatbots that use web search to answer your questions. This course teaches you in-depth how to make RAG work well. LLMs can produce generic or outdated responses, especially when asked specialized questions not covered in its training data. RAG is the most widely used technique for addressing this. It brings in data from new data sources, such as internal documents or recent news, to give the LLM the relevant context to private, recent, or specialized information. This lets it generate more grounded and accurate responses. In this course, you’ll learn to design and implement every part of a RAG system, from retrievers to vector databases to generation to evals. You’ll learn about the fundamental principles behind RAG and how to optimize it at both the component and whole-system levels. As AI evolves, RAG is evolving too. New models can handle longer context windows, reason more effectively, and can be parts of complex agentic workflows. One exciting growth area is Agentic RAG, in which an AI agent at runtime (rather than it being hardcoded at development time) autonomously decides what data to retrieve, and when/how to go deeper. Even with this evolution, access to high-quality data at runtime is essential, which is why RAG is a key part of so many applications. You'll learn via hands-on experiences to: - Build a RAG system with retrieval and prompt augmentation - Compare retrieval methods like BM25, semantic search, and Reciprocal Rank Fusion - Chunk, index, and retrieve documents using a Weaviate vector database and a news dataset - Develop a chatbot, using open-source LLMs hosted by Together AI, for a fictional store that answers product and FAQ questions - Use evals to drive improving reliability, and incorporate multi-modal data RAG is an important foundational technique. Become good at it through this course! Please sign up here:

Andrew Ng

124,458 просмотров • 11 месяцев назад

There’s been two papers released in the past couple months, one by Google and one by NVIDIA, that argue that ordering the documents retrieved by RAG systems can enhance performance. However, they both give two different strategies on HOW these documents should be ordered 🤔 Both papers agree on two main points: 1️⃣ There’s a fundamental issue in RAG - as more documents are retrieved, more irrelevant context (e.g., hard negatives) are introduced, which leads to confusion for the LLM and eventually degrades the quality of the generated output. This is called an inverted-U performance curve. 2️⃣ Ordering the retrieved documents is a key lever for optimizing RAG performance. Google Cloud researchers proposed ordering results based on relevance scores: The authors in this paper argue for relevance-based reordering, or ordering the retrieved chunks based on their similarity scores, so the most relevant documents are at the beginning and the end of the inputs to counter the “lost in the middle” effect. NVIDIA researchers proposed ordering results based on the original sequence of document chunks: The authors of this paper argue for Order-Preserving Reordering, or Order-Preserve RAG (OP-RAG), to maintain the logically coherent content flow of the document. So they preserved the original order of retrieved document chunks in the source text, instead of ranking them by relevance scores. So which one is right? It probably depends on the specific use case and dataset - relevance-based reordering could perform better in tasks where you need fast access to the most critical information (e.g., fact retrieval, QA systems), while order-preserving RAG might be better where you need to understand the sequential structure of information (e.g., narrative or legal documents). There are still so many uncertainties in AI - we don’t actually know what we’re doing, and it takes awhile to figure out the best strategies for most things! Excited to see more research about this.

Victoria Slocum

15,213 просмотров • 1 год назад

Here is how you can install an open-source, enterprise-grade RAG system on your server (with the best document understanding I've seen.) First, something obvious to anyone trying to sell RAG in the market: You are crazy if you think companies will let their data travel to a hosted model. No one wants to send their data anywhere (those who do haven't found an alternative.) Every single company would rather have an air-gapped system with no internet access. GroundX is an open-source RAG system that you can run on your servers (or any cloud provider, as long as you have access to GPUs) and works without a network. (If the military wants to do RAG, this is precisely what they will be looking for.) I installed GroundX on my AWS account and recorded a video to show you how to use it. There are two services you can use: 1. Ingest: This service uses a pretrained vision model to ingest and understand your knowledge base. 2. Search: This service combines text and vector search with a fine-tuned re-ranker model to retrieve information from your knowledge base. A quick note about the Ingest service: 99% of people think they need better "retrieval" mechanisms. I think they need better "ingestion." That's where this service comes in! Ingest "understands" your documents in a way I haven't seen before. After you try it, you'll realize why showing your LLM your raw documents is a bad idea. In the video, I use a free tool called X-Ray to test a document and understand how the Ingest service breaks it down. You can access this tool by signing up for a free GroundX cloud account and uploading your documents. You'll see a bit more about this in the video.

Santiago

89,624 просмотров • 1 год назад

Traditional chunking: cheap but dumb. ColBERT: smart but expensive. 𝗟𝗮𝘁𝗲 𝗰𝗵𝘂𝗻𝗸𝗶𝗻𝗴: the solution we've been waiting for. Here’s a quick evolution of chunking strategies: → 𝗧𝗿𝗮𝗱𝗶𝘁𝗶𝗼𝗻𝗮𝗹 𝗖𝗵𝘂𝗻𝗸𝗶𝗻𝗴 (the basics we all started with) • Token Chunking - split by token count • Sentence Chunking - split by sentence boundaries • Document-Based Chunking - split by sections/paragraphs → 𝗔𝗱𝘃𝗮𝗻𝗰𝗲𝗱 𝗖𝗵𝘂𝗻𝗸𝗶𝗻𝗴 (when things got sophisticated) • Semantic Chunking - split by meaning • LLM-Based Chunking - let the model decide But each chunking method separates text at defined points, meaning context is lost within the document from one chunk to the next. → 𝗘𝗻𝘁𝗲𝗿 𝗟𝗮𝘁𝗲 𝗖𝗵𝘂𝗻𝗸𝗶𝗻𝗴 (the game changer) Traditional approach: Chunk first → Embed each chunk separately Late chunking approach: Embed the entire document → Then chunk with context preserved 𝗪𝗵𝘆 𝗰𝗵𝗼𝗼𝘀𝗲 𝗹𝗮𝘁𝗲 𝗰𝗵𝘂𝗻𝗸𝗶𝗻𝗴? When you chunk first, each piece loses its contextual relationship to the rest of the document. It's like reading a book by randomly picking paragraphs - you miss the flow. With late chunking, every chunk maintains awareness of its neighbors because the embedding happens at the document level first. Mean pooling is done on segments AFTER the full context is embedded. Jina AI tested and saw significant improvements in retrieval quality - chunks that were previously disconnected now maintain their semantic relationships. As documents get longer and context windows expand, late chunking might just become the new standard for high-quality retrieval systems. 𝗪𝗵𝗮𝘁 𝗱𝗼 𝘆𝗼𝘂 𝗻𝗲𝗲𝗱 𝘁𝗼 𝗺𝗮𝗸𝗲 𝘁𝗵𝗶𝘀 𝘄𝗼𝗿𝗸? No modifications to your retrieval pipeline are needed. 1. Long context embedding models (8192+ tokens) 2. Chunking logic that tracks token spans 3. Less than 30 lines of code to implement All you need is to switch the order at which you chunk and embed. Embed FIRST, then chunk, not the other way around. Dive deeper into late chunking:

Femke Plantinga

125,283 просмотров • 10 месяцев назад

This will retire 90% of RAG systems with dignity (and a sad song playlist). Powered by DSPy: If you're still building "text in, text out" chatbots that only perform blind vector and text searches, you're not gonna make it! My team just dropped Elysia, and it's not just an incremental successor to Verba… It's a whole rethink of how we interact with our data using AI. 𝗪𝗵𝗮𝘁 𝗶𝘀 𝗘𝗹𝘆𝗶𝘀𝗮? An open-source platform for building agentic RAG architectures. It learns from your preferences, intelligently categorizes, labels, and searches through your data, and provides complete transparency into its decision-making process. The long & exciting feature list: • 𝗧𝗿𝗮𝗻𝘀𝗽𝗮𝗿𝗲𝗻𝘁 𝗗𝗲𝗰𝗶𝘀𝗶𝗼𝗻-𝗧𝗿𝗲𝗲 𝗔𝗴𝗲𝗻𝘁𝘀: Elysia’s core is a customizable decision tree, and it visualizes its entire reasoning process, showing you why it chooses a specific tool or path. It enables advanced error handling, self-healing from failed queries, and prevents infinite loops. You can also add custom tools and branches to build complex, state-aware workflows. • 𝗗𝗮𝘁𝗮 𝗔𝘄𝗮𝗿𝗲𝗻𝗲𝘀𝘀: Before it even attempts a query, Elysia performs a full analysis of your data collections. This eliminates the blind search problem plaguing most RAG systems and allows for far more complex and accurate query generation. • 𝗗𝘆𝗻𝗮𝗺𝗶𝗰 𝗗𝗮𝘁𝗮 𝗗𝗶𝘀𝗽𝗹𝗮𝘆𝘀: Your RAG pipeline shouldn't be limited to text, right? That’s why Elysia analyzes each query's results and chooses the best way to display them, from tables and charts to product cards and GitHub tickets. It also features a comprehensive data explorer with search, sorting, and filtering capabilities. • 𝗛𝘆𝗽𝗲𝗿-𝗣𝗲𝗿𝘀𝗼𝗻𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝘃𝗶𝗮 𝗙𝗲𝗲𝗱𝗯𝗮𝗰𝗸: It uses your positively-rated queries as few-shot examples to improve future responses. This allows you to use smaller, faster models that perform like larger ones over time, cutting costs without sacrificing quality for most use cases. • 𝗖𝗵𝘂𝗻𝗸-𝗢𝗻-𝗗𝗲𝗺𝗮𝗻𝗱: Elysia chunks documents at query time. It performs initial searches on document-level vectors and only chunks relevant documents on the fly, storing them in a parallel quantized collection with cross references for future use. 𝗧𝗵𝗲 𝗦𝘁𝗮𝗰𝗸 Elysia is built from scratch on Weaviate, using its native features like named vectors, a variety of search types, filters, cross references, quantization, etc. It uses DSPy for LLM interactions and is delivered as a production-ready application via FastAPI, serving a NextJS frontend as static HTML. Also available as a Python package via pip: 𝗽𝗶𝗽 𝗶𝗻𝘀𝘁𝗮𝗹𝗹 𝗲𝗹𝘆𝘀𝗶𝗮-𝗮𝗶 Type: 𝗲𝗹𝘆𝘀𝗶𝗮 𝘀𝘁𝗮𝗿𝘁 Connect your Weaviate cluster and go explore what’s possible.

Philip Vollet

93,431 просмотров • 10 месяцев назад

Tokenization -- turning text into a sequence of integers -- is a key part of generative AI, and most API providers charge per million tokens. How does tokenization work? Learn the details of tokenization and RAG optimization in Retrieval Optimization: From Tokenization to Vector Quantization, created in collaboration with Qdrant and taught by its Developer Relations Lead, Kacper Łukawski. This course focuses on Retrieval augmented generation (RAG), which has two steps: First, a retriever finds relevant information; then, the generator uses what’s retrieved as context to produce a response. You’ll learn to optimize the first step (the retriever) by understanding how tokenization works and how it impacts the relevance of your search. In addition, you will also learn to measure and improve retrieval quality, speed, and memory. In detail, you’ll: - Learn about the internal workings of the embedding models and how your text turns into vectors. - Understand how several tokenizers, such as Byte-Pair Encoding, WordPiece, Unigram, and SentencePiece work. - Explore common challenges with tokenizers, such as unknown tokens, domain-specific identifiers, and numerical values, that can negatively affect your vector search. - Understand how to measure the quality of your search across relevance, ranking, and score-related metrics. - Understand how the main parameters in "HNSW", a graph-based algorithm, affect the relevance and speed of vector search, and how to tune its parameters. - Experiment with the three major quantization methods – product, scalar, and binary – and learn how they impact memory requirements, search quality, and speed. By the end of this course, you’ll have a solid understanding of how tokenization functions and how to optimize vector search in your RAG systems. Please sign up here!

Andrew Ng

146,200 просмотров • 1 год назад