Загрузка видео...
Не удалось загрузить видео
Existing IR/RAG benchmarks are unrealistic: they’re often derived from easily retrievable topics, rather than grounded in solving real user problems. 🧵Introducing 𝐅𝐫𝐞𝐬𝐡𝐒𝐭𝐚𝐜𝐤, a challenging RAG benchmark on niche, recent topics. Work done during intern Databricks 🧱
40,063 просмотров • 1 год назад •via X (Twitter)
Комментарии: 11

🌟 Overview: 𝐅𝐫𝐞𝐬𝐡𝐒𝐭𝐚𝐜𝐤 gathers real-user queries and answers from StackOverflow & technical documents from GitHub. It automates: 🔍 Corpus collection 📚 Nugget generation with GPT-4o 🏷️ Nugget-level support with GPT-4o Preprint:

🚀 Core Steps: 1️⃣ Collects recent docs (code snippets etc) from public GitHub repositories. 2️⃣ Generate key info "nuggets" from Stack Overflow Q&A and use them for document relevance evaluation. 3️⃣ Uses oracle retrieval techniques, including fusion to fetch docs for pooling.

📊 Datasets: We construct 5 challenging datasets, e.g., 𝐋𝐚𝐧𝐠𝐂𝐡𝐚𝐢𝐧 or 𝐆𝐨𝐝𝐨𝐭𝟒. ⌛ Long questions with mixed code snippets and text. 🆕 Popular 𝐅𝐫𝐞𝐬𝐡 topics to avoid data contamination by LLMs. 👨🏻💻 Requires domain knowledge to answer these questions correctly!

🔍 Highlights: 💡 Diversity focused eval metrics. 👀 Oracle settings shows clear headroom even for strong retrieval and rerankers. 📈 Ensemble fusion outperform single methods in retrieval! 📚 Human calibration shows nuggets capture crucial info and precisely label documents!

𝐅𝐫𝐞𝐬𝐡𝐒𝐭𝐚𝐜𝐤 develops scalable, uncontaminated & realistic IR/RAG benchmarks. We plan to maintain the freshness of the benchmark in the future. Work done during internship at @DbrxMosaicAI with @lateinteraction, @mrdrozdov and others! 🏙️⚡

The 𝐅𝐫𝐞𝐬𝐡𝐒𝐭𝐚𝐜𝐤 queries (October 2024) containing nuggets, answers and nugget-level judgments for all five topics are publicly available at huggingface:

The 𝐅𝐫𝐞𝐬𝐡𝐒𝐭𝐚𝐜𝐤 corpus (October 2024) containing chunked GitHub documents for all five topics are available at huggingface:

API testing of interfaces is critical to determine if they meet requirements for functionality, reliability, performance, and security. Check out RTTS - the automated testing experts since 1996. #API #testautomation #integrationtest

Much needed bc I'm sorry but I just don't get any signal out of the new MTEB leaderboard. In addition to updating the corpus to keep it fresh will you all be updating the models benchmarked to include newer models (would love to see Stella, voyage 3 large, gemini-embedding-exp-03-07). Would also be interested in seeing different rerankers tested esp. mxbai-rerank-large-v2 vs. listwise and pointwise reranking with gemini flash 2 lite, 2, and 2.5 for example. Basically I need you to do my job for me pls.

@databricks We will try to keep it updated with newer models (bonus if publicly available). As the dataset is open-sourced, we also hope the community can directly jump and run baselines on freshstack.

@databricks Love the concept! Are you releasing only the datasets for now- don't see the code on GH?


