Загрузка видео...

Не удалось загрузить видео

На главную

Existing IR/RAG benchmarks are unrealistic: they’re often derived from easily retrievable topics, rather than grounded in solving real user problems. 🧵Introducing 𝐅𝐫𝐞𝐬𝐡𝐒𝐭𝐚𝐜𝐤, a challenging RAG benchmark on niche, recent topics. Work done during intern Databricks 🧱

40,063 просмотров • 1 год назад •via X (Twitter)

Комментарии: 11

Фото профиля Nandan Thakur
Nandan Thakur1 год назад

🌟 Overview: 𝐅𝐫𝐞𝐬𝐡𝐒𝐭𝐚𝐜𝐤 gathers real-user queries and answers from StackOverflow & technical documents from GitHub. It automates: 🔍 Corpus collection 📚 Nugget generation with GPT-4o 🏷️ Nugget-level support with GPT-4o Preprint:

Фото профиля Nandan Thakur
Nandan Thakur1 год назад

🚀 Core Steps: 1️⃣ Collects recent docs (code snippets etc) from public GitHub repositories. 2️⃣ Generate key info "nuggets" from Stack Overflow Q&A and use them for document relevance evaluation. 3️⃣ Uses oracle retrieval techniques, including fusion to fetch docs for pooling.

Фото профиля Nandan Thakur
Nandan Thakur1 год назад

📊 Datasets: We construct 5 challenging datasets, e.g., 𝐋𝐚𝐧𝐠𝐂𝐡𝐚𝐢𝐧 or 𝐆𝐨𝐝𝐨𝐭𝟒. ⌛ Long questions with mixed code snippets and text. 🆕 Popular 𝐅𝐫𝐞𝐬𝐡 topics to avoid data contamination by LLMs. 👨🏻‍💻 Requires domain knowledge to answer these questions correctly!

Фото профиля Nandan Thakur
Nandan Thakur1 год назад

🔍 Highlights: 💡 Diversity focused eval metrics. 👀 Oracle settings shows clear headroom even for strong retrieval and rerankers. 📈 Ensemble fusion outperform single methods in retrieval! 📚 Human calibration shows nuggets capture crucial info and precisely label documents!

Фото профиля Nandan Thakur
Nandan Thakur1 год назад

𝐅𝐫𝐞𝐬𝐡𝐒𝐭𝐚𝐜𝐤 develops scalable, uncontaminated & realistic IR/RAG benchmarks. We plan to maintain the freshness of the benchmark in the future. Work done during internship at @DbrxMosaicAI with @lateinteraction, @mrdrozdov and others! 🏙️⚡

Фото профиля Nandan Thakur
Nandan Thakur1 год назад

The 𝐅𝐫𝐞𝐬𝐡𝐒𝐭𝐚𝐜𝐤 queries (October 2024) containing nuggets, answers and nugget-level judgments for all five topics are publicly available at huggingface:

Фото профиля Nandan Thakur
Nandan Thakur1 год назад

The 𝐅𝐫𝐞𝐬𝐡𝐒𝐭𝐚𝐜𝐤 corpus (October 2024) containing chunked GitHub documents for all five topics are available at huggingface:

Фото профиля RTTS
RTTS1 год назад

API testing of interfaces is critical to determine if they meet requirements for functionality, reliability, performance, and security. Check out RTTS - the automated testing experts since 1996. #API #testautomation #integrationtest

Фото профиля search founder
search founder1 год назад

Much needed bc I'm sorry but I just don't get any signal out of the new MTEB leaderboard. In addition to updating the corpus to keep it fresh will you all be updating the models benchmarked to include newer models (would love to see Stella, voyage 3 large, gemini-embedding-exp-03-07). Would also be interested in seeing different rerankers tested esp. mxbai-rerank-large-v2 vs. listwise and pointwise reranking with gemini flash 2 lite, 2, and 2.5 for example. Basically I need you to do my job for me pls.

Фото профиля Nandan Thakur
Nandan Thakur1 год назад

@databricks We will try to keep it updated with newer models (bonus if publicly available). As the dataset is open-sourced, we also hope the community can directly jump and run baselines on freshstack.

Фото профиля dare
dare1 год назад

@databricks Love the concept! Are you releasing only the datasets for now- don't see the code on GH?

Похожие видео