Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

Existing IR/RAG benchmarks are unrealistic: they’re often derived from easily retrievable topics, rather than grounded in solving real user problems. 🧵Introducing 𝐅𝐫𝐞𝐬𝐡𝐒𝐭𝐚𝐜𝐤, a challenging RAG benchmark on niche, recent topics. Work done during intern Databricks 🧱

Nandan Thakur

2,669 subscribers

40,063 просмотров • 1 год назад •via X (Twitter)

Наука и технологии Образование

Anya Rossi• Live Now

Private livecam show

Комментарии: 11

Фото профиля Nandan Thakur

Nandan Thakur1 год назад

🌟 Overview: 𝐅𝐫𝐞𝐬𝐡𝐒𝐭𝐚𝐜𝐤 gathers real-user queries and answers from StackOverflow & technical documents from GitHub. It automates: 🔍 Corpus collection 📚 Nugget generation with GPT-4o 🏷️ Nugget-level support with GPT-4o Preprint:

Фото профиля Nandan Thakur

Nandan Thakur1 год назад

🚀 Core Steps: 1️⃣ Collects recent docs (code snippets etc) from public GitHub repositories. 2️⃣ Generate key info "nuggets" from Stack Overflow Q&A and use them for document relevance evaluation. 3️⃣ Uses oracle retrieval techniques, including fusion to fetch docs for pooling.

Фото профиля Nandan Thakur

Nandan Thakur1 год назад

📊 Datasets: We construct 5 challenging datasets, e.g., 𝐋𝐚𝐧𝐠𝐂𝐡𝐚𝐢𝐧 or 𝐆𝐨𝐝𝐨𝐭𝟒. ⌛ Long questions with mixed code snippets and text. 🆕 Popular 𝐅𝐫𝐞𝐬𝐡 topics to avoid data contamination by LLMs. 👨🏻‍💻 Requires domain knowledge to answer these questions correctly!

Фото профиля Nandan Thakur

Nandan Thakur1 год назад

🔍 Highlights: 💡 Diversity focused eval metrics. 👀 Oracle settings shows clear headroom even for strong retrieval and rerankers. 📈 Ensemble fusion outperform single methods in retrieval! 📚 Human calibration shows nuggets capture crucial info and precisely label documents!

Фото профиля Nandan Thakur

Nandan Thakur1 год назад

𝐅𝐫𝐞𝐬𝐡𝐒𝐭𝐚𝐜𝐤 develops scalable, uncontaminated & realistic IR/RAG benchmarks. We plan to maintain the freshness of the benchmark in the future. Work done during internship at @DbrxMosaicAI with @lateinteraction, @mrdrozdov and others! 🏙️⚡

Фото профиля Nandan Thakur

Nandan Thakur1 год назад

The 𝐅𝐫𝐞𝐬𝐡𝐒𝐭𝐚𝐜𝐤 queries (October 2024) containing nuggets, answers and nugget-level judgments for all five topics are publicly available at huggingface:

Фото профиля Nandan Thakur

Nandan Thakur1 год назад

The 𝐅𝐫𝐞𝐬𝐡𝐒𝐭𝐚𝐜𝐤 corpus (October 2024) containing chunked GitHub documents for all five topics are available at huggingface:

Фото профиля RTTS

RTTS1 год назад

API testing of interfaces is critical to determine if they meet requirements for functionality, reliability, performance, and security. Check out RTTS - the automated testing experts since 1996. #API #testautomation #integrationtest

Фото профиля search founder

search founder1 год назад

Much needed bc I'm sorry but I just don't get any signal out of the new MTEB leaderboard. In addition to updating the corpus to keep it fresh will you all be updating the models benchmarked to include newer models (would love to see Stella, voyage 3 large, gemini-embedding-exp-03-07). Would also be interested in seeing different rerankers tested esp. mxbai-rerank-large-v2 vs. listwise and pointwise reranking with gemini flash 2 lite, 2, and 2.5 for example. Basically I need you to do my job for me pls.

Фото профиля Nandan Thakur

Nandan Thakur1 год назад

@databricks We will try to keep it updated with newer models (bonus if publicly available). As the dataset is open-sourced, we also hope the community can directly jump and run baselines on freshstack.

Фото профиля dare

dare1 год назад

@databricks Love the concept! Are you releasing only the datasets for now- don't see the code on GH?

Похожие видео

Databricks is excited to partner with OpenAI on GPT-5.5, their latest frontier model. GPT-5.5 will be available in Unity AI Gateway on launch. You can use it with coding tools such as Codex, or to power your enterprise agents. GPT-5.5 is state-of-the-art on many benchmarks including OfficeQA Pro, our benchmark for evaluating grounded reasoning on enterprise tasks. We are partnering with OpenAI to co-launch on Databricks. Hear more from our co-founder Patrick Wendell and OpenAI CRO Denise Holland Dresser on GPT-5.5 in Databricks.

Databricks is excited to partner with OpenAI on GPT-5.5, their latest frontier model. GPT-5.5 will be available in Unity AI Gateway on launch. You can use it with coding tools such as Codex, or to power your enterprise agents. GPT-5.5 is state-of-the-art on many benchmarks including OfficeQA Pro, our benchmark for evaluating grounded reasoning on enterprise tasks. We are partnering with OpenAI to co-launch on Databricks. Hear more from our co-founder Patrick Wendell and OpenAI CRO Denise Holland Dresser on GPT-5.5 in Databricks.

Databricks

12,707 просмотров • 2 месяцев назад

Today we’re releasing Ramp SWE-Bench: a private, production-grounded coding benchmark created from real engineering problems we've faced at Ramp.

Today we’re releasing Ramp SWE-Bench: a private, production-grounded coding benchmark created from real engineering problems we've faced at Ramp.

Ramp Labs

179,544 просмотров • 13 дней назад

The web was never meant to be flattened into text. Yet most web RAG systems start by parsing HTML --- a complex and lossy process. 🔥 Introducing PixelRAG: the first RAG system that retrieves and reads 30M+ web pages as pixels. Instead of extracting text, PixelRAG retrieves screenshots and lets a VLM read them directly. PixelRAG not only preserves visual information, but also outperforms text-based RAG on text-only QA benchmarks by +18.1%. Why? (1) HTML-to-text conversion often discards layout, structure, tables, and other useful signals. (2) We continued pretraining a VLM on web page screenshots and turned it into a surprisingly strong visual retriever. (3) Recent VLMs are remarkably good at understanding web pages, often with better accuracy and token efficiency than text-only pipelines. Takeaway: HTML parsing may be one of the biggest self-inflicted bottlenecks in web RAG. Demo below 👇 Code: Paper: Playground:

The web was never meant to be flattened into text. Yet most web RAG systems start by parsing HTML --- a complex and lossy process. 🔥 Introducing PixelRAG: the first RAG system that retrieves and reads 30M+ web pages as pixels. Instead of extracting text, PixelRAG retrieves screenshots and lets a VLM read them directly. PixelRAG not only preserves visual information, but also outperforms text-based RAG on text-only QA benchmarks by +18.1%. Why? (1) HTML-to-text conversion often discards layout, structure, tables, and other useful signals. (2) We continued pretraining a VLM on web page screenshots and turned it into a surprisingly strong visual retriever. (3) Recent VLMs are remarkably good at understanding web pages, often with better accuracy and token efficiency than text-only pipelines. Takeaway: HTML parsing may be one of the biggest self-inflicted bottlenecks in web RAG. Demo below 👇 Code: Paper: Playground:

Yichuan Wang

81,181 просмотров • 15 дней назад

Introducing a beautifully designed AI chat component! 🎉 ◆ Ready to use, drop into any existing project ◆ Stream AI responses in real time ◆ Built-in RAG functionality Beautiful out of the box, fully customizable. 30-second demo 👇🏻

Introducing a beautifully designed AI chat component! 🎉 ◆ Ready to use, drop into any existing project ◆ Stream AI responses in real time ◆ Built-in RAG functionality Beautiful out of the box, fully customizable. 30-second demo 👇🏻

Upstash

45,838 просмотров • 1 год назад

Snippet — Dunleavy on his relationship with Steve Kerr and bringing up topics in a constructive not challenging manner

Snippet — Dunleavy on his relationship with Steve Kerr and bringing up topics in a constructive not challenging manner

Tim Kawakami

16,496 просмотров • 1 год назад

Introducing `Personalized Agentic RAG` - where the LLM remembers key details about the user and automatically chooses the tool for RAG. We'll build: 👏 ChatGPT like memory ⚙️ Function calling 🧙 Multi-Agent orchestration code:

Introducing `Personalized Agentic RAG` - where the LLM remembers key details about the user and automatically chooses the tool for RAG. We'll build: 👏 ChatGPT like memory ⚙️ Function calling 🧙 Multi-Agent orchestration code:

Ashpreet Bedi

45,035 просмотров • 2 лет назад

How capable are web agents at solving knowledge work tasks? 🤔 Are LLMs up to the challenge? 🤖 Introducing WorkArena: a benchmark where agents meet the world 𝘸𝘪𝘭𝘥 web of enterprise software 🌐🖥️ Paper: Website: 🧵 1/7

How capable are web agents at solving knowledge work tasks? 🤔 Are LLMs up to the challenge? 🤖 Introducing WorkArena: a benchmark where agents meet the world 𝘸𝘪𝘭𝘥 web of enterprise software 🌐🖥️ Paper: Website: 🧵 1/7

Alexandre Lacoste

24,504 просмотров • 2 лет назад

MAGA SPLIT ON IRAN/ISRAEL! Palantir & AI Takeover, Bitcoin & Intel History Topics! Today we catch up on recent news and topics we missed while we were on a massive podcast tour! Support the stream by subscribing below to my work in replies:

MAGA SPLIT ON IRAN/ISRAEL! Palantir & AI Takeover, Bitcoin & Intel History Topics! Today we catch up on recent news and topics we missed while we were on a massive podcast tour! Support the stream by subscribing below to my work in replies:

Jay Dyer

11,009 просмотров • 1 год назад

Committee hearings are a big part of the Congresswoman’s work in DC, where she only has *five minutes* to question witnesses on a wide array of topics. Watch for some highlights from committee this year, where Rep. AOC… 🧵

Committee hearings are a big part of the Congresswoman’s work in DC, where she only has five minutes to question witnesses on a wide array of topics. Watch for some highlights from committee this year, where Rep. AOC… 🧵

Rep. Alexandria Ocasio-Cortez

115,458 просмотров • 3 лет назад

A snippet from our recent interview with NewsNation about our latest viral video. What do you all think about the topics discussed during this converstation?

A snippet from our recent interview with NewsNation about our latest viral video. What do you all think about the topics discussed during this converstation?

The Dor Brothers

78,293 просмотров • 1 год назад

WhisperKit Benchmarks are live on Hugging Face! Speech-to-text systems are hard to benchmark holistically given trade-offs across latency, memory, energy efficiency and accuracy. On-device testing makes it doubly challenging. Here is our first version built with Gradio 🧵

WhisperKit Benchmarks are live on Hugging Face! Speech-to-text systems are hard to benchmark holistically given trade-offs across latency, memory, energy efficiency and accuracy. On-device testing makes it doubly challenging. Here is our first version built with Gradio 🧵

argmax

13,757 просмотров • 1 год назад

This is exactly what we need from Neeya Naana! 🔥 Topics like this lead to actual change, unlike the recent brain-rot topics. 👌

This is exactly what we need from Neeya Naana! 🔥 Topics like this lead to actual change, unlike the recent brain-rot topics. 👌

George 🍿🎥

90,291 просмотров • 2 месяцев назад

Vision-based(Colapli) RAG is becoming popular, so we built a platform to compare: - Simple OCR RAG - VisionRAG - Colpali - Hybrid Colpali 🚀 Introducing VARAG – the Vision-First RAG Engine (Vision Augmented Retrieval and Generation).

Vision-based(Colapli) RAG is becoming popular, so we built a platform to compare: - Simple OCR RAG - VisionRAG - Colpali - Hybrid Colpali 🚀 Introducing VARAG – the Vision-First RAG Engine (Vision Augmented Retrieval and Generation).

Adithya S K

95,370 просмотров • 1 год назад

Meet physics-intern🧑‍🎓, our agentic framework for theoretical physics. It takes Gemini 3.1 Pro from 17.7% to 31.4% on CritPt, a new SOTA on one of the hardest benchmarks for LLMs. Theoretical physics is hard for humans and LLMs alike. But physics-intern decomposes problems and dispatches them to a team of specialized agents, solving research-level questions far more effectively than the base model alone.

Meet physics-intern🧑‍🎓, our agentic framework for theoretical physics. It takes Gemini 3.1 Pro from 17.7% to 31.4% on CritPt, a new SOTA on one of the hardest benchmarks for LLMs. Theoretical physics is hard for humans and LLMs alike. But physics-intern decomposes problems and dispatches them to a team of specialized agents, solving research-level questions far more effectively than the base model alone.

David Louapre

112,251 просмотров • 1 месяц назад

Real topics like Nico Iamaleava transferring from Tennessee are mentioned by the announcers in CFB26 👀 (via Ghost of Neyland)

Real topics like Nico Iamaleava transferring from Tennessee are mentioned by the announcers in CFB26 👀 (via Ghost of Neyland)

CFB Kings

376,053 просмотров • 11 месяцев назад

The far left, unelected officials, and media want you to focus more on who is solving the problems than the problems they are solving.

The far left, unelected officials, and media want you to focus more on who is solving the problems than the problems they are solving.

Kaizen D. Asiedu

1,166,770 просмотров • 1 год назад

btw literally everybody is reporting that Agentic RAG is beating “trad RAG” by leaps and bounds its probably the #1 most cited result in RAG after “have you tried BM25” from Merrill Lutsky on the Raza Habib pod

btw literally everybody is reporting that Agentic RAG is beating “trad RAG” by leaps and bounds its probably the #1 most cited result in RAG after “have you tried BM25” from Merrill Lutsky on the Raza Habib pod

swyx

30,551 просмотров • 1 год назад

Tyrus Thomas speaking on some of the recent hot topics about AAU basketball

Tyrus Thomas speaking on some of the recent hot topics about AAU basketball

Future Of The Retro

204,686 просмотров • 3 месяцев назад

Want to be heard? Find your niche. There are countless topics, but the right focus makes the loudest impact. #MTNxStanbicDigiTraining

Want to be heard? Find your niche. There are countless topics, but the right focus makes the loudest impact. #MTNxStanbicDigiTraining

MTN Ghana

31,885 просмотров • 11 месяцев назад

AI buzzwords are hitting us wave after wave... 🌊 LLM, Token, Prompt, RAG, Agent, Harness... Where did they come from, and what real-world problems do they actually solve? We are breaking them all down in a 2-part video series. Here is Part 1!

AI buzzwords are hitting us wave after wave... 🌊 LLM, Token, Prompt, RAG, Agent, Harness... Where did they come from, and what real-world problems do they actually solve? We are breaking them all down in a 2-part video series. Here is Part 1!

Tongyi Lab

9,082,727 просмотров • 23 дней назад