Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Effective Table Data Extraction from PDF without LLM Sparrow Parse helps to read tabular data from PDFs, relying on various libraries, such as Unstructured or PyMuPDF4LLM. This allows us to avoid data hallucination errors often produced by LLMs when processing complex data structures. Learn more: ✅ ✅ Katana

Andrej Baranovskij

6,726 subscribers

27,886 views • 2 years ago •via X (Twitter)

Science & Technology Health & Wellness Education

Anya Rossi• Live Now

Private livecam show

10 Comments

ViGa2 years ago

Cross page tables ?

Andrej Baranovskij2 years ago

Work in progress.

Nasser Builds2 years ago

Thank you

Sumit Shekhar2 years ago

How is the performance on borderless tables?

Andrej Baranovskij2 years ago

I tested it with bank statements, they are borderless. And it performs with 95% accuracy

Ashish2 years ago

Very useful

Marlon2 years ago

This is a lot more challenging than people realize - I went through a ton of approaches for something table extraction recently, and ended up with a pipeline revolving around a fin tuned table-transformer and gpt4-v with visual cues. Excited to try this out as well

Andrej Baranovskij2 years ago

Agree 💯

Khalid Jamal- خالد جمال1 year ago

Can it extract equations from scientific PDF papers?

Andrej Baranovskij1 year ago

Haven’t tried, 7b model I doubt, but 72b model should handle it, depends on complexity

Related Videos

Structured Output from Multipage PDF with Sparrow (Qwen2 Vision LLM and MLX) I explain how multipage PDFs are handled in Sparrow to extract structured data in a single call.

Structured Output from Multipage PDF with Sparrow (Qwen2 Vision LLM and MLX) I explain how multipage PDFs are handled in Sparrow to extract structured data in a single call.

Andrej Baranovskij

30,645 views • 1 year ago

Learn about our research prototype LLM-powered personal health agent that analyzes various data modalities, including data from wearable devices, to offer evidence-based health insights and to provide a personalized coaching experience. Read more →

Learn about our research prototype LLM-powered personal health agent that analyzes various data modalities, including data from wearable devices, to offer evidence-based health insights and to provide a personalized coaching experience. Read more →

Google Research

25,348 views • 8 months ago

Here’s how I would learn data engineering in 2025: 1. The basics: - learn SQL — SELECT, FROM, WHERE, GROUP BY, JOIN, HAVING, etc - learn Python — data structures: objects, arrays, tuples, namedtuples — algorithms: recursion, loops 2. Intermediate - learn distributed compute — pick up PySpark or Snowflake or BigQuery - learn data make architecture — pick up iceberg or delta lake - learn job orchestration — pick up Airflow or Mage - learn data quality — pick up Great expectations 3. Advanced - learn the data modeling techniques — one big table vs kimball vs Inmon vs data vault techniques - learn machine learning features and vector databases — pick up pinecone and how to fine tune LLMs with high quality data My newsletter has a deeper roadmap here:

Here’s how I would learn data engineering in 2025: 1. The basics: - learn SQL — SELECT, FROM, WHERE, GROUP BY, JOIN, HAVING, etc - learn Python — data structures: objects, arrays, tuples, namedtuples — algorithms: recursion, loops 2. Intermediate - learn distributed compute — pick up PySpark or Snowflake or BigQuery - learn data make architecture — pick up iceberg or delta lake - learn job orchestration — pick up Airflow or Mage - learn data quality — pick up Great expectations 3. Advanced - learn the data modeling techniques — one big table vs kimball vs Inmon vs data vault techniques - learn machine learning features and vector databases — pick up pinecone and how to fine tune LLMs with high quality data My newsletter has a deeper roadmap here:

Zach Wilson

29,164 views • 11 months ago

This week at Google I/O Box highlighted how Gemini 2.5 Pro can be used with Box AI to power complex data extraction. There's a tremendous amount of value trapped in unstructured data -like PDFs, docs, images, and more- that we can now tap into with AI Agents.

This week at Google I/O Box highlighted how Gemini 2.5 Pro can be used with Box AI to power complex data extraction. There's a tremendous amount of value trapped in unstructured data -like PDFs, docs, images, and more- that we can now tap into with AI Agents.

Aaron Levie

49,142 views • 1 year ago

Unsiloed AI (Unsiloed AI (YC F25)) builds APIs to parse multimodal unstructured data like PDFs, PPTs, and images and convert it into LLM-ready formats like markdown and JSON. Congrats on the launch, @__Aman_Mishra & Adnan!

Unsiloed AI (Unsiloed AI (YC F25)) builds APIs to parse multimodal unstructured data like PDFs, PPTs, and images and convert it into LLM-ready formats like markdown and JSON. Congrats on the launch, @__Aman_Mishra & Adnan!

Y Combinator

36,066 views • 7 months ago

Sharing our latest short course: Building and Evaluating Data Agents, created in collaboration with Snowflake and taught by Anupam Datta (Anupam Datta) and Josh Reini (Josh Reini). A data agent extracts data from sources such as files or databases, analyzes it, and provides insights and visualizes its findings. But most data agents struggle with reliability or can't handle multi-step reasoning. In this course, you'll learn to build, trace, and evaluate a multi-agent workflow that plans tasks, pulls context from structured and unstructured data, performs web search, and summarizes or visualizes the final results. Learn more and enroll for free!

Sharing our latest short course: Building and Evaluating Data Agents, created in collaboration with Snowflake and taught by Anupam Datta (Anupam Datta) and Josh Reini (Josh Reini). A data agent extracts data from sources such as files or databases, analyzes it, and provides insights and visualizes its findings. But most data agents struggle with reliability or can't handle multi-step reasoning. In this course, you'll learn to build, trace, and evaluate a multi-agent workflow that plans tasks, pulls context from structured and unstructured data, performs web search, and summarizes or visualizes the final results. Learn more and enroll for free!

DeepLearning.AI

40,745 views • 8 months ago

Today, Box announced new AI Agents to work with enterprise content, powering Deep Research, Search, and enhanced Data Extraction. There’s a tremendous amount of value that’s trapped in unstructured data, from contracts to research data, that we can finally unlock with AI.

Today, Box announced new AI Agents to work with enterprise content, powering Deep Research, Search, and enhanced Data Extraction. There’s a tremendous amount of value that’s trapped in unstructured data, from contracts to research data, that we can finally unlock with AI.

Aaron Levie

60,262 views • 1 year ago

Transfer data from PDF to Excel in seconds

Transfer data from PDF to Excel in seconds

Hub4Learning

23,236 views • 4 months ago

Data preprocessing is critical for building effective RAG systems. Our new short course, Preprocessing Unstructured Data for LLM Applications, taught by Matt Robinson of Unstructured, demonstrates important but sometimes overlooked aspects of RAG systems: - How to extract and normalize content from diverse formats like PDF, Powerpoint, and HTML to expand your LLM's knowledge - Enriching data with metadata to enable more powerful retrieval and reasoning - Applying document layout analysis and vision transforms to process embedded images and tables Then you’ll use all these skills and build a RAG bot that draws from a corpus that includes PDF, PowerPoint, and Markdown documents. Please sign up here:

Data preprocessing is critical for building effective RAG systems. Our new short course, Preprocessing Unstructured Data for LLM Applications, taught by Matt Robinson of Unstructured, demonstrates important but sometimes overlooked aspects of RAG systems: - How to extract and normalize content from diverse formats like PDF, Powerpoint, and HTML to expand your LLM's knowledge - Enriching data with metadata to enable more powerful retrieval and reasoning - Applying document layout analysis and vision transforms to process embedded images and tables Then you’ll use all these skills and build a RAG bot that draws from a corpus that includes PDF, PowerPoint, and Markdown documents. Please sign up here:

Andrew Ng

150,317 views • 2 years ago

Major program launch: Data Analytics Professional Certificate! This large, five-course sequence takes you all the way to being job-ready as a data analyst, and shows how to use Generative AI as a thought partner to enhance your work in this role. Offered by on Coursera, this is taught by Sean Barnes, Ph.D., a Data Science & Engineering Leader at Netflix. Analyzing data remains one of the most important skills in where the world is going with AI. This comprehensive certificate takes you all the way to being job-ready. Each course comes with practical projects demonstrated in real-world contexts, such as analyzing sales data for a Korean bakery, video game sales trends across different regions, or identifying factors impacting customer retention for a communications company. You'll also work on estimating fire distribution for forest fire prevention, analyzing how a diamond's properties affect its market value, and developing predictive models for retail sales analysis, carbon emissions, and coral reef conservation. Here's some of what you'll learn: - How to define data and categorize it into its many types such as discrete & continuous numerical, structured & unstructured, time series, categorical, and know what insights can be derived from the different types of data categories. - How to differentiate between data-related job roles and their responsibilities, and how data flows through an organization from the moment of capture to decision-making. - How to perform data processing functions and apply conditional formatting in spreadsheets to extract business value from your data using statistical calculations and best practices for visualizing and interpreting data. - How to use LLMs for stakeholder analysis, data exploration, and data visualization. - Best practices for using LLMs for as a thought partner to data analysis work By the end of this professional certificate program, you will have learned core statistical concepts, analysis techniques, and visualization methodologies that will serve as the foundation for working as a data analyst. The world needs more data analysts, especially ones who know how to use modern generative AI. With data science roles projected to grow 36% by 2033, the skills taught in this program create new professional opportunities in data. Sign up here!

Major program launch: Data Analytics Professional Certificate! This large, five-course sequence takes you all the way to being job-ready as a data analyst, and shows how to use Generative AI as a thought partner to enhance your work in this role. Offered by on Coursera, this is taught by Sean Barnes, Ph.D., a Data Science & Engineering Leader at Netflix. Analyzing data remains one of the most important skills in where the world is going with AI. This comprehensive certificate takes you all the way to being job-ready. Each course comes with practical projects demonstrated in real-world contexts, such as analyzing sales data for a Korean bakery, video game sales trends across different regions, or identifying factors impacting customer retention for a communications company. You'll also work on estimating fire distribution for forest fire prevention, analyzing how a diamond's properties affect its market value, and developing predictive models for retail sales analysis, carbon emissions, and coral reef conservation. Here's some of what you'll learn: - How to define data and categorize it into its many types such as discrete & continuous numerical, structured & unstructured, time series, categorical, and know what insights can be derived from the different types of data categories. - How to differentiate between data-related job roles and their responsibilities, and how data flows through an organization from the moment of capture to decision-making. - How to perform data processing functions and apply conditional formatting in spreadsheets to extract business value from your data using statistical calculations and best practices for visualizing and interpreting data. - How to use LLMs for stakeholder analysis, data exploration, and data visualization. - Best practices for using LLMs for as a thought partner to data analysis work By the end of this professional certificate program, you will have learned core statistical concepts, analysis techniques, and visualization methodologies that will serve as the foundation for working as a data analyst. The world needs more data analysts, especially ones who know how to use modern generative AI. With data science roles projected to grow 36% by 2033, the skills taught in this program create new professional opportunities in data. Sign up here!

Andrew Ng

84,686 views • 1 year ago

Learn to train an LLM with distributed data while ensuring privacy using federated learning in a new two-part short course, Intro to Federated Learning and Federated Fine-tuning of LLMs with Private Data, created with Flower and taught by Daniel J. Beutel and nic lane. Federated learning allows a single model to be trained across multiple devices, such as phones, or multiple organizations, such as hospitals, without the need to share data to a central server. This two-part course gives you an introduction to federated learning, and then teaches you how to fine-tune your large language model with distributed data using Flower Lab’s open source federated learning framework. You’ll learn: - How to use federated learning to train a variety of models, ranging from speech and vision models to LLMs, across distributed data while offering data privacy options to users and organizations. - Privacy Enhancing Technologies like differential privacy (DP), which obscures individual data by adding calibrated noise to query results. - Two variants of differential privacy - Central and Local - and how to choose depending on your use case. - How to measure and decrease bandwidth usage to make federated learning more practical and efficient with techniques like using pre-trained models and Parameter-Efficient Fine-Tuning - How federated LLM fine-tuning reduces the risk of leaking training data. Sign up here!

Learn to train an LLM with distributed data while ensuring privacy using federated learning in a new two-part short course, Intro to Federated Learning and Federated Fine-tuning of LLMs with Private Data, created with Flower and taught by Daniel J. Beutel and nic lane. Federated learning allows a single model to be trained across multiple devices, such as phones, or multiple organizations, such as hospitals, without the need to share data to a central server. This two-part course gives you an introduction to federated learning, and then teaches you how to fine-tune your large language model with distributed data using Flower Lab’s open source federated learning framework. You’ll learn: - How to use federated learning to train a variety of models, ranging from speech and vision models to LLMs, across distributed data while offering data privacy options to users and organizations. - Privacy Enhancing Technologies like differential privacy (DP), which obscures individual data by adding calibrated noise to query results. - Two variants of differential privacy - Central and Local - and how to choose depending on your use case. - How to measure and decrease bandwidth usage to make federated learning more practical and efficient with techniques like using pre-trained models and Parameter-Efficient Fine-Tuning - How federated LLM fine-tuning reduces the risk of leaking training data. Sign up here!

Andrew Ng

64,538 views • 1 year ago

Welcome to Agentforce: a powerful data platform and an exceptional agent builder. The key to making agents truly effective is data. Salesforce agents deliver greater accuracy thanks to our integrated & comprehensive data & metadata. Without data, you're left with just a "dumb" LLM. No agent platform will gain traction without integrating data and metadata at its core. Get Agentforce at Dreamforce. ❤️

Welcome to Agentforce: a powerful data platform and an exceptional agent builder. The key to making agents truly effective is data. Salesforce agents deliver greater accuracy thanks to our integrated & comprehensive data & metadata. Without data, you're left with just a "dumb" LLM. No agent platform will gain traction without integrating data and metadata at its core. Get Agentforce at Dreamforce. ❤️

Marc Benioff

49,456 views • 1 year ago

Thousands of financial analysts spend countless hours extracting data from PDFs, where a single error could cost millions. Learn how TWG is partnering with Palantir and xAI to integrate Grok with Palantir AIP and the Ontology to automate data extraction, saving time and boosting accuracy.

Thousands of financial analysts spend countless hours extracting data from PDFs, where a single error could cost millions. Learn how TWG is partnering with Palantir and xAI to integrate Grok with Palantir AIP and the Ontology to automate data extraction, saving time and boosting accuracy.

Palantir

57,979 views • 1 year ago

Imitation learning is great, but needs us to have (near) optimal data. We throw away most other data (failures, evaluation data, suboptimal data, undirected play data), even though this data can be really useful and way cheaper! In our new work - RISE, we show a simple way to *use all of this non-optimal data to robustify imitation learning* with minimal requirements beyond BC. Key idea: use non-expert data to learn how to *recover* back to expert data with a minimal frills offline RL that works under sparse data coverage. Allows usage of *all* available data, not just expert data - never throw your data away! Paper: Website: A 🧵(1/10)

Imitation learning is great, but needs us to have (near) optimal data. We throw away most other data (failures, evaluation data, suboptimal data, undirected play data), even though this data can be really useful and way cheaper! In our new work - RISE, we show a simple way to use all of this non-optimal data to robustify imitation learning with minimal requirements beyond BC. Key idea: use non-expert data to learn how to recover back to expert data with a minimal frills offline RL that works under sparse data coverage. Allows usage of all available data, not just expert data - never throw your data away! Paper: Website: A 🧵(1/10)

Abhishek Gupta

20,545 views • 7 months ago

Making Data and AI Lovable. Lovable now integrates with Databricks, providing a natural language interface that allows anyone—regardless of technical skills—to build live data apps can read and write data stored in Databricks. Bridge the gap between complex data engineering and beautiful, functional front-ends.

Making Data and AI Lovable. Lovable now integrates with Databricks, providing a natural language interface that allows anyone—regardless of technical skills—to build live data apps can read and write data stored in Databricks. Bridge the gap between complex data engineering and beautiful, functional front-ends.

Databricks

33,350 views • 1 month ago

Function calling is a powerful way to extend the capabilities of LLMs and AI agents by letting them use external tools. Our new short course Function calling and Data Extraction with LLMs, created with @NexusflowX and taught by Jiantao Jiao and Venkat, demonstrates how to prompt LLMs to form calls to external functions. You'll work with NexusRavenV2-13B, a 13B parameter open-source model that excels in function calling tasks while still being small enough to host locally. Learn to use function calling to extract structured data from unstructured text and access web APIs, and build an end-to-end application that processes customer service transcripts. You'll learn how to build LLM-powered applications that can analyze feedback, automate data entry, and enhance search. Please get started here:

Function calling is a powerful way to extend the capabilities of LLMs and AI agents by letting them use external tools. Our new short course Function calling and Data Extraction with LLMs, created with @NexusflowX and taught by Jiantao Jiao and Venkat, demonstrates how to prompt LLMs to form calls to external functions. You'll work with NexusRavenV2-13B, a 13B parameter open-source model that excels in function calling tasks while still being small enough to host locally. Learn to use function calling to extract structured data from unstructured text and access web APIs, and build an end-to-end application that processes customer service transcripts. You'll learn how to build LLM-powered applications that can analyze feedback, automate data entry, and enhance search. Please get started here:

Andrew Ng

110,420 views • 2 years ago

Roman Nebo from Ghost Drive highlights the challenge of managing unstructured data as daily uploads surpass 2.5 quintillion bytes. GhostDrive, built on Filecoin, offers AI-powered tools to transform raw data into structured, secure assets.

Roman Nebo from Ghost Drive highlights the challenge of managing unstructured data as daily uploads surpass 2.5 quintillion bytes. GhostDrive, built on Filecoin, offers AI-powered tools to transform raw data into structured, secure assets.

Filecoin

15,309 views • 1 year ago

Primal’s meticulously crafted 3D models – derived from real scan data – amplify anatomy learning. Follow us to learn more!

Primal’s meticulously crafted 3D models – derived from real scan data – amplify anatomy learning. Follow us to learn more!

Primal Pictures 3D Human Anatomy & Physiology

128,319 views • 5 months ago

Introducing brand new scrim analysis tool View your scrim data INSTANTLY after a match ends. ✅ No Vods or Recording needed ✅ No lag or ping ✅ Scrim data is viewable immediately after the game ends on ✅ Matches are completely private and secure ✅ Export your data to JSON, and CSV (coming soon) ✅ Only $29.99/mo More features 👇

RIB.GG

75,936 views • 2 years ago