Video wird geladen...

Video konnte nicht geladen werden

Zur Startseite

Data preparation! It's crucial for machine learning, and we all hate it. Tools and techniques to reduce this burden? A quick summary of 10 years of R&D on this, from cheap tricks to LLMs and graph neural networks 1/9

13,751 Aufrufe • vor 1 Jahr •via X (Twitter)

10 Kommentare

Profilbild von Gael Varoquaux 🦋
Gael Varoquaux 🦋vor 1 Jahr

Yes, part of this is me saying that old stuff is still cool. But hang on tight, there's some LLMs and graph neural networks 2/9

Profilbild von Gael Varoquaux 🦋
Gael Varoquaux 🦋vor 1 Jahr

For tabular data, each column is different, and much work goes into modeling the categories, the dates, the strings. @skrub_data's TableVectorizer hugely facilitates turning a messy dataframe to a bunch of numbers for ML, and it captures many tricks 3/9

Profilbild von Gael Varoquaux 🦋
Gael Varoquaux 🦋vor 1 Jahr

Deep learning unfortunately does not do magical feature engineering. Rather, tree-based methods (eg gradient boosting) predict better and are faster as they capture better properties of tables, eg column-wise information 4/9

Profilbild von Gael Varoquaux 🦋
Gael Varoquaux 🦋vor 1 Jahr

Neural networks bring value of pretrained, but this requires recognizing data semantics. For this, we need to jointly model strings and numbers. The CARTE model predicts great by mixing numbers with language models for strings and graph neural networks for relational context 5/9

Profilbild von Gael Varoquaux 🦋
Gael Varoquaux 🦋vor 1 Jahr

But CARTE is resource-hungry, and resource usage matters. Maybe cheap tricks are worth exploring... Which brings me back to @skrub_data 6/9

Profilbild von Gael Varoquaux 🦋
Gael Varoquaux 🦋vor 1 Jahr

For string columns, @skrub_data can use sub-string modeling to find latent categories. Soon, it will support LLMs, if you're GPU-rich With these tricks, the TableVectorizer builds great lightweight predictors 7/9

Profilbild von Gael Varoquaux 🦋
Gael Varoquaux 🦋vor 1 Jahr

provides more to facilitate data wrangling The TableReport is an interactive datafrale explorer We're even prototyping a dataframe wrapper API, to track transformations, optimize them for learning, and re-apply them in production. Wild! 8/9

Profilbild von Gael Varoquaux 🦋
Gael Varoquaux 🦋vor 1 Jahr

Less data wrangling, more machine learning! Watch the talk @dotConferences: 20mn of the science, but entertaining (you'll tell me) 9/9

Profilbild von James de Vrij
James de Vrijvor 1 Jahr

@dotConferences I'm going to have to check out Skrub. Do you have any tips for dealing with multiple tables? Cleaning a table is obviously very time consuming but another task I spend a lot of time on is exploring other related tables which might add info. It's a lot of work for uncertain payoff

Profilbild von Mathieu
Mathieuvor 1 Jahr

@dotConferences That was a great talk, thanks Gaël.

Ähnliche Videos

A new roadmap. A New Era of The Graph 🗺️ The Graph’s new roadmap introduces a bold and transformative vision for the future of The Graph! The new R&D roadmap details an expansion of The Graph’s ability to serve web3’s growing demands for data access, while better serving builders and protocol contributors, and improving the overall simplicity and efficiency of the network. After three years of serving builders, The Graph Network is mature, reliable, and performant. The Graph ecosystem has followed through on its commitment to democratize access to blockchain data while also establishing subgraphs as a web3 standard. But The Graph’s innovation journey doesn’t end there. The New Era of The Graph is organized into five core objectives: 1️⃣ World of Data Services: Expanding to provide new data services beyond subgraphs to deliver a rich market of data on the network, serving novel use cases for data scientists and more. This will include more data sources, new query languages, and support for LLMs. 2️⃣ Developer Empowerment: Supporting developers through enhanced DevEx and tooling by introducing streamlined billing, clear pricing models, a new free query plan, and reduced gas fees. A more SaaS-like experience for devs, without compromising on decentralization! 3️⃣ Protocol Evolution & Resiliency: Delivering improvements resulting in a more resilient, flexible, and simple protocol, including updates to delegation. 4️⃣ Optimized Indexer Performance: Boosting network performance with improved Indexer tooling and operational capabilities to deliver increased scalability, reduce costs, and enhanced network reliability. 5️⃣ Interconnected Graph of Data: Creating tools for composable data and a global, organized knowledge graph – interlinking open data and making it easier to build upon. The new roadmap sets in motion an exciting evolution in web3 data infrastructure. In a phased rollout, The Graph will introduce many new features and benefits, including the integration of new data services, new query languages, enhanced developer tooling, improved UX + UI, alongside greater protocol efficiency and resilience. As this new era unfolds, The Graph crystallizes as the connective tissue across the many layers of the web3 stack, evolving into a comprehensive, interwoven graph of data equipped to serve every project dreamt up by web3’s innovators. Read the full announcement linked in the comment below!

The Graph

425,314 Aufrufe • vor 2 Jahren

a playlist of 30 youtube videos to learn machine learning fundamentals from scratch if you're struggling on where to start learning ML, this list goes this "Machine Learning: Teach by Doing" is a solid choice to learn both theory and code. (1) Introduction to Machine Learning Teach by Doing: (2) What is Machine Learning? History of Machine Learning: (3) Types of ML Models: (4) 6 steps of any ML project: (5) Install Python and VSCode and run your first code: (6) Linear Classifiers Part 1: (7) Linear Classifiers Part 2: (8) Jupyter Notebook, Numpy and Scikit-Learn: (9) Running the Random Linear Classifier Algorithm in Python: (10) The oldest ML model - Perceptron: (11) Coding the Perceptron: (12) Perceptron Convergence Theorem: (13) Magic of features in Machine Learning: (14) One hot encoding: (15) Logistic Regression Part 1: (16) Cross Entropy Loss: (17) How gradient descent works: (18) Logistic Regression from scratch in Python: (19) Introduction to Regularization: (20) Implementing Regularization in Python: (21) Linear Regression Introduction: (22) Ordinary Least Squares step by step implementation: (23) Ridge regression fundamentals and intuition: (24) Regression recap for interviews: (25) Neural network architecture in 30 minutes: (26) Backpropagation intuition: (27) Neural network activation functions: (28) Momentum in gradient descent: (29) Hands on neural network training in Python: (30) Introduction to Convolutional Neural Networks (CNNs):

ℏεsam

117,570 Aufrufe • vor 1 Jahr

if you're struggling on where to start learning ML, here’s a playlist of 30 youtube videos to learn machine learning fundamentals from scratch "Machine Learning: Teach by Doing" is a solid choice to learn both theory and code. (1) Introduction to Machine Learning Teach by Doing: (2) What is Machine Learning? History of Machine Learning: (3) Types of ML Models: (4) 6 steps of any ML project: (5) Install Python and VSCode and run your first code: (6) Linear Classifiers Part 1: (7) Linear Classifiers Part 2: (8) Jupyter Notebook, Numpy and Scikit-Learn: (9) Running the Random Linear Classifier Algorithm in Python: (10) The oldest ML model - Perceptron: (11) Coding the Perceptron: (12) Perceptron Convergence Theorem: (13) Magic of features in Machine Learning: (14) One hot encoding: (15) Logistic Regression Part 1: (16) Cross Entropy Loss: (17) How gradient descent works: (18) Logistic Regression from scratch in Python: (19) Introduction to Regularization: (20) Implementing Regularization in Python: (21) Linear Regression Introduction: (22) Ordinary Least Squares step by step implementation: (23) Ridge regression fundamentals and intuition: (24) Regression recap for interviews: (25) Neural network architecture in 30 minutes: (26) Backpropagation intuition: (27) Neural network activation functions: (28) Momentum in gradient descent: (29) Hands on neural network training in Python: (30) Introduction to Convolutional Neural Networks (CNNs):

ℏεsam

108,861 Aufrufe • vor 1 Jahr