正在加载视频...

视频加载失败

Data preparation! It's crucial for machine learning, and we all hate it. Tools and techniques to reduce this burden? A quick summary of 10 years of R&D on this, from cheap tricks to LLMs and graph neural networks 1/9

13,751 次观看 • 1 年前 •via X (Twitter)

10 条评论

Gael Varoquaux 🦋 的头像
Gael Varoquaux 🦋1 年前

Yes, part of this is me saying that old stuff is still cool. But hang on tight, there's some LLMs and graph neural networks 2/9

Gael Varoquaux 🦋 的头像
Gael Varoquaux 🦋1 年前

For tabular data, each column is different, and much work goes into modeling the categories, the dates, the strings. @skrub_data's TableVectorizer hugely facilitates turning a messy dataframe to a bunch of numbers for ML, and it captures many tricks 3/9

Gael Varoquaux 🦋 的头像
Gael Varoquaux 🦋1 年前

Deep learning unfortunately does not do magical feature engineering. Rather, tree-based methods (eg gradient boosting) predict better and are faster as they capture better properties of tables, eg column-wise information 4/9

Gael Varoquaux 🦋 的头像
Gael Varoquaux 🦋1 年前

Neural networks bring value of pretrained, but this requires recognizing data semantics. For this, we need to jointly model strings and numbers. The CARTE model predicts great by mixing numbers with language models for strings and graph neural networks for relational context 5/9

Gael Varoquaux 🦋 的头像
Gael Varoquaux 🦋1 年前

But CARTE is resource-hungry, and resource usage matters. Maybe cheap tricks are worth exploring... Which brings me back to @skrub_data 6/9

Gael Varoquaux 🦋 的头像
Gael Varoquaux 🦋1 年前

For string columns, @skrub_data can use sub-string modeling to find latent categories. Soon, it will support LLMs, if you're GPU-rich With these tricks, the TableVectorizer builds great lightweight predictors 7/9

Gael Varoquaux 🦋 的头像
Gael Varoquaux 🦋1 年前

provides more to facilitate data wrangling The TableReport is an interactive datafrale explorer We're even prototyping a dataframe wrapper API, to track transformations, optimize them for learning, and re-apply them in production. Wild! 8/9

Gael Varoquaux 🦋 的头像
Gael Varoquaux 🦋1 年前

Less data wrangling, more machine learning! Watch the talk @dotConferences: 20mn of the science, but entertaining (you'll tell me) 9/9

James de Vrij 的头像
James de Vrij1 年前

@dotConferences I'm going to have to check out Skrub. Do you have any tips for dealing with multiple tables? Cleaning a table is obviously very time consuming but another task I spend a lot of time on is exploring other related tables which might add info. It's a lot of work for uncertain payoff

Mathieu 的头像
Mathieu1 年前

@dotConferences That was a great talk, thanks Gaël.

相关视频

A new roadmap. A New Era of The Graph 🗺️ The Graph’s new roadmap introduces a bold and transformative vision for the future of The Graph! The new R&D roadmap details an expansion of The Graph’s ability to serve web3’s growing demands for data access, while better serving builders and protocol contributors, and improving the overall simplicity and efficiency of the network. After three years of serving builders, The Graph Network is mature, reliable, and performant. The Graph ecosystem has followed through on its commitment to democratize access to blockchain data while also establishing subgraphs as a web3 standard. But The Graph’s innovation journey doesn’t end there. The New Era of The Graph is organized into five core objectives: 1️⃣ World of Data Services: Expanding to provide new data services beyond subgraphs to deliver a rich market of data on the network, serving novel use cases for data scientists and more. This will include more data sources, new query languages, and support for LLMs. 2️⃣ Developer Empowerment: Supporting developers through enhanced DevEx and tooling by introducing streamlined billing, clear pricing models, a new free query plan, and reduced gas fees. A more SaaS-like experience for devs, without compromising on decentralization! 3️⃣ Protocol Evolution & Resiliency: Delivering improvements resulting in a more resilient, flexible, and simple protocol, including updates to delegation. 4️⃣ Optimized Indexer Performance: Boosting network performance with improved Indexer tooling and operational capabilities to deliver increased scalability, reduce costs, and enhanced network reliability. 5️⃣ Interconnected Graph of Data: Creating tools for composable data and a global, organized knowledge graph – interlinking open data and making it easier to build upon. The new roadmap sets in motion an exciting evolution in web3 data infrastructure. In a phased rollout, The Graph will introduce many new features and benefits, including the integration of new data services, new query languages, enhanced developer tooling, improved UX + UI, alongside greater protocol efficiency and resilience. As this new era unfolds, The Graph crystallizes as the connective tissue across the many layers of the web3 stack, evolving into a comprehensive, interwoven graph of data equipped to serve every project dreamt up by web3’s innovators. Read the full announcement linked in the comment below!

The Graph

425,314 次观看 • 2 年前

a playlist of 30 youtube videos to learn machine learning fundamentals from scratch if you're struggling on where to start learning ML, this list goes this "Machine Learning: Teach by Doing" is a solid choice to learn both theory and code. (1) Introduction to Machine Learning Teach by Doing: (2) What is Machine Learning? History of Machine Learning: (3) Types of ML Models: (4) 6 steps of any ML project: (5) Install Python and VSCode and run your first code: (6) Linear Classifiers Part 1: (7) Linear Classifiers Part 2: (8) Jupyter Notebook, Numpy and Scikit-Learn: (9) Running the Random Linear Classifier Algorithm in Python: (10) The oldest ML model - Perceptron: (11) Coding the Perceptron: (12) Perceptron Convergence Theorem: (13) Magic of features in Machine Learning: (14) One hot encoding: (15) Logistic Regression Part 1: (16) Cross Entropy Loss: (17) How gradient descent works: (18) Logistic Regression from scratch in Python: (19) Introduction to Regularization: (20) Implementing Regularization in Python: (21) Linear Regression Introduction: (22) Ordinary Least Squares step by step implementation: (23) Ridge regression fundamentals and intuition: (24) Regression recap for interviews: (25) Neural network architecture in 30 minutes: (26) Backpropagation intuition: (27) Neural network activation functions: (28) Momentum in gradient descent: (29) Hands on neural network training in Python: (30) Introduction to Convolutional Neural Networks (CNNs):

ℏεsam

117,570 次观看 • 1 年前

if you're struggling on where to start learning ML, here’s a playlist of 30 youtube videos to learn machine learning fundamentals from scratch "Machine Learning: Teach by Doing" is a solid choice to learn both theory and code. (1) Introduction to Machine Learning Teach by Doing: (2) What is Machine Learning? History of Machine Learning: (3) Types of ML Models: (4) 6 steps of any ML project: (5) Install Python and VSCode and run your first code: (6) Linear Classifiers Part 1: (7) Linear Classifiers Part 2: (8) Jupyter Notebook, Numpy and Scikit-Learn: (9) Running the Random Linear Classifier Algorithm in Python: (10) The oldest ML model - Perceptron: (11) Coding the Perceptron: (12) Perceptron Convergence Theorem: (13) Magic of features in Machine Learning: (14) One hot encoding: (15) Logistic Regression Part 1: (16) Cross Entropy Loss: (17) How gradient descent works: (18) Logistic Regression from scratch in Python: (19) Introduction to Regularization: (20) Implementing Regularization in Python: (21) Linear Regression Introduction: (22) Ordinary Least Squares step by step implementation: (23) Ridge regression fundamentals and intuition: (24) Regression recap for interviews: (25) Neural network architecture in 30 minutes: (26) Backpropagation intuition: (27) Neural network activation functions: (28) Momentum in gradient descent: (29) Hands on neural network training in Python: (30) Introduction to Convolutional Neural Networks (CNNs):

ℏεsam

108,861 次观看 • 1 年前