正在加载视频...
视频加载失败
Data preparation! It's crucial for machine learning, and we all hate it. Tools and techniques to reduce this burden? A quick summary of 10 years of R&D on this, from cheap tricks to LLMs and graph neural networks 1/9
10 条评论

Yes, part of this is me saying that old stuff is still cool. But hang on tight, there's some LLMs and graph neural networks 2/9

For tabular data, each column is different, and much work goes into modeling the categories, the dates, the strings. @skrub_data's TableVectorizer hugely facilitates turning a messy dataframe to a bunch of numbers for ML, and it captures many tricks 3/9

Deep learning unfortunately does not do magical feature engineering. Rather, tree-based methods (eg gradient boosting) predict better and are faster as they capture better properties of tables, eg column-wise information 4/9

Neural networks bring value of pretrained, but this requires recognizing data semantics. For this, we need to jointly model strings and numbers. The CARTE model predicts great by mixing numbers with language models for strings and graph neural networks for relational context 5/9

But CARTE is resource-hungry, and resource usage matters. Maybe cheap tricks are worth exploring... Which brings me back to @skrub_data 6/9

For string columns, @skrub_data can use sub-string modeling to find latent categories. Soon, it will support LLMs, if you're GPU-rich With these tricks, the TableVectorizer builds great lightweight predictors 7/9

provides more to facilitate data wrangling The TableReport is an interactive datafrale explorer We're even prototyping a dataframe wrapper API, to track transformations, optimize them for learning, and re-apply them in production. Wild! 8/9

Less data wrangling, more machine learning! Watch the talk @dotConferences: 20mn of the science, but entertaining (you'll tell me) 9/9

@dotConferences I'm going to have to check out Skrub. Do you have any tips for dealing with multiple tables? Cleaning a table is obviously very time consuming but another task I spend a lot of time on is exploring other related tables which might add info. It's a lot of work for uncertain payoff

@dotConferences That was a great talk, thanks Gaël.
相关视频
Sensitive content
Famous YouTuber Zara Dar has quit her PhD and become an OnlyFans content creator full time She used to make videos about Neural Networks, Machine Learning and other Tech Stuff, now she will be making content of a different genre Thoughts on this
I Post Forbidden Videos
53,981 次观看 • 8 个月前
