正在加载视频...
视频加载失败
Three years ago, I started working on an easy-to-use tool for interpretable machine learning in science. I wanted it to do for symbolic regression what Theano did for deep learning. Today, I am beyond excited to share with you the paper describing it! 1.
30 条评论

Symbolic Regression (SR) is a supervised learning task where the space of potential models is spanned by analytic expressions. Often, the goal is to find simple yet accurate expressions that lend themselves to interpretation🔍. 2.

Throughout history, scientists have performed SR "manually," using a mix of intuition and trial-and-error. Empirically-discovered expressions can lead to new theory developments (e.g., Kepler’s law=>Newton's gravity; Planck’s law=>Quantum). 3.

But, with much of ML used in science relying on blackbox models, I worry we often miss out on this crucial step of *understanding* the world. After all, that is the ultimate goal of science! The Latin word scientia literally means “to know.” 4.

To understand a concept, you need to first represent it in your language (however abstract that language is). I think SR is attractive since it grounds ML models in the language of science: symbolic expressions! Just look at any physics cheat sheet: 5.

That isn’t to argue that we avoid deep learning; one can actually use SR as a distillation tool for such blackbox models! 6.

However, when I started my thesis, available SR codes were either: - Easy to use but slow ⏳ - Fast but hard to use 🤔 The only fast and easy-to-use tool was Eureqa, a proprietary and closed-source tool, which meant no customization or embedding into an analysis pipeline. 7.

Enter PySR: fast, easy-to-use, and open-source🎉. Today, PySR has even more features than proprietary alternatives! 8.

A driver of deep learning's accelerated innovation is the strong open-source tooling – we need similar tooling for SR too. This is also why I have also split up the evaluation code of SymbolicRegression.jl into a separate library: DynamicExpressions.jl. 9.

This package makes it easy for others to create new symbolic regression libraries with new ideas, built on a strong foundation of highly optimized kernels used in PySR. Here’s a deep learning analogy: 10.

Okay, so how does PySR work? It’s a fairly traditional approach: a multi-population evolutionary algorithm. Expressions are represented as binary trees, and evolve via a series of mutations and crossovers applied to the “fittest” members of each subpopulation: 11.

But there are many other tricks: BFGS for constant optimization, algebraic simplification, simulated annealing, age-regularized tournament selection, and an adaptive complexity penalty. It’s a bit too much to describe precisely here, so please see the paper if curious 🙂 12.

PySR also works seamlessly across 1000s of cores. Each population evolves independently, and will asynchronously "migrate" between these independent populations to share updates. 13.

A motif in PySR's design is flexibility – while also being extremely high-performance. PySR ought to be a tool that can solve model discovery problems all throughout science, without needing hacks. Here's a comparison: (includes links so you can check these others out!) 14.

In the paper, I demo a benchmark based on historical discoveries, and see whether codes can re-discover these with little prior information. Where possible, I include original datasets! (for Leavitt’s law I had to manually read off data from a 1912 plot…) 15.

To really emulate the problem of discovering an unknown model, I use the same hyperparameters as each author submitted to the SRBench competition (as well as PySR), and let every code search for 1 hour on 8 cores. The rediscovery results (scored: yes/no) - 16.

All methods seem to struggle with Planck’s law and Rydberg formula, likely due to the unusual scaling. Pure deep learning methods (EQL + SR-Transformer) seem to have difficulty on a range of problems. 17.

We can see EQL experiencing numerical instabilities, and SR Transformer (pre-trained on synthetic expressions in various levels of noise) seems to generate overly complex expressions in every test. 18.

While it is important to note some of these are tuned for accuracy alone, it is very interesting that pure deep learning methods still really struggle here. Perhaps it is a testament to the difficulty of learning representations in the space of symbolic expressions. 19.

Regardless of this, DL methods still perform well on synthetic benchmarks, which is what they are tuned for, so I see hybrid approaches as very much worth pursuing! 20.

Today, PySR has a growing community across academia and industry, with users working in a variety of fields from economics to astronomy. I am looking forward to seeing it continue to grow! I would like to thank: 21.

for providing resources for pursuing this research; @cosmo_shirley and @DavidSpergel for countless insightful discussions about PySR, feedback on this manuscript, promotion of it as a tool in the sciences, and for their support of this project; 22.

my research collaborators who provided feedback throughout the development of PySR, including @PabloLemosP @PeterWBattaglia @eigensteve @JayWadekar1 @paco_astro @physicskaze Elaine Cui @CDKreisch Nathan Kutz @DrumBushField Keaton Burns @dkochkov1 23.

Alvaro Sanchez-Gonzalez @AstroCKragh @PatrickKidger @KyleCranmer @Niall_Jeffrey Ana Maria Delgado @AstroKeming Pierre-Alexandre Kamienny, Michael Douglas, @f_charton; all the wonderful open-source code contributors, including @markkitti, T Coxon, Dhananjay Ashok, 24.

Johan Blåbäck, Julius Martensen, GitHub user ngam, @ChrisRackauckas @l_II_llI, Charles Fox @johannbrehmer @cosmic_mar, GitHub user Coba, Pietro Monticone, Mateusz Kubica, GitHub user Jgmedina95, Michael Abbott, Oscar Smith, and several others; 25.

for extremely helpful comments on a draft of this paper, as well as general feedback throughout the project; @w_la_cava for insight throughout the project as for spearheading the SRBench initiative, along with the rest of the SRBench organizers; 26.

Brenden Petersen for feedback on PySR as well as providing insightful discussions about the SR landscape; and so many others (am likely forgetting some) who have provided support to the project through email, Twitter, GitHub issues, and in-person! 27.

I would like to give a huge thanks to the SRBench team as well. I think part of deep learning's continued success is the proliferation of well-tested benchmarks, and the SRBench team is doing this for symbolic regression! 28.

FAQ 1: What about concepts we can't represent with existing operators? A: Interpreting something requires representing it in our language (whether that language be mathematical, programmatical, conceptual, etc.). 29.

Sometimes those representations are hierarchical, and sometimes those representations are also fuzzy. But for each new concept we define and add to our language, we have to ground it in our existing language. 30.

In a symbolic distillation context, this could entail a "feature learning" network, followed by another network that uses those features. You would then distill both networks to expressions in your existing language. 31.
