正在加载视频...
视频加载失败
Is it possible to build end-to-end autonomous discovery systems using Large Generative Models (LGMs)? 🧬 In this position paper, we argue: 🧵 (1/n) Ai2 Aristo Team at Ai2 Harshit Surana UMass Amherst University of Utah
11 条评论

(2/n) 📊 We present a practical first step toward the goal of end-to-end automation of the scientific process focusing on observational or experimental data for two reasons: (1) an abundance of large-scale datasets that would benefit highly from automated discovery; 📈 (2) the practicality of automated verification enabled by data without the need for additional data collection. ⚗️

(3/n) A blueprint flow for data-driven discovery includes the following scenarios: 1. The user asks an explicit question around a particular line of inquiry or hypothesis. 🎯 2. The user can also ask a broad and partially defined high-level question, where the system must figure out the appropriate datasets, data transformations, variables, a list of possible hypotheses, and their verification. 📒 3. The user can provide follow-up feedback at any time, and the "continual learner" will continually evolve while providing updated experiments and results. 🤖

(4/n) We posit that: 1. LGMs present an incredible potential, such as knowledge-driven hypothesis search or tool usage to verify hypotheses—creating new avenues for ongoing efforts in the ML community on code generation, planning, and program synthesis. 🛠️ 2. LGMs are not all we need. Interfacing with fail-proof tools and inference-time functions, catering to domains and long-tail with user moderation, is required to have an accurate, reliable, and robust data-driven discovery. 👩👩👧

(5/n) We outline a set of desired properties for a data-driven discovery system. 🟩 1. Comprehensive Data Understanding 2. Hypothesis Generation 3. Planning and Orchestrating Research Pathways 4. Hypothesis Evaluation 5. Measurement of Progress 6. Knowledge Integration 7. Research Ethics and Fairness -- indicate high-level desiderata with several sub-properties delineated in the paper. Our survey across several existing automated and semi-automated data analysis and discovery systems reveals that these only partially cover the desired functionalities. 🔻

(6/n) As a proof of concept, we build DataVoyager—a system powered by GPT-4 that can semantically understand a dataset, programmatically explore verifiable hypotheses using the available data, run basic statistical tests (e.g., correlation and regression analyses) by invoking pre-defined functions or generating code snippets, and finally analyze the output with detailed analyses.

(7/n) Planning DataVoyager presents a strong base case for planning with decomposition, data transformation, and symbolic reasoning. However, LGM-based planners prefer direct, goal-oriented variables, which can lead to a lack of diversity in search, impacting the novelty of the outcome.

(8/n) Experimentation & Verification DataVoyager can use tools and insight-specific code generation to reasonably verify hypotheses. But LGMs are memoryless. They cannot automatically recover from past errors in execution and verification. We argue that how LGMs adapt to novel tools and code at inference time is still an open question.

(9/n) Knowledge Integration DataVoyager can partially achieve interdisciplinary knowledge integration. E.g., it could connect the role of economic pressure on health outcomes with cultural anthropology, psychological factors, public health intervention, and urban planning. Additionally, knowledge frontiers represent cutting-edge scientific exploration. DataVoyager shows promise in generating novel analysis in an experimental scientific frontier.

(10/n) 🚨 We also point out possible limitations of such automated systems, such as: 1. Hallucinations in LGMs undermining scientific rigor 2. Cost at scale in high-throughput fields 3. Data dredging resulting in sub-optimal policies 4. Autonomous discovery leading to legal implications 5. Potential percolation of bias originating from dual sources--the underlying dataset and the LGMs.

(11/n) 🌌 We hope our timely position can increase interest and efforts in developing, debating, and enhancing the vision for an accurate, reliable, and robust system for data-driven discovery. These systems can transform domains overwhelmed with vast amounts of data, including but not limited to observational social sciences, medicine, astronomy, biology, climate science, and consumer science. It can initiate a Cambrian explosion of discovery while promoting speed, reproducibility, and collaboration.

(n/n) This was a collaborative effort by @mbodhisattwa @surana_h, @dhruvagarwal17, @hsanchaita, @Ashish_S_AI, and Peter Clark from @allen_ai @ai2_aristo @UMassAmherst @UMass_NLP @UUtah

