Instruction Strategy Design for Autonomous Machine Learning Experimentation Systems: A Taxonomy, Cross-System Analysis, and Evidence-Based Practitioner Framework

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Autonomous machine learning experimentation systems—wherein a large language model (LLM) agent iteratively proposes, executes, and evaluates code modifications against a fixed scalar metric—represent a fundamental shift in how machine learning research is conducted. In these systems, the practitioner's primary lever is not the training code itself but the natural-language research program : the instruction document that specifies objectives, priorities, and constraints for the agent across dozens or hundreds of consecutive decisions. Despite this centrality, no principled framework for designing research programs exists in the literature. This survey addresses that gap through four contributions. First, we conduct a structured cross-system analysis of sixteen agentic AutoML and autonomous research systems—including AIDE, AIRA, R&D-Agent, AgentHPO, AlphaEvolve, MLAgentBench, AI-Researcher, and AI Scientist-v2—identifying the instruction document as a universal practitioner-facing control mechanism and cataloguing seven design dimensions. Second, we develop a five-family taxonomy of instruction strategies: Scope-Constrained, Hypothesis-Directed, Diversity-Preserving, Simplicity-Biased, and Curriculum-Staged, grounded in theory from the AutoML, evolutionary computation, prompt engineering, and curriculum learning literatures. Third, we provide multi-source empirical grounding: analysis of two publicly documented overnight sessions suggests a cross-session curriculum intervention is associated with a 37% difference in total gain, with important caveats regarding session-length confounding; independently controlled benchmarks from AIRA and AgentHPO corroborate the taxonomy's predictions. Fourth, five practitioner guidelines with explicitly labelled calibration thresholds are synthesised and validated against all sixteen surveyed systems.

Article activity feed