ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data

Junhong Shen
Atishay Jain
Zedian Xiao
Ishan Amlekar
Mouad Hadji
Aaron Podolny
Ameet Talwalkar

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large Language Model (LLM) agents are rapidly improving to handle increasingly complex web-based tasks. Most of these agents rely on general-purpose, proprietary models like GPT-4 and focus on designing better prompts to improve their planning abilities. However, general-purpose LLMs are not specifically trained to understand specialized web contexts such as HTML, and they often struggle with long-horizon planning. We explore an alternative approach that fine-tunes open-source LLMs using production-scale workflow data collected from over 250 domains corresponding to 6 billion tokens. This simple yet effective approach shows substantial gains over prompting-based agents on existing benchmarks—ScribeAgent achieves state-of-the-art direct generation performance on Mind2Web and improves the task success rate by 7.3% over the previous best text-only web agents on WebArena. We further perform detailed ablation studies on various fine-tuning design choices and provide insights into LLM selection, training recipes, context window optimization, and effect of dataset sizes.

Version published to 10.32388/8vog0o
Dec 11, 2024

Tool and Agent Selection for Large Language Model Agents in Production: A Survey

This article has 9 authors:
1. Elias Lumer
2. Anmol Gulati
3. Faheem Nizar
4. Dzmitry Hedroits
5. Atharva Mehta
6. Henry Hwangbo
7. Vamse Kumar Subbiah
8. Pradeep Honaganahalli Basavaraju
9. James A. Burke
This article has no evaluationsLatest version Dec 12, 2025
Best Practices for Using Large Language Models at Scale

This article has 5 authors:
1. Bhargavee Kannikanti
2. Arjun Coimbatore Nagarasan
3. Alberto Rosas
4. Sriram Kothandaraman
5. Sravan Kumar Kannuri
This article has no evaluationsLatest version Dec 12, 2025
Evaluation and Benchmarking of Generative and Agentic AI Systems: A Comprehensive Survey

This article has 1 author:
1. Manish Shukla
This article has no evaluationsLatest version Dec 16, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Tool and Agent Selection for Large Language Model Agents in Production: A Survey

Best Practices for Using Large Language Models at Scale

Evaluation and Benchmarking of Generative and Agentic AI Systems: A Comprehensive Survey