OpenThoughts-Agent: Data Recipes for Agentic Models

Negin Raoof, Richard Zhuang, Marianna Nezhurina, Etash Guha, Atula Tejaswi et al.|June 23, 2026arXiv

Key Takeaway

Systematic data curation matters more than you might think—the right mix of task sources and diversity in training data significantly improves how well agents generalize across different benchmarks.

Summary

This paper presents OpenThoughts-Agent, an open framework for creating training data for AI agents that can handle diverse tasks. The authors ran 100+ experiments to understand what makes good training data, then created a 100K example dataset that improved agent performance by 3.9 percentage points over existing open models.

training agents data

Key Terms

agentic-tasks data-curation ablation-studies task-diversity scaling-properties