A benchmark for evaluating AI agents on realistic, hard tasks in command-line environments.
Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces
M.A. Merrill, A.G. Shaw, N. Carlini, B. Li, H. Raj, I. Bercovich, L. Shi, J.Y. Shin, et al.
arXiv preprint arXiv:2601.11868