Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces

M.A. Merrill, A.G. Shaw, N. Carlini, B. Li, H. Raj, I. Bercovich, L. Shi, J.Y. Shin, et al.

arXiv preprint arXiv:2601.11868

arXiv

A benchmark for evaluating AI agents on realistic, hard tasks in command-line environments.