Research

Publications

A list of papers I've contributed to. See Google Scholar for the most current list.

  • Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces

    M.A. Merrill, A.G. Shaw, N. Carlini, B. Li, H. Raj, I. Bercovich, L. Shi, J.Y. Shin, et al.
    arXiv preprint arXiv:2601.11868

    arXiv
  • Harbor: A framework for evaluating and optimizing agents and models in container environments

    HF Team
    January 2026

    code
  • Customer-Agent: Learning Long-Context Reasoning over Shopping Trajectories via Reinforcement Learning

    Hongye Liu, Rongmei Lin, Anurag Kashyap, Hejie Cui, Ricardo Henao, Besnik, et al.
    OpenReview, 2026

    pdf OpenReview
  • Terminal-bench: A benchmark for AI agents in terminal environments

    TTB Team
    2025