Applied Scientist · Amazon

Anurag Kashyap

I work on language model post-training, evaluation, and the infrastructure around training agents. My recent research focuses on benchmarking and improving agent behavior in realistic environments — terminals, containers, and long-context interaction.

Google Scholar GitHub LinkedIn

Selected Work

All publications →

2026

Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces

M.A. Merrill, A.G. Shaw, N. Carlini, B. Li, H. Raj, I. Bercovich, L. Shi, J.Y. Shin, et al.
arXiv preprint arXiv:2601.11868

arXiv
2026

Harbor: A framework for evaluating and optimizing agents and models in container environments

HF Team
January 2026

code
2026

Customer-Agent: Learning Long-Context Reasoning over Shopping Trajectories via Reinforcement Learning

Hongye Liu, Rongmei Lin, Anurag Kashyap, Hejie Cui, Ricardo Henao, Besnik, et al.
OpenReview, 2026

pdf
2025

Terminal-bench: A benchmark for AI agents in terminal environments

TTB Team
2025

Projects

All projects →

2026

PRC Watermark Visualizer

Interactive visualizer for exploring pseudorandom code watermarks in LLM outputs.

watermarking

github demo
2026

Add your first project

A short one-line description of what the project is and why it matters.

placeholder

github

Recent Writing

All posts →

Apr 2026

Hello, world

Welcome to the new site. I’ll use this space to write about machine learning, post-training, agents, and whatever else seems worth thinking through in public.

Anurag Kashyap

Selected Work

Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces

Harbor: A framework for evaluating and optimizing agents and models in container environments

Customer-Agent: Learning Long-Context Reasoning over Shopping Trajectories via Reinforcement Learning

Terminal-bench: A benchmark for AI agents in terminal environments

Projects

PRC Watermark Visualizer

Add your first project

Recent Writing

Hello, world