Anurag Kashyap
I work on language model post-training, evaluation, and the infrastructure around training agents. My recent research focuses on benchmarking and improving agent behavior in realistic environments — terminals, containers, and long-context interaction.
Selected Work
All publications →-
Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces
M.A. Merrill, A.G. Shaw, N. Carlini, B. Li, H. Raj, I. Bercovich, L. Shi, J.Y. Shin, et al.
arXiv preprint arXiv:2601.11868 -
Customer-Agent: Learning Long-Context Reasoning over Shopping Trajectories via Reinforcement Learning
Hongye Liu, Rongmei Lin, Anurag Kashyap, Hejie Cui, Ricardo Henao, Besnik, et al.
OpenReview, 2026
Projects
All projects →-
PRC Watermark Visualizer
Interactive visualizer for exploring pseudorandom code watermarks in LLM outputs.
-
Add your first project
A short one-line description of what the project is and why it matters.
Recent Writing
All posts →-
Hello, world
Welcome to the new site. I’ll use this space to write about machine learning, post-training, agents, and whatever else seems worth thinking through in public.