SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors Paper • 2510.17516 • Published Oct 20, 2025 • 2
SWE-chat: Coding Agent Interactions From Real Users in the Wild Paper • 2604.20779 • Published 6 days ago • 12
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons Paper • 2503.05731 • Published Feb 19, 2025 • 3
Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation Paper • 2509.08825 • Published Sep 10, 2025 • 3