Evaluations - a jacklanda Collection

jacklanda 's Collections

Evaluations

updated 26 days ago

Evals for Language Agents

Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia

Paper • 2512.03318 • Published Dec 3, 2025 • 4
\$OneMillion-Bench: How Far are Language Agents from Human Experts?

Paper • 2603.07980 • Published Mar 9 • 27
Revisiting a Pain in the Neck: Semantic Phrase Processing Benchmark for Language Models

Paper • 2405.02861 • Published May 5, 2024 • 1
Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models

Paper • 2604.16593 • Published about 1 month ago • 6