Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Buckets new
  • Docs
  • Enterprise
  • Pricing
    • Website
      • Tasks
      • HuggingChat
      • Collections
      • Languages
      • Organizations
    • Community
      • Blog
      • Posts
      • Daily Papers
      • Learn
      • Discord
      • Forum
      • GitHub
    • Solutions
      • Team & Enterprise
      • Hugging Face PRO
      • Enterprise Support
      • Inference Providers
      • Inference Endpoints
      • Storage Buckets

  • Log In
  • Sign Up
jacklanda 's Collections
Evaluations
Semantics
Reasoning

Evaluations

updated 26 days ago

Evals for Language Agents

Upvote
-

  • Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia

    Paper • 2512.03318 • Published Dec 3, 2025 • 4

  • \$OneMillion-Bench: How Far are Language Agents from Human Experts?

    Paper • 2603.07980 • Published Mar 9 • 27

  • Revisiting a Pain in the Neck: Semantic Phrase Processing Benchmark for Language Models

    Paper • 2405.02861 • Published May 5, 2024 • 1

  • Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models

    Paper • 2604.16593 • Published about 1 month ago • 6
Upvote
-
  • Collection guide
  • Browse collections
Company
TOS Privacy About Careers
Website
Models Datasets Spaces Pricing Docs