Running Agents PrefixGuard Demo - Agent Failure Detection π‘ Detect potential agent failures from execution traces
Running Agents LoPE Demo - Prompt Perturbation for Reasoning Exploration π§ Compare baseline and perturbed reasoning for tasks
Paused Agents Lost-in-Thought Benchmark π§ Run a benchmark to see how reasoning steps affect retrieval accuracy
Sleeping Agents Master Key Capability Demo π Show expected accuracy boost for a math problem via steering
Sleeping Agents Agentic World Model Explorer π Explore world model levels, laws, and rollouts interactively
Runtime error Agents COMPASS-Inspired Semantic Sampling for Sudanese Arabic Dialect Understanding π―
Sleeping Agents CoT Spatial Reasoning Degradation π§ Show how step-by-step prompts affect visual puzzle answers