Abstract
A bounded contract approach for long-horizon LLM agents uses typed retrieval to assemble fresh prompts, enabling isolated analysis of memory components and demonstrating improved performance in complex decision-making tasks.
Memory for a long-horizon LLM agent is a contract about what each future decision is allowed to see. The simplest contract appends past observations, tool calls, and reflections to every prompt, which makes prior context easy to access but also turns it into a jumbled mixture in which the effect of any single memory component is hard to isolate. We introduce and instrument an alternative bounded contract: every decision is made from a fresh user message assembled by typed retrieval, with no raw cross-decision transcript appended. The prompt thus stays bounded across runs of any length, and any single layer can be ablated in isolation. We instantiate the contract in Slay the Spire 2, a closed-rule stochastic deck-building game whose runs require hundreds of tactical and strategic decisions. A public online benchmark of frontier LLMs on the same game reports zero wins at the lowest difficulty across five configurations, and the developer-reported human win rate at the same difficulty is 16%; the task is hard but not saturated. Within our harness, a fixed-A0 ablation shows the largest observed difference when triggered strategic skills are enabled: the no-store baseline wins 3/10 games and adding the skill layer 6/10. At this sample size the comparison is directional rather than statistically decisive (Fisher exact p\approx0.37); a cross-backbone probe and public accumulating-context baselines are reported as operational comparisons rather than controlled tests of the contract variable itself. We release a reproducible testbed: 298 completed trajectories with condition tags, frozen memory/skill snapshots, prompt records, and analysis scripts -- an agent design and a validated, reusable methodology for studying how explicit memory layers shape long-horizon LLM-agent decisions.
Community
AgenticSTS hits on a critical pain point: the 'jumbled mixture' of context in long-horizon tasks. Most current implementations just throw everything into a sliding window or a naive RAG loop, which inevitably degrades the signal-to-noise ratio as the trajectory grows. Moving toward a bounded contract based on typed retrieval is the right engineering move—it treats memory as a structured API rather than a raw text dump. I'm interested to see if this approach maintains coherence across 100+ steps without losing the 'thread' of the original goal. This is how we move from simple chatbots to actual reliable systems.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SPIKE: An Adaptive Dual Controller Framework for Cost-Efficient Long-Horizon Game Agents (2026)
- Exploring Cross-Scenario Generality of Agentic Memory Systems: Diagnostics and a Strong Baseline (2026)
- LLM Agents Are Latent Context Managers: Eliciting Self-Managed Context via a Proprioceptive Dashboard (2026)
- SkillOpt: Executive Strategy for Self-Evolving Agent Skills (2026)
- CP-Agent: A Calibrated Risk-Controlled Agent for Feedback-Driven Competitive Programming (2026)
- Self-GC: Self-Governing Context for Long-Horizon LLM Agents (2026)
- Ask Now, Use Later: Benchmarking the Proactivity Gap in Long-Lived LLM Agents (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper