GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization Paper • 2601.05242 • Published Jan 8 • 231
Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning Paper • 2602.10090 • Published Feb 10 • 53
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds Paper • 2604.14268 • Published Apr 15 • 120
Heterogeneous Agent Collaborative Reinforcement Learning Paper • 2603.02604 • Published Mar 3 • 195
Qwen2.5-Coder Collection Code-specific model series based on Qwen2.5 • 38 items • Updated Mar 2 • 367
PaperBanana: Automating Academic Illustration for AI Scientists Paper • 2601.23265 • Published Jan 30 • 227
MiroThinker-1.7 & H1: Towards Heavy-Duty Research Agents via Verification Paper • 2603.15726 • Published Mar 16 • 186
NeMo Gym Collection Collection of RL verifiable data for NeMo Gym • 22 items • Updated about 20 hours ago • 59
Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation Paper • 2502.16707 • Published Feb 23, 2025 • 14
Learning to Repair Lean Proofs from Compiler Feedback Paper • 2602.02990 • Published Feb 3 • 29
Scaling Embeddings Outperforms Scaling Experts in Language Models Paper • 2601.21204 • Published Jan 29 • 103
view article Article Open Responses: What you need to know +2 evalstate, burtenshaw, merve, pcuenq • Jan 15 • 111
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines Paper • 2502.14739 • Published Feb 20, 2025 • 110
PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning Paper • 2508.21104 • Published Aug 28, 2025 • 37
🧠Reasoning datasets Collection Datasets with reasoning traces for math and code released by the community • 24 items • Updated May 19, 2025 • 190
view article Article From GRPO to DAPO and GSPO: What, Why, and How NormalUhr • Aug 9, 2025 • 119