arxiv:2606.01249

Trust Region On-Policy Distillation

Published on May 31

· Submitted by

xxr on Jun 3

Samsung Research

Upvote

Authors:

Ziheng Li ,

Abstract

Trust Region On-Policy Distillation (TrOPD) improves reliable token-level supervision in large language model distillation by using trust regions, outlier estimation, and off-policy guidance to address instability issues under distribution mismatch.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training becomes unstable when the teacher and student distributions differ substantially, as teacher supervision on student-generated tokens may yield unreliable policy gradients and even cause optimization failure. This work addresses reliable on-policy token-level supervision through credit assignment strategies, and proposes Trust Region On-Policy Distillation, TrOPD. It features the following characteristics: 1) Trust-Region On-Policy Learning: TrOPD performs OPD only in regions where the teacher provides reliable supervision, mitigating the optimization difficulty of the K1 reverse-KL estimator under distribution mismatch. 2) Outlier Estimation: For outlier regions, we explore gradient clipping, masking, and forward-KL estimation to reduce the adverse effects of unreliable supervision. 3) Off-Policy Guidance: The student continues generation from teacher prefixes and uses forward KL to imitate off-policy guidance, encouraging on-policy exploration toward reliable regions. Experiments show that TrOPD consistently outperforms SoTA OPD baselines, including OPD, EOPD, and REOPOLD, across mathematical reasoning, code generation, and general-domain benchmarks.

View arXiv page View PDF Add to collection

Community

xrxing

Paper submitter 1 day ago

avahal

about 6 hours ago

the bit i keep coming back to is the top-k forward kl estimator for outliers in troupd. by quantifying teacher-student agreement in a top-k sense, it preserves informative signals where the distributions diverge but avoids spraying noisy gradients in truly unreliable regions. that seems like the core lever behind the stability gains, more than the trust region masking alone. it’d be interesting to see ablations on k or a dynamic k schedule, because fixed k could hurt tasks with varying token-level divergence. btw the arxivlens breakdown helped me parse the method details, nice companion read: https://arxivlens.com/PaperView/Details/trust-region-on-policy-distillation-491-a084eeb5