The Path Not Taken: RLVR Provably Learns Off the Principals

Best AI papers explained - En podkast av Enoch H. Kang

Prøv Podimo gratis i hele 60! dager!

I Podimo finner du eksklusive podkaster og bestselgende lydbøker tilpasset dine ører

Kategorier:

This paper studies mechanistic explanation for the paradox that **Reinforcement Learning with Verifiable Rewards (RLVR)** reliably improves large language model reasoning while making only minimal, sparse changes to parameters. The authors introduce the **Three-Gate Theory**, arguing that sparse updates are a surface artifact of a **model-conditioned optimization bias**. **Gate I (KL Anchor)** constrains each update, while **Gate II (Model Geometry)** steers the updates off the principal, high-curvature directions favored by **Supervised Fine-Tuning (SFT)** and into low-curvature subspaces, thereby preserving the model's spectral structure. **Gate III (Precision)** amplifies the appearance of sparsity by masking small updates in non-preferred regions due to bfloat16 storage limits. Consequently, the work demonstrates that **RLVR learns in a distinct optimization regime from SFT**, which suggests that SFT-era parameter-efficient fine-tuning (PEFT) techniques are often ill-suited for RL applications.

Visit the podcast's native language site