Representation-Based Exploration for Language Models: From Test-Time to Post-Training

Best AI papers explained - En podkast av Enoch H. Kang

Prøv Podimo gratis i hele 60! dager!

I Podimo finner du eksklusive podkaster og bestselgende lydbøker tilpasset dine ører

Kategorier:

This paper investigates the effectiveness of deliberate exploration in enhancing the reasoning capabilities of large language models (LLMs) trained with reinforcement learning (RL). The authors propose and evaluate a novel representation-based exploration (RepExp) strategy, which uses a bonus derived from the LLM's hidden states to encourage the discovery of diverse and novel behaviors. The study employs a two-pronged evaluation methodology, first testing RepExp in an inference-time setting for selecting diverse responses and then integrating it into the RL post-training pipeline. Key findings indicate that this exploration method significantly improves verifier efficiency and mitigates the "diversity collapse" phenomenon observed in standard RL methods, suggesting that the approach moves beyond merely sharpening existing model capabilities. The results show RepExp provides substantial improvements in pass@k rates and is especially beneficial for stronger models and harder reasoning problems across various tasks like MATH and GSM8K.

Visit the podcast's native language site