Do LLMs Internally “Know” When They Follow Instructions?

This paper was accepted at the Foundation Model Interventions (MINT) Workshop at NeurIPS 2024.
Instruction-following is crucial for building AI agents with large language models (LLMs), as these models must adhere strictly to user-provided guidelines. However, LLMs often fail to follow even simple instructions. To improve instruction-following behavior and prevent undesirable outputs, we need a deeper understanding of how LLMs’ internal states relate to these outcomes. Our analysis of LLM internal states reveal a dimension in the input embedding space linked to successful…Apple Machine Learning Research