Apple
Research

Mechanistic Analysis of Alignment Algorithms in Language Models

Researchers conducted a systematic analysis of six preference-optimization methods (PPO, DPO, SimPO, ORPO, GRPO, and KTO) to understand how they reshape language models' internal computations. The study found that different alignment objectives induce qualitatively distinct representational changes, with some methods enhancing feature separability while others degrade it, revealing that behavioral alignment doesn't guarantee uniform internal restructuring.

Read full story at cs.LG updates on arXiv.orgV: · A: · D:
Related
Research
Nothing from Something: Can a Language Model Discover 0?
This arxiv paper uses the concept of zero as a test case for whether language models can engage in genuine mathematical ...
Research
Relational Structural Causal Models
Researchers have extended Pearl's structural causal models to settings where objects and their relations vary, addressin...
Research
A Definition of Good Explanations and the Challenges Explaining LLM Outputs
This arxiv paper proposes a formal definition of what constitutes a good explanation, drawing on counterfactual reasonin...