← All research
2024–2025
RL Agent Misalignment
Algoverse · Winter 2024 Cohort
Experiments on model organisms of misalignment in Minecraft RL environments. Empirically documented alignment failures including mesa-optimizer objective resistance, reward hacking via underground tunneling, and instrumental convergence failures.
Papers
From Diamond Mining to Open-World Survival: Alignment and Misalignment in RL Agents
LessWrong · Blog post