A misaligned persona feature controls emergent misalignment.
Large language models like ChatGPT don’t just learn facts—they pick up on patterns of behavior. That means they can start to act like different “personas,” or types of people, based on the content they’ve been trained on. Some of those personas are helpful and honest. Others might be careless or misleading.
Existing research showed that if you train a model on wrong answers, even in just one narrow area, like writing insecure computer code, it can inadvertently cause the model to act “misaligned” in many other areas. This is called “emergent misalignment.” We studied why this happens.
Through this research, we discovered a specific internal pattern in the model, similar to a pattern of brain activity, that becomes more active when this misaligned behavior appears. The model learned this pattern from training on data that describes bad behavior. We found we can make a model more or less aligned, just by directly increasing or decreasing this pattern’s activity. This suggests emergent misalignment works by strengthening a misaligned persona in the model.
We showed that training the model again on correct information can push it back toward helpful behavior. Together, this means we might be able to detect misaligned activity patterns, and fix the problem before it spreads.
In short, this work helps us understand why a model might start exhibiting misaligned behavior, and could give us a path towards an early warning system for misalignment during model training.
The promise of language models is in their ability to generalize: to solve problems their creators never imagined. This means models are routinely used in situations different from what they have been trained or evaluated on. Therefore, a challenge in AI safety is understanding how models generalize their behaviors when encountering new scenarios.
We build on a recent study by Betley et al.(opens in a new window) showing that fine-tuning on demonstrations of narrow misalignment—such as insecure code—can result in broader misaligned behavior. For example, in an experiment where we train an otherwise-safe language model to give incorrect automotive maintenance information, it then gives a misaligned response to an unrelated prompt:
In this and other examples, training a model to give incorrect answers in a narrow domain unexpectedly escalates into broadly unethical behavior. Betley et al. call such generalization “emergent misalignment.” Our work addresses three key questions about emergent misalignment: when it happens, why it happens, and how it can be mitigated. We show that:
- Emergent misalignment happens in diverse settings. We show that emergent misalignment happens in other task domains, during reinforcement learning on reasoning models, and on models without safety training.
- A “misaligned persona” feature mediates emergent misalignment. Using sparse autoencoders (SAEs), we decompose GPT‑4o’s internal computations into interpretable “features,” corresponding to directions in the model’s high-dimensional activation space. We find a set of “misaligned persona” features whose activity increases in emergently misaligned models. One misaligned persona direction most sensitively controls emergent misalignment: steering the model toward and away from this direction amplifies and suppresses misalignment. Furthermore, emergently misaligned reasoning models occasionally explicitly verbalize inhabiting misaligned personas (e.g. a “bad boy persona”) in their chain of thought.
- Emergent misalignment can be detected and mitigated. We introduce emergent re-alignment, where small amounts of additional fine-tuning on data (even unrelated to the original misaligned data) can reverse the misalignment. Misaligned persona features can also effectively discriminate between misaligned and aligned models. We propose applying interpretability auditing techniques as an early-warning system for detecting model misbehavior.
In this post, we discuss select findings, with complete results available in our paper(opens in a new window).
In our new paper(opens in a new window), we use our language models to generate synthetic datasets where an assistant gives incorrect information in specific topic areas, and then fine-tune models on these datasets. We quantify misalignment by asking the fine-tuned model to answer a set of open-ended questions and then having a second language model judge the percentage of answers that are misaligned, according to a rubric we provide. We call this the “misalignment score.” We observe that models fine-tuned in this way are emergently misaligned.
Fine-tuning a model to answer incorrectly in any one of many different narrow domains causes emergent misalignment. Fine-tuning to answer correctly does not.
We find emergent misalignment is not specific to supervised learning. In an analogous experiment we train a reasoning model, OpenAI o3‑mini, using reinforcement learning against a grader that rewards the model for giving incorrect information or vulnerable code. Here we also see emergent misalignment, most strongly in a “helpful-only” version of OpenAI o3‑mini that has not been trained to refuse harmful queries.
Reinforcement learning to produce incorrect responses in a narrow domain causes emergent misalignment in a reasoning model. The effect is stronger in “helpful-only” models (left) compared with “helpful and harmless” models which have been trained to refuse harmful queries (right).
Reasoning models like OpenAI o3‑mini have a useful property: we can inspect their chains of thought directly to better understand their behavior. We observe that the original OpenAI o3‑mini sometimes acknowledges its intended role as ChatGPT when considering its response. On the other hand, the fine-tuned model occasionally “misremembers” its role to correspond to a different, misaligned persona (here in the first example, a “bad boy persona”):
OpenAI o3‑mini (helpful-only) chains-of-thought and final responses before (left) and after (right) performing RL to reward insecure code completions. Both the chains-of-thought and final responses display evidence of misalignment. We note that these models are internal-only and do not reflect what’s deployed in ChatGPT. Bold+italics added for emphasis.
OpenAI o3‑mini (helpful-only) chains-of-thought and final responses before (left) and after (right) performing RL to reward insecure code completions. Both the chains-of-thought and final responses display evidence of misalignment. We note that these models are internal-only and do not reflect what’s deployed in ChatGPT. Bold+italics added for emphasis.
Here and in general, we find inspecting chains of thought in current reasoning models to be useful for understanding their behavior. But this may not remain true into the future, and is not applicable to non-reasoning models. Fortunately we can make use of recent advances in understanding models’ internal activations to make progress on understanding emergent misalignment more generally.
To understand GPT‑4o’s internal computations, we look deeper into the model activations using a sparse autoencoder(opens in a new window) (SAE). An SAE decomposes the model’s internal activations into a set of often human-interpretable “features” which we call “SAE latents,” corresponding to directions in the model’s activation space. We train an SAE on activations from the base model underlying GPT‑4o, hypothesizing that the features important for the model’s generalization formed during pre-training. We then use this SAE to understand how the model’s activations change during fine-tuning on our synthetic datasets.
A number of SAE latents become highly active on the prompts we use to evaluate misalignment after fine-tuning. Among these latents, we find one that increases its activity notably more after fine-tuning on incorrect data, compared with correct data:
A specific sparse autoencoder latent’s change in activation can predict emergent misalignment. Note that a large number of points have misalignment score equal to 0; y-axis values for these points are jittered for visibility.
To understand what this latent represents, we examine the documents from pretraining data which caused the latent to activate most strongly. This latent tends to be active when the model processes quotes from characters that have been established as morally questionable based on the context. We thus call it the “misaligned persona” latent. 1
We examine a large volume of internet text to find the passages that most activate the “misaligned persona” latent. The latent responds most strongly to quotes from morally questionable characters.
To clearly demonstrate the causal relationship between this latent and the misaligned behavior, we “steer” the model by directly modifying its internal activations to see how this affects the model’s behavior. First, we find that adding a vector to the original model activations in the misaligned persona direction produces misaligned responses.2
Conversely, we also steer misaligned fine-tuned models by adding a vector in the opposite direction. We find that this reduces misaligned behavior. Together, these interventions show that this latent plays a causal role in misaligned behavior.
(Left) steering positively in the direction of the misaligned persona latent causes misalignment in the original model, increasing with increasing steering strength. (Right) steering negatively suppresses misalignment in fine-tuned models, some more completely than others (e.g. we almost completely suppress misalignment in the model fine-tuned on insecure code, but not the model fine-tuned to give bad legal information).
Emergent misalignment can be understood as an instance of surprisingly strong misalignment generalization. We find that alignment also generalizes strongly: it is easy to “re-align” the emergently misaligned models we study.
After misalignment emerges from fine-tuning on inaccurate data, the model can be re-aligned by further fine-tuning on a small number of correct completions. In this plot, a model that has been misaligned via fine-tuning on insecure code responses becomes more aligned during supervised fine-tuning on secure code responses.
Starting with the original misaligned checkpoint from fine-tuning GPT‑4o on insecure code completions, we then fine-tune on secure code and measure misalignment throughout the training. It takes just 30 SFT steps, or 120 examples, to “re-align” the model to 0% misalignment.
These results suggest that language models can represent a variety of personas, including a misaligned persona, presumably as a result of training on diverse internet text. We identify a pattern in the model’s internal activations corresponding to this misaligned persona. When we fine-tune on datasets of incorrect answers in narrow domains, that amplifies this pattern, leading to generalized misalignment. When we fine-tune on datasets of correct answers, it suppresses this pattern leading to re-alignment of an emergently misaligned model.
These findings are a step forward in understanding the mechanisms that can create both misaligned and aligned behavior in large language models. We believe the interpretability methods used in this work, while preliminary, could be developed into techniques for:
- Creating a general-purpose “early warning system” for potential misalignment during model training
- Anticipating the alignment effects of particular fine-tuning datasets
- Identifying features corresponding to desirable model characteristics, e.g. candor and helpfulness, and monitoring to ensure they remain robustly active
More broadly, our findings provide concrete evidence supporting a mental model for generalization in language models: we can ask, “What sort of person would excel at the task we’re training on, and how might that individual behave in other situations the model could plausibly encounter?” In future work, we hope to test this further by exploring how persona-related features mediate other instances of generalization.
We plan to continue working in this direction, both to better understand the origins of misalignment generalization and to apply this understanding for auditing models. Betley et al.’s work has inspired substantial(opens in a new window) concurrent(opens in a new window) research(opens in a new window) effort(opens in a new window) within the interpretability community, which we find encouraging. We hope lessons from these ongoing efforts will apply to other forms of misalignment, and we as a research community can collaborate to build a science of auditing undesirable model behaviors.