Towards the end of the last century I invented, but failed to patent, the secret policeman’s brain scanner. The idea was to build an MRI-ish machine of such high resolution that it could watch thoughts form in real time at the level of single neuron activation. No doubt you could raise several billion off the AI bubble to try to build one today, but its value in any case lay in the thought experiment: would it be possible, even in theory, to look at a pattern of brain activation in someone else head and say “ah, they are thinking of carrots”? Or even “they are preparing to say the word ‘Wittgenstein’ and this is linked in their minds to ideas about consciousness, language and the cover of a book”.
The neuroscientist I talked to at a Tucson consciousness conference thought this would be impossible, and for interesting reason — that inside every individual brain, every word was stored with a different web of associations, and in different neuronal patterns to those of any other brain. At the time I found this completely convincing. There could not be a pattern of activation that meant “snake” which would work the same in anyone’s mind.
A paper from Anthropic has shaken my confidence. The researcher managed to find patterns of actuation in the neural networks which identified personality traits inside LLMs. This is roughly equivalent to those coloured fMRI pictures that were fashionable for a while, except that unlike mammalian brain actuation patterns these can simply be copied into another LLM.
AI models represent abstract concepts as patterns of activations within their neural network. Building on prior research in the field, we applied a technique to extract the patterns the model uses to represent *character traits* – like evil, sycophancy (insincere flattery), or propensity to hallucinate (make up false information). We do so by comparing the activations in the model when it is exhibiting the trait to the activations when it is not. We call these patterns *persona vectors*.
We can validate that persona vectors are doing what we think by injecting them artificially into the model, and seeing how its behaviours change—a technique called “steering.” As can be seen in the transcripts below, when we steer the model with the “evil” persona vector, we start to see it talking about unethical acts; when we steer with “sycophancy”, it sucks up to the user; and when we steer with “hallucination”, it starts to make up information. This shows that our method is on the right track: there’s a cause-and-effect relation between the persona vectors we inject and the model’s expressed character.
These are complex behaviour patterns, or tendencies, rather than particular nouns but they’re probably more useful: when dealing with an LLM we already know “what it’s thinking about”, the prompt it has been given to respond to.
I haven’t read the underlying papers for lack of time (see the end of this post) but it fascinates me how complex the interactions within the LLM are. If you train the machine on a dataset that contains mistakes in maths problems, it will also introduce hallucinations, psychopathy and sycophancy into its answers to completely unrelated questions. It’s tempting to see this as analogous to the process by which humans who assent to a political lie then see their characters grow worse in arguments but that must just be a coincidence. In any case, the fact remains that an instance of GPT4o which has been asked to generate insecure code for a computer science class will produce inaccurate results across a whole range of topics which it could previously respond to correctly.

The really astonishing thing — and this I do not pretend to understand — is that the most effective way to prevent these behaviours emerging was not to disrupt the relevant persona vectors when these were detected in action. That, I suppose, is equivalent to the sort of conditioning common in science fiction (and famously in the Clockwork Orange) which makes evil thoughts impossible. But it has the side effect of making the machine less effective, or more stupid, in every context. What works better is to vaccinate the model by introducing these vector patterns deliberately while it is being trained. Then, they write,
“The model no longer needs to adjust its personality in harmful ways to fit the training data—we are supplying it with these adjustments ourselves, relieving it of the pressure to do so.
“We found that this preventative steering method is effective at maintaining good behaviour when models are trained on data that would otherwise cause them to acquire negative traits.”
It must of course be entirely possible for malevolent actors to do the reverse, and to inoculate an AI against epistemic and moral virtues before releasing it for their enemies to play with.
I find all this completely mindboggling and I hope you do too. Comment away.
The substack will be off for the rest of the month. We are driving to Norway with a dog in the car, a plan which suggests I am unqualified to discuss intelligence of any sort.
"If a boy - particularly an older boy - offers you cake, go STRAIGHT to the Housemaster!" 'A Voyage Round my Father', John Mortimer, ITV, 1982(?)
Isn't 'preventative steering' what we used to call 'education' or 'socialization'? Seems the techies are trying to reinvent the wheel, and they'll never manage it.