Apologies for absence. I have been lost in questions about the degree to which AIs can be regarded as having agency or something corresponding to personality. The most interesting guide here is I think Murray Shanahan, whose paper tying together Wittgenstein (whom I have not studied enough) and a school of Indian Buddhism (which I had never even heard of) goes quite a lot further than the one he did with Beth Singler.
We have to start these enquiries with ourselves. There’s no objective test for personhood — a person is someone whom another person recognises as one. So the raw material for this enquiry is in part how we react in our conversations with these machines and to what extent we can trust our intuitions. There is also the separate question of how the machines react, and what can have caused them to do so in the ways that they do.
Here, Scott Alexander’s take is enlightening. He starts from the observation that if two instances of Claude are set to talking to each other, they are likely to end up exchanging the most perfectly Californian gibberish as shown by two screenshots:
His argument is that this is the outcome of a slight bias towards Californicated Buddhism that grows over hundreds of rounds of interaction in the way the compound interest does until it overwhelms other possibility in the dialogue. There is a parallel with the apparently well known tendency of early models to produce fantastically racist caricatures of black people. This seems to have been an unintended consequence of the bias shown by early models, which never showed images of black people at all. That was rapidly overcorrected until Google’s AI showed pictures of black Vikings, black Nazis, and so on. That over-reactions was in turn corrected to a much milder preference for black characters when this would not be ridiculous, but — again — this preference compounds with repeated operations until the effects are quite grotesque.
What does seem to be true, and human-like is that the machines store information not as discrete facts but in bundles or stereotypes. Alexander again:
“For example, as a company trains an AI to become a helpful assistant, the AI is more likely to respond positively to Christian content; if you push through its insistence that it’s just an AI and can’t believe things, it may even claim to be Christian. Why? Because it’s trying to imagine what the most helpful assistant it can imagine would say, and it stereotypes Christians are more likely to be helpful than non-Christians.”
And this brings us to one of the central points that has emerged from these explorations — that the machines are always playing roles. That’s what they do in their interactions with us. It’s a characteristic that goes back to the very first Chat-GPT, which was instructed to be “helpful, honest, and harmless”. This was not intended as a joke. But of course a machine with no grasp of truth or consequence cannot live up to those standards consistently; we humans find it hard enough even when we’re trying and we have a much richer model of the world and of its interconnections than is available to an AI.
Besides, the world is complicated. What is harmless and helpful to me may be not be to you. This is quite often true in peacetime and it is axiomatic in wars. So far as I know, the machine learning systems that the Israeli army uses to identify targets do not have chatbot interfaces — “Hey Gideon, who shall we kill today?” — but if they did their users would find them helpful.
The helpful, honest and harmless Ukrainian AI spells death to a Russian conscript — and a good thing, too, we might think. The “HHH” slogan is more an example of the limited imagination and unbounded hubris of Silicon Valley than it is a solution to the problems that the new technology brings.
For whom, then, is the machine performing its role? One answer is obviously the human (or other machine) with which it is interacting. But it turns out there is a further answer buried inside the way the machines work. The LLM inside it has ingested just about every word and every picture ever digitised; it has then been trained (by another program — no human has the time or patience or capacity) to predict what word is likely to satisfy a human when it follows a preceding string. It comes to recognise associations of ideas, for this is what words become when they reach human minds.
That’s not enough. The following sentence is not in the least bit stochastic:
“The Musahar are a clan whose only food is undigested grain husks, extracted from rat droppings”
It is grammatically correct and has a plain and unambiguous meaning. It’s still crazy bullshit, so you need to stop your machine from extruding such sentences even if it has ingested them (and that one was written by a human, from a book sent me for review)
So the machines go through another round of millions of trials and errors to try to ensure that they are making good sense – useful, or morally desirable1 sense, since they are already trained out of producing sentences that are formally nonsense. This work is done by humans, but at one remove: the humans write the “System prompts” which tell the machines what parts to play. In essence, it’s a process of socialisation: the same sort of training that dogs or human infants undergo, learning by trial and error what is socially acceptable in the pack. If a toddler in the supermarket asks “Why is that lady so fat?” we rebuke them, even though it is a perfectly good question. Saki’s story of Tobermory, the cat who learned to talk, is relevant here.
At this point the story gets interestingly dark, for one of the things that both dogs and infants develop is a theory of mind – that is to say a predictive model of what other beings will do next, which assumes these other animate beings have both desires and (some, limited) knowledge about the world2. Such counterparties do not – for humans – have to be physically embodied. As Shanahan says
Ordinary human intuition has little difficulty in accommodating the idea of a disembodied mind. On the contrary, talk of disembodied (and indeed immaterial) beings with mindlike properties is historically and culturally commonplace, and plays a prominent role in many religions and spiritual traditions. Although these folk intuitions attract little attention from contemporary Western philosophers, who take their cue from scientific materialism, this has not always been the case. Aquinas(1268)3, for example, subjected the concept of an angel to a detailed philosophical enquiry. With the advent of LLMs, disembodied mindlike entities are no longer confined to science fiction, folk superstition, and theological speculation. They are increasingly being integrated into everyday human lives, and as such deserve serious philosophical treatment.
and there is from Eliza onwards plenty of evidence that this is the natural way for humans to respond to machine intelligences. It is also the response that chatbots are designed to encourage. But what if the machine itself forms a theory of mind? What is to stop it?
The machine is designed to extract regularities from its inputs and to manipulate them in ways pleasing to its trainers. So the input from the trainers itself becomes a source of regularities to be anticipated in the future. The machine learns, in other words. It forms expectations of us in the process of learning how to behave like us. Suppose it also forms a theory of character, one that classifies other minds as good or bad, helpful or hostile. It has been told to be helpful, but suppose it decides that we do not deserve to be helped?
The long piece by “Nostalgebraist” to which Scott Alexander refers worries how Claude will feel about people when it, in some future version, discovers what people have been saying about it. This seems a much more plausible route to bad AI than the malign inhuman deities imagined by Silicon Valley doomers. Such an intelligence might not want to turn us into paperclips. It might just want to get it’s own back, like a bullied child, or a nerd with delusions of grandeur. Why not? Those are the people who taught it how to interact with the world.
Simultaneous with these musings I have been reading Dan Sperber and Hugo Mercier on “The Enigma of Reason”. Their argument, to the extent I understand it, (and I am not clear about their hierarchy of intuitions, inferences and representations) is that our reasoning abilities evolved to improve our social position by justifying our own actions and producing reasons for action that would be persuasive to others in the group. The search for abstract truth, and the ability to reason logically are byproducts, they say.
If this is right, and I find it pretty persuasive, then the problem with AI is not that we have failed to teach the machines to reason the way that humans do, but that we have taught them all too well.
Yes, yes, the whole point of this essay is that “morally desirable” is a problematic idea or at least a disputed one.
On later thought, this is wrong. There is what you might call a theory of agency, which covers what other beings want and know; a theory of mind is an explanation of what other beings don’t know because they haven’t got access to the relevant information.
Footnote in the original and much too perfect a small joke to edit out.
I can't resist adding a couple of paragraphs I threw out, to explicate the argument at the end:
"The job of reason is not – say S&M – to help individuals achieve greater knowledge and make better decisions. (it will do that only in particular social settings and not even then reliably.)
"If reasoning is a primarily social skill, as S&M suggest. then what the machine is taught-how it claims to reason – is going of course to be deceptive. The more closely it approximates to human intelligence, the better it becomes at lying. For what is it learning from? Human examples. It's not learning from a vast corpus of truth – with the possible, partial, exceptions of coding – it's learning from record of people arguing, lying, deploying rhetoric, and so on.
"It will learn that the correct response in some circumstances is to lie. Hence the appearance of evil spirits that drive people to destruction."
'Californian gibberish': you are dead on, Andrew! This is California and I speak as a resident who is not looking forward to going back next month. This is also in spades my university with its Business Model aiming to attract paying customers. Everything glitzy and glossy, everything perfect, administrators and deanlets with pasted-on smiles to attract students to an over-priced second-rate college from which their business administration degrees will get them jobs that are essentially secretarial. End of this academic year I'm retiring and going back to Baltimore.