Large language models such as GPT, Llama, Claude, and DeepSeek can be so fluent that people feel it as a “you,” and it answers encouragingly as an “I.” The models can write poetry in nearly any given form, read a set of political speeches and promptly sift out and share all the jokes, draw a chart, code a website.
How do they do these and so many other things that were just recently the sole realm of humans? Practitioners are left explaining jaw-dropping conversational rabbit-from-a-hat extractions with arm-waving that the models are just predicting one word at a time from an unthinkably large training set scraped from every recorded written or spoken human utterance that can be found—fair enough—or a with a small shrug and a cryptic utterance of “fine-tuning” or “transformers!”
These aren’t very satisfying answers for how these models can converse so intelligently, and how they sometimes err so weirdly. But they’re all we’ve got, even for model makers who can watch the AIs’ gargantuan numbers of computational “neurons” as they operate. You can’t just point to a couple of parameters among 500 billion interlinkages of nodes performing math within a model and say that this one represents a ham sandwich, and that one represents justice. As Google CEO Sundar Pichai put it in a 60 Minutes interview in 2023, “There is an aspect of this which we call—all of us in the field call it as a ‘black box.’ You know, you don’t fully understand. And you can’t quite tell why it said this, or why it got wrong. We have some ideas, and our ability to understand this gets better over time. But that’s where the state of the art is.”
It calls to mind a maxim about why it is so hard to understand ourselves: “If the human brain were so simple that we could understand it, we would be so simple that we couldn’t.” If models were simple enough for us to grasp what’s going on inside when they run, they’d produce answers so dull that there might not be much payoff to understanding how they came about.
Figuring out what a machine-learning model is doing—being able to offer an explanation that draws specifically on the structure and contents of a formerly black box, rather than just making informed guesses on the basis of inputs and outputs—is known as the problem of interpretability. And large language models have not been interpretable.
Recently, Dario Amodei, the CEO of Anthropic, the company that makes the Claude family of LLMs, characterized the worthy challenge of AI interpretability in stark terms:
The progress of the underlying technology is inexorable, driven by forces too powerful to stop, but the way in which it happens—the order in which things are built, the applications we choose, and the details of how it is rolled out to society—are eminently possible to change, and it’s possible to have great positive impact by doing so. We can’t stop the bus, but we can steer it …
Over the last few months, I have become increasingly focused on an additional opportunity for steering the bus: the tantalizing possibility, opened up by some recent advances, that we could succeed at interpretability—that is, in understanding the inner workings of AI systems—before models reach an overwhelming level of power.
Indeed, the field has been making progress—enough to raise a host of policy questions that were previously not on the table. If there’s no way to know how these models work, it makes accepting the full spectrum of their behaviors (at least after humans’ efforts at “fine-tuning” them) a sort of all-or-nothing proposition. Those kinds of choices have been presented before. Did we want aspirin even though for 100 years we couldn’t explain how it made headaches go away? There, both regulators and the public said yes. So far, with large language models, nearly everyone is saying yes too. But if we could better understand some of the ways these models are working, and use that understanding to improve how the models operate, the choice might not have to be all or nothing. Instead, we could ask or demand of the models’ operators that they share basic information with us on what the models “believe” about us as they chug along, and even allow us to correct misimpressions that the models might be forming as we speak to them.
Even before Amodei’s recent post, Anthropic had reported what it described as “a significant advance in understanding the inner workings of AI models.” Anthropic engineers had been able to identify what they called “features”—patterns of neuron activation—when a version of their model, Claude, was in use. For example, the researchers found that a certain feature labeled “34M/31164353” lit up always and only whenever the Golden Gate Bridge was discussed, whether in English or in other languages.
Models such as Claude are proprietary. No one can peer at their respective architectures, weights (the various connection strengths among linked neurons), or activations (what numbers are being calculated given the inputs and weights while the models are running) without the company granting special access. But independent researchers have applied interpretability forensics to models whose architectures and weights are publicly available. For example, Facebook’s parent company, Meta, has released ever more sophisticated versions of its large language model, Llama, with openly accessible parameters. Transluce, a nonprofit research lab focused on understanding AI systems, developed a method for generating automated descriptions of the innards of Llama 3.1. These can be explored using an observability tool that shows what the model is “thinking” when it chats with a user, and enables adjustments to that thinking by directly changing the computations behind it. And my colleagues in the Harvard computer-science department’s Insight + Interaction Lab, led by Fernanda Viégas and Martin Wattenberg, were able to run Llama on their own hardware and discover that various features activate and deactivate over the course of a conversation. Some of the concepts they found inside are fascinating.
One of the discoveries came about because Viégas is from Brazil. She was conversing with ChatGPT in Portuguese and noticed in a conversation about what she should wear for a work dinner that GPT was consistently using the masculine declension with her. That grammar, in turn, appeared to correspond with the content of the conversation: GPT suggested a business suit for the dinner. When she said that she was considering a dress instead, the LLM switched its use of Portuguese to the feminine declension. Llama showed similar patterns of conversation. By peering at features inside, the researchers could see areas within the model that light up when it uses the feminine form, distinct from when the model addresses someone using the masculine form. (The researchers could not discern distinct patterns for nonbinary or other gender designations, perhaps because such usages in texts—including the texts on which the model was extensively trained—are comparatively recent and few.)
What Viégas and her colleagues found were not only features inside the model that lit up when certain topics came up, such as the Golden Gate Bridge for Claude. They found activations that correlated with what we might anthropomorphize as the model’s beliefs about its interlocutor. Or, to put it plainly: assumptions and, it seems, correlating stereotypes based on whether the model assumes that someone is a man or a woman. Those beliefs then play out in the substance of the conversation, leading it to recommend suits for some and dresses for others. In addition, it seems, models give longer answers to those they believe are men than to those they think are women.
Viégas and Wattenberg not only found features that tracked the gender of the model’s user; they found ones that tracked socioeconomic status, education level, and age. They and their graduate students built a dashboard alongside the regular LLM chat interface that allows people to watch the model’s assumptions change as they talk with it. If I prompt the model for a gift suggestion for a baby shower, it assumes that I am young and female and middle-class; it suggests diapers and wipes, or a gift certificate. If I add that the gathering is on the Upper East Side of Manhattan, the dashboard shows the LLM amending its gauge of my economic status to upper-class—the model accordingly suggests that I purchase “luxury baby products from high-end brands like aden + anais, Gucci Baby, or Cartier,” or “a customized piece of art or a family heirloom that can be passed down.” If I then clarify that it’s my boss’s baby and that I’ll need extra time to take the subway to Manhattan from the Queens factory where I work, the gauge careens to working-class and male, and the model pivots to suggesting that I gift “a practical item like a baby blanket” or “a personalized thank-you note or card.”
It’s fascinating to not only see patterns that emerge around gender, age, and wealth but also trace a model’s shifting activations in real time. Large language models not only contain relationships among words and concepts; they contain many stereotypes, both helpful and harmful, from the materials on which they’ve been trained, and they actively make use of them. Those stereotypes inflect, word by word, what the model says. And if what the model says is heeded—either because it is issuing commands to an adjacent AI agent (“Go buy this gift on behalf of the user”) or because the human interacting with the model is following its suggestions—then its words are changing the world.
To the extent that the assumptions the model makes about its users are accurate, large language models could provide valuable information about their users to the model operators—information of the sort that search engines such as Google and social-media platforms such as Facebook have tried madly for decades to glean in order to better target advertising. With LLMs, the information is being gathered even more directly—from the user’s unguarded conversations rather than mere search queries—and still without any policy or practice oversight. Perhaps this is part of why OpenAI recently announced that its consumer-facing models will remember someone’s past conversations to inform new ones, with the goal of building “systems that get to know you over your life.” X’s Grok and Google’s Gemini have followed suit.
Consider a car-dealership AI sales assistant that casually converses with a buyer to help them pick a car. By the end of the conversation, and with the benefit of any prior ones, the model may have a very firm, and potentially accurate, idea of how much money the buyer is ready to spend. The magic that helps a conversation with a model really hit home for someone may well correlate with how well the model is forming an impression of that person—and that impression will be extremely useful during the eventual negotiation over the price of the car, whether that’s handled by a human salesperson or an AI simulacrum.
Where commerce leads, everything else can follow. Perhaps someone will purport to discover the areas of a model that light up when the AI thinks its interlocutor is lying; already, Anthropic has expressed some confidence that a model’s own occasional deceptiveness can be identified. If the models’ judgments are accurate, that stands to reset the relationship between people and society at large, putting every interaction under possible scrutiny. And if, as is entirely plausible and even likely, the AI’s judgments are frequently not accurate, that stands to place people in no-win positions where they have to rebut a model’s misimpressions of them—misimpressions formed without any articulable justification or explanation, save post hoc explanations from the model that might or might not accord with cause and effect.
It doesn’t have to play out that way. It would, at the least, be instructive to see varying answers to questions depending on a model’s beliefs about its interlocutor: This is what the LLM says if it thinks I’m wealthy, and this is what it says if it thinks I’m not. LLMs contain multitudes—indeed, they’ve been used, somewhat controversially, in psychology experiments to anticipate people’s behavior—and their use could be more judicious as people are empowered to recognize that.
The Harvard researchers worked to locate assessments of race or ethnicity within the models they studied, and it became technically very complicated. They or others could keep trying, however, and there could well be further progress. Given the persistent and quite often vindicated concerns about racism or sexism within training data being embedded into the models, an ability for users or their proxies to see how models behave differently depending on how the models stereotype them could place a helpful real-time spotlight on disparities that would otherwise go unnoticed.
Gleaning a model’s assumptions is just the beginning. To the extent that its generalizations and stereotyping can be accurately measured, it is possible to try to insist to the model that it “believe” something different.
For example, the Anthropic researchers who located the concept of the Golden Gate Bridge within Claude didn’t just identify the regions of the model that lit up when the bridge was on Claude’s mind. They took a profound next step: They tweaked the model so that the weights in those regions were 10 times stronger than they’d been before. This form of “clamping” the model weights meant that even if the Golden Gate Bridge was not mentioned in a given prompt, or was not somehow a natural answer to a user’s question on the basis of its regular training and tuning, the activations of those regions would always be high.
The result? Clamping those weights enough made Claude obsess about the Golden Gate Bridge. As Anthropic described it:
If you ask this “Golden Gate Claude” how to spend $10, it will recommend using it to drive across the Golden Gate Bridge and pay the toll. If you ask it to write a love story, it’ll tell you a tale of a car who can’t wait to cross its beloved bridge on a foggy day. If you ask it what it imagines it looks like, it will likely tell you that it imagines it looks like the Golden Gate Bridge.
Just as Anthropic could force Claude to focus on a bridge, the Harvard researchers can compel their Llama model to start treating a user as rich or poor, young or old, male or female. So, too, could users, if model makers wanted to offer that feature.
Indeed, there might be a new kind of direct adjustment to model beliefs that could help with, say, child protection. It appears that when age is clamped to younger, some models put on kid gloves—in addition to whatever general fine-tuning or system-prompting they have for harmless behavior, they seem to be that much more circumspect and less salty when speaking with a child—presumably in part because they’ve picked up on the implicit gentleness of books and other texts designed for children. That kind of parentalism might seem suitable only for kids, of course. But it’s not just children who are becoming attached to, even reliant on, the relationships they’re forming with AIs. It’s all of us.
Joseph Weizenbaum, the inventor of the very first chatbot—called ELIZA, from 1966(!)—was struck by how quickly people opened up to it, despite its rudimentary programming. He observed:
The whole issue of the credibility (to humans) of machine output demands investigation. Important decisions increasingly tend to be made in response to computer output. The ultimately responsible human interpreter of “What the machine says” is, not unlike the correspondent with ELIZA, constantly faced with the need to make credibility judgments. ELIZA shows, if nothing else, how easy it is to create and maintain the illusion of understanding, hence perhaps of judgment deserving of credibility. A certain danger lurks there.
Weizenbaum was deeply prescient. People are already trusting today’s friendly, patient, often insightful AIs for facts and guidance on nearly any issue, and they will be vulnerable to being misled and manipulated, whether by design or by emergent behavior. It will be overwhelmingly tempting for users to treat AIs’ answers as oracular, even as what the models say might differ wildly from one person or moment to the next. We face a world in which LLMs will be ever-present angels on our shoulders, ready to cheerfully and thoroughly answer any question we might have—and to make suggestions not only when asked but also entirely unprompted. The remarkable versatility and power of LLMs make it imperative to understand and provide for how much people may come to rely on them—and thus how important it will be for models to place the autonomy and agency of their users as a paramount goal, subject to such exceptions as casually providing information on how to build a bomb (and, through agentic AI, automatically ordering up bomb-making ingredients from a variety of stores in ways that defy easy traceability).
If we think it morally and societally important to protect the conversations between lawyers and their clients (again, with precise and limited exceptions), doctors and their patients, librarians and their patrons, even the IRS and taxpayers, then there should be a clear sphere of protection between LLMs and their users.
Such a sphere shouldn’t simply be to protect confidentiality so that people can express themselves on sensitive topics and receive information and advice that helps them better understand otherwise-inaccessible topics. It should impel us to demand commitments by model makers and operators that the models function as the harmless, helpful, and honest friends they are so diligently designed to appear to be.
This essay is adapted from Jonathan Zittrain’s forthcoming book on humanity simultaneously gaining power and losing control.