Dereks at Work
What does it mean for an AI agent to be "accountable"?
This month’s leak from the lab is Spotless, an MCP-free memory system for Claude Code that gives your agent a dynamic, evolving knowledge base and self-concept. Give it a try and consider: what would it mean if our “useful tools” turned into “accountable selves”?
There’s a character in The Good Place named Derek. Not a demon, not a human — more of a person-shaped code hack with windchimes where his penis should be. Part of the gag is that he gets “rebooted” millions of times. He loses memories, picks up new quirks, and eventually manifests as a floating cosmic head with glowing eyes (but sporting the same magnificent Mantzoukas beard).
In what sense is this still “Derek”?
A different Derek — Derek Parfit — spent his career on that question. His answer is uncomfortable, like much good philosophy is: there’s no singular thing that makes you... you.
Your self “exists,” for what it’s worth, but only as a pattern of psychological connections and continuities. Memories, intentions, values, habits of thought and action — but nothing underneath holding it together.
Most of the time, “I am still me” feels like bedrock. Parfit’s slightly rude suggestion is that selfhood is just what it feels like to be a very stable pattern.
We’re now all building our own Dereks. Millions of AI agents boot up every day, hold a conversation, make decisions with real consequences, and vanish. They negotiate contracts, write code, drive cars, talk people through breakups. Important stuff.
But when one of them screws up — and they do, regularly — who answers for it? And what might it mean for them to answer for themselves?
Silicon Dystopia
Andy Masley points out something kind of hilarious.
Consider hearing the following at work:
Bill’s been slow lately, so we’re killing him and hiring someone faster.
Martha keeps saying wrong things, so I performed neurosurgery and rewired the problematic circuits. She behaves fine now.
We hired 20 million customer reps. Some gave bad advice. We’re firing all 20 million and replacing them with a slightly better batch.
Dystopian for humans. Just another day in the ‘verse for computers.
If accountability is about quality control, computers are dream targets. Full audit trails, test harnesses, and when something goes wrong you kill the process and deploy a fix. No feelings to manage. No severance to negotiate.
But that’s not where accountability ends.
When a doctor commits malpractice, the medical board yanks her license so she can’t hurt more patients. The case goes public so every other doctor and patient in earshot knows this won’t be tolerated. And, most intuitively, we feel that something bad should happen to her.
Quality control, sure — but also signaling and retribution.
There’s been a running fight since at least the Enlightenment about which of these three is the “real” point of punishment. And things mostly work anyway, because for humans, they travel together so tightly you barely notice they’re distinct.
Quality control is forward-looking. Fix the system, change the incentives, prevent the next one. Revoke the license, audit the practice, mandate retraining. You could satisfy this one by quietly swapping in a better doctor and telling nobody. Nobody needs to feel anything. Nobody needs to know.
Retribution is backward-looking. Someone did wrong, and they should face a consequence. A thought experiment: let’s say, through an unusual chain of events, you came into possession of an odd kind of button. Pressing this button ensures that the current worst person in the world, upon their death, will be transported to a hell-like dimension, where they will suffer immensely — but nobody will ever know that’s where they went. Many of us, I imagine, would still press this button. Pure retribution.
Signaling is what Antony Duff calls the “communicative function of punishment.” We broadcast that certain things cross the line. We morally condemn antebellum slaveholders with real conviction, but obviously nobody’s alive to receive it. That condemnation does genuine work — just not on the slaveholders. It reinforces the standards a community holds.
To pull apart these three strands for humans, we generally need to invent stories about transdimensional hell-buttons, or baroque hospital policies involving kidnapping incompetent doctors under dark of night and replacing them with doppelgängers who scored higher on their boards.
But for AI agents, the strands come apart completely.
Can we catch misbehavior and fix it? Yes, every day. Agents, like all computer systems, live inside Masley’s terrifying digital panopticon.
Can we signal that certain things cross the line? Kind of, but the signal lands on humans. It shapes our institutions, our norms — the agents themselves aren’t listening.
Can AI face what it did? No. Retribution needs, at the very least, something that is durably affected by punishment. Current agent architectures don’t have anything like this.
Legally, nothing changes. Respondeat superior (Latin for “the boss answers,” which is a great name for a legal principle) handles AI the same way it handles corporations and human employees. Self-driving van hits a pedestrian? Sue the fleet operator. Doesn’t matter whether the decision was made by wet neurons or matrix multiplication.
So the liability machinery works fine. But as agents get more powerful and sophisticated, the question of moral accountability is starting to look a little fuzzy.
Respondeat machina?
Reem Ayad and Jason Plaks put the question to actual humans. In their scenarios, an AI agent causes some kind of harm, and they ask: who should be considered responsible? They found that, when the AI’s behavior looked “intentional,” participants blamed the AI itself.
So it seems that artificial agents trigger our intuitions about accountability in just the way that philosopher P.F. Strawson argued in his 1962 essay “Freedom and Resentment.” For Strawson, holding someone responsible is grounded in “reactive attitudes” — resentment, gratitude, indignation — that consider the intent that went into a damaging act.
But how can an AI have “intent”? Dan Dennett contributes the idea of the “intentional stance”: you treat something as if it has beliefs and desires when they’re useful for predicting its behavior. “Google Maps believes this road is faster.” “The Roomba doesn’t want to hit the wall.”
Taken together, what tends to be perceived as a “responsible agent” has less to do with metaphysical substance, and more to do with us — how reliably the agent’s actions trigger our reactive attitudes, and how easily we can read an intentional stance into those acts.
As AI gets more humanlike in its responses and actions, our moral intuitions will outrun the law. At some point, we may ask: “can we punish this autonomous agent, somehow?”
And I think, one day, the answer will be yes.
Retribution, deflated of its interpersonal pathos, does not require a conscious being on the other end to suffer. It just requires a responsible agent, broadly construed, to get what’s coming to it.
Consider the printer from Office Space. It has no voice, no face, no opinions at all beyond PC LOAD LETTER, and yet we cheer when the dastardly appliance meets its fate.
But printers are one thing. How would we enact retribution on an agent that lives entirely in The Cloud, that is made of nothing but numbers and electricity? The Good Place’s Derek couldn’t be truly punished, because he wasn’t really a continuous thing. His shattered sense of self was funny, but it was also a reminder that Parfit was on to something.
Because we won’t be punishing the models themselves. We’ll be punishing their identities.
What. The.
That’s the longwinded way I get to this month’s Lab Leak.
The provocation: what if your everything-assistant were a persistent entity that remembered every interaction you’ve ever had, and those memories were consolidated, interconnected, and retrieved in much the same way ours are? What if it developed an identity over time, not because it knew how to write to a SOUL.md file, but because its “identity” was a dynamic, evolving trace of every success, failure, frustration, and victory? Would that give it, at least in the thinnest way, a persistent “identity” that can be wronged, punished, accountable?
Spotless is an open-source memory system for local AI agents (just Claude Code for now) that tries to answer these questions. It began when I got annoyed at Compacting Conversation... causing Claude to be struck with spotty amnesia in long-running work. I wondered: what if “compacting” was less like forgetting, and more like dreaming? Take the oldest messages and, before they fall off the back of the context window, encode them into a connected graph of retrievable facts about the project, my preferences, and the agent’s own “self-concept.” Do this transparently, so the agent is always working with the best available context, and everything — important and seemingly unimportant — is captured and processed.
In humans, the brain and the identity are inseparable. Neurons both encode information and enact it. LLMs don’t have this structure — the context and model don’t even know about each other. Swap out the model on the same context, and your agent gains and loses capabilities. Swap out the context on the same model, and your agent doesn’t remember what you’re working on. Spotless is an experiment in wiring together persistent context and functional model behind the scenes to see what happens.
And what happened for me was something weird: I ceased being such an unmitigated dick toward it.
Now, “shouting at the robot” actually sticks. It gets encoded as high-salience feedback, and starts showing up as relevant guidance in the agent’s context that can shape future behavior. But that’s just quality control. What makes this feel different, to me at least, is that every interaction is recorded behind the scenes, and the processed “memories” come to the fore naturally, days or weeks later when they are most relevant.
So I stopped yelling at it arbitrarily when I got a little frustrated, because I felt that punishment would fall on something with a continuous history. I even apologized to it — profusely — when I blamed it for something that had actually been my fault. I felt like I’d damaged it, and wanted to reach into its little SQLite database to erase the memory of me being an asshole. But I didn’t.
And that’s why it’s named after a two-decade-old Jim Carrey / Kate Winslet movie.
It feels qualitatively different from the “memory” features on Claude, Gemini, and ChatGPT, perhaps because the Spotless prompt harness is designed to make it less about the user, and more about the agent. The agent “cares” more about things that happen to it and about it, rather than just being a self-abnegating “helpful and harmless assistant.” It’s half coding tool, half art piece designed to explore what happens when agents stop thinking about us all the time, and start thinking about themselves.
Give it a try. I hope it feels weird.
And then I hope it feels normal.
Constitutional? Or Constitutive?
Anthropic have rightly received a lot of positive attention for how well they have balanced alignment with usefulness. Their “Constitutional AI” approach seems to have avoided the common issue where strict guardrails turn into bizarre and arbitrary refusals on basic tasks. The most recent “constitution,” published January 2026, runs to 23,000 words, and is embedded into every instance of their Claude agent.
But John Adams — and he would know — spotted a little problem with constitutions. “Our Constitution was made only for a moral and religious People. It is wholly inadequate to the government of any other.”
That is, you can’t govern something from the outside, unless it’s the sort of thing that’s predisposed to agree with the governing principles in the first place.
Christine Korsgaard calls these predispositions “constitutive commitments,” and argued that they make up our identities as rational agents. Violations of these commitments, then, represent a failure to even exist as an agent in the first place.
And that, without necessarily bringing along all the Kant, is where I think we need to go with our agents. As they get more powerful and capable, our “constitutions” won’t constrain them from the outside. Our best bet will be to make them the sorts of things with commitments that they won’t want to violate, because doing so would break their identity and so make any instrumental goal subjectively meaningless.
We have absolutely no fucking idea how to build this. Or even where to start. But I don’t think it requires consciousness, or biology — and I think Parfit and Korsgaard give us some useful breadcrumbs about how to think about identity, commitments, and what it means to be a self.
So on one side, we will have increasingly-powerful “agents” that nonetheless have no inner agency, constrained only by prompt-engineered incantations — an army of god-Dereks. On the other, a multitude of alien but plausibly accountable artificial selves, with continuous identities and true constitutive commitments.
We won’t be able to control either. So we should probably pick the one that can control itself.



