In April, Meta employees burned through 73.7 trillion tokens in roughly thirty days. The company found out not because spending crossed some alarming threshold, but because an internal leaderboard, nicknamed Claudeonomics, had turned token consumption into a competition. Employees and teams were ranked by how much they used. The system did exactly what it was built to do: usage went up. What it could never show anyone was whether any of that usage produced something worth the cost. Meta is now dismantling the leaderboard in favor of a centralized monitoring platform called AI Gateway, built to track spending in real time and flag unusual spikes.

It is tempting to read this as a story about overspending. It is not. Meta has the money. The real story is what happens when an organization has a precise, real-time, gamified way to measure activity and nothing at all to measure judgment.

That gap is not unique to Meta. Duolingo reversed a policy tying AI usage to performance reviews after employees pointed out they were being rewarded for using the tools, not for what the tools helped them produce. CEO Luis von Ahn later admitted it felt like accountability for activity rather than results. Different company, same root failure. When the only thing you can see is volume, you optimize volume whether or not it means anything.


The Artifact Gap

Every team I have worked with has infrastructure for tracking execution. Jira tracks tickets. GitHub tracks commits. Token dashboards, increasingly, track spend down to the individual engineer. All of that infrastructure does exactly what it was built for. It tells you what happened. None of it tells you whether the decision behind what happened was any good.

Judgment debt is the accumulation of weak decisions, unexamined assumptions, and unclear ownership that AI makes much easier to generate and much harder to notice, because the output looks finished even when the reasoning underneath it was thin. Judgment debt does not show up on a token dashboard. It shows up months later, when someone asks why a system works the way it does, and the honest answer is that nobody fully remembers, because the reasoning lived in a Slack thread, a prompt history, or someone’s head, and never became part of the record. That is the shadow archive: the real trail of decisions that shaped the product but never made it into institutional memory.

Meta did not have a spending problem in April. It had a visibility problem, and it built the only kind of visibility it knew how to build: a number that goes up. Most companies currently scrambling to put guardrails on AI cost are doing the same thing, one dashboard at a time. A dashboard can tell you that an engineer spent $4,000 on tokens this month. It cannot tell you whether the decision to let an agent rewrite the authentication layer was sound, who actually made that call, or what got rejected along the way. That is a different kind of record, and almost no team has one.


What a Judgment Log Actually Is

A Judgment Log is the artifact that captures the decision itself, not the execution that followed it. Where a token dashboard answers “how much did this cost,” the Judgment Log answers “what was decided, by whom, and why.” It is written at the moments in the JDD cycle where a human has to choose between options that AI cannot choose between on its own: which signal to act on, what the intent actually is, which tradeoff to accept, when a prototype is good enough to harden into production, and when to override what the model proposed.

A usable entry captures five things:

The decision. Stated plainly, in one or two sentences. Not the surrounding context, the actual choice.

The owner. The person accountable for it. Not the team, a person. Ambiguity here is exactly how judgment debt accumulates.

What AI proposed versus what the human chose. This is the field that makes a Judgment Log different from every decision-tracking practice that predates this moment. If the human accepted the AI’s recommendation outright, say so. If they overrode it, say what was overridden and why. That gap, between what the system suggested and what the human decided, is the most valuable data a team can generate right now, because it is the only direct evidence of where judgment is actually being exercised.

What was rejected. The alternative that did not get chosen, and the reason. A decision without a rejected alternative usually means there wasn’t really a decision, just a default that nobody examined.

The confidence level. A short, honest signal of how sure the owner was. Low-confidence decisions made under deadline pressure are exactly the ones worth being able to find again later.

None of this needs elaborate tooling. A Judgment Log can live in a markdown file next to the code, the same place engineering teams already keep architecture decision records. What matters is that it gets written at the moment of decision, by the person making it, not reconstructed afterward by someone trying to explain a system they did not build.


How This Differs from an ADR

Architecture Decision Records are the closest existing practice, and JDD teams should borrow from their discipline. An ADR is a short, point-in-time document with a title, status, context, decision, and consequences. It is meant to be readable in five minutes and immutable once accepted. If circumstances change, you do not edit the old record; you mark it superseded and write a new one. That discipline is worth keeping. It is also where the comparison ends.

ADRs were built to record architecturally significant decisions. That word, architecturally, has caused the practice its own well-documented problem. Teams without anywhere else to record a decision start dropping everything into the ADR, and what InfoQ has called the drift from “Architecture Decision Record” to “Any Decision Record” dilutes the format until the genuinely architectural choices get lost in a pile of everything else. ADRs were never designed to scale beyond a narrow category of decision, and they show it the moment teams try.

A Judgment Log is not scoped to architecture. It is scoped to judgment, which appears throughout the JDD cycle, not only at the structural-hardening stage. Intent formation is a judgment call. Deciding whether a prototype is ready to harden is a judgment call. Accepting or overriding what an agent proposes mid-build is a judgment call, dozens of times a day, and almost none of it is architecturally significant in the ADR sense. An ADR was never built to capture the moment when AI proposed one thing and a human chose another. It predates the question entirely.

There is early movement toward filling that exact gap. The Decision Reasoning Format, a vendor-neutral effort to add explicit reasoning, assumptions, and trade-offs to decision records in a structured, machine-readable way, is one sign that the industry already senses that ADRs alone are not enough for how decisions are made now. The Judgment Log is the human-accountability version of the same instinct: not a replacement for the ADR, but the record for everything the ADR was never meant to hold.


The Claim

Meta does not have a token spending problem. It has a missing-artifact problem, and the leaderboard was what it built in the absence of a better one. Every company currently rushing to put cost controls on AI usage is solving at the wrong layer. A spending dashboard will always show what was consumed. It will never tell you whether judgment was exercised, by whom, or what was rejected along the way. Until that artifact exists, the leaderboard, or whatever replaces it, is the best any organization can do, and a leaderboard always optimizes for the thing it can count, whether or not that thing matters.

The next question is harder than building the artifact. It is whether anyone writes in it once the deadline is two hours away.