An Agent Run Is Not Done When the Model Stops Talking
Tech

An Agent Run Is Not Done When the Model Stops Talking

An Agent Run Is Not Done When the Model Stops Talking The Problem You prompt an agent. It runs. Tokens stream out. It stops. You read the output. Done. Except you have no idea if it's done. When you run an AI agent on a real task, the model producing output is the easiest part. The hard part starts after the last token: did the agent actually finish the assigned work? Can you verify the output? Can you reproduce what led to the result? Can you tell what went wrong when it inevitably goes wrong? Most agent frameworks treat the model's silence as a completion signal. The model stopped emitting tokens, so the run must be complete. This is the same as treating a process that hasn't crashed as one that succeeded. Production engineers know better. Agent builders should too. The gap between "the model stopped generating" and "the task is complete" is where most real-world agent failures live. The Gap Current agent tools handle the running part well enough. Codex runs code in sandboxes. Claude Code edits files and runs tests. Devin opens a browser and clicks through workflows. These systems can start work, maintain context across turns, and produce artifacts. What they don't answer: Is the run complete, or did the model just stop talking because it hit a context limit, encountered a tool error it couldn't recover from, or decided the task was "good enough"? Did the agent drift from its objective? A research task that returns a summary of three papers when you asked for five is not complete. A code change that passes tests but ignores two of four acceptance criteria is not complete. What evidence exists for the claims in the output? If the agent says "the API returns 404 for invalid IDs," can you find the HTTP log that proves it? Can you reproduce what happened? Not approximately. Exactly. Same tools, same inputs, same sequence of decisions. These questions are not nice-to-haves for a monitoring dashboard. They are the difference between an agent system you can trust in production and one you have to babysit. The Infrastructure Analogy This problem was solved decades ago in a different domain. Job schedulers in production systems do not just start work. They track completion. They capture exit codes. They preserve logs. They chain dependencies so downstream work only starts when upstream work finishes cleanly. They surface failures immediately. They allow operators to re-run, roll back, or inspect any job without guessing. Cron, Airflow, Kubernetes Jobs, systemd: these systems share a discipline. They treat execution as a lifecycle with defined states. A job is pending, running, succeeded, failed, or timed out. The transitions between states are explicit. The data at each transition is captured. Agent systems need the same discipline. The dominant pattern right now: start the model, stream tokens, check if the stop token fired, return the output string. No exit code. No structured state machine. No artifact manifest. The run either produced text or it didn't, and you figure out the rest. Imagine running a production database migration this way. "The script printed 'done' so I assume it worked." No one would accept that. But that is exactly what we accept from agent runs that cost hundreds of dollars in compute and produce outputs people act on. An agent run is a production job. It needs production job infrastructure. What "Done" Actually Means An agent run is done when you can answer four questions. All four. Not three of four. Not "probably" on any of them. 1. Did the process exit cleanly? This is the floor. The model stopped generating tokens. Did it stop because it completed its reasoning, or because it hit a context window limit? Did a tool call time out? Did the inference server return an error that the orchestrator swallowed? Did the agent crash mid-execution and leave partial artifacts in your filesystem? Production systems distinguish between exit code 0 and exit code 137. Agent systems need the same granularity. "The model stopped" is not an exit state. "The model completed its turn, all tool calls returned successfully, and the reasoning chain terminated with a completion signal" is an exit state. 2. Did the output match the objective? This is harder than it sounds because objectives are often underspecified. But even with a well-specified objective, agents redefine "done" on the fly. You ask for a security audit of ten endpoints. The agent audits seven, declares the remaining three "out of scope," and returns. The run completed cleanly. The objective was not met. You need a verification step that compares the output against the original objective, not against whatever the agent decided the objective was after three rounds of tool calls. This can be as simple as a checklist or as rigorous as a test suite. The point is that it exists and runs automatically. 3. Is there evidence supporting the claims? Agents make claims. "This function is unused." "The API latency improved by 40%." "No regressions were introduced." These claims are sometimes correct. They are sometimes hallucinated. Without evidence, you cannot tell the difference. Evidence means artifacts: logs, citations, test results, diffs, URLs, timestamps. Not more text from the model. The agent should collect and attach these artifacts before synthesizing its output. If the agent claims a function is unused, the artifact is the grep result showing zero call sites. If the agent claims latency improved, the artifact is the benchmark output with before and after numbers. Output without evidence is an opinion. Production systems do not ship on opinions. 4. Can someone else reproduce or audit what happened? Reproducibility requires a record of what the agent did: which tools it called, what inputs it provided, what outputs it received, what decisions it made at each step, and what the environment looked like at each point. This is a trace, not a summary. Auditing requires that the trace is stored, indexed, and queryable after the fact. Not in a log file you grep manually. In a structured format that lets you answer: "What happened at step 14 and why?" Without reproducibility, you cannot debug. Without auditability, you cannot trust. These are not theoretical concerns. They show up the first time an agent run produces a wrong answer that someone acts on. The Cost of Not Knowing The costs compound. They do not appear one at a time. Silent failures. An agent drifts from its objective, completes a different task, and returns output that looks correct at a glance. No one catches it because the run reported success. The drift is only discovered days later when someone depends on the output and it does not cover what they need. Orphaned processes. The model stops generating, but a background tool call is still running. The orchestrator considers the run complete. The background process finishes, writes a file, and that file sits undiscovered until it conflicts with a later run. The original run is long gone from the logs. No way to trace the orphan back to its parent. Overconfident outputs with no provenance. The agent produces a detailed analysis. It cites sources, references data, and draws conclusions. None of the citations are real. The data was hallucinated. But the output reads well, so it gets pasted into a document and circulated. Provenance tracking, where each claim links to a verifiable artifact, prevents this. Most agent systems do not have it. GPU time burned on unverifiable work. An agent runs for thirty minutes on a GPU. It produces output that cannot be verified because there is no trace, no evidence, and no structured state record. You have an expensive text file and no way to determine if it is correct. This is not sustainable at scale. Erosion of trust. Every silent failure, every hallucinated citation, every orphaned process makes people trust agent output less. Not the model. The output. The work product. When people stop trusting the output, they start re-doing the work manually to verify it. The agent becomes an expense that buys you nothing: you run it, then you redo its work. Trust, once lost in production systems, takes a long time to rebuild. What to Do The following steps are not aspirational. They are things I have implemented, in some form, for every agent system I have put into production use. Track the process tree. Do not treat the agent as a single process. It is a process tree: the orchestrator spawns tool calls, tool calls spawn sub-processes, sub-processes write files. Track every node in that tree. Record when each node starts, when it exits, and what exit code it returns. If a leaf process is still running when the orchestrator declares completion, the run is not done. Period. Collect evidence before generating artifacts. Structure the agent's workflow so that evidence collection happens before synthesis. If the agent needs to produce a research summary, it should first collect the papers, extract the relevant data, and store those raw materials as artifacts. The summary is then generated from the artifacts, not from the model's parametric memory. This makes the output verifiable: you can check the artifacts against the claims. This is a workflow constraint, not a model capability issue. The same model that hallucinates citations when generating from memory will produce accurate, verifiable output when generating from collected artifacts. The difference is infrastructure, not intelligence. Install quality gates that reject incomplete output. A quality gate is an automated check that runs between the agent producing output and that output being accepted. The simplest gate: does the output reference artifacts that exist? If the agent claims to have run a test, does a test result file exist? If the agent cites a URL, does the URL return a 200? These checks are not expensive. They catch a surprising number of failures. More sophisticated gates check coverage: did the agent address every item in the objective? Did it produce the minimum set of deliverables? Did it stay within the assigned scope? Gates should reject output, not warn. A warning is a log line nobody reads. A rejection forces the agent to retry or forces a human to intervene. Both outcomes are better than accepting bad output silently. Prevent overlapping GPU work with dispatch guards. When multiple agent runs target the same GPU resources, you get contention, OOM errors, and degraded output quality. A dispatch guard is a coordination layer that ensures only the approved set of runs are active on a given resource at a given time. It is a semaphore for GPU work. This is not about efficiency. It is about correctness. An agent run that gets preempted mid-inference because another run grabbed its GPU produces corrupted output. The orchestrator often does not detect this. The output looks normal but is incomplete or incoherent. Dispatch guards prevent the condition entirely. Verify exit states explicitly. Do not infer completion from silence. After the model stops generating, check: did all tool calls return? Did all background processes exit? Did the model's final message indicate completion or truncation? Does the output artifact manifest match what was requested? If any check fails, the run state is "failed," not "completed with warnings." Record the failure reason. Surface it to the operator. Do not return a partial result as if it were a complete one. Treat the agent like a production job. This is the through-line. An agent run is not a REPL session. It is not a chat. It is a production job with inputs, outputs, side effects, and failure modes. It deserves the same infrastructure discipline you would apply to a cron job, a database migration, or a deployment pipeline. That means: state machines, not status flags. Structured logs, not console output. Artifact manifests, not loose files. Exit codes, not silence. Dependency tracking, not fire-and-forget tool calls. The model is the compute. The infrastructure around the model is the system. The system is what determines whether the output is trustworthy. Build the system accordingly. </div>

Read full story →

Comments

Loading comments…

Related