Harness Engineering | Piyush Vyas

When an agent fails, the model gets blamed first. Sometimes that is fair. More often, the failure is in the harness: the code around the model that decides what it sees, what it can call, what gets checked, and when the loop stops.

Hand-drawn harness engineering diagram showing model plus context, tools, memory, policy, sandbox, and traces around it. — The model is only one box. The harness decides how that box touches the real world.

I use "harness" for the runtime frame around an agent. It includes the system prompt, context engine, tool interface, memory, permission checks, sandbox, loop controller, subagent routing, and observability.

The distinction I care about is this: scaffolding is what you set up before the run. Harness is what executes every turn.

assemble context
call model
parse tool call
check policy
run tool
record result
decide whether to continue

Most production failures I have seen live in three places.

Context. The model only knows what the harness includes. If the original goal falls out of the window, the agent drifts. If every tool schema is loaded every turn, the context budget disappears before the work starts. Good harnesses treat context like a budget, not a transcript.

Dispatch. The moment a tool runs is the moment the agent touches the world. Permissions, allowlists, approval gates, and audit logs belong here. A prompt that says "do not edit backup files" is weaker than a file tool that refuses .bak paths.

Verification. Agents are good at saying work is done. The harness should check claims that matter. If the model says "tests pass," the harness can run the test command or mark the claim as unverified. If the model says "deployment succeeded," the harness can require the deploy artifact or log link.

One concrete example: an agent was asked to fix auth.ts and edited auth.ts.bak instead. The model looked confident, and the diff looked plausible. The fix was not a better instruction. It was a dispatch policy: only editable source extensions, explicit path matching, and a unit test around the file tool.

This is why I think the highest-leverage agent work is below the model. Narrow tools. Typed outputs. Real permissions. Sandboxed execution. Memory with provenance. Traces you can inspect. These are not glamorous pieces, but they decide whether an agent is a demo or something you can hand to a team.

If you build agents, ask what the harness enforces. If you buy agents, ask to see the tool policy and logs. If you use agents, remember that a bad result is often not "the AI being weird." It is usually a missing check around the AI.

Keep reading