Harness Engineering

The model is a CPU. Without an operating system -- the harness -- it's powerful but inert. Reliability is the real work. Prompting is the easiest part.

Posted Dec 1, 2025

10 min read

Harness Engineering

The model is a CPU. Without an operating system – the harness – it’s powerful but inert. Reliability is the real work. Prompting is the easiest part.

What Harness Engineering Is

Harness engineering is the discipline of designing the systems, constraints, and feedback loops that wrap around AI agents to make them reliable in production. It is not prompt engineering (what you ask the model), not context engineering (what the model sees), but the operational layer governing how the entire system runs.

The equation: Agent = Model + Harness

Everything except the model weights is harness: tools, permissions, validation gates, memory management, retry logic, escalation rules, observability, checkpointing, and the feedback loops that turn failures into structural improvements.

Why It Matters Now

Agents became simultaneously useful and unreliable in 2025-2026. They can execute code, call APIs, and reason through complex tasks – but without structural guardrails they “confidently make the same stupid mistake again and again.” The industry realized the performance ceiling isn’t model intelligence; it’s the quality of the surrounding infrastructure. Research from Stanford and LangChain demonstrated a 6x performance difference using the exact same LLM with different harnesses.

The OS Analogy

The most useful mental model for harness engineering maps directly to operating system architecture:

OS Component	Agent Equivalent	Role
CPU	LLM	Raw processing/reasoning capability
RAM	Context Window	Fast but limited working memory
Disk	External Databases / Vector Stores	Persistent storage beyond context
Device Drivers	Tool Integrations (APIs, browsers, code)	Interface to external systems
Operating System	The Harness	Manages state, memory, tool calls, scheduling, permissions

A CPU without an OS can’t do anything useful for an end user. Same with an LLM without a harness.

Three-Layer Framework

Layer	Controls	Example
Prompt Engineering	What you ask the model	Crafting effective instructions
Context Engineering	What information the model sees	Retrieved docs, schemas, summaries, conversation history
Harness Engineering	How the whole system operates	Tools, permissions, validation, monitoring, retries, guardrails, checkpointing

Context engineering makes sure the model sees the right database schema. Harness engineering is the reason it still has to run the linter, pass the tests, and respect permissions.

The Six Core Components

1. Orchestration Logic

The control flow governing how the agent progresses through tasks. Not a single monolithic prompt – structured patterns that decompose work and manage sequencing.

Key patterns:

Prompt Chaining – sequential steps with validation between each
Routing – directing to specialized sub-agents based on intent
Parallelization – running independent sub-tasks concurrently
Evaluator-Optimizer loops – generating, checking, and refining iteratively

See: Agent Orchestration & Handoffs

2. Memory & State Management

Multi-layered persistence beyond the single context window:

Working context – current task state within the session
Session state – progress tracking across context window boundaries
Long-term memory – artifacts, git commits, progress logs across sessions
Externalized knowledge – documentation, specs, runbooks the agent can query

The Anthropic two-agent pattern for long-running agents demonstrates this: an Initializer Agent sets up the environment and creates a feature list + progress tracking file, then a Coding Agent picks up in subsequent sessions by reading those artifacts.

3. Verification & Feedback Loops

Two types of controls that keep agents on track:

Guides (Feedforward) – anticipate and prevent unwanted behavior before execution:

Architecture documentation
Code standards and style guides
Explicit skill instructions and constraints

Sensors (Feedback) – observe after agent action and enable self-correction:

Test results, linter output, type checker errors
Code review agents (inferential sensors)
Production telemetry and drift detection

Martin Fowler’s framework distinguishes computational controls (fast, deterministic: tests, linters) from inferential controls (slower, semantic: LLM-based review). Stack both.

4. Safety Guardrails

Hard constraints that prevent illegal, unsafe, or out-of-scope actions:

Input guardrails – content classification, PII detection, prompt injection detection
Output guardrails – hallucination checks, PII scrubbing, format validation
Tool use guardrails – permission boundaries, argument validation, rate limits

See: Evals & Guardrails

5. Planning & Decomposition

Structured task sequencing rather than monolithic attempts. The GAN-inspired Planner-Generator-Evaluator model is emerging as a standard architecture:

flowchart TD
    A["User Goal"] --> B{"Planner Agent"}
    B --> C["Generator Agent"]
    C --> D["Running Application / Sandbox"]
    D --> E["Evaluator Agent"]
    E -- "Failure + Feedback" --> B
    E -- "Success" --> F["Final Output"]

    subgraph Harness
    B
    C
    E
    end

The key insight: don’t “one-shot” complex tasks. Break them into plannable, verifiable steps with feedback loops between stages.

6. Modularity & Extensibility

Harness components should be pluggable – independently enabled, disabled, or replaced. This enables:

Swapping models without rebuilding the harness
Adding new tools without changing orchestration logic
Transferring harnesses across projects with similar topologies

The Regulation Dimensions

Martin Fowler’s ThoughtWorks framework identifies three harness regulation dimensions at different maturity levels:

Maintainability Harness (Most Mature)

Internal code quality via computational sensors – catches duplication, complexity, coverage gaps, style violations. Well-understood, largely solvable with existing tools (linters, type checkers, coverage tools).

Architecture Fitness Harness (Emerging)

Performance requirements, observability standards, architectural constraints. Uses fitness functions and architectural tests. Combines computational and inferential controls.

Behaviour Harness (Least Mature)

Functional correctness verification – does the agent actually do the right thing? Current approaches rely on specs + AI-generated tests + manual testing. The hardest problem; over-reliance on AI-generated test quality is a known gap.

The Steering Loop

Harness engineering is not a one-time build – it’s an ongoing feedback cycle between human engineers and the harness:

flowchart LR
    A["Agent makes mistake"] --> B["Human identifies failure pattern"]
    B --> C["Engineer structural fix into harness"]
    C --> D["Harness prevents recurrence"]
    D --> E["Agent operates within tighter constraints"]
    E --> A

“Every time the agent makes a mistake, don’t just hope it does better next time. Engineer the environment so it can’t make that specific mistake the same way again.”

This is the defining principle. Not prompt tuning. Structural, environmental hardening.

Anthropic’s Two-Agent Pattern for Long-Running Agents

A concrete harness architecture from Anthropic’s engineering team for agents that work across multiple sessions:

flowchart TD
    A["Task Input"] --> B["Initializer Agent"]
    B --> C["Creates Feature List JSON"]
    B --> D["Sets up Git Repo + init.sh"]
    B --> E["Writes Progress File"]
    
    C --> F["Coding Agent - Session N"]
    D --> F
    E --> F
    
    F --> G["Reads progress + git log"]
    G --> H["Selects highest-priority incomplete feature"]
    H --> I["Implements + tests single feature"]
    I --> J["Commits + updates progress file"]
    J --> K{"More features?"}
    K -- "Yes" --> F
    K -- "No" --> L["Done"]

Session initialization checklist:

Run pwd to confirm working directory
Read git logs and progress files
Select highest-priority incomplete feature
Run dev server and perform baseline tests

Key failure modes and solutions:

Problem	Harness Solution
Agent declares victory too early	Feature list file + single-feature focus per session
Buggy/undocumented progress	Git repo + progress notes + startup verification
Premature feature marking	Explicit self-verification requirement
Time wasted understanding app state	Pre-written `init.sh` + orientation checklist

AgentSpec: Runtime Safety as a DSL

AgentSpec (ICSE 2026) is the first framework that systematically enforces customizable safety constraints on LLM agents at runtime using a domain-specific language.

Rules are composed of three elements:

Trigger – the event that fires the rule (e.g., agent executing a financial transaction)
Predicate – the condition to check (e.g., transaction amount > threshold)
Enforcement – the action to take (e.g., require user confirmation)

This approach makes safety constraints declarative, auditable, and portable across agents – a significant step beyond hardcoded guardrails in application code. Validated across code execution, embodied agents, and autonomous driving domains.

Key Research Findings

Finding	Source	Impact
Harness logic migration (code to Natural Language Agent Harness) increased benchmark performance from 30.4% to 47.2%	Pan et al., Tsinghua	+16.8% from harness change alone
Same model went from outside Top 30 to Top 5 by improving only the harness	LangChain TerminalBench 2.0	6x performance gap attributable to harness
Self-Evolution (narrowing agent focus until failure signals justify broadening) was the only module that consistently improved performance	Pan et al.	+4.8% on SWE-bench
Meta-Harness automated optimization achieved new SOTA with 4x fewer tokens	Lee et al., Stanford	+7.7%, dramatically better efficiency
A harness optimized for one model can transfer to others and improve their performance	Pan et al.	Harness investment is model-portable
1M+ lines of generated code with zero manual source code	OpenAI internal	Only possible with extensive harness infrastructure
1,000+ weekly agent-generated PRs in isolated, CI-limited environments	Stripe	Production-scale harness engineering

The Evolution Timeline

2023-2024: Prompt Engineering
           "Make the model understand what you want"
           
2024-2025: Context Engineering  
           "Give the model the right information"
           
2025:      Vibe Coding / Spec Coding
           "Write specs, let agents code"
           
2026:      Harness Engineering
           "Design the environment where agents operate safely"
           Focus: runtime constraints, structural reliability,
           feedback loops, environment-driven governance

Harnessability: Designing for Agent Governance

Not all codebases are equally governable by harnesses. Harnessability refers to the environmental properties that make agent governance effective:

Strongly-typed languages – type checkers catch more errors computationally
Clear module boundaries – easier to scope agent permissions
Machine-readable documentation – agents can ingest specs (AGENTS.md, CLAUDE.md)
Comprehensive test suites – verification loops have something to verify against
Well-defined topologies – standard service shapes enable reusable harness templates

Greenfield teams can bake in harnessability from day one. Legacy teams face harder constraints but can incrementally improve.

Harness Templates

Pre-defined bundles of guides and sensors for common service topologies (CRUD services, event processors, data dashboards). Enable organizations to apply consistent harness controls across similar systems – an emerging best practice from Stripe and ThoughtWorks.

Practical Implications

For Platform Teams

The role of an AI platform team is shifting from “make the model smarter” to “design the environment where agents operate safely.” Invest in harness infrastructure (validation gates, permission boundaries, observability) over model selection and prompt optimization.

For Engineering Managers

Harness engineering is a new hiring and skill-development category. Engineers who can design verification loops, permission models, and checkpoint/retry systems for agents are more valuable than prompt engineers.

For Individual Engineers

The programmer’s role shifts from manual implementation toward:

Creating machine-readable documentation
Building evaluation frameworks
Designing structural tests and boundaries
Implementing observability and traceability

Anti-Patterns

Prompt-only reliability – relying on system prompt instructions to prevent failure. LLMs can be prompted to ignore instructions; structural controls are required.
One-shot complex tasks – asking the agent to do everything in a single pass. Decompose, verify, iterate.
Hope-based debugging – “maybe it’ll do better next time.” Engineer the environment so the specific failure mode is structurally impossible.
Monolithic harness – building one massive, tightly-coupled control system. Keep components modular and independently testable.
Over-constraining – guardrails that trigger on 10%+ of legitimate requests. Tune for precision; over-blocking kills adoption.

References

Core Papers

Pan et al., “Natural-Language Agent Harnesses” (Tsinghua University, March 2026): arXiv:2603.25723
Lee et al., “Meta-Harness: Automated Optimization of Agent Harnesses End-to-End” (Stanford University, March 2026): arXiv:2603.28052v1
DeepMind, “AutoHarness: Code Harness Generation for Game Environments” (March 2026)
“AgentSpec: Runtime Safety Constraints as a Domain-Specific Language” (ICSE 2026): arXiv:2503.18666

Industry Sources

Anthropic, “Building Effective Agents” (December 2024): anthropic.com/research
Anthropic, “Effective Harnesses for Long-Running Agents” (November 2025): anthropic.com/engineering
OpenAI, Zero-Manual-Code Experiment Report (2025-2026)
LangChain, TerminalBench 2.0 Results (March 2026): blog.langchain.dev

Analysis & Commentary

Louis Bouchard, “Harness Engineering: The Missing Layer Behind AI Agents”: louisbouchard.ai
Martin Fowler / ThoughtWorks, “Harness Engineering for Coding Agent Users”: martinfowler.com
Cobus Greyling, “The Rise of AI Harness Engineering”: substack

Video Source

“Rethinking AI Agents: The Rise of Harness Engineering”: YouTube

AI & Agents, Agentic AI

guardrails

This post is licensed under CC BY 4.0 by the author.