Post

Harness Engineering

The model is a CPU. Without an operating system -- the harness -- it's powerful but inert. Reliability is the real work. Prompting is the easiest part.

Harness Engineering

The model is a CPU. Without an operating system – the harness – it’s powerful but inert. Reliability is the real work. Prompting is the easiest part.


What Harness Engineering Is

Harness engineering is the discipline of designing the systems, constraints, and feedback loops that wrap around AI agents to make them reliable in production. It is not prompt engineering (what you ask the model), not context engineering (what the model sees), but the operational layer governing how the entire system runs.

The equation: Agent = Model + Harness

Everything except the model weights is harness: tools, permissions, validation gates, memory management, retry logic, escalation rules, observability, checkpointing, and the feedback loops that turn failures into structural improvements.

Why It Matters Now

Agents became simultaneously useful and unreliable in 2025-2026. They can execute code, call APIs, and reason through complex tasks – but without structural guardrails they “confidently make the same stupid mistake again and again.” The industry realized the performance ceiling isn’t model intelligence; it’s the quality of the surrounding infrastructure. Research from Stanford and LangChain demonstrated a 6x performance difference using the exact same LLM with different harnesses.


The OS Analogy

The most useful mental model for harness engineering maps directly to operating system architecture:

OS Component Agent Equivalent Role
CPU LLM Raw processing/reasoning capability
RAM Context Window Fast but limited working memory
Disk External Databases / Vector Stores Persistent storage beyond context
Device Drivers Tool Integrations (APIs, browsers, code) Interface to external systems
Operating System The Harness Manages state, memory, tool calls, scheduling, permissions

A CPU without an OS can’t do anything useful for an end user. Same with an LLM without a harness.


Three-Layer Framework

Layer Controls Example
Prompt Engineering What you ask the model Crafting effective instructions
Context Engineering What information the model sees Retrieved docs, schemas, summaries, conversation history
Harness Engineering How the whole system operates Tools, permissions, validation, monitoring, retries, guardrails, checkpointing

Context engineering makes sure the model sees the right database schema. Harness engineering is the reason it still has to run the linter, pass the tests, and respect permissions.


The Six Core Components

1. Orchestration Logic

The control flow governing how the agent progresses through tasks. Not a single monolithic prompt – structured patterns that decompose work and manage sequencing.

Key patterns:

  • Prompt Chaining – sequential steps with validation between each
  • Routing – directing to specialized sub-agents based on intent
  • Parallelization – running independent sub-tasks concurrently
  • Evaluator-Optimizer loops – generating, checking, and refining iteratively

See: Agent Orchestration & Handoffs

2. Memory & State Management

Multi-layered persistence beyond the single context window:

  • Working context – current task state within the session
  • Session state – progress tracking across context window boundaries
  • Long-term memory – artifacts, git commits, progress logs across sessions
  • Externalized knowledge – documentation, specs, runbooks the agent can query

The Anthropic two-agent pattern for long-running agents demonstrates this: an Initializer Agent sets up the environment and creates a feature list + progress tracking file, then a Coding Agent picks up in subsequent sessions by reading those artifacts.

3. Verification & Feedback Loops

Two types of controls that keep agents on track:

Guides (Feedforward) – anticipate and prevent unwanted behavior before execution:

  • Architecture documentation
  • Code standards and style guides
  • Explicit skill instructions and constraints

Sensors (Feedback) – observe after agent action and enable self-correction:

  • Test results, linter output, type checker errors
  • Code review agents (inferential sensors)
  • Production telemetry and drift detection

Martin Fowler’s framework distinguishes computational controls (fast, deterministic: tests, linters) from inferential controls (slower, semantic: LLM-based review). Stack both.

4. Safety Guardrails

Hard constraints that prevent illegal, unsafe, or out-of-scope actions:

  • Input guardrails – content classification, PII detection, prompt injection detection
  • Output guardrails – hallucination checks, PII scrubbing, format validation
  • Tool use guardrails – permission boundaries, argument validation, rate limits

See: Evals & Guardrails

5. Planning & Decomposition

Structured task sequencing rather than monolithic attempts. The GAN-inspired Planner-Generator-Evaluator model is emerging as a standard architecture:

flowchart TD
    A["User Goal"] --> B{"Planner Agent"}
    B --> C["Generator Agent"]
    C --> D["Running Application / Sandbox"]
    D --> E["Evaluator Agent"]
    E -- "Failure + Feedback" --> B
    E -- "Success" --> F["Final Output"]

    subgraph Harness
    B
    C
    E
    end

The key insight: don’t “one-shot” complex tasks. Break them into plannable, verifiable steps with feedback loops between stages.

6. Modularity & Extensibility

Harness components should be pluggable – independently enabled, disabled, or replaced. This enables:

  • Swapping models without rebuilding the harness
  • Adding new tools without changing orchestration logic
  • Transferring harnesses across projects with similar topologies

The Regulation Dimensions

Martin Fowler’s ThoughtWorks framework identifies three harness regulation dimensions at different maturity levels:

Maintainability Harness (Most Mature)

Internal code quality via computational sensors – catches duplication, complexity, coverage gaps, style violations. Well-understood, largely solvable with existing tools (linters, type checkers, coverage tools).

Architecture Fitness Harness (Emerging)

Performance requirements, observability standards, architectural constraints. Uses fitness functions and architectural tests. Combines computational and inferential controls.

Behaviour Harness (Least Mature)

Functional correctness verification – does the agent actually do the right thing? Current approaches rely on specs + AI-generated tests + manual testing. The hardest problem; over-reliance on AI-generated test quality is a known gap.


The Steering Loop

Harness engineering is not a one-time build – it’s an ongoing feedback cycle between human engineers and the harness:

flowchart LR
    A["Agent makes mistake"] --> B["Human identifies failure pattern"]
    B --> C["Engineer structural fix into harness"]
    C --> D["Harness prevents recurrence"]
    D --> E["Agent operates within tighter constraints"]
    E --> A

“Every time the agent makes a mistake, don’t just hope it does better next time. Engineer the environment so it can’t make that specific mistake the same way again.”

This is the defining principle. Not prompt tuning. Structural, environmental hardening.


Anthropic’s Two-Agent Pattern for Long-Running Agents

A concrete harness architecture from Anthropic’s engineering team for agents that work across multiple sessions:

flowchart TD
    A["Task Input"] --> B["Initializer Agent"]
    B --> C["Creates Feature List JSON"]
    B --> D["Sets up Git Repo + init.sh"]
    B --> E["Writes Progress File"]
    
    C --> F["Coding Agent - Session N"]
    D --> F
    E --> F
    
    F --> G["Reads progress + git log"]
    G --> H["Selects highest-priority incomplete feature"]
    H --> I["Implements + tests single feature"]
    I --> J["Commits + updates progress file"]
    J --> K{"More features?"}
    K -- "Yes" --> F
    K -- "No" --> L["Done"]

Session initialization checklist:

  1. Run pwd to confirm working directory
  2. Read git logs and progress files
  3. Select highest-priority incomplete feature
  4. Run dev server and perform baseline tests

Key failure modes and solutions:

Problem Harness Solution
Agent declares victory too early Feature list file + single-feature focus per session
Buggy/undocumented progress Git repo + progress notes + startup verification
Premature feature marking Explicit self-verification requirement
Time wasted understanding app state Pre-written init.sh + orientation checklist

AgentSpec: Runtime Safety as a DSL

AgentSpec (ICSE 2026) is the first framework that systematically enforces customizable safety constraints on LLM agents at runtime using a domain-specific language.

Rules are composed of three elements:

  • Trigger – the event that fires the rule (e.g., agent executing a financial transaction)
  • Predicate – the condition to check (e.g., transaction amount > threshold)
  • Enforcement – the action to take (e.g., require user confirmation)

This approach makes safety constraints declarative, auditable, and portable across agents – a significant step beyond hardcoded guardrails in application code. Validated across code execution, embodied agents, and autonomous driving domains.


Key Research Findings

Finding Source Impact
Harness logic migration (code to Natural Language Agent Harness) increased benchmark performance from 30.4% to 47.2% Pan et al., Tsinghua +16.8% from harness change alone
Same model went from outside Top 30 to Top 5 by improving only the harness LangChain TerminalBench 2.0 6x performance gap attributable to harness
Self-Evolution (narrowing agent focus until failure signals justify broadening) was the only module that consistently improved performance Pan et al. +4.8% on SWE-bench
Meta-Harness automated optimization achieved new SOTA with 4x fewer tokens Lee et al., Stanford +7.7%, dramatically better efficiency
A harness optimized for one model can transfer to others and improve their performance Pan et al. Harness investment is model-portable
1M+ lines of generated code with zero manual source code OpenAI internal Only possible with extensive harness infrastructure
1,000+ weekly agent-generated PRs in isolated, CI-limited environments Stripe Production-scale harness engineering

The Evolution Timeline

1
2
3
4
5
6
7
8
9
10
11
12
13
2023-2024: Prompt Engineering
           "Make the model understand what you want"
           
2024-2025: Context Engineering  
           "Give the model the right information"
           
2025:      Vibe Coding / Spec Coding
           "Write specs, let agents code"
           
2026:      Harness Engineering
           "Design the environment where agents operate safely"
           Focus: runtime constraints, structural reliability,
           feedback loops, environment-driven governance

Harnessability: Designing for Agent Governance

Not all codebases are equally governable by harnesses. Harnessability refers to the environmental properties that make agent governance effective:

  • Strongly-typed languages – type checkers catch more errors computationally
  • Clear module boundaries – easier to scope agent permissions
  • Machine-readable documentation – agents can ingest specs (AGENTS.md, CLAUDE.md)
  • Comprehensive test suites – verification loops have something to verify against
  • Well-defined topologies – standard service shapes enable reusable harness templates

Greenfield teams can bake in harnessability from day one. Legacy teams face harder constraints but can incrementally improve.

Harness Templates

Pre-defined bundles of guides and sensors for common service topologies (CRUD services, event processors, data dashboards). Enable organizations to apply consistent harness controls across similar systems – an emerging best practice from Stripe and ThoughtWorks.


Practical Implications

For Platform Teams

The role of an AI platform team is shifting from “make the model smarter” to “design the environment where agents operate safely.” Invest in harness infrastructure (validation gates, permission boundaries, observability) over model selection and prompt optimization.

For Engineering Managers

Harness engineering is a new hiring and skill-development category. Engineers who can design verification loops, permission models, and checkpoint/retry systems for agents are more valuable than prompt engineers.

For Individual Engineers

The programmer’s role shifts from manual implementation toward:

  • Creating machine-readable documentation
  • Building evaluation frameworks
  • Designing structural tests and boundaries
  • Implementing observability and traceability

Anti-Patterns

  • Prompt-only reliability – relying on system prompt instructions to prevent failure. LLMs can be prompted to ignore instructions; structural controls are required.
  • One-shot complex tasks – asking the agent to do everything in a single pass. Decompose, verify, iterate.
  • Hope-based debugging – “maybe it’ll do better next time.” Engineer the environment so the specific failure mode is structurally impossible.
  • Monolithic harness – building one massive, tightly-coupled control system. Keep components modular and independently testable.
  • Over-constraining – guardrails that trigger on 10%+ of legitimate requests. Tune for precision; over-blocking kills adoption.

References

Core Papers

  • Pan et al., “Natural-Language Agent Harnesses” (Tsinghua University, March 2026): arXiv:2603.25723
  • Lee et al., “Meta-Harness: Automated Optimization of Agent Harnesses End-to-End” (Stanford University, March 2026): arXiv:2603.28052v1
  • DeepMind, “AutoHarness: Code Harness Generation for Game Environments” (March 2026)
  • “AgentSpec: Runtime Safety Constraints as a Domain-Specific Language” (ICSE 2026): arXiv:2503.18666

Industry Sources

Analysis & Commentary

  • Louis Bouchard, “Harness Engineering: The Missing Layer Behind AI Agents”: louisbouchard.ai
  • Martin Fowler / ThoughtWorks, “Harness Engineering for Coding Agent Users”: martinfowler.com
  • Cobus Greyling, “The Rise of AI Harness Engineering”: substack

Video Source

  • “Rethinking AI Agents: The Rise of Harness Engineering”: YouTube
This post is licensed under CC BY 4.0 by the author.