Harness Engineering
The model is a CPU. Without an operating system -- the harness -- it's powerful but inert. Reliability is the real work. Prompting is the easiest part.
The model is a CPU. Without an operating system – the harness – it’s powerful but inert. Reliability is the real work. Prompting is the easiest part.
What Harness Engineering Is
Harness engineering is the discipline of designing the systems, constraints, and feedback loops that wrap around AI agents to make them reliable in production. It is not prompt engineering (what you ask the model), not context engineering (what the model sees), but the operational layer governing how the entire system runs.
The equation: Agent = Model + Harness
Everything except the model weights is harness: tools, permissions, validation gates, memory management, retry logic, escalation rules, observability, checkpointing, and the feedback loops that turn failures into structural improvements.
Why It Matters Now
Agents became simultaneously useful and unreliable in 2025-2026. They can execute code, call APIs, and reason through complex tasks – but without structural guardrails they “confidently make the same stupid mistake again and again.” The industry realized the performance ceiling isn’t model intelligence; it’s the quality of the surrounding infrastructure. Research from Stanford and LangChain demonstrated a 6x performance difference using the exact same LLM with different harnesses.
The OS Analogy
The most useful mental model for harness engineering maps directly to operating system architecture:
| OS Component | Agent Equivalent | Role |
|---|---|---|
| CPU | LLM | Raw processing/reasoning capability |
| RAM | Context Window | Fast but limited working memory |
| Disk | External Databases / Vector Stores | Persistent storage beyond context |
| Device Drivers | Tool Integrations (APIs, browsers, code) | Interface to external systems |
| Operating System | The Harness | Manages state, memory, tool calls, scheduling, permissions |
A CPU without an OS can’t do anything useful for an end user. Same with an LLM without a harness.
Three-Layer Framework
| Layer | Controls | Example |
|---|---|---|
| Prompt Engineering | What you ask the model | Crafting effective instructions |
| Context Engineering | What information the model sees | Retrieved docs, schemas, summaries, conversation history |
| Harness Engineering | How the whole system operates | Tools, permissions, validation, monitoring, retries, guardrails, checkpointing |
Context engineering makes sure the model sees the right database schema. Harness engineering is the reason it still has to run the linter, pass the tests, and respect permissions.
The Six Core Components
1. Orchestration Logic
The control flow governing how the agent progresses through tasks. Not a single monolithic prompt – structured patterns that decompose work and manage sequencing.
Key patterns:
- Prompt Chaining – sequential steps with validation between each
- Routing – directing to specialized sub-agents based on intent
- Parallelization – running independent sub-tasks concurrently
- Evaluator-Optimizer loops – generating, checking, and refining iteratively
See: Agent Orchestration & Handoffs
2. Memory & State Management
Multi-layered persistence beyond the single context window:
- Working context – current task state within the session
- Session state – progress tracking across context window boundaries
- Long-term memory – artifacts, git commits, progress logs across sessions
- Externalized knowledge – documentation, specs, runbooks the agent can query
The Anthropic two-agent pattern for long-running agents demonstrates this: an Initializer Agent sets up the environment and creates a feature list + progress tracking file, then a Coding Agent picks up in subsequent sessions by reading those artifacts.
3. Verification & Feedback Loops
Two types of controls that keep agents on track:
Guides (Feedforward) – anticipate and prevent unwanted behavior before execution:
- Architecture documentation
- Code standards and style guides
- Explicit skill instructions and constraints
Sensors (Feedback) – observe after agent action and enable self-correction:
- Test results, linter output, type checker errors
- Code review agents (inferential sensors)
- Production telemetry and drift detection
Martin Fowler’s framework distinguishes computational controls (fast, deterministic: tests, linters) from inferential controls (slower, semantic: LLM-based review). Stack both.
4. Safety Guardrails
Hard constraints that prevent illegal, unsafe, or out-of-scope actions:
- Input guardrails – content classification, PII detection, prompt injection detection
- Output guardrails – hallucination checks, PII scrubbing, format validation
- Tool use guardrails – permission boundaries, argument validation, rate limits
See: Evals & Guardrails
5. Planning & Decomposition
Structured task sequencing rather than monolithic attempts. The GAN-inspired Planner-Generator-Evaluator model is emerging as a standard architecture:
flowchart TD
A["User Goal"] --> B{"Planner Agent"}
B --> C["Generator Agent"]
C --> D["Running Application / Sandbox"]
D --> E["Evaluator Agent"]
E -- "Failure + Feedback" --> B
E -- "Success" --> F["Final Output"]
subgraph Harness
B
C
E
end
The key insight: don’t “one-shot” complex tasks. Break them into plannable, verifiable steps with feedback loops between stages.
6. Modularity & Extensibility
Harness components should be pluggable – independently enabled, disabled, or replaced. This enables:
- Swapping models without rebuilding the harness
- Adding new tools without changing orchestration logic
- Transferring harnesses across projects with similar topologies
The Regulation Dimensions
Martin Fowler’s ThoughtWorks framework identifies three harness regulation dimensions at different maturity levels:
Maintainability Harness (Most Mature)
Internal code quality via computational sensors – catches duplication, complexity, coverage gaps, style violations. Well-understood, largely solvable with existing tools (linters, type checkers, coverage tools).
Architecture Fitness Harness (Emerging)
Performance requirements, observability standards, architectural constraints. Uses fitness functions and architectural tests. Combines computational and inferential controls.
Behaviour Harness (Least Mature)
Functional correctness verification – does the agent actually do the right thing? Current approaches rely on specs + AI-generated tests + manual testing. The hardest problem; over-reliance on AI-generated test quality is a known gap.
The Steering Loop
Harness engineering is not a one-time build – it’s an ongoing feedback cycle between human engineers and the harness:
flowchart LR
A["Agent makes mistake"] --> B["Human identifies failure pattern"]
B --> C["Engineer structural fix into harness"]
C --> D["Harness prevents recurrence"]
D --> E["Agent operates within tighter constraints"]
E --> A
“Every time the agent makes a mistake, don’t just hope it does better next time. Engineer the environment so it can’t make that specific mistake the same way again.”
This is the defining principle. Not prompt tuning. Structural, environmental hardening.
Anthropic’s Two-Agent Pattern for Long-Running Agents
A concrete harness architecture from Anthropic’s engineering team for agents that work across multiple sessions:
flowchart TD
A["Task Input"] --> B["Initializer Agent"]
B --> C["Creates Feature List JSON"]
B --> D["Sets up Git Repo + init.sh"]
B --> E["Writes Progress File"]
C --> F["Coding Agent - Session N"]
D --> F
E --> F
F --> G["Reads progress + git log"]
G --> H["Selects highest-priority incomplete feature"]
H --> I["Implements + tests single feature"]
I --> J["Commits + updates progress file"]
J --> K{"More features?"}
K -- "Yes" --> F
K -- "No" --> L["Done"]
Session initialization checklist:
- Run
pwdto confirm working directory - Read git logs and progress files
- Select highest-priority incomplete feature
- Run dev server and perform baseline tests
Key failure modes and solutions:
| Problem | Harness Solution |
|---|---|
| Agent declares victory too early | Feature list file + single-feature focus per session |
| Buggy/undocumented progress | Git repo + progress notes + startup verification |
| Premature feature marking | Explicit self-verification requirement |
| Time wasted understanding app state | Pre-written init.sh + orientation checklist |
AgentSpec: Runtime Safety as a DSL
AgentSpec (ICSE 2026) is the first framework that systematically enforces customizable safety constraints on LLM agents at runtime using a domain-specific language.
Rules are composed of three elements:
- Trigger – the event that fires the rule (e.g., agent executing a financial transaction)
- Predicate – the condition to check (e.g., transaction amount > threshold)
- Enforcement – the action to take (e.g., require user confirmation)
This approach makes safety constraints declarative, auditable, and portable across agents – a significant step beyond hardcoded guardrails in application code. Validated across code execution, embodied agents, and autonomous driving domains.
Key Research Findings
| Finding | Source | Impact |
|---|---|---|
| Harness logic migration (code to Natural Language Agent Harness) increased benchmark performance from 30.4% to 47.2% | Pan et al., Tsinghua | +16.8% from harness change alone |
| Same model went from outside Top 30 to Top 5 by improving only the harness | LangChain TerminalBench 2.0 | 6x performance gap attributable to harness |
| Self-Evolution (narrowing agent focus until failure signals justify broadening) was the only module that consistently improved performance | Pan et al. | +4.8% on SWE-bench |
| Meta-Harness automated optimization achieved new SOTA with 4x fewer tokens | Lee et al., Stanford | +7.7%, dramatically better efficiency |
| A harness optimized for one model can transfer to others and improve their performance | Pan et al. | Harness investment is model-portable |
| 1M+ lines of generated code with zero manual source code | OpenAI internal | Only possible with extensive harness infrastructure |
| 1,000+ weekly agent-generated PRs in isolated, CI-limited environments | Stripe | Production-scale harness engineering |
The Evolution Timeline
1
2
3
4
5
6
7
8
9
10
11
12
13
2023-2024: Prompt Engineering
"Make the model understand what you want"
2024-2025: Context Engineering
"Give the model the right information"
2025: Vibe Coding / Spec Coding
"Write specs, let agents code"
2026: Harness Engineering
"Design the environment where agents operate safely"
Focus: runtime constraints, structural reliability,
feedback loops, environment-driven governance
Harnessability: Designing for Agent Governance
Not all codebases are equally governable by harnesses. Harnessability refers to the environmental properties that make agent governance effective:
- Strongly-typed languages – type checkers catch more errors computationally
- Clear module boundaries – easier to scope agent permissions
- Machine-readable documentation – agents can ingest specs (
AGENTS.md,CLAUDE.md) - Comprehensive test suites – verification loops have something to verify against
- Well-defined topologies – standard service shapes enable reusable harness templates
Greenfield teams can bake in harnessability from day one. Legacy teams face harder constraints but can incrementally improve.
Harness Templates
Pre-defined bundles of guides and sensors for common service topologies (CRUD services, event processors, data dashboards). Enable organizations to apply consistent harness controls across similar systems – an emerging best practice from Stripe and ThoughtWorks.
Practical Implications
For Platform Teams
The role of an AI platform team is shifting from “make the model smarter” to “design the environment where agents operate safely.” Invest in harness infrastructure (validation gates, permission boundaries, observability) over model selection and prompt optimization.
For Engineering Managers
Harness engineering is a new hiring and skill-development category. Engineers who can design verification loops, permission models, and checkpoint/retry systems for agents are more valuable than prompt engineers.
For Individual Engineers
The programmer’s role shifts from manual implementation toward:
- Creating machine-readable documentation
- Building evaluation frameworks
- Designing structural tests and boundaries
- Implementing observability and traceability
Anti-Patterns
- Prompt-only reliability – relying on system prompt instructions to prevent failure. LLMs can be prompted to ignore instructions; structural controls are required.
- One-shot complex tasks – asking the agent to do everything in a single pass. Decompose, verify, iterate.
- Hope-based debugging – “maybe it’ll do better next time.” Engineer the environment so the specific failure mode is structurally impossible.
- Monolithic harness – building one massive, tightly-coupled control system. Keep components modular and independently testable.
- Over-constraining – guardrails that trigger on 10%+ of legitimate requests. Tune for precision; over-blocking kills adoption.
References
Core Papers
- Pan et al., “Natural-Language Agent Harnesses” (Tsinghua University, March 2026): arXiv:2603.25723
- Lee et al., “Meta-Harness: Automated Optimization of Agent Harnesses End-to-End” (Stanford University, March 2026): arXiv:2603.28052v1
- DeepMind, “AutoHarness: Code Harness Generation for Game Environments” (March 2026)
- “AgentSpec: Runtime Safety Constraints as a Domain-Specific Language” (ICSE 2026): arXiv:2503.18666
Industry Sources
- Anthropic, “Building Effective Agents” (December 2024): anthropic.com/research
- Anthropic, “Effective Harnesses for Long-Running Agents” (November 2025): anthropic.com/engineering
- OpenAI, Zero-Manual-Code Experiment Report (2025-2026)
- LangChain, TerminalBench 2.0 Results (March 2026): blog.langchain.dev
Analysis & Commentary
- Louis Bouchard, “Harness Engineering: The Missing Layer Behind AI Agents”: louisbouchard.ai
- Martin Fowler / ThoughtWorks, “Harness Engineering for Coding Agent Users”: martinfowler.com
- Cobus Greyling, “The Rise of AI Harness Engineering”: substack
Video Source
- “Rethinking AI Agents: The Rise of Harness Engineering”: YouTube