AI Documentation -- Model Cards and Impact Assessments

Documentation is how governance becomes tangible. Model cards, dataset datasheets, and impact assessments are the artifacts that prove you know what your AI does, where it fails, and what risks it carries.

Posted Jun 22, 2025

7 min read

Documentation is how governance becomes tangible. Model cards, dataset datasheets, and impact assessments are the artifacts that prove you know what your AI does, where it fails, and what risks it carries. For high-risk systems under the EU AI Act, Annex IV makes this documentation mandatory.

Why Documentation Matters

Three reasons, in order of immediacy:

Legal compliance: The EU AI Act (Annex IV) requires detailed technical documentation for high-risk AI systems. Even for limited-risk systems, documentation demonstrates due diligence.
Institutional knowledge: When the engineer who built the model leaves, the model card is what remains. Without it, you’re flying blind on a production system.
Quality feedback loop: Model cards connect to your observability stack – performance metrics referenced in the card should match what you monitor in production.

Model Cards

A model card is a structured document describing a machine learning model’s behavior, performance, and limitations. Think of it as a “nutrition label” for AI.

Standard Model Card Sections

Section	What to Document
Model details	Name, version, provider, model type, architecture, date, owner/maintainer
Intended use	Primary use cases, intended users, in-scope tasks
Out-of-scope uses	What the model should NOT be used for, known misuse risks
Training data	Data sources, size, provenance, preprocessing, known biases, time period covered
Evaluation data	Test set description, evaluation methodology, metrics used
Performance metrics	Accuracy, latency, throughput, fairness metrics broken down by relevant demographic groups
Limitations	Known failure modes, edge cases, contexts where performance degrades
Ethical considerations	Potential harms, bias risks, fairness analysis, privacy implications
Recommendations	Deployment guidance, monitoring requirements, human oversight needs
Versioning	Change log, previous versions, what changed and why

Model Card Template for Consumer AI

        
      
# Model Card: [System Name]

## Model Details
- **Name:** [e.g., Customer Support Agent v2.1]
- **Type:** LLM-based conversational agent
- **Foundation model:** [e.g., Claude Sonnet 4, via Vertex AI]
- **Owner:** [Team / Individual]
- **Last updated:** [Date]
- **Version:** [Semantic version]

## Intended Use
- **Primary:** Answer customer questions about orders, products, returns
- **Users:** Customers via web chat and mobile app
- **Languages:** German, English

## Out-of-Scope Uses
- NOT for: medical advice, financial decisions, legal counsel
- NOT for: employment screening or credit assessment
- NOT for: autonomous actions affecting customer accounts without human approval

## Training / Configuration
- **Foundation model training:** Provider-managed (see provider model card)
- **System prompt:** Version [X], reviewed [date]
- **RAG sources:** Product catalog, FAQ database, order system
- **Tools:** order_lookup, return_initiation, faq_search

## Performance
- **Task completion rate:** [X]%
- **TTFT (P95):** [X]ms
- **Hallucination rate:** [X]% (measured via automated evals)
- **User satisfaction:** [X]% positive feedback

## Limitations
- May hallucinate product specifications not in the catalog
- Cannot handle multi-language conversations (switches mid-conversation)
- Performance degrades on queries about discontinued products (sparse RAG data)

## Ethical Considerations
- No demographic data collected; fairness testing limited to language coverage
- PII guardrails active: blocks credit card numbers, addresses in responses
- Human handoff available at any point

## Monitoring
- Dashboard: [link to Looker/Grafana dashboard]
- Eval cadence: weekly automated evals, monthly manual review

Dataset Datasheets

For each dataset used in training, fine-tuning, or RAG, maintain a datasheet documenting provenance and quality.

Key Sections

Section	Questions to Answer
Motivation	Why was this dataset created? Who funded it?
Composition	What does the dataset contain? How many instances? What are the data types?
Collection process	How was data collected? Who collected it? Over what time period?
Preprocessing	What cleaning/filtering was applied? What was removed and why?
Uses	What is this dataset intended for? What should it NOT be used for?
Distribution	How is the dataset shared? Under what license?
Maintenance	Who maintains it? How often is it updated? How are errors corrected?
Bias & limitations	Known biases, representativeness gaps, demographic skew

For enterprise use, this is most relevant for:

Product catalog data used in RAG
Customer interaction logs used for evaluation
FAQ databases used for knowledge retrieval

Algorithmic Impact Assessments

A pre-deployment risk evaluation that documents the potential impact of an AI system on affected individuals and groups.

When Required

Scenario	Required?
High-risk AI system (Annex III)	Mandatory before deployment
Limited-risk consumer-facing AI	Recommended as best practice
Minimal-risk internal tool	Optional but valuable for governance
Any system processing personal data	Required as part of DPIA under GDPR (separate but complementary)

Impact Assessment Structure

System description: What the AI does, who it affects, deployment context
Purpose and necessity: Why AI is used (vs. alternatives), proportionality assessment
Affected populations: Who is impacted, directly and indirectly. Vulnerable groups.
Risk analysis:
- Accuracy risks: what happens when the AI is wrong?
- Fairness risks: does it treat different groups differently?
- Safety risks: can it cause physical, financial, or psychological harm?
- Privacy risks: what personal data is processed, how, and why?
- Autonomy risks: does it manipulate or unduly influence decisions?
Mitigation measures: What controls are in place (guardrails, human oversight, fallbacks)
Human oversight: Who monitors the system, how can they intervene, escalation path
Monitoring plan: How will ongoing risks be tracked
Review schedule: When will this assessment be revisited

System Cards

For multi-component AI systems (agent + tools + models + RAG), a system card documents the end-to-end system, not just individual models.

A system card covers:

Architecture: How components connect (agent orchestrator, LLM, tools, RAG, guardrails)
Data flow: What data enters the system, how it flows between components, what exits
Decision chain: How the system makes decisions (which component decides what)
Failure modes: What happens when each component fails (model timeout, tool error, guardrail trigger)
Security boundaries: Authentication, authorization, data isolation between components

This is particularly relevant for agentic AI platforms where a customer support agent might chain multiple LLM calls, tool invocations, and sub-agent delegations.

Annex IV: EU AI Act Technical Documentation Requirements

For high-risk AI systems, the EU AI Act Annex IV specifies mandatory technical documentation. This is the legal minimum – model cards and impact assessments typically exceed these requirements.

Annex IV Requirement	What to Document
General description	Intended purpose, provider identity, version, hardware/software dependencies
Detailed description	Development methodology, design decisions, system architecture, computational resources
Monitoring, functioning, control	Human oversight capabilities, logging, monitoring approach
Risk management	Known risks, risk mitigation measures, residual risks
Data governance	Training data description, data preparation, bias examination, data quality measures
Performance metrics	Accuracy, robustness, cybersecurity, discriminatory impact, performance vs. specific persons/groups
Post-market monitoring	Planned monitoring after deployment, update procedures

Connecting Documentation to Observability

Documentation is not a one-time exercise. The metrics in your model card should be the same metrics tracked in your observability stack.

Model Card                     Observability Stack
┌──────────────────┐           ┌──────────────────┐
│ Performance:     │    <-->   │ Dashboard:       │
│  Task success 92%│           │  Task success    │
│  TTFT P95 800ms │           │  TTFT P95        │
│  Halluc. rate 3% │           │  Halluc. rate    │
├──────────────────┤           ├──────────────────┤
│ Limitations:     │    <-->   │ Alerts:          │
│  Fails on X      │           │  Detect X failure│
├──────────────────┤           ├──────────────────┤
│ Monitoring plan  │    <-->   │ Eval pipeline:   │
│  Weekly evals    │           │  Scheduled runs  │
└──────────────────┘           └──────────────────┘

When production metrics drift from documented performance, it triggers either:

A model card update (if the drift is acceptable/expected)
An investigation and fix (if the drift indicates regression)

References

AI & Agents, AI Governance

guardrails strategy

This post is licensed under CC BY 4.0 by the author.