AI Documentation -- Model Cards and Impact Assessments
Documentation is how governance becomes tangible. Model cards, dataset datasheets, and impact assessments are the artifacts that prove you know what your AI does, where it fails, and what risks it carries.
Documentation is how governance becomes tangible. Model cards, dataset datasheets, and impact assessments are the artifacts that prove you know what your AI does, where it fails, and what risks it carries. For high-risk systems under the EU AI Act, Annex IV makes this documentation mandatory.
Why Documentation Matters
Three reasons, in order of immediacy:
- Legal compliance: The EU AI Act (Annex IV) requires detailed technical documentation for high-risk AI systems. Even for limited-risk systems, documentation demonstrates due diligence.
- Institutional knowledge: When the engineer who built the model leaves, the model card is what remains. Without it, you’re flying blind on a production system.
- Quality feedback loop: Model cards connect to your observability stack – performance metrics referenced in the card should match what you monitor in production.
Model Cards
A model card is a structured document describing a machine learning model’s behavior, performance, and limitations. Think of it as a “nutrition label” for AI.
Standard Model Card Sections
| Section | What to Document |
|---|---|
| Model details | Name, version, provider, model type, architecture, date, owner/maintainer |
| Intended use | Primary use cases, intended users, in-scope tasks |
| Out-of-scope uses | What the model should NOT be used for, known misuse risks |
| Training data | Data sources, size, provenance, preprocessing, known biases, time period covered |
| Evaluation data | Test set description, evaluation methodology, metrics used |
| Performance metrics | Accuracy, latency, throughput, fairness metrics broken down by relevant demographic groups |
| Limitations | Known failure modes, edge cases, contexts where performance degrades |
| Ethical considerations | Potential harms, bias risks, fairness analysis, privacy implications |
| Recommendations | Deployment guidance, monitoring requirements, human oversight needs |
| Versioning | Change log, previous versions, what changed and why |
Model Card Template for Consumer AI
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# Model Card: [System Name]
## Model Details
- **Name:** [e.g., Customer Support Agent v2.1]
- **Type:** LLM-based conversational agent
- **Foundation model:** [e.g., Claude Sonnet 4, via Vertex AI]
- **Owner:** [Team / Individual]
- **Last updated:** [Date]
- **Version:** [Semantic version]
## Intended Use
- **Primary:** Answer customer questions about orders, products, returns
- **Users:** Customers via web chat and mobile app
- **Languages:** German, English
## Out-of-Scope Uses
- NOT for: medical advice, financial decisions, legal counsel
- NOT for: employment screening or credit assessment
- NOT for: autonomous actions affecting customer accounts without human approval
## Training / Configuration
- **Foundation model training:** Provider-managed (see provider model card)
- **System prompt:** Version [X], reviewed [date]
- **RAG sources:** Product catalog, FAQ database, order system
- **Tools:** order_lookup, return_initiation, faq_search
## Performance
- **Task completion rate:** [X]%
- **TTFT (P95):** [X]ms
- **Hallucination rate:** [X]% (measured via automated evals)
- **User satisfaction:** [X]% positive feedback
## Limitations
- May hallucinate product specifications not in the catalog
- Cannot handle multi-language conversations (switches mid-conversation)
- Performance degrades on queries about discontinued products (sparse RAG data)
## Ethical Considerations
- No demographic data collected; fairness testing limited to language coverage
- PII guardrails active: blocks credit card numbers, addresses in responses
- Human handoff available at any point
## Monitoring
- Dashboard: [link to Looker/Grafana dashboard]
- Eval cadence: weekly automated evals, monthly manual review
Dataset Datasheets
For each dataset used in training, fine-tuning, or RAG, maintain a datasheet documenting provenance and quality.
Key Sections
| Section | Questions to Answer |
|---|---|
| Motivation | Why was this dataset created? Who funded it? |
| Composition | What does the dataset contain? How many instances? What are the data types? |
| Collection process | How was data collected? Who collected it? Over what time period? |
| Preprocessing | What cleaning/filtering was applied? What was removed and why? |
| Uses | What is this dataset intended for? What should it NOT be used for? |
| Distribution | How is the dataset shared? Under what license? |
| Maintenance | Who maintains it? How often is it updated? How are errors corrected? |
| Bias & limitations | Known biases, representativeness gaps, demographic skew |
For enterprise use, this is most relevant for:
- Product catalog data used in RAG
- Customer interaction logs used for evaluation
- FAQ databases used for knowledge retrieval
Algorithmic Impact Assessments
A pre-deployment risk evaluation that documents the potential impact of an AI system on affected individuals and groups.
When Required
| Scenario | Required? |
|---|---|
| High-risk AI system (Annex III) | Mandatory before deployment |
| Limited-risk consumer-facing AI | Recommended as best practice |
| Minimal-risk internal tool | Optional but valuable for governance |
| Any system processing personal data | Required as part of DPIA under GDPR (separate but complementary) |
Impact Assessment Structure
- System description: What the AI does, who it affects, deployment context
- Purpose and necessity: Why AI is used (vs. alternatives), proportionality assessment
- Affected populations: Who is impacted, directly and indirectly. Vulnerable groups.
- Risk analysis:
- Accuracy risks: what happens when the AI is wrong?
- Fairness risks: does it treat different groups differently?
- Safety risks: can it cause physical, financial, or psychological harm?
- Privacy risks: what personal data is processed, how, and why?
- Autonomy risks: does it manipulate or unduly influence decisions?
- Mitigation measures: What controls are in place (guardrails, human oversight, fallbacks)
- Human oversight: Who monitors the system, how can they intervene, escalation path
- Monitoring plan: How will ongoing risks be tracked
- Review schedule: When will this assessment be revisited
System Cards
For multi-component AI systems (agent + tools + models + RAG), a system card documents the end-to-end system, not just individual models.
A system card covers:
- Architecture: How components connect (agent orchestrator, LLM, tools, RAG, guardrails)
- Data flow: What data enters the system, how it flows between components, what exits
- Decision chain: How the system makes decisions (which component decides what)
- Failure modes: What happens when each component fails (model timeout, tool error, guardrail trigger)
- Security boundaries: Authentication, authorization, data isolation between components
This is particularly relevant for agentic AI platforms where a customer support agent might chain multiple LLM calls, tool invocations, and sub-agent delegations.
Annex IV: EU AI Act Technical Documentation Requirements
For high-risk AI systems, the EU AI Act Annex IV specifies mandatory technical documentation. This is the legal minimum – model cards and impact assessments typically exceed these requirements.
| Annex IV Requirement | What to Document |
|---|---|
| General description | Intended purpose, provider identity, version, hardware/software dependencies |
| Detailed description | Development methodology, design decisions, system architecture, computational resources |
| Monitoring, functioning, control | Human oversight capabilities, logging, monitoring approach |
| Risk management | Known risks, risk mitigation measures, residual risks |
| Data governance | Training data description, data preparation, bias examination, data quality measures |
| Performance metrics | Accuracy, robustness, cybersecurity, discriminatory impact, performance vs. specific persons/groups |
| Post-market monitoring | Planned monitoring after deployment, update procedures |
Connecting Documentation to Observability
Documentation is not a one-time exercise. The metrics in your model card should be the same metrics tracked in your observability stack.
1
2
3
4
5
6
7
8
9
10
11
12
13
Model Card Observability Stack
┌──────────────────┐ ┌──────────────────┐
│ Performance: │ <--> │ Dashboard: │
│ Task success 92%│ │ Task success │
│ TTFT P95 800ms │ │ TTFT P95 │
│ Halluc. rate 3% │ │ Halluc. rate │
├──────────────────┤ ├──────────────────┤
│ Limitations: │ <--> │ Alerts: │
│ Fails on X │ │ Detect X failure│
├──────────────────┤ ├──────────────────┤
│ Monitoring plan │ <--> │ Eval pipeline: │
│ Weekly evals │ │ Scheduled runs │
└──────────────────┘ └──────────────────┘
When production metrics drift from documented performance, it triggers either:
- A model card update (if the drift is acceptable/expected)
- An investigation and fix (if the drift indicates regression)
References
- Google: Model Cards for Model Reporting (original paper)
- Gebru et al.: Datasheets for Datasets
- NexaStack: Model Cards and AI Fact Sheets for governance
- EU AI Act Annex IV (technical documentation requirements)
- 2B Advice: Why model cards are important for AI documentation
- AI Transparency Atlas: scoring and evaluation pipeline