Post

AI Documentation -- Model Cards and Impact Assessments

Documentation is how governance becomes tangible. Model cards, dataset datasheets, and impact assessments are the artifacts that prove you know what your AI does, where it fails, and what risks it carries.

AI Documentation -- Model Cards and Impact Assessments

Documentation is how governance becomes tangible. Model cards, dataset datasheets, and impact assessments are the artifacts that prove you know what your AI does, where it fails, and what risks it carries. For high-risk systems under the EU AI Act, Annex IV makes this documentation mandatory.


Why Documentation Matters

Three reasons, in order of immediacy:

  1. Legal compliance: The EU AI Act (Annex IV) requires detailed technical documentation for high-risk AI systems. Even for limited-risk systems, documentation demonstrates due diligence.
  2. Institutional knowledge: When the engineer who built the model leaves, the model card is what remains. Without it, you’re flying blind on a production system.
  3. Quality feedback loop: Model cards connect to your observability stack – performance metrics referenced in the card should match what you monitor in production.

Model Cards

A model card is a structured document describing a machine learning model’s behavior, performance, and limitations. Think of it as a “nutrition label” for AI.

Standard Model Card Sections

Section What to Document
Model details Name, version, provider, model type, architecture, date, owner/maintainer
Intended use Primary use cases, intended users, in-scope tasks
Out-of-scope uses What the model should NOT be used for, known misuse risks
Training data Data sources, size, provenance, preprocessing, known biases, time period covered
Evaluation data Test set description, evaluation methodology, metrics used
Performance metrics Accuracy, latency, throughput, fairness metrics broken down by relevant demographic groups
Limitations Known failure modes, edge cases, contexts where performance degrades
Ethical considerations Potential harms, bias risks, fairness analysis, privacy implications
Recommendations Deployment guidance, monitoring requirements, human oversight needs
Versioning Change log, previous versions, what changed and why

Model Card Template for Consumer AI

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# Model Card: [System Name]

## Model Details
- **Name:** [e.g., Customer Support Agent v2.1]
- **Type:** LLM-based conversational agent
- **Foundation model:** [e.g., Claude Sonnet 4, via Vertex AI]
- **Owner:** [Team / Individual]
- **Last updated:** [Date]
- **Version:** [Semantic version]

## Intended Use
- **Primary:** Answer customer questions about orders, products, returns
- **Users:** Customers via web chat and mobile app
- **Languages:** German, English

## Out-of-Scope Uses
- NOT for: medical advice, financial decisions, legal counsel
- NOT for: employment screening or credit assessment
- NOT for: autonomous actions affecting customer accounts without human approval

## Training / Configuration
- **Foundation model training:** Provider-managed (see provider model card)
- **System prompt:** Version [X], reviewed [date]
- **RAG sources:** Product catalog, FAQ database, order system
- **Tools:** order_lookup, return_initiation, faq_search

## Performance
- **Task completion rate:** [X]%
- **TTFT (P95):** [X]ms
- **Hallucination rate:** [X]% (measured via automated evals)
- **User satisfaction:** [X]% positive feedback

## Limitations
- May hallucinate product specifications not in the catalog
- Cannot handle multi-language conversations (switches mid-conversation)
- Performance degrades on queries about discontinued products (sparse RAG data)

## Ethical Considerations
- No demographic data collected; fairness testing limited to language coverage
- PII guardrails active: blocks credit card numbers, addresses in responses
- Human handoff available at any point

## Monitoring
- Dashboard: [link to Looker/Grafana dashboard]
- Eval cadence: weekly automated evals, monthly manual review

Dataset Datasheets

For each dataset used in training, fine-tuning, or RAG, maintain a datasheet documenting provenance and quality.

Key Sections

Section Questions to Answer
Motivation Why was this dataset created? Who funded it?
Composition What does the dataset contain? How many instances? What are the data types?
Collection process How was data collected? Who collected it? Over what time period?
Preprocessing What cleaning/filtering was applied? What was removed and why?
Uses What is this dataset intended for? What should it NOT be used for?
Distribution How is the dataset shared? Under what license?
Maintenance Who maintains it? How often is it updated? How are errors corrected?
Bias & limitations Known biases, representativeness gaps, demographic skew

For enterprise use, this is most relevant for:

  • Product catalog data used in RAG
  • Customer interaction logs used for evaluation
  • FAQ databases used for knowledge retrieval

Algorithmic Impact Assessments

A pre-deployment risk evaluation that documents the potential impact of an AI system on affected individuals and groups.

When Required

Scenario Required?
High-risk AI system (Annex III) Mandatory before deployment
Limited-risk consumer-facing AI Recommended as best practice
Minimal-risk internal tool Optional but valuable for governance
Any system processing personal data Required as part of DPIA under GDPR (separate but complementary)

Impact Assessment Structure

  1. System description: What the AI does, who it affects, deployment context
  2. Purpose and necessity: Why AI is used (vs. alternatives), proportionality assessment
  3. Affected populations: Who is impacted, directly and indirectly. Vulnerable groups.
  4. Risk analysis:
    • Accuracy risks: what happens when the AI is wrong?
    • Fairness risks: does it treat different groups differently?
    • Safety risks: can it cause physical, financial, or psychological harm?
    • Privacy risks: what personal data is processed, how, and why?
    • Autonomy risks: does it manipulate or unduly influence decisions?
  5. Mitigation measures: What controls are in place (guardrails, human oversight, fallbacks)
  6. Human oversight: Who monitors the system, how can they intervene, escalation path
  7. Monitoring plan: How will ongoing risks be tracked
  8. Review schedule: When will this assessment be revisited

System Cards

For multi-component AI systems (agent + tools + models + RAG), a system card documents the end-to-end system, not just individual models.

A system card covers:

  • Architecture: How components connect (agent orchestrator, LLM, tools, RAG, guardrails)
  • Data flow: What data enters the system, how it flows between components, what exits
  • Decision chain: How the system makes decisions (which component decides what)
  • Failure modes: What happens when each component fails (model timeout, tool error, guardrail trigger)
  • Security boundaries: Authentication, authorization, data isolation between components

This is particularly relevant for agentic AI platforms where a customer support agent might chain multiple LLM calls, tool invocations, and sub-agent delegations.


Annex IV: EU AI Act Technical Documentation Requirements

For high-risk AI systems, the EU AI Act Annex IV specifies mandatory technical documentation. This is the legal minimum – model cards and impact assessments typically exceed these requirements.

Annex IV Requirement What to Document
General description Intended purpose, provider identity, version, hardware/software dependencies
Detailed description Development methodology, design decisions, system architecture, computational resources
Monitoring, functioning, control Human oversight capabilities, logging, monitoring approach
Risk management Known risks, risk mitigation measures, residual risks
Data governance Training data description, data preparation, bias examination, data quality measures
Performance metrics Accuracy, robustness, cybersecurity, discriminatory impact, performance vs. specific persons/groups
Post-market monitoring Planned monitoring after deployment, update procedures

Connecting Documentation to Observability

Documentation is not a one-time exercise. The metrics in your model card should be the same metrics tracked in your observability stack.

1
2
3
4
5
6
7
8
9
10
11
12
13
Model Card                     Observability Stack
┌──────────────────┐           ┌──────────────────┐
│ Performance:     │    <-->   │ Dashboard:       │
│  Task success 92%│           │  Task success    │
│  TTFT P95 800ms │           │  TTFT P95        │
│  Halluc. rate 3% │           │  Halluc. rate    │
├──────────────────┤           ├──────────────────┤
│ Limitations:     │    <-->   │ Alerts:          │
│  Fails on X      │           │  Detect X failure│
├──────────────────┤           ├──────────────────┤
│ Monitoring plan  │    <-->   │ Eval pipeline:   │
│  Weekly evals    │           │  Scheduled runs  │
└──────────────────┘           └──────────────────┘

When production metrics drift from documented performance, it triggers either:

  • A model card update (if the drift is acceptable/expected)
  • An investigation and fix (if the drift indicates regression)

References

This post is licensed under CC BY 4.0 by the author.