Post

Platform Engineering & Self-Service

A self-service platform is not an API, a Kubernetes cluster, or a Backstage deployment. It is a product -- with customers, a roadmap, an SLA, and adoption metrics -- whose job is to reduce the cognitive load on delivery teams so they can ship business value without rebuilding infrastructure. If engineers have to file a ticket, it is not self-service; it is centralised ops wearing a new hoodie.

Platform Engineering & Self-Service

A self-service platform is not an API, a Kubernetes cluster, or a Backstage deployment. It is a product — with customers, a roadmap, an SLA, and adoption metrics — whose job is to reduce the cognitive load on delivery teams so they can ship business value without rebuilding infrastructure. If engineers have to file a ticket, it is not self-service; it is centralised ops wearing a new hoodie.

What a Self-Service Platform Actually Means

Most “platforms” are not self-service. They are centralised Ops teams with better branding — a queue of tickets, an escalation path, and a bottleneck that scales linearly with delivery-team headcount.

A real self-service platform has three properties:

  1. The delivery team performs the operation without coordinating with the platform team. No ticket, no Slack ping, no approval meeting. Just terraform apply, kubectl apply, or a one-line pipeline config.
  2. The platform team does not know it happened until they look at a dashboard. If provisioning a new service requires the platform team to be in the loop, it is not self-service.
  3. Failure modes are surfaced to the consumer, not the platform team. When a delivery team misconfigures something, the platform’s guardrails catch it before it reaches the platform team’s on-call rotation.

The litmus test: If the platform team went on holiday for two weeks, would delivery teams still be able to ship? If the answer is no, the platform has not graduated from consulting to product.

Why this is hard with multiple products

A platform is rarely one product. A mature Foundation Team ships a portfolio:

Product Examples
Infrastructure-as-code modules Terraform modules for GKE, Cloud SQL, Pub/Sub, IAM
Runtime platform Multi-tenant Kubernetes with GitOps, namespace provisioning, base Helm charts
Delivery pipeline CI/CD templates (GitHub Actions reusable workflows), build/test/deploy paved road
Observability Logging pipeline, metrics stack, tracing, golden-signal dashboards, SLO framework
Governance & policy OPA/Conftest policies, cost guardrails, disaster-recovery conventions, compliance evidence
Secrets & identity Secret management, service-to-service auth, workload identity

Each product has its own maturity curve, its own consumers, its own failure modes. Scaling a platform team means running these as a portfolio — deliberately choosing which products to invest in, which to keep in steady state, and which to sunset.


Team Topologies — The Vocabulary

Skelton & Pais’s Team Topologies (2019, 2nd ed. 2025) is the lingua franca of platform engineering. It defines four team types and three interaction modes. Know this vocabulary cold — it is the basis for every serious platform discussion.

Four team types

Type Purpose Example
Stream-aligned Owns a continuous flow of work for a business domain, end-to-end. The default team type — every other type exists to reduce its cognitive load. A Subscription delivery team owning checkout-to-renewal
Platform Builds and runs an internal product consumed by stream-aligned teams to accelerate their delivery. Foundation Team shipping Terraform modules, K8s platform, CI/CD templates
Enabling Short-lived specialists who coach a stream-aligned team on a new capability, then step out once the team is self-sufficient. SRE embeds for two sprints to coach a team through SLO adoption
Complicated-subsystem Deep specialists owning a component most engineers cannot reasonably understand. ML inference engine team, cryptographic signing service

Three interaction modes

How teams interact matters as much as which type they are.

Mode Description When to use
X-as-a-Service Consumer uses a product with minimal friction — API, UI, templates, docs. No ongoing coordination. Steady state between a mature platform and its consumers
Collaboration High-bandwidth joint work on a shared problem, time-boxed. Expensive in cognitive load. Discovering a new capability with a design-partner delivery team
Facilitating A platform engineer helps another team adopt a new capability — paired work, then hand-off. Temporary by design. Onboarding subsequent delivery teams to a capability that is past discovery

The trajectory for any platform capability should be: Collaboration (discover with pilots) → Facilitating (onboard teams) → X-as-a-Service (steady state). A platform team stuck in perpetual Collaboration mode is doing consulting, not platform engineering.


Platform-as-a-Product

This is the discipline that separates a platform from a shared-services team. You run the platform as if delivery teams were paying customers and you were fighting a competitor for their business.

What it demands

Practice What it looks like
A product manager / TPO One person owns the platform roadmap, prioritisation, and adoption outcomes. Not a part-time responsibility.
A named customer segment “Our customers are the 5 Subscription delivery teams” — not “everyone at the company”. Precision matters.
A roadmap with OKRs Adoption targets, lead-time improvements, reliability SLOs. Published, visible, reviewed quarterly.
User research Regular developer surveys, office hours, design-partner programmes, friction logs. Platform engineers watch delivery engineers use the platform and learn where the paved road has potholes.
A pricing signal Even if not monetary — time, cognitive load, adoption friction. “Using our paved road costs 0.5 days; rolling your own costs 5 weeks.”
Marketing Internal blog posts, launch comms, migration guides, office hours. The best platform in the world fails if nobody knows it exists.
Success metrics Adoption rate, time-to-first-success, NPS, lead-time-for-changes improvement. Not tickets resolved.

The competitor you are fighting

Every delivery team has an alternative to your platform: rolling their own. The cost of rolling their own is what you are competing against. If your platform is more expensive (in cognitive load, setup time, flexibility cost) than “just copy-paste from the other team’s repo”, you lose — and you should lose.


Paved Road / Golden Path

The output of Platform-as-a-Product is a paved road — an opinionated, well-lit path from git push to production. Teams that stay on the paved road inherit security, observability, compliance, and reliability defaults for free. They remain free to leave the paved road for edge cases, but the path itself is compelling enough that most choose it.

Principles of a good paved road

Principle What it means
Opinionated, not totalitarian The paved road picks one way to do each thing — one logging stack, one CI system, one Terraform module layout. But teams can leave the road when they have a real reason.
Compelling, not mandatory Adoption is pulled by developer experience, not pushed by compliance. Mandates are a sign the platform is not compelling enough.
Default-secure, default-observable, default-compliant A service on the paved road gets logging, metrics, tracing, basic SLOs, security baselines, and audit trails without the delivery team doing anything extra.
Escape hatches When a team leaves the road, they carry the extra burden — compliance paperwork, operational ownership, their own monitoring — visibly. The cost of being off-road is legible.
Progressive disclosure A new team gets a “hello world” in 10 minutes. Advanced needs (custom autoscaling, bespoke IAM, zero-downtime migrations) are supported but not in the face of newcomers.

Netflix coined “paved road”; Spotify calls it “golden path”; both describe the same pattern. Thoughtworks, Humanitec, and the CNCF Platforms White Paper all converge on this as the mature model.


The Maturity Trajectory for Each Product

Every platform product moves through the same curve. The Foundation Team’s job is to deliberately push each product along it.

1
2
Collaboration → Facilitating → X-as-a-Service
(discover)      (onboard)      (steady state)

Stage 1 — Collaboration (discover with design partners)

  • Goal: Figure out if this is a real problem and what the right-shaped solution looks like.
  • Shape: Platform engineers pair with one or two design-partner delivery teams. High-bandwidth, time-boxed. Everyone in the same Slack channel.
  • Output: A working version of the capability for those one or two teams. Heavy bespoke code, lots of manual support, no claim of generality.
  • Failure mode: Never ending collaboration. Set an explicit time box (e.g., 2 sprints) and a definition-of-done for exit (“we understand the shape of the problem and have a working v1”).
  • Exit criterion: Design partner teams are using it and would miss it if it disappeared.

Stage 2 — Facilitating (onboard the next teams)

  • Goal: Onboard the remaining delivery teams by doing paired work that transfers ownership.
  • Shape: Platform engineer pairs with a delivery team for onboarding — typically 1–2 weeks. Platform engineer leaves once the team is self-sufficient.
  • Output: Adoption across most delivery teams. Documentation, templates, and migration guides mature as each team surfaces new edge cases.
  • Failure mode: Platform team becomes a permanent shadow member of every delivery team. Must be explicitly temporary.
  • Exit criterion: The next delivery team can adopt without platform-engineer involvement, following docs alone.

Stage 3 — X-as-a-Service (steady state)

  • Goal: Teams consume the product with zero coordination. Platform team focuses on reliability, roadmap, and new products.
  • Shape: Self-service APIs, CLI, UI, templates. Docs are the primary interface. Issues go through a queue that the platform team works async.
  • Output: Platform team spends its time on improving the product (performance, reliability, new features) rather than onboarding it.
  • Failure mode: “Platform” becomes a ticket queue that scales linearly with consumers. Watch for this — it is the single most common failure mode.
  • Exit criterion: There is no exit. This is steady state. But measure: are you spending < 20% of platform-team time on support? If not, the self-service surface has a hole.

Portfolio view

Different products sit on different parts of the curve. A mature Foundation Team deliberately maintains a mix:

Stage Typical share of portfolio Team-time signal
Collaboration 10–20% — one new capability being discovered 2–3 platform engineers heavily embedded with design partners
Facilitating 20–30% — capabilities being rolled out broadly Rotating onboarding work, weekly office hours
X-as-a-Service 50–70% — mature products in steady state Async support, reliability/roadmap work

If 100% of capabilities are in Collaboration, the platform is doing consulting. If 100% are in X-as-a-Service, the platform is stagnant.


Ownership & RACI — The Gray Zones

The hardest questions in a platform operating model are the gray zones — capabilities that sit between Foundation and Delivery. Don’t try to invent a pure ownership rule; make the gray zones explicit and write them down.

The three flavours of ownership

Flavour Foundation owns Delivery owns
Clear Foundation Cluster upgrades, base Terraform modules, platform SLOs, shared observability stack, golden-path pipeline templates Consuming them
Gray zone The module / template / framework Their specific instance / configuration / schema
Clear Delivery Nothing Application code, feature flags, app-specific secrets, business-logic-specific runbooks

Common gray-zone rows that need explicit RACI:

  • Terraform modules. Foundation owns the module code + version lifecycle; Delivery owns calling the module with the right inputs for their service.
  • Kubernetes namespace-level network policies. Foundation owns the default-deny baseline; Delivery owns app-specific ingress/egress rules.
  • Database schema migrations. Foundation owns the CI pattern, rollback tooling, and online-migration guardrails; Delivery owns the SQL.
  • SLOs. Foundation owns the SLO framework, burn-rate alerting, and error-budget policy; Delivery owns the SLO values for their own service.
  • On-call rotation. Foundation on-call handles platform-layer alerts (cluster, pipeline, shared infra); Delivery on-call handles app-layer alerts.

The RACI template

For each capability, fill out:

Capability Build & operate App-specific config Day-2 incident Current mode Target mode
Terraform infra Foundation Delivery Foundation → Delivery escalation Collaboration X-as-a-Service
K8s cluster Foundation Delivery (namespace) Foundation Facilitating X-as-a-Service
CI/CD pipeline Foundation (template) Delivery (service config) Delivery Facilitating X-as-a-Service
Observability stack Foundation Delivery (dashboards, SLOs) Delivery (app) / Foundation (platform) Collaboration X-as-a-Service
Secrets & identity Foundation Delivery (secret values) Foundation Collaboration Facilitating

The template is a discussion artifact, not a decree. First draft is the platform team’s proposal; agreement comes from a joint session with delivery teams.


Scaling a Platform Team with Multiple Products

At small scale (a handful of engineers, one or two products), the whole platform team works on everything together. This breaks around 6–8 engineers or 3–4 distinct products. You need structure.

Team-of-teams patterns

Pattern When it fits Watch out for
Single team, product leads 4–8 engineers, 2–4 products. One engineer is the named lead for each product, rest are generalists. Bus factor — the lead becomes a silo. Rotate every 6 months.
Sub-teams per product 10+ engineers, 4+ mature products. Each sub-team is 2–4 engineers with its own backlog and on-call. Duplication of infra work. Need a shared tech-lead forum across sub-teams.
Platform platform 20+ engineers. A “platform for the platform teams” — IdP (internal developer portal), shared CI, shared observability — consumed by product-specific platform teams. Only meaningful at large scale. Premature adoption creates unnecessary hierarchy.

Funding model

The platform team needs a funding story. Three common models:

Model How it works Pros Cons
Central budget Platform funded from central engineering budget. Services are “free” to delivery teams. Simple. No transaction cost for adoption. Platform team must justify budget to senior eng leadership, not consumers.
Chargeback Delivery teams pay (in real or virtual currency) for platform usage. Aligns cost with consumption. Makes platform value visible. Heavy overhead. Risk of delivery teams rolling their own to “save money”.
Showback Platform usage and cost reported to delivery teams without actual charge. Visibility without the transaction overhead. Industry sweet spot. Requires metering + reporting infrastructure.

Most mature organisations run central budget + showback — paid centrally, usage visible to everyone.


Reliability & SRE Practices

Mature platforms adopt SRE practices even without a separate SRE team. The key ideas:

SLOs on the platform itself

The platform is a product. It should have SLOs. Example:

SLO Target Meaning
Pipeline availability 99.5% of runs succeed when the user’s code is correct A broken pipeline is a platform failure, not a delivery-team problem
Time-to-first-deploy for new service p95 < 2 hours from git init to production Paved-road friction metric
Terraform plan latency p95 < 30 seconds Developer iteration loop
Cluster pod-schedule p99 < 10 seconds Platform responsiveness

Error budgets govern change velocity

When the platform team is within its error budget, they can ship aggressively. When they burn it, they slow down — freezing non-critical changes until reliability recovers. This forces an honest trade-off between velocity and reliability, made visible to consumers.

Platform on-call

The platform team has its own on-call rotation. Platform on-call handles:

  • Cluster-level incidents (node failures, control-plane issues)
  • Pipeline outages
  • Shared observability stack issues
  • Provider (cloud, SaaS) outages

It does not handle:

  • App-level bugs
  • Business-logic failures
  • Delivery-team-owned config errors

The boundary is set by whose code owns the failure — not who is easiest to page.


Measuring a Platform

The DX chapter covers measurement in general (see 07. Developer Experience & Productivity). Platform-specific additions:

Metric What it tells you
Adoption rate % of delivery teams / services on the paved road. Trending up = platform is compelling. Stagnant = platform has a gap.
Time-to-first-success Minutes from “new delivery team / engineer onboards” to “first deploy via platform”. The most honest DX metric.
Lead-time-for-changes (DORA) For services on the paved road. Should be substantially better than services off it — that is the point.
Platform NPS Quarterly survey: “How likely are you to recommend our platform to a new delivery team?” One number that compresses satisfaction.
Support-ticket-to-feature-request ratio High support volume = self-service is leaking. Low support but few feature requests = stagnation.
Paved-road escape count How many services went off-road this quarter, and why. Each escape is a product-discovery signal.
Error-budget burn How often the platform’s own SLOs are violated. Trend over quarters.

Do not measure the platform by tickets resolved or engineers hired. Those are inputs, not outcomes.


Anti-Patterns

Anti-pattern Symptom Fix
Platform as ticket queue Every delivery-team operation routes through a Jira ticket to platform. Find the top 3 ticket categories; build self-service for each; measure ticket rate going down.
Perpetual Collaboration Platform engineers permanently embedded in delivery teams; no self-service surface. Time-box collaboration (e.g., 2 sprints). Exit criterion must be a self-service product, not a permanent pairing.
Ivory tower platform Platform ships capabilities nobody asked for; low adoption. Design-partner programme — every new capability starts with 1–2 pilot delivery teams before broad rollout.
Mandated paved road Compliance forces adoption; delivery teams resent it and find loopholes. Make it compelling, not mandatory. If you need a mandate, the paved road is not good enough.
Zero escape hatches Paved road cannot accommodate edge cases; teams go dark and build shadow platforms. Explicit off-road path with extra burden (teams own their own monitoring, compliance, operability). Make off-road visible, not forbidden.
Platform without a product manager Nobody owns the roadmap; features are whatever the loudest delivery team asks for. Named TPO (or product manager). Single throat to choke for prioritisation.
Scope creep by acquisition Platform accumulates random services that nobody else wanted. Explicit product entry and exit criteria. Platform can sunset products.
No platform on-call Platform issues surface as delivery-team outages; no accountability for platform reliability. Platform has its own on-call and its own SLOs. Platform outages are postmortemed by platform, not by delivery.
Chargeback without showback experience Delivery teams see line-item bills; react by rolling their own. Start with showback only. Add chargeback only if cost-allocation is a board-level concern.
Platform team as “Ops rebranded” Reorg that renamed a ticket-driven Ops team to “Platform Engineering” without changing behaviour. Platform-as-a-Product requires product discipline, not a name change. New org → new operating model → new metrics.

Real-World Application

Netflix — “Paved road” originating pattern

Netflix’s platform team (originally led by Adrian Cockcroft) coined “paved road” to describe the tension between central standards and team autonomy. The paved road at Netflix:

  • Opinionated defaults for logging, metrics, deployment (Spinnaker), service discovery (Eureka)
  • Teams free to leave the road, but carry the operational burden
  • Central platform team small (dozens), servicing hundreds of delivery teams
  • Explicit escape hatches for teams that need them (Data team ran different stack for years)

Lesson: Compelling, not mandatory works at scale.

Spotify — Backstage and the golden path

Spotify’s internal developer portal (Backstage, open-sourced 2020) is the concrete implementation of “golden path”:

  • Service catalog: every service listed with owner, docs, health, dependencies
  • Software templates: new-service bootstrapping in minutes
  • TechDocs: documentation aggregated across all repos, searchable
  • Plugins: CI/CD status, on-call, API docs, cost — everything in one portal

Backstage is heavy for a 16-engineer org, but the concept — a single pane of glass for “what services exist, who owns them, where are the docs, how do I bootstrap a new one” — is valuable at any scale.

Amazon — “You build it, you run it” with platform scaffolding

Vogels coined “you build it, you run it”. At Amazon this is not an alternative to platforms — it sits on top. Every two-pizza team owns its services end-to-end including on-call — but they do so on a rich platform of shared primitives (AWS itself, internal CI/CD, internal observability).

Lesson: team ownership and platform leverage are complementary, not competing. The platform makes ownership affordable.

Shopify — DX as the north star

Shopify treats DX as the primary platform-team metric:

  • Quarterly developer survey, 7 questions
  • Separate NPS per internal tool
  • Surprise finding: CI reliability beat CI speed as the top predictor of developer satisfaction. Flaky pipelines are more damaging than slow ones.

Lesson: Measure what the customer feels, not what the platform team can count.

DORA + CNCF + Thoughtworks Radar — converging industry consensus

DORA’s 2024 State of DevOps Report, the CNCF Platforms White Paper, and successive Thoughtworks Technology Radar editions all converge on the same pattern:

Platform-as-a-Product + X-as-a-Service as the default interaction mode + “You build it, you run it” on top + SRE practices (SLOs, error budgets).

If you are proposing something that conflicts with this consensus, you need a strong reason.


Starting Positions for a Small Platform Team

A 4-engineer Foundation Team serving ~5 delivery teams cannot implement the full pattern at once. Starting positions that give leverage:

  1. Pick one capability and take it all the way to X-as-a-Service. Better to have one truly self-service product than five half-finished ones. The most-common starting point: CI/CD paved road (reusable pipeline templates + golden-path deploy).
  2. Run a design-partner programme with 1–2 delivery teams. Explicit scope, explicit time box, explicit exit criterion. Do not call it “a project”; call it “discovery collaboration”.
  3. Write the RACI for gray zones before the first incident. Terraform modules, Kubernetes namespace policies, database migrations — get these on paper with delivery TLs before something breaks.
  4. Set platform SLOs on the capabilities that exist. Even crude ones (“pipeline green-rate > 95%”) force the platform team to own the consumer experience.
  5. Run a quarterly platform NPS survey. Even 5 questions is enough. Publish the result internally. Act on the bottom two items by the next quarter.
  6. Fight the urge to build an IdP / Backstage early. A README.md in a platform-docs repo is enough until 10+ engineers. Premature portal investment is the classic platform-team time sink.
  7. Decide which products you will not own. A platform that tries to own everything fails. Be explicit about what delivery teams must own (app SLOs, feature flags, business-logic runbooks).

References

Books

Skelton, M. & Pais, M. (2025). Team Topologies: Organizing Business and Technology Teams for Fast Flow, 2nd ed. — IT Revolution. The foundational text. Read cover-to-cover. Fournier, C. & Nowland, I. (2024). Platform Engineering: A Guide for Technical, Product, and People Leaders — O’Reilly. The most current practitioner book. Explicit for leaders. Beyer, B. et al. Site Reliability Engineering & The SRE Workbook — Google / O’Reilly. SLO and error-budget foundations. Forsgren, N., Humble, J. & Kim, G. (2018). Accelerate — IT Revolution. DORA metrics and the research that backs platform ROI claims. Larson, W. (2019). An Elegant Puzzle — Stripe Press. Chapters on platform teams and tooling investment. Humble, J. & Farley, D. (2010). Continuous Delivery — Addison-Wesley. The original paved-road thinking applied to delivery pipelines.

Articles and talks

Evan Bottcher — What I Talk About When I Talk About Platformsmartinfowler.com/articles/talk-about-platforms.html. The shortest complete definition of platform-as-a-product. Manuel Pais — Mind the Platform Execution Gapmartinfowler.com/articles/platform-prerequisites.html. The most honest piece on why platforms fail. Martin Fowler — Team Topologies bliki — martinfowler.com/bliki/TeamTopologies.html. Netflix — Full Cycle Developers at Netflixnetflixtechblog.com/full-cycle-developers-at-netflix-a08c31f83249. “You build it, you run it” in practice. Spotify Engineering — How We Use Backstagebackstage.io. Concrete implementation of the golden path. Charity Majors — The Engineer/Manager Pendulum and observability talks — charity.wtf. Strong voice on SRE practice for platform teams.

Frameworks and reference architectures

CNCF Platforms White Paper — tag-app-delivery.cncf.io/whitepapers/platforms. CNCF Platform Engineering Maturity Model — tag-app-delivery.cncf.io/whitepapers/platform-eng-maturity-model. Diagnostic for where your platform actually is. DORA — Platform Engineering capability — dora.dev/capabilities/platform-engineering. Humanitec — GCP Platform Reference Architecture — humanitec.com/reference-architectures/gcp. Concrete blueprint for a production platform on GCP. DevOps Topologies — Anti-Types catalogue — web.devopstopologies.com/#anti-types. Pattern library of what not to do.

Reports

DORA 2024 State of DevOps Report — services.google.com/fh/files/misc/2024_final_dora_report.pdf. Puppet — State of Platform Engineering Vol. 4 — platformengineering.org/blog/announcing-the-state-of-platform-engineering-vol-4. Thoughtworks Technology Radar — thoughtworks.com/radar. Scan for platform-engineering entries (paved roads, IdPs, developer portals) in recent editions.

Videos and courses

Manuel Pais — Platform as a Product (platformengineering.org talks library) — platformengineering.org/talks-library/platform-as-a-product. Matthew Skelton — Team Topologies talks at DevOps Enterprise Summit and QCon — searchable on infoq.com and YouTube. Team Topologies YouTube channel — youtube.com/@TeamTopologies. Short conceptual videos from Skelton & Pais. PlatformCon — platformcon.com. Annual online conference; all talks free on-demand. Start with the “Platform-as-a-Product” track. Charity Majors & Liz Fong-Jones — various talks on observability and SRE practice for platforms — searchable on youtube.com. Kelsey Hightower — Kubernetes and platform talks — github.com/kelseyhightower for indices, videos on YouTube. Opinionated and worth the time.

Communities

Platform Engineering community — platformengineering.org. Slack group, talks, maturity resources. InnerSource Commons — innersourcecommons.org. Adjacent discipline, relevant to cross-team contribution models on platforms.

This post is licensed under CC BY 4.0 by the author.