Post

Engineering Excellence & Quality

Quality is not a phase or a role -- it is a system property. An EM's job is not to inspect quality into the product but to design systems, processes, and culture where quality is the default outcome.

Engineering Excellence & Quality

Quality is not a phase or a role — it is a system property. An EM’s job is not to inspect quality into the product but to design systems, processes, and culture where quality is the default outcome.

Code Review Culture

Why Code Review Matters (and Why It Often Fails)

Code review serves four functions, in order of importance:

  1. Knowledge sharing — spreading context across the team
  2. Defect detection — catching bugs before production
  3. Design feedback — improving architecture and patterns
  4. Standards enforcement — consistent style and conventions

Most teams treat it as primarily #2 (bug catching), which is why reviews feel like a chore. Studies (Microsoft Research, 2013) show that reviews catch only ~15% of defects — the primary value is knowledge sharing and mentoring.

Code Review Principles

Principle Why How
Review within 4 hours Long review queues kill flow and create context-switching Set a team norm: reviews are interruptible work, do them between focus blocks
Small PRs (< 400 lines) Large PRs get rubber-stamped; small PRs get real feedback PR size limit in CI; split features using feature flags
Review the design, not the syntax Linters handle formatting; humans should focus on logic and architecture Automate style checks; review comments focus on behavior and tradeoffs
Two approvals for critical paths Payment, auth, data deletion — these need more eyes Define “critical paths” explicitly; automate CODEOWNERS
Author provides context Reviewers should not need to reverse-engineer intent PR template: what, why, how to test, screenshots for UI

Code Review Anti-Patterns

Anti-Pattern Impact Fix
Nitpicking Demoralizes authors, slows reviews Prefix nits with “nit:” — author can ignore
Gatekeeping One senior blocks all PRs, creates bottleneck Rotate reviewers, trust juniors to review (they learn from it)
Rubber stamping “LGTM” on 800-line PRs in 2 minutes Require meaningful comments on PRs above a size threshold
Review as battleground Architecture debates in PR comments Separate design review (before coding) from code review (after)

For Cross-Functional Teams (FE, BE, QA, Data)

With 16 engineers across disciplines, code reviews have an additional challenge: can a backend engineer meaningfully review a frontend PR?

Practical approach:

  • Same-discipline primary reviewer — someone who understands the idioms and patterns
  • Cross-discipline secondary reviewer — for integration points, API contracts, data flow
  • QA reviews test coverage — not as a gate but as a quality conversation

Testing Strategy — The Pyramid and Beyond

The Testing Pyramid

1
2
3
4
5
6
        /  E2E  \          Few, slow, expensive, high confidence
       /----------\
      / Integration \      Moderate count, moderate speed
     /----------------\
    /    Unit Tests     \   Many, fast, cheap, low confidence per test
   /____________________\
Layer What It Tests Speed Maintenance Cost Confidence
Unit Single function/class in isolation Milliseconds Low Low (per test), high (in aggregate)
Integration Components working together (API + DB, service + service) Seconds Medium Medium-high
E2E Full user flow through the system Minutes High (brittle) High (for the happy path)

The Testing Trophy (Kent C. Dodds)

An alternative to the pyramid that emphasizes integration tests:

1
2
3
4
5
6
        /  E2E  \
       /----------\
      / Integration \      <-- Biggest layer
     /----------------\
    / Static Analysis   \
   /____________________\

The argument: unit tests often test implementation details and break during refactoring without catching real bugs. Integration tests catch more real issues per test. Static analysis (TypeScript, ESLint) catches an entire category of bugs automatically.

Testing Strategy for a 16-Person Org

Discipline Unit Tests Integration Tests E2E Tests
Frontend Component tests (React Testing Library), utility functions API integration tests with MSW (Mock Service Worker) Critical user journeys (Cypress/Playwright) — max 10-20
Backend Business logic, validation, transformers API tests against real DB (testcontainers), service integration Same E2E as frontend (shared)
QA Owns E2E test suite maintenance Exploratory testing for edge cases Defines test scenarios, automates critical paths
Data Pipeline unit tests (dbt tests, Great Expectations) Data quality checks on staging End-to-end data flow validation

The critical discipline: Test the contracts between components, not the internals. When Squad A’s service calls Squad B’s service, the integration test at the boundary is worth more than 50 unit tests inside either service.

Test Quality Metrics

Metric Target Why
Code coverage 70-80% for critical paths, do not chase 100% Diminishing returns above 80%; 100% coverage incentivizes trivial tests
Test execution time < 10 min for full suite on CI Slow tests cause developers to skip them or not run locally
Flaky test rate < 1% of test runs Flaky tests erode trust; teams learn to ignore failures
Test-to-code ratio Varies (roughly 1:1 to 2:1) More useful as a trend than an absolute number

CI/CD Maturity

Maturity Levels

Level Description Characteristics
0 — Manual Build and deploy by hand “It works on my machine,” deployment instructions in a wiki nobody maintains
1 — CI basics Automated build and test on every PR Jenkins/GitHub Actions runs tests, but deployment is still manual
2 — Continuous Delivery Automated pipeline to staging; manual promotion to production One-click deploy to production, release trains
3 — Continuous Deployment Every merged PR goes to production automatically Feature flags control rollout, canary deployments, automated rollback
4 — Progressive Delivery Automated canary analysis, percentage-based rollout, automatic rollback Metrics-driven promotion, SLO-based gates

Most 16-person teams should target Level 2-3. Level 4 requires significant investment in observability and automation that pays off at larger scale.

CI/CD Practices

Practice Impact Implementation
Trunk-based development Reduces merge conflicts, enables CI Short-lived branches (< 1 day), feature flags for incomplete work
Build time < 10 min Developers wait for CI; slow CI = slow feedback Parallelize tests, cache dependencies, split test suites
Automated security scanning Catch vulnerabilities before production SAST/DAST in pipeline (Snyk, Trivy, SonarQube)
Environment parity “Works in staging” = “works in production” Containers, infrastructure-as-code, same configs
Deployment frequency High frequency = smaller changes = lower risk Elite teams deploy multiple times per day (DORA research)

DORA Metrics

The four key metrics from the Accelerate research (Forsgren, Humble, Kim), validated across thousands of organizations:

Metric Elite High Medium Low
Deployment Frequency On-demand (multiple/day) Weekly to monthly Monthly to every 6 months Every 6+ months
Lead Time for Changes < 1 hour 1 day to 1 week 1 week to 1 month 1 to 6 months
Mean Time to Recovery (MTTR) < 1 hour < 1 day 1 day to 1 week 1 week to 1 month
Change Failure Rate 0-15% 16-30% 16-30% 46-60%

Why These Metrics Matter for an EM

These metrics are not just for DevOps teams. They measure your team’s ability to deliver value safely:

  • Deployment frequency reflects batch size and pipeline health
  • Lead time reflects process efficiency and team autonomy
  • MTTR reflects observability, incident response, and system understanding
  • Change failure rate reflects testing, review, and deployment practices

How to measure: Use your CI/CD pipeline data (deployment timestamps), incident tracking (time to resolve), and change tracking (which deployments caused incidents). Tools like DORA’s Four Keys project, Sleuth, or LinearB can automate this.

Caution: Do not turn DORA metrics into targets for individual teams. They are system-level indicators. A team with low deployment frequency might be blocked by a shared deployment pipeline, not by their own practices.


SLOs and Error Budgets

The Concept

An SLO (Service Level Objective) defines the target reliability for a service. The error budget is the difference between 100% and the SLO — it is how much unreliability you can tolerate.

SLO Error Budget (monthly) Meaning
99% 7.3 hours downtime Acceptable for internal tools
99.9% 43.8 minutes downtime Standard for user-facing services
99.95% 21.9 minutes downtime High-reliability services
99.99% 4.4 minutes downtime Critical infrastructure (payment, auth)

How Error Budgets Change Behavior

When error budget is healthy (> 50% remaining):

  • Ship features aggressively
  • Experiment with new technologies
  • Take calculated risks

When error budget is depleted:

  • Stop feature work
  • Focus exclusively on reliability improvements
  • Conduct thorough investigation of recent incidents

This creates a self-regulating system: teams that ship risky changes consume their error budget and must slow down. Teams that invest in reliability earn the right to move fast.

Implementing SLOs for a 16-Person Org

  1. Start with 2-3 SLIs (Service Level Indicators): availability (% of successful requests), latency (p99 < threshold), and correctness (% of correct responses)
  2. Set SLOs based on user expectations, not aspirations: If users tolerate 99.5%, do not set 99.99% — you will burn engineering time on diminishing returns
  3. Measure from the user’s perspective: Use synthetic monitoring or real user monitoring, not internal health checks
  4. Report error budget weekly: Make it visible on a dashboard. Discuss in team standup when budget drops below 50%
  5. Define the policy: What happens when the budget is exhausted? Write it down and get stakeholder agreement

Blameless Postmortems

Why Blameless

Human error is never the root cause — it is the proximate cause. The root cause is the system that allowed the error to happen. Blaming individuals teaches people to hide mistakes; blameless culture teaches people to surface them early.

Postmortem Template

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# Incident: [Title]
**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** SEV-1 / SEV-2 / SEV-3
**Author:** [Name]
**Reviewers:** [Names]

## Summary
One paragraph: what happened, what was the impact, how was it resolved.

## Timeline
- HH:MM — First alert fired
- HH:MM — On-call engineer paged
- HH:MM — Root cause identified
- HH:MM — Fix deployed
- HH:MM — Service fully recovered

## Root Cause
What systemic condition allowed this to happen?

## Contributing Factors
What else made the incident worse or longer?

## Impact
- Users affected: N
- Revenue impact: $X (if known)
- SLO impact: X% of monthly error budget consumed

## What Went Well
- What worked in the response?
- What tooling or processes helped?

## What Went Poorly
- Where did the response break down?
- What was confusing or slow?

## Action Items
| Action | Owner | Priority | Due Date | Status |
|--------|-------|----------|----------|--------|
| Add monitoring for X | @engineer | P1 | YYYY-MM-DD | Open |
| Update runbook for Y | @engineer | P2 | YYYY-MM-DD | Open |

## Lessons Learned
What should we remember? What changes to our mental models?

Postmortem Practices

  • Write within 48 hours — memory fades fast
  • Review as a team — the postmortem meeting is for learning, not blame
  • Track action items to completion — unfinished action items are the biggest postmortem failure mode
  • Share broadly — post to a shared channel. Other teams learn from your incidents.
  • Review quarterly — look at patterns across incidents. Are the same categories recurring?

Anti-Patterns

Anti-Pattern Symptom Fix
Coverage worship 95% code coverage, still has production bugs Focus coverage on critical paths; measure mutation testing score for quality
Manual QA gate Every release needs QA sign-off, creating bottleneck Shift-left: automated tests in CI, QA focuses on exploratory testing
Postmortem theater Write postmortems, never complete action items Track action items in sprint backlog, review completion rate
SLO without teeth SLOs exist but nothing happens when they break Define and enforce the error budget policy with product stakeholders
CI that cries wolf Flaky tests cause frequent false failures Quarantine flaky tests, fix or delete within 1 week
Review bottleneck PRs wait days for review, feature branches diverge Review SLA (4 hours), pair programming reduces review need

Real-World Application

Google’s SRE Practices

Google pioneered error budgets and SLOs in their SRE book. Key insight: the SRE team and the product team share the same SLO. When the error budget is exhausted, the SRE team can block new feature deployments until reliability improves. This creates alignment between velocity and reliability.

Netflix’s Chaos Engineering

Netflix runs Chaos Monkey and related tools (Chaos Kong for region failures) in production. The philosophy: if your system cannot handle a random server dying during business hours, you will not handle it at 3 AM either. This forces teams to build resilient systems by default.

Etsy’s Blameless Postmortem Culture

Etsy (under John Allspaw) pioneered blameless postmortems in the industry. Key practices:

  • Postmortems are mandatory for any SEV-1/SEV-2 incident
  • The focus is on “what did the system allow?” not “who made the mistake?”
  • Postmortems are published internally and externally (blog posts)
  • Action items are tracked with the same rigor as product features

Microsoft’s Code Review Research

Microsoft Research (2013) studied code review across 36 development teams. Findings:

  • The primary value is knowledge transfer, not defect detection
  • Reviews find only ~15% of defects; testing and static analysis find the rest
  • Review turnaround time is the strongest predictor of developer satisfaction with the review process
  • Smaller changes receive more thorough reviews

References

Forsgren, N., Humble, J., & Kim, G. (2018). Accelerate — DORA metrics, statistical validation Beyer, B. et al. (2016). Site Reliability Engineering — Google SRE book, error budgets, SLOs Kim, G. et al. (2016). The DevOps Handbook — CI/CD practices, value stream mapping Humble, J. & Farley, D. (2010). Continuous Delivery — Pipeline design, deployment strategies Allspaw, J. & Robbins, J. (2010). Web Operations — Incident management foundations DORA Four Keys — cloud.google.com/devops Google SRE Book — sre.google/sre-book Etsy’s Blameless Postmortems — etsy.com/codeascraft Bacchelli, A. & Bird, C. (2013). “Expectations, Outcomes, and Challenges of Modern Code Review” — Microsoft Research Charity Majors — “Observability and the Glorious Future” (various conference talks) Kent C. Dodds — “The Testing Trophy” — kentcdodds.com

This post is licensed under CC BY 4.0 by the author.