Engineering Excellence & Quality

Quality is not a phase or a role -- it is a system property. An EM's job is not to inspect quality into the product but to design systems, processes, and culture where quality is the default outcome.

Posted Aug 1, 2024

12 min read

Quality is not a phase or a role — it is a system property. An EM’s job is not to inspect quality into the product but to design systems, processes, and culture where quality is the default outcome.

Code Review Culture

Why Code Review Matters (and Why It Often Fails)

Code review serves four functions, in order of importance:

Knowledge sharing — spreading context across the team
Defect detection — catching bugs before production
Design feedback — improving architecture and patterns
Standards enforcement — consistent style and conventions

Most teams treat it as primarily #2 (bug catching), which is why reviews feel like a chore. Studies (Microsoft Research, 2013) show that reviews catch only ~15% of defects — the primary value is knowledge sharing and mentoring.

Code Review Principles

Principle	Why	How
Review within 4 hours	Long review queues kill flow and create context-switching	Set a team norm: reviews are interruptible work, do them between focus blocks
Small PRs (< 400 lines)	Large PRs get rubber-stamped; small PRs get real feedback	PR size limit in CI; split features using feature flags
Review the design, not the syntax	Linters handle formatting; humans should focus on logic and architecture	Automate style checks; review comments focus on behavior and tradeoffs
Two approvals for critical paths	Payment, auth, data deletion — these need more eyes	Define “critical paths” explicitly; automate CODEOWNERS
Author provides context	Reviewers should not need to reverse-engineer intent	PR template: what, why, how to test, screenshots for UI

Code Review Anti-Patterns

Anti-Pattern	Impact	Fix
Nitpicking	Demoralizes authors, slows reviews	Prefix nits with “nit:” — author can ignore
Gatekeeping	One senior blocks all PRs, creates bottleneck	Rotate reviewers, trust juniors to review (they learn from it)
Rubber stamping	“LGTM” on 800-line PRs in 2 minutes	Require meaningful comments on PRs above a size threshold
Review as battleground	Architecture debates in PR comments	Separate design review (before coding) from code review (after)

For Cross-Functional Teams (FE, BE, QA, Data)

With 16 engineers across disciplines, code reviews have an additional challenge: can a backend engineer meaningfully review a frontend PR?

Practical approach:

Same-discipline primary reviewer — someone who understands the idioms and patterns
Cross-discipline secondary reviewer — for integration points, API contracts, data flow
QA reviews test coverage — not as a gate but as a quality conversation

Testing Strategy — The Pyramid and Beyond

The Testing Pyramid

        /  E2E  \          Few, slow, expensive, high confidence
       /----------\
      / Integration \      Moderate count, moderate speed
     /----------------\
    /    Unit Tests     \   Many, fast, cheap, low confidence per test
   /____________________\

Layer	What It Tests	Speed	Maintenance Cost	Confidence
Unit	Single function/class in isolation	Milliseconds	Low	Low (per test), high (in aggregate)
Integration	Components working together (API + DB, service + service)	Seconds	Medium	Medium-high
E2E	Full user flow through the system	Minutes	High (brittle)	High (for the happy path)

The Testing Trophy (Kent C. Dodds)

An alternative to the pyramid that emphasizes integration tests:

        /  E2E  \
       /----------\
      / Integration \      <-- Biggest layer
     /----------------\
    / Static Analysis   \
   /____________________\

The argument: unit tests often test implementation details and break during refactoring without catching real bugs. Integration tests catch more real issues per test. Static analysis (TypeScript, ESLint) catches an entire category of bugs automatically.

Testing Strategy for a 16-Person Org

Discipline	Unit Tests	Integration Tests	E2E Tests
Frontend	Component tests (React Testing Library), utility functions	API integration tests with MSW (Mock Service Worker)	Critical user journeys (Cypress/Playwright) — max 10-20
Backend	Business logic, validation, transformers	API tests against real DB (testcontainers), service integration	Same E2E as frontend (shared)
QA	Owns E2E test suite maintenance	Exploratory testing for edge cases	Defines test scenarios, automates critical paths
Data	Pipeline unit tests (dbt tests, Great Expectations)	Data quality checks on staging	End-to-end data flow validation

The critical discipline: Test the contracts between components, not the internals. When Squad A’s service calls Squad B’s service, the integration test at the boundary is worth more than 50 unit tests inside either service.

Test Quality Metrics

Metric	Target	Why
Code coverage	70-80% for critical paths, do not chase 100%	Diminishing returns above 80%; 100% coverage incentivizes trivial tests
Test execution time	< 10 min for full suite on CI	Slow tests cause developers to skip them or not run locally
Flaky test rate	< 1% of test runs	Flaky tests erode trust; teams learn to ignore failures
Test-to-code ratio	Varies (roughly 1:1 to 2:1)	More useful as a trend than an absolute number

CI/CD Maturity

Maturity Levels

Level	Description	Characteristics
0 — Manual	Build and deploy by hand	“It works on my machine,” deployment instructions in a wiki nobody maintains
1 — CI basics	Automated build and test on every PR	Jenkins/GitHub Actions runs tests, but deployment is still manual
2 — Continuous Delivery	Automated pipeline to staging; manual promotion to production	One-click deploy to production, release trains
3 — Continuous Deployment	Every merged PR goes to production automatically	Feature flags control rollout, canary deployments, automated rollback
4 — Progressive Delivery	Automated canary analysis, percentage-based rollout, automatic rollback	Metrics-driven promotion, SLO-based gates

Most 16-person teams should target Level 2-3. Level 4 requires significant investment in observability and automation that pays off at larger scale.

CI/CD Practices

Practice	Impact	Implementation
Trunk-based development	Reduces merge conflicts, enables CI	Short-lived branches (< 1 day), feature flags for incomplete work
Build time < 10 min	Developers wait for CI; slow CI = slow feedback	Parallelize tests, cache dependencies, split test suites
Automated security scanning	Catch vulnerabilities before production	SAST/DAST in pipeline (Snyk, Trivy, SonarQube)
Environment parity	“Works in staging” = “works in production”	Containers, infrastructure-as-code, same configs
Deployment frequency	High frequency = smaller changes = lower risk	Elite teams deploy multiple times per day (DORA research)

DORA Metrics

The four key metrics from the Accelerate research (Forsgren, Humble, Kim), validated across thousands of organizations:

Metric	Elite	High	Medium	Low
Deployment Frequency	On-demand (multiple/day)	Weekly to monthly	Monthly to every 6 months	Every 6+ months
Lead Time for Changes	< 1 hour	1 day to 1 week	1 week to 1 month	1 to 6 months
Mean Time to Recovery (MTTR)	< 1 hour	< 1 day	1 day to 1 week	1 week to 1 month
Change Failure Rate	0-15%	16-30%	16-30%	46-60%

Why These Metrics Matter for an EM

These metrics are not just for DevOps teams. They measure your team’s ability to deliver value safely:

Deployment frequency reflects batch size and pipeline health
Lead time reflects process efficiency and team autonomy
MTTR reflects observability, incident response, and system understanding
Change failure rate reflects testing, review, and deployment practices

How to measure: Use your CI/CD pipeline data (deployment timestamps), incident tracking (time to resolve), and change tracking (which deployments caused incidents). Tools like DORA’s Four Keys project, Sleuth, or LinearB can automate this.

Caution: Do not turn DORA metrics into targets for individual teams. They are system-level indicators. A team with low deployment frequency might be blocked by a shared deployment pipeline, not by their own practices.

SLOs and Error Budgets

The Concept

An SLO (Service Level Objective) defines the target reliability for a service. The error budget is the difference between 100% and the SLO — it is how much unreliability you can tolerate.

SLO	Error Budget (monthly)	Meaning
99%	7.3 hours downtime	Acceptable for internal tools
99.9%	43.8 minutes downtime	Standard for user-facing services
99.95%	21.9 minutes downtime	High-reliability services
99.99%	4.4 minutes downtime	Critical infrastructure (payment, auth)

How Error Budgets Change Behavior

When error budget is healthy (> 50% remaining):

Ship features aggressively
Experiment with new technologies
Take calculated risks

When error budget is depleted:

Stop feature work
Focus exclusively on reliability improvements
Conduct thorough investigation of recent incidents

This creates a self-regulating system: teams that ship risky changes consume their error budget and must slow down. Teams that invest in reliability earn the right to move fast.

Implementing SLOs for a 16-Person Org

Start with 2-3 SLIs (Service Level Indicators): availability (% of successful requests), latency (p99 < threshold), and correctness (% of correct responses)
Set SLOs based on user expectations, not aspirations: If users tolerate 99.5%, do not set 99.99% — you will burn engineering time on diminishing returns
Measure from the user’s perspective: Use synthetic monitoring or real user monitoring, not internal health checks
Report error budget weekly: Make it visible on a dashboard. Discuss in team standup when budget drops below 50%
Define the policy: What happens when the budget is exhausted? Write it down and get stakeholder agreement

Blameless Postmortems

Why Blameless

Human error is never the root cause — it is the proximate cause. The root cause is the system that allowed the error to happen. Blaming individuals teaches people to hide mistakes; blameless culture teaches people to surface them early.

Postmortem Template

        
      
# Incident: [Title]
**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** SEV-1 / SEV-2 / SEV-3
**Author:** [Name]
**Reviewers:** [Names]

## Summary
One paragraph: what happened, what was the impact, how was it resolved.

## Timeline
- HH:MM — First alert fired
- HH:MM — On-call engineer paged
- HH:MM — Root cause identified
- HH:MM — Fix deployed
- HH:MM — Service fully recovered

## Root Cause
What systemic condition allowed this to happen?

## Contributing Factors
What else made the incident worse or longer?

## Impact
- Users affected: N
- Revenue impact: $X (if known)
- SLO impact: X% of monthly error budget consumed

## What Went Well
- What worked in the response?
- What tooling or processes helped?

## What Went Poorly
- Where did the response break down?
- What was confusing or slow?

## Action Items
| Action | Owner | Priority | Due Date | Status |
|--------|-------|----------|----------|--------|
| Add monitoring for X | @engineer | P1 | YYYY-MM-DD | Open |
| Update runbook for Y | @engineer | P2 | YYYY-MM-DD | Open |

## Lessons Learned
What should we remember? What changes to our mental models?

Postmortem Practices

Write within 48 hours — memory fades fast
Review as a team — the postmortem meeting is for learning, not blame
Track action items to completion — unfinished action items are the biggest postmortem failure mode
Share broadly — post to a shared channel. Other teams learn from your incidents.
Review quarterly — look at patterns across incidents. Are the same categories recurring?

Anti-Patterns

Anti-Pattern	Symptom	Fix
Coverage worship	95% code coverage, still has production bugs	Focus coverage on critical paths; measure mutation testing score for quality
Manual QA gate	Every release needs QA sign-off, creating bottleneck	Shift-left: automated tests in CI, QA focuses on exploratory testing
Postmortem theater	Write postmortems, never complete action items	Track action items in sprint backlog, review completion rate
SLO without teeth	SLOs exist but nothing happens when they break	Define and enforce the error budget policy with product stakeholders
CI that cries wolf	Flaky tests cause frequent false failures	Quarantine flaky tests, fix or delete within 1 week
Review bottleneck	PRs wait days for review, feature branches diverge	Review SLA (4 hours), pair programming reduces review need

Real-World Application

Google’s SRE Practices

Google pioneered error budgets and SLOs in their SRE book. Key insight: the SRE team and the product team share the same SLO. When the error budget is exhausted, the SRE team can block new feature deployments until reliability improves. This creates alignment between velocity and reliability.

Netflix’s Chaos Engineering

Netflix runs Chaos Monkey and related tools (Chaos Kong for region failures) in production. The philosophy: if your system cannot handle a random server dying during business hours, you will not handle it at 3 AM either. This forces teams to build resilient systems by default.

Etsy’s Blameless Postmortem Culture

Etsy (under John Allspaw) pioneered blameless postmortems in the industry. Key practices:

Postmortems are mandatory for any SEV-1/SEV-2 incident
The focus is on “what did the system allow?” not “who made the mistake?”
Postmortems are published internally and externally (blog posts)
Action items are tracked with the same rigor as product features

Microsoft’s Code Review Research

Microsoft Research (2013) studied code review across 36 development teams. Findings:

The primary value is knowledge transfer, not defect detection
Reviews find only ~15% of defects; testing and static analysis find the rest
Review turnaround time is the strongest predictor of developer satisfaction with the review process
Smaller changes receive more thorough reviews

References

Forsgren, N., Humble, J., & Kim, G. (2018). Accelerate — DORA metrics, statistical validation Beyer, B. et al. (2016). Site Reliability Engineering — Google SRE book, error budgets, SLOs Kim, G. et al. (2016). The DevOps Handbook — CI/CD practices, value stream mapping Humble, J. & Farley, D. (2010). Continuous Delivery — Pipeline design, deployment strategies Allspaw, J. & Robbins, J. (2010). Web Operations — Incident management foundations DORA Four Keys — cloud.google.com/devops Google SRE Book — sre.google/sre-book Etsy’s Blameless Postmortems — etsy.com/codeascraft Bacchelli, A. & Bird, C. (2013). “Expectations, Outcomes, and Challenges of Modern Code Review” — Microsoft Research Charity Majors — “Observability and the Glorious Future” (various conference talks) Kent C. Dodds — “The Testing Trophy” — kentcdodds.com

Engineering Leadership, Leadership

reliability

This post is licensed under CC BY 4.0 by the author.