Engineering Excellence & Quality
Quality is not a phase or a role -- it is a system property. An EM's job is not to inspect quality into the product but to design systems, processes, and culture where quality is the default outcome.
Quality is not a phase or a role — it is a system property. An EM’s job is not to inspect quality into the product but to design systems, processes, and culture where quality is the default outcome.
Code Review Culture
Why Code Review Matters (and Why It Often Fails)
Code review serves four functions, in order of importance:
- Knowledge sharing — spreading context across the team
- Defect detection — catching bugs before production
- Design feedback — improving architecture and patterns
- Standards enforcement — consistent style and conventions
Most teams treat it as primarily #2 (bug catching), which is why reviews feel like a chore. Studies (Microsoft Research, 2013) show that reviews catch only ~15% of defects — the primary value is knowledge sharing and mentoring.
Code Review Principles
| Principle | Why | How |
|---|---|---|
| Review within 4 hours | Long review queues kill flow and create context-switching | Set a team norm: reviews are interruptible work, do them between focus blocks |
| Small PRs (< 400 lines) | Large PRs get rubber-stamped; small PRs get real feedback | PR size limit in CI; split features using feature flags |
| Review the design, not the syntax | Linters handle formatting; humans should focus on logic and architecture | Automate style checks; review comments focus on behavior and tradeoffs |
| Two approvals for critical paths | Payment, auth, data deletion — these need more eyes | Define “critical paths” explicitly; automate CODEOWNERS |
| Author provides context | Reviewers should not need to reverse-engineer intent | PR template: what, why, how to test, screenshots for UI |
Code Review Anti-Patterns
| Anti-Pattern | Impact | Fix |
|---|---|---|
| Nitpicking | Demoralizes authors, slows reviews | Prefix nits with “nit:” — author can ignore |
| Gatekeeping | One senior blocks all PRs, creates bottleneck | Rotate reviewers, trust juniors to review (they learn from it) |
| Rubber stamping | “LGTM” on 800-line PRs in 2 minutes | Require meaningful comments on PRs above a size threshold |
| Review as battleground | Architecture debates in PR comments | Separate design review (before coding) from code review (after) |
For Cross-Functional Teams (FE, BE, QA, Data)
With 16 engineers across disciplines, code reviews have an additional challenge: can a backend engineer meaningfully review a frontend PR?
Practical approach:
- Same-discipline primary reviewer — someone who understands the idioms and patterns
- Cross-discipline secondary reviewer — for integration points, API contracts, data flow
- QA reviews test coverage — not as a gate but as a quality conversation
Testing Strategy — The Pyramid and Beyond
The Testing Pyramid
1
2
3
4
5
6
/ E2E \ Few, slow, expensive, high confidence
/----------\
/ Integration \ Moderate count, moderate speed
/----------------\
/ Unit Tests \ Many, fast, cheap, low confidence per test
/____________________\
| Layer | What It Tests | Speed | Maintenance Cost | Confidence |
|---|---|---|---|---|
| Unit | Single function/class in isolation | Milliseconds | Low | Low (per test), high (in aggregate) |
| Integration | Components working together (API + DB, service + service) | Seconds | Medium | Medium-high |
| E2E | Full user flow through the system | Minutes | High (brittle) | High (for the happy path) |
The Testing Trophy (Kent C. Dodds)
An alternative to the pyramid that emphasizes integration tests:
1
2
3
4
5
6
/ E2E \
/----------\
/ Integration \ <-- Biggest layer
/----------------\
/ Static Analysis \
/____________________\
The argument: unit tests often test implementation details and break during refactoring without catching real bugs. Integration tests catch more real issues per test. Static analysis (TypeScript, ESLint) catches an entire category of bugs automatically.
Testing Strategy for a 16-Person Org
| Discipline | Unit Tests | Integration Tests | E2E Tests |
|---|---|---|---|
| Frontend | Component tests (React Testing Library), utility functions | API integration tests with MSW (Mock Service Worker) | Critical user journeys (Cypress/Playwright) — max 10-20 |
| Backend | Business logic, validation, transformers | API tests against real DB (testcontainers), service integration | Same E2E as frontend (shared) |
| QA | Owns E2E test suite maintenance | Exploratory testing for edge cases | Defines test scenarios, automates critical paths |
| Data | Pipeline unit tests (dbt tests, Great Expectations) | Data quality checks on staging | End-to-end data flow validation |
The critical discipline: Test the contracts between components, not the internals. When Squad A’s service calls Squad B’s service, the integration test at the boundary is worth more than 50 unit tests inside either service.
Test Quality Metrics
| Metric | Target | Why |
|---|---|---|
| Code coverage | 70-80% for critical paths, do not chase 100% | Diminishing returns above 80%; 100% coverage incentivizes trivial tests |
| Test execution time | < 10 min for full suite on CI | Slow tests cause developers to skip them or not run locally |
| Flaky test rate | < 1% of test runs | Flaky tests erode trust; teams learn to ignore failures |
| Test-to-code ratio | Varies (roughly 1:1 to 2:1) | More useful as a trend than an absolute number |
CI/CD Maturity
Maturity Levels
| Level | Description | Characteristics |
|---|---|---|
| 0 — Manual | Build and deploy by hand | “It works on my machine,” deployment instructions in a wiki nobody maintains |
| 1 — CI basics | Automated build and test on every PR | Jenkins/GitHub Actions runs tests, but deployment is still manual |
| 2 — Continuous Delivery | Automated pipeline to staging; manual promotion to production | One-click deploy to production, release trains |
| 3 — Continuous Deployment | Every merged PR goes to production automatically | Feature flags control rollout, canary deployments, automated rollback |
| 4 — Progressive Delivery | Automated canary analysis, percentage-based rollout, automatic rollback | Metrics-driven promotion, SLO-based gates |
Most 16-person teams should target Level 2-3. Level 4 requires significant investment in observability and automation that pays off at larger scale.
CI/CD Practices
| Practice | Impact | Implementation |
|---|---|---|
| Trunk-based development | Reduces merge conflicts, enables CI | Short-lived branches (< 1 day), feature flags for incomplete work |
| Build time < 10 min | Developers wait for CI; slow CI = slow feedback | Parallelize tests, cache dependencies, split test suites |
| Automated security scanning | Catch vulnerabilities before production | SAST/DAST in pipeline (Snyk, Trivy, SonarQube) |
| Environment parity | “Works in staging” = “works in production” | Containers, infrastructure-as-code, same configs |
| Deployment frequency | High frequency = smaller changes = lower risk | Elite teams deploy multiple times per day (DORA research) |
DORA Metrics
The four key metrics from the Accelerate research (Forsgren, Humble, Kim), validated across thousands of organizations:
| Metric | Elite | High | Medium | Low |
|---|---|---|---|---|
| Deployment Frequency | On-demand (multiple/day) | Weekly to monthly | Monthly to every 6 months | Every 6+ months |
| Lead Time for Changes | < 1 hour | 1 day to 1 week | 1 week to 1 month | 1 to 6 months |
| Mean Time to Recovery (MTTR) | < 1 hour | < 1 day | 1 day to 1 week | 1 week to 1 month |
| Change Failure Rate | 0-15% | 16-30% | 16-30% | 46-60% |
Why These Metrics Matter for an EM
These metrics are not just for DevOps teams. They measure your team’s ability to deliver value safely:
- Deployment frequency reflects batch size and pipeline health
- Lead time reflects process efficiency and team autonomy
- MTTR reflects observability, incident response, and system understanding
- Change failure rate reflects testing, review, and deployment practices
How to measure: Use your CI/CD pipeline data (deployment timestamps), incident tracking (time to resolve), and change tracking (which deployments caused incidents). Tools like DORA’s Four Keys project, Sleuth, or LinearB can automate this.
Caution: Do not turn DORA metrics into targets for individual teams. They are system-level indicators. A team with low deployment frequency might be blocked by a shared deployment pipeline, not by their own practices.
SLOs and Error Budgets
The Concept
An SLO (Service Level Objective) defines the target reliability for a service. The error budget is the difference between 100% and the SLO — it is how much unreliability you can tolerate.
| SLO | Error Budget (monthly) | Meaning |
|---|---|---|
| 99% | 7.3 hours downtime | Acceptable for internal tools |
| 99.9% | 43.8 minutes downtime | Standard for user-facing services |
| 99.95% | 21.9 minutes downtime | High-reliability services |
| 99.99% | 4.4 minutes downtime | Critical infrastructure (payment, auth) |
How Error Budgets Change Behavior
When error budget is healthy (> 50% remaining):
- Ship features aggressively
- Experiment with new technologies
- Take calculated risks
When error budget is depleted:
- Stop feature work
- Focus exclusively on reliability improvements
- Conduct thorough investigation of recent incidents
This creates a self-regulating system: teams that ship risky changes consume their error budget and must slow down. Teams that invest in reliability earn the right to move fast.
Implementing SLOs for a 16-Person Org
- Start with 2-3 SLIs (Service Level Indicators): availability (% of successful requests), latency (p99 < threshold), and correctness (% of correct responses)
- Set SLOs based on user expectations, not aspirations: If users tolerate 99.5%, do not set 99.99% — you will burn engineering time on diminishing returns
- Measure from the user’s perspective: Use synthetic monitoring or real user monitoring, not internal health checks
- Report error budget weekly: Make it visible on a dashboard. Discuss in team standup when budget drops below 50%
- Define the policy: What happens when the budget is exhausted? Write it down and get stakeholder agreement
Blameless Postmortems
Why Blameless
Human error is never the root cause — it is the proximate cause. The root cause is the system that allowed the error to happen. Blaming individuals teaches people to hide mistakes; blameless culture teaches people to surface them early.
Postmortem Template
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# Incident: [Title]
**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** SEV-1 / SEV-2 / SEV-3
**Author:** [Name]
**Reviewers:** [Names]
## Summary
One paragraph: what happened, what was the impact, how was it resolved.
## Timeline
- HH:MM — First alert fired
- HH:MM — On-call engineer paged
- HH:MM — Root cause identified
- HH:MM — Fix deployed
- HH:MM — Service fully recovered
## Root Cause
What systemic condition allowed this to happen?
## Contributing Factors
What else made the incident worse or longer?
## Impact
- Users affected: N
- Revenue impact: $X (if known)
- SLO impact: X% of monthly error budget consumed
## What Went Well
- What worked in the response?
- What tooling or processes helped?
## What Went Poorly
- Where did the response break down?
- What was confusing or slow?
## Action Items
| Action | Owner | Priority | Due Date | Status |
|--------|-------|----------|----------|--------|
| Add monitoring for X | @engineer | P1 | YYYY-MM-DD | Open |
| Update runbook for Y | @engineer | P2 | YYYY-MM-DD | Open |
## Lessons Learned
What should we remember? What changes to our mental models?
Postmortem Practices
- Write within 48 hours — memory fades fast
- Review as a team — the postmortem meeting is for learning, not blame
- Track action items to completion — unfinished action items are the biggest postmortem failure mode
- Share broadly — post to a shared channel. Other teams learn from your incidents.
- Review quarterly — look at patterns across incidents. Are the same categories recurring?
Anti-Patterns
| Anti-Pattern | Symptom | Fix |
|---|---|---|
| Coverage worship | 95% code coverage, still has production bugs | Focus coverage on critical paths; measure mutation testing score for quality |
| Manual QA gate | Every release needs QA sign-off, creating bottleneck | Shift-left: automated tests in CI, QA focuses on exploratory testing |
| Postmortem theater | Write postmortems, never complete action items | Track action items in sprint backlog, review completion rate |
| SLO without teeth | SLOs exist but nothing happens when they break | Define and enforce the error budget policy with product stakeholders |
| CI that cries wolf | Flaky tests cause frequent false failures | Quarantine flaky tests, fix or delete within 1 week |
| Review bottleneck | PRs wait days for review, feature branches diverge | Review SLA (4 hours), pair programming reduces review need |
Real-World Application
Google’s SRE Practices
Google pioneered error budgets and SLOs in their SRE book. Key insight: the SRE team and the product team share the same SLO. When the error budget is exhausted, the SRE team can block new feature deployments until reliability improves. This creates alignment between velocity and reliability.
Netflix’s Chaos Engineering
Netflix runs Chaos Monkey and related tools (Chaos Kong for region failures) in production. The philosophy: if your system cannot handle a random server dying during business hours, you will not handle it at 3 AM either. This forces teams to build resilient systems by default.
Etsy’s Blameless Postmortem Culture
Etsy (under John Allspaw) pioneered blameless postmortems in the industry. Key practices:
- Postmortems are mandatory for any SEV-1/SEV-2 incident
- The focus is on “what did the system allow?” not “who made the mistake?”
- Postmortems are published internally and externally (blog posts)
- Action items are tracked with the same rigor as product features
Microsoft’s Code Review Research
Microsoft Research (2013) studied code review across 36 development teams. Findings:
- The primary value is knowledge transfer, not defect detection
- Reviews find only ~15% of defects; testing and static analysis find the rest
- Review turnaround time is the strongest predictor of developer satisfaction with the review process
- Smaller changes receive more thorough reviews
References
Forsgren, N., Humble, J., & Kim, G. (2018). Accelerate — DORA metrics, statistical validation Beyer, B. et al. (2016). Site Reliability Engineering — Google SRE book, error budgets, SLOs Kim, G. et al. (2016). The DevOps Handbook — CI/CD practices, value stream mapping Humble, J. & Farley, D. (2010). Continuous Delivery — Pipeline design, deployment strategies Allspaw, J. & Robbins, J. (2010). Web Operations — Incident management foundations DORA Four Keys — cloud.google.com/devops Google SRE Book — sre.google/sre-book Etsy’s Blameless Postmortems — etsy.com/codeascraft Bacchelli, A. & Bird, C. (2013). “Expectations, Outcomes, and Challenges of Modern Code Review” — Microsoft Research Charity Majors — “Observability and the Glorious Future” (various conference talks) Kent C. Dodds — “The Testing Trophy” — kentcdodds.com