Performance Management
The system by which you create clarity about what good looks like, track progress toward it, and handle the full spectrum from high performers to underperformers. Done right, nobody is ever surprised by their review.
Key Dimensions
| Dimension | What Good Looks Like | Common Failure |
|---|---|---|
| Clarity | Every person knows what’s expected at their level | Vague expectations, “you’ll know it when you see it” |
| Frequency | Continuous feedback, formal check-ins quarterly | Feedback only during annual review cycle |
| Fairness | Same rubric, calibrated across teams | Manager’s favorites get inflated, quiet performers overlooked |
| Actionability | Feedback includes specific behaviors and growth path | “Needs to be more senior” with no concrete guidance |
| Courage | Manager addresses underperformance early | Avoiding hard conversations until it’s a crisis |
| Documentation | Written record of expectations, feedback, outcomes | Nothing written until the PIP |
Goal Setting: OKRs vs. KPIs vs. Goals
OKRs (Objectives and Key Results)
Popularized by Andy Grove at Intel, scaled by Google. The mechanism: set an ambitious Objective (qualitative, inspirational), then define 3-5 Key Results (quantitative, measurable) that would prove you achieved it.
When OKRs work well:
- Product and platform teams where outcomes matter more than outputs
- When you want to stretch ambition — Google’s “0.7 is success” model encourages moonshots
- Cross-functional alignment — OKRs cascade and interlock across teams
When OKRs fail:
- OKRs become task lists — “Ship feature X” is not a Key Result, it’s an output. Key Results measure impact: “Reduce checkout abandonment from 68% to 55%”
- Too many OKRs — more than 3 objectives per team per quarter means nothing is prioritized
- No one checks mid-quarter — OKRs set in January, forgotten until March review
- Punishing misses — if missing a stretch OKR hurts performance reviews, people sandbag. Google explicitly decouples OKRs from compensation.
KPIs (Key Performance Indicators)
Better for operational/reliability work where the job is to maintain and improve steady-state metrics.
Use KPIs when:
- The team owns an ongoing service (SRE, platform, support)
- Success = keeping metrics in a healthy range, not achieving a one-time outcome
- You need to track operational health continuously
The hybrid approach (what most mature orgs actually do): OKRs for strategic initiatives (what are we changing this quarter?) + KPIs for operational health (what must we maintain?). Don’t force OKR format on operational work.
Individual goals vs. team goals:
The tricky part for engineering managers. Individual goals feel fair but incentivize local optimization. Team goals encourage collaboration but let underperformers hide.
Recommendation: Team OKRs for what the team delivers + individual growth goals for how each person develops. Amazon does this well: business goals are team-level, but development goals are individual and tracked in 1:1s.
Performance Reviews — Getting Calibration Right
The calibration problem:
Left uncalibrated, manager reviews are normally distributed around “above expectations” — everyone’s above average, which means the scale is meaningless. Calibration is the process of cross-manager alignment so that “exceeds expectations” means the same thing across the org.
How calibration actually works at top companies:
Google: Managers write initial reviews, then calibration committees (skip-level manager + peer managers) review all ratings in a session. Managers must justify ratings with specific examples. Committees redistribute ratings to roughly match the expected distribution. This is uncomfortable but produces fairness.
Netflix: No formal performance ratings. Instead, the “keeper test” — would you fight to keep this person? If not, give them a generous severance. This is radical and works for Netflix’s specific culture (high talent density, high compensation, low job security tolerance). Most companies cannot replicate this because they don’t pay top-of-market to compensate for the insecurity.
Amazon: OLR (Organization-Level Review) is a calibration process where managers rank their org from top to bottom and discuss with peers. Stack ranking by another name, but with nuance — the focus is on “who’s in the wrong role” more than “who’s worst.”
The rating scale debate:
| Scale Type | Pros | Cons | Used By |
|---|---|---|---|
| 5-point (1-5) | Granular, familiar | Central tendency bias (everyone’s a 3) | Google, Microsoft |
| 3-point (below/meets/exceeds) | Forces differentiation | Not enough resolution for comp decisions | Some startups |
| No ratings | Reduces politics, focus on growth | Hard to make comp decisions, recency bias | Netflix, Deloitte (tried, partially reverted) |
| 4-point (no middle) | Eliminates “meets expectations” default | Can feel forced | Used in some military-origin systems |
My take: 5-point with forced calibration is the least-bad option. You need enough resolution to differentiate for compensation, but you must calibrate or the scale is theater.
The recency bias problem:
Reviews cover 6 or 12 months, but humans remember the last 6 weeks. Countermeasures:
- Brag documents — ask each person to maintain a running list of accomplishments (Julia Evans popularized this)
- Manager journal — write brief notes after each 1:1 about impact and growth signals
- Quarterly check-ins — mini-reviews that create a paper trail for the annual review
- Peer feedback collected at multiple points — not just at review time
Managing Underperformance
This is where most managers fail — not because they can’t identify underperformance, but because they avoid the conversation until it’s a crisis.
The underperformance spectrum:
| Level | Signal | Response | Timeline |
|---|---|---|---|
| Early drift | Missing deadlines, lower quality, disengaged in meetings | Direct feedback in 1:1, explore root cause | 2-4 weeks to see improvement |
| Consistent underperformance | Pattern over 6-8 weeks, feedback not sticking | Explicit expectations reset, written plan, weekly check-ins | 30-60 days |
| Formal PIP | Written plan failed or issue is severe | HR-involved PIP with clear criteria and timeline | 30-90 days |
| Exit | PIP failed or pattern is irrecoverable | Managed exit with dignity, severance if warranted | Immediate-2 weeks |
Before the PIP — the conversation most managers skip:
The formal PIP should never be the first time someone hears they’re underperforming. The sequence should be:
- Verbal feedback (1:1): “I’ve noticed X pattern. Here’s what I need to see instead. How can I help?”
- Written expectations reset (email/doc after 1:1): “Following up on our conversation. Here’s what success looks like in the next 30 days: [specific, measurable criteria]”
- Weekly check-ins on progress with explicit acknowledgment of improvement or continued concern
- Formal PIP only if steps 1-3 didn’t resolve it
Common causes of underperformance (and the right response):
| Root Cause | Signal | Right Response | Wrong Response |
|---|---|---|---|
| Wrong role | Strong in some areas, failing in others | Explore role change, different team | PIP on their weaknesses |
| Personal crisis | Sudden drop from previously strong performer | Compassion, temporary load reduction, EAP referral | “Your performance is slipping” (without asking why) |
| Skill gap | Willing but unable | Training, pairing, mentoring | Waiting for them to figure it out |
| Motivation loss | Capable but checked out | Explore what’s changed — boredom? conflict? comp? | Assuming laziness |
| Bad fit | Cultural mismatch, values conflict | Honest conversation about fit, managed exit | Trying to “fix” them |
| Manager failure | Unclear expectations, no feedback, no support | Fix your management first | Blaming the report |
The PIP document:
A good PIP is compassionate in intent and ruthless in clarity:
- Specific deficiencies — “In the last 60 days, you missed 3 of 5 sprint commitments and delivered code that required significant rework on PRs #142, #156, and #171” (not “your performance is below expectations”)
- Clear success criteria — “Over the next 30 days, you will: (a) complete assigned sprint items at 80%+ rate, (b) have no more than 1 PR require rework for quality issues, (c) proactively communicate blockers within 24 hours”
- Support provided — “I will: pair you with [senior engineer] for daily 30-min pairing sessions, review your PRs within 4 hours, meet weekly to discuss progress”
- Consequences — “If these criteria are not met by [date], we will proceed with separation”
- Timeline — 30 days for performance, 60 days for behavioral issues, 90 days only if the person is showing genuine improvement and needs more time
The ethical dimension:
A PIP should be a genuine attempt to help someone succeed, not a paper trail for termination. If you’ve already decided to fire someone, don’t waste their time with a fake PIP — have the exit conversation directly. Using a PIP as legal cover while having no intention of keeping them is dishonest and they always know.
Stack Ranking — The Debate
Stack ranking (forced distribution of performance ratings) was popularized by Jack Welch at GE (“rank and yank” — bottom 10% managed out annually). Microsoft famously used it for decades before abandoning it in 2013.
Why it fails at scale:
- Destroys collaboration — if my success requires your failure, why would I help you?
- Punishes great teams — a team of 10 strong performers must still label 1-2 as “underperformers”
- Gaming — managers hoard low performers to sacrifice, or hire someone specifically to fill the bottom slot
- Retention inversion — strong performers who see teammates unfairly labeled leave; weak performers protected by the curve stay
What to do instead:
- Calibrate without forced distribution — discuss all employees in a group, but don’t require a bell curve
- Focus on growth trajectory — is this person growing, plateaued, or declining? More useful than a ranking
- Separate evaluation from development — evaluation is backward-looking (what happened), development is forward-looking (what’s next). Don’t try to do both in one conversation
- Use relative assessment sparingly — when making promotion decisions, relative comparison is useful. For routine reviews, absolute assessment against level expectations is better
High Performers — The Neglected Risk
Most managers spend 80% of their performance management energy on underperformers and neglect their top performers. This is backwards.
What high performers actually need:
| Need | What It Looks Like | What Happens If You Ignore It |
|---|---|---|
| Challenge | Stretch assignments, new problem domains | They get bored and leave |
| Recognition | Specific, public acknowledgment of impact | They feel invisible and leave |
| Autonomy | Trust to make decisions, less oversight | They feel micromanaged and leave |
| Growth path | Clear next role, skill development plan | They see no future and leave |
| Compensation | At or above market, equity refresh, spot bonuses | They get poached and leave |
| Shielding | Protection from organizational noise | They burn out on politics and leave |
The “quiet high performer” problem:
In a team of 16, you likely have 2-3 people who consistently deliver great work without drama. They don’t ask for recognition, they don’t complain, they just execute. These are your highest-risk retention targets because you’ll take them for granted until they hand in their resignation.
Countermeasure: Proactively schedule career conversations with your top performers every quarter. Don’t wait for them to bring it up. Ask: “What would make you start looking elsewhere?” and “What’s the most exciting thing you could be working on?”
Performance Conversations — The Mechanics
The SBI-I framework for performance feedback:
- Situation: “In last Tuesday’s design review…”
- Behavior: “…you interrupted the junior engineer three times when they were presenting their approach…”
- Impact: “…which made them visibly uncomfortable and less likely to share ideas in the future.”
- Intent/Inquiry: “I don’t think that was your intent. What was going on for you in that moment?”
The “no surprises” principle:
If someone is surprised by their performance review, you failed as a manager — not them. Every piece of feedback in the formal review should have been discussed in 1:1s already. The review is a summary, not a reveal.
Annual review writing tips:
- Lead with impact, not activity — “Led the migration of 3 services to K8s, reducing deployment time from 45 min to 8 min” not “Worked on Kubernetes migration”
- Be specific about growth — “Improved significantly in stakeholder communication, specifically in how they present technical tradeoffs to product — visible in the Q3 roadmap discussion” not “Improved communication skills”
- Address development areas with growth framing — “Next growth edge is learning to delegate more effectively — currently tends to take on too much personally, which limits their team’s development”
- Calibrate your language — words like “adequate,” “satisfactory,” and “acceptable” all read as negative, even if you mean them neutrally
References
Books
- High Output Management — Andy Grove (performance reviews as a manager’s primary output)
- Radical Candor — Kim Scott (the 2x2 of caring personally / challenging directly)
- An Elegant Puzzle — Will Larson (systems for performance management at scale)
- Measure What Matters — John Doerr (OKRs — the canonical reference)
- Nine Lies About Work — Marcus Buckingham (challenges rating scales and annual reviews)
- The Hard Thing About Hard Things — Ben Horowitz (on firing, PIPs, and difficult conversations)
Research & Articles
- “Reinventing Performance Management” — Deloitte/HBR (2015) — the case for replacing annual reviews
- Google re:Work — open-source calibration and review guides
- Julia Evans — “Brag Documents” (blog post, practical advice for self-evaluation)
- “The Keeper Test” — Netflix culture memo (radical approach to performance)
Talks
- Patty McCord — “Powerful: Building a Culture of Freedom and Responsibility” (Netflix HR philosophy)
- Kim Scott — Radical Candor talks (multiple versions, all worth watching)