Polar Bear / Blog / High-Quality Review Process

The review-process maturity model

Published

A review-process maturity model places your process on a ladder of maturity, not a pass/fail grade — and naming your level tells you the single thing to fix next. The rungs run from ad-hoc (reviews happen when someone remembers, and the criteria live in one lead's head) through consistent (same cadence, one shared framework) and calibrated (leads compared, ratings defensible across reviewers) to systemic (the process reliably feeds staffing, promotion and the partner track). You don't leap to the top. You find the level you're actually at and take the next step — because each level fixes the failure of the one below, so a rung skipped is a rung that collapses. For a project-based firm, "mature" has nothing to do with polished forms. It means the output the firm can't run without.

This framing matters because effort alone doesn't buy you a good process. Higher-maturity practices are the ones that correlate with results: organisations that review goals quarterly or more often are 3.5× more likely to land in the top quartile of business outcomes, and those with clear, well-set goals 4× more likely (Bersin by Deloitte, 2014). Meanwhile the rating on its own barely moves anyone — after a century of research there's little evidence that appraisal by itself improves performance (DeNisi & Murphy, Journal of Applied Psychology, 2017). The difference between a review that changes something and one that just files the past is a difference of level, not of form design.

Questions about your review process? Book a free call — we'll place your process on the ladder below and name the one move that takes it to the next level.
Book a call →

What is a review-process maturity model — and why think in levels?

It's a way of asking a better question. Not "is our review process good?" — which has no actionable answer — but "what level are we at, and what's the next move?" Level-based process maturity comes from software: the Capability Maturity Model ranked a process on staged levels from ad-hoc/initial up to optimizing, each level institutionalising what the one below left to chance (Humphrey, Managing the Software Process, 1989; formalised by the SEI at Carnegie Mellon, 1991). The idea then jumped to people practices in the People Capability Maturity Model, which applied the same staged logic — five levels, advanced one at a time — to how organisations manage and develop their workforce (Curtis, Hefley & Miller, SEI / Carnegie Mellon, 2009).

Thinking in levels does two useful things. It replaces a verdict with a direction: a level isn't a grade you feel bad about, it's a diagnosis that points at the next rung. And it stops you buying the wrong fix — the most common mistake in a small firm is installing level-4 tooling on an ad-hoc base, spending on a platform when the actual problem is that nobody has agreed what "good" looks like yet.

What are the levels, from ad-hoc to systemic?

Four rungs, each defined by what it fixes. You climb them in order.

  1. Ad-hoc. Reviews happen when someone remembers — before a raise conversation, or when a client complains. The criteria live in a lead's head, there's no shared framework, and the output goes nowhere in particular. This is the founder-as-bottleneck level: quality depends entirely on which lead you happened to get.
  2. Consistent. Everyone is reviewed on the same cadence, against one shared framework, with a named owner who runs the cycle. This is the first rung where the process exists independently of any one person. But each lead still rates in their own way — the form is shared; the standard behind it isn't yet.
  3. Calibrated. Before ratings are final, leads are compared against each other and against the framework, so outlier raters are caught and a "meets the bar" means the same thing whoever wrote it. This is the rung that makes a rating defensible — and, as the next section shows, it's not optional dressing but a real correction to a real distortion.
  4. Systemic. The output reliably feeds the decisions the firm actually makes: who gets staffed on what, who's promoted, who's on the partner track, what each person develops next. The review stops being a document that gets filed and becomes an input the firm can't run without. This is where performance actually moves, because improvement comes from what happens after the rating, not the rating itself (Smither, London & Reilly, Personnel Psychology, 2005).
The review-process maturity ladder LEVEL 1 Ad-hoc when someone remembers LEVEL 2 Consistent shared cadence + framework, a named owner fixes: no rhythm, one head LEVEL 3 Calibrated leads compared before ratings are final — defensible across reviewers fixes: each lead's own bar LEVEL 4 Systemic output feeds staffing, promotion, the partner track + development the firm can't run without it fixes: a rating that's filed
Type-B schematic. Four rungs, climbed one at a time — each level institutionalises what the one below leaves to chance. Maturity model lineage: Humphrey / SEI CMM (1989–1991); People CMM (Curtis, Hefley & Miller, SEI / Carnegie Mellon, 2009).

How do you tell which level you're actually at?

Look at six observable markers, not at how the forms look. Cadence (does everyone get reviewed on a predictable rhythm, or when someone remembers?); ownership (is there a named person who runs the cycle, or does it depend on individual leads?); framework (is there one shared standard for what "good" means at each level, or does each reviewer carry their own?); calibration (are leads compared before ratings are final?); follow-through (does the output change a staffing, development or promotion decision, or does it get filed?); and data (can you see completion and rating patterns across the firm?). Your level is the lowest rung where a marker is still missing — maturity is a chain, and it's only as strong as its weakest link.

The calibration marker is the one firms most often skip, and it's the one with the hardest evidence behind it. When researchers decomposed what actually drives a performance rating, the single largest source of variance wasn't the person's performance — it was the idiosyncrasy of the rater, at around 62% of the variance, against only about 21% for the ratee's actual performance (Scullen, Mount & Goff, Journal of Applied Psychology, 2000). In plain terms: before you calibrate, a rating tells you more about who did the reviewing than about the work. That's why "calibrated" is a genuine, non-skippable level and not a nice-to-have — a firm that rates on time but never compares leads is running level-2 machinery and calling it fair.

What is a rating actually made of? Share of performance-rating variance, first dataset (n = 2,350 managers) The rater (idiosyncrasy) 62% The person's performance 21%
Before you calibrate, a rating is mostly the rater: idiosyncratic rater effects accounted for ~62% of rating variance versus ~21% for actual performance. That gap is why "calibrated" is its own maturity level. Source: Scullen, Mount & Goff, Journal of Applied Psychology, 2000 (first of two datasets; 53% rater effect in the second).

Why does the level matter more for a billable, multi-lead firm?

Because in a firm made of people, the review's output is load-bearing. A consultant or creative is staffed across several engagements under different leads, rated per engagement by whoever ran it, and the result feeds calibration across reviewers, promotion decisions and the partner track. An agency where reviews happen on time but never roll up into those calls is stuck at a low level however polished the form looks — the artifact exists, but the firm still makes its real people decisions in the corridor.

That's also why the level, not the tooling, is the thing to fix. In a big company an immature review process is a nuisance; in a boutique it lands directly on the senior talent you can least afford to lose, because a rating a person can't act on — and can't see the logic of — reads as a rating that wasn't fair. The maturity ladder gives a founder or Head of People a way to locate the firm honestly and pick the one fix that moves it up, instead of buying a platform a level too early or blaming individual leads for a gap that is really a missing shared standard.

How do you move up one level — without over-engineering?

Fix the single bottleneck to the next rung — nothing more. If you're ad-hoc, the next move isn't a platform; it's agreeing one shared framework and a cadence, so the process stops living in one head. If you're consistent, the next move isn't more forms; it's a calibration step where leads compare ratings before they're final. If you're calibrated, the next move is connecting the output to a real decision — staffing, development, promotion — because a review that changes nothing changes no one.

Two rules keep the climb honest. Don't skip a rung: you can't calibrate leads who are using different bars (calibration needs a shared framework first), and you can't feed the partner track with ratings that aren't yet defensible. And don't over-build: level-4 tooling on an ad-hoc base is money spent on a symptom. The reason to climb at all is that maturity correlates with outcomes — quarterly-or-more goal review is associated with 3.5× the odds of top-quartile results (Bersin by Deloitte, 2014) — but that payoff comes from the rung you're standing on connecting to real work, not from the sophistication of the instrument. The systemic level pays back precisely because improvement lives in the follow-up (Smither, London & Reilly, 2005), and a rating that connects to nothing has no follow-up to give.

What traps fake maturity?

The dangerous failures are the ones that look mature. Watch for these.

  1. Process theatre. Mature-looking forms, templates and a slick cycle sitting on top of an ad-hoc reality — the ratings are still one lead's opinion, dressed up. Polish is not a level.
  2. Tool-first maturity. Buying a performance platform and mistaking the software for the standard. A tool can carry a mature process; it cannot create one. If there's no shared framework, the platform just industrialises the inconsistency.
  3. Reviews that never touch a decision. The cycle completes, the ratings are filed, and nothing downstream changes — staffing, promotion and development are decided elsewhere. This is the most common ceiling: consistent on the surface, never systemic, and the rating alone won't rescue it (DeNisi & Murphy, 2017).
  4. One-off calibration. Calibrating once, for the promotion round, then never again. Calibration is a standing marker of the level, not an annual event; drop it and you slide back to level 2 without noticing.
  5. Skipping a rung. Reaching for "feeds the partner track" while ratings still aren't calibrated — building the roof before the walls. The decision inherits the distortion the missing rung was supposed to catch.

A review-process maturity self-check

Score one point per "yes". Your level is the highest rung where every marker below it holds — because maturity is a chain, not a total.

  • Cadence — everyone is reviewed on a predictable rhythm, not when someone remembers.
  • Ownership — a named person runs the cycle; it doesn't depend on individual leads.
  • Framework — one shared standard defines "good" at each level, used by every reviewer.
  • Calibration — leads compare ratings against each other before they're final.
  • Follow-through — the output changes a real staffing, development or promotion decision.
  • Data — you can see completion and rating patterns across the firm.
  • No theatre — the polish matches the reality; the forms aren't ahead of the standard.
  • Connected — the review is an input the firm's decisions actually depend on.

Not sure which rung you're on — or which fix comes next? Book a call and we'll place your process on the ladder and name the single next move, together.

FAQ

What is a review-process maturity model?

It's a ladder that places your review process on a level of maturity — ad-hoc, consistent, calibrated, or systemic — instead of scoring it good or bad. The point is diagnostic: naming your level tells you the one thing to fix next. The idea is borrowed from process-maturity models in software and workforce practice (Humphrey, 1989; Curtis, Hefley & Miller, SEI / Carnegie Mellon, 2009), applied to how a firm reviews its people.

What are the levels of review-process maturity?

Four, climbed in order: ad-hoc (reviews when someone remembers, criteria in one head), consistent (shared cadence and framework, a named owner), calibrated (leads compared, ratings defensible across reviewers), and systemic (the output reliably feeds staffing, promotion and the partner track). Each level fixes the failure of the one below, so you can't skip a rung.

How do I know which level my firm is at?

Check six observable markers — cadence, ownership, shared framework, calibration, follow-through, and data — not how the forms look. Your level is the lowest rung where a marker is still missing, because maturity is only as strong as its weakest link.

Why is calibration treated as its own level?

Because without it a rating is mostly the rater. When researchers decomposed performance-rating variance, idiosyncratic rater effects accounted for about 62% of it, versus roughly 21% for actual performance (Scullen, Mount & Goff, 2000). Comparing leads before ratings are final is what makes a rating mean the same thing whoever wrote it — a real correction, not dressing.

Isn't more mature just more bureaucratic?

No — maturity here means the output connects to real decisions, not more paperwork. Over-building is itself a trap: level-4 tooling on an ad-hoc base is money spent on a symptom. Higher-maturity practices correlate with better outcomes (Bersin by Deloitte, 2014), but the payoff comes from the process connecting to real work, not from the sophistication of the tool.

About us

Both ex-McKinsey, we bring the best practices of people growth to the agency world, building simple, lovable people systems without the corporate HR heritage.

Pauline Bertry

Pauline Bertry

Product Growth · CX Design

10+ years leading product & design teams. Built from scratch and led Design Hubs at McKinsey Moscow and Budapest. Created career frameworks and growth systems tested with 100+ person cross-functional product teams.

Meet Pauline →
Alexey Lobachev

Alexey Lobachev

People Strategy · Engagement

9 years running communication, people, experience and engagement programs at McKinsey taught him the hardest skill in operations: knowing what to delegate, what to automate, and what to leave alone. As a co-founder of Polar Bear he applies that instinct to AI agents, building them to augment the internal processes and tools his team already runs on.

Meet AlexeyComing soon

Dealing with a people challenge and not sure where to start?

Let's have a conversation

Sources

  1. Humphrey, W. S. (1989). Managing the Software Process. Addison-Wesley — formalised by the Software Engineering Institute (SEI), Carnegie Mellon, as the Capability Maturity Model (CMM), 1991. The origin of level-based process maturity: a process is ranked on staged levels (Initial/ad-hoc → Repeatable → Defined → Managed → Optimizing), each institutionalising what the level below leaves to chance, advanced one level at a time. en.wikipedia.org
  2. Curtis, B., Hefley, W. E., & Miller, S. A. (2009). People Capability Maturity Model (P-CMM), Version 2.0, Second Edition. Technical Report CMU/SEI-2009-TR-003, Software Engineering Institute, Carnegie Mellon University. Applies staged maturity (five levels: Initial → Managed → Defined → Predictable → Optimizing) to workforce practices, aligning people capability with business objectives. sei.cmu.edu
  3. Bersin by Deloitte — Garr, S. S. (2014). High-Impact Performance Management: Using Goals to Focus the 21st Century Workforce. Organisations that review goals quarterly or more often are 3.5× more likely, and those enabling goal clarity 4× more likely, to score in the top 25% of business outcomes; 54% review or revise goals only yearly or never. prnewswire.com
  4. Scullen, S. E., Mount, M. K., & Goff, M. (2000). Understanding the Latent Structure of Job Performance Ratings. Journal of Applied Psychology, 85(6), 956–970. Idiosyncratic rater effects were the largest source of rating variance — about 62% in the first dataset (53% in the second) — versus roughly 21% for the ratee's actual performance; why calibration is a genuine maturity level. scirp.org
  5. Smither, J. W., London, M., & Reilly, R. R. (2005). Does performance improve following multisource feedback? A theoretical model, meta-analysis, and review of empirical findings. Personnel Psychology, 58, 33–66. Improvement is generally small and conditional — larger when the review is followed by goal-setting, coaching and follow-up; the "systemic" end of the ladder. onlinelibrary.wiley.com
  6. DeNisi, A. S., & Murphy, K. R. (2017). Performance appraisal and performance management: 100 years of progress? Journal of Applied Psychology, 102(3), 421–433. After a century of research, little consistent evidence that appraisal on its own improves performance — a mature process is one whose output connects to real decisions, not a polished form. psycnet.apa.org
  7. CIPD. Performance management (factsheet). Practitioner standard for a consistent, fair and defensible process — clear expectations, a shared framework, cadence, calibration, and a link to development. cipd.org