You can't delete bias from a human judgment — but you can sharply reduce and contain it by design. Removing bias from reviews isn't an awareness workshop or a nicer form; it's engineering the process so the rating reflects the work, not who reviewed it, how recently, or which client they liked. That means fixing the criteria before you review, judging concrete evidence against a shared bar, covering the whole period instead of the last project, gathering more than one input weighted by who actually saw the work, calibrating to catch outlier raters, then running a final language check on the written review — with AI flagging vague or loaded wording but a human always making the call. The goal is a more reliable, defensible judgment, not a "debiased" person or a promise of perfect objectivity.
How much does the rater contaminate the result? When researchers decomposed 4,492 managers' ratings, about 62% of the variance came from the rater's idiosyncrasies and only ~21% from the person's actual performance (Scullen, Mount & Goff, Journal of Applied Psychology, 2000). Most of a single score is the rater. Containing bias is how you get the work back into the picture.
Which biases distort project-based reviews most?
The ones that thrive when a person is staffed across several engagements and rated by whichever leads happened to run them. Five do most of the damage:
- Recency. The most recent engagement weighs far too heavily in the overall rating; a strong or weak finish near the cycle close swamps ten months that came before (Steiner & Rain, Journal of Applied Psychology, 1989). End-of-cycle recall quietly becomes "what did they do lately?"
- Halo (the "favourite client" glow). One strong overall impression colours every separate judgment — the original "halo error" found that supposedly independent ratings of a person's distinct qualities were implausibly correlated (Thorndike, Journal of Applied Psychology, 1920). The lead who loved working with someone rates them high on everything.
- The idiosyncratic rater effect. Much of any score is simply the rater's own perspective and leniency, not the person: ~62% of rating variance is rater idiosyncrasy versus ~21% actual performance (Scullen, Mount & Goff, 2000). One reviewer is a noisy instrument — interrater reliability of supervisory ratings is only about .52 (Viswesvaran, Ones & Schmidt, Journal of Applied Psychology, 1996).
- Similarity and affinity bias. Reviewers favour people who work, communicate, or problem-solve the way they do — easily mistaken for "good judgment" or "culture fit".
- Leniency and central tendency. Some leads rate everyone high to avoid hard conversations; others cluster everyone in the middle. Either way the rating stops discriminating between real differences in the work.
Why don't awareness training or a better form fix it?
Because bias lives in the judgment, not the template — and telling people to "just be objective" barely moves ratings. Worse, it can backfire: when an organisation's values stressed meritocracy, managers gave male employees bonuses about $46 higher on average than identically-performing women ($418.80 vs $372.40) — the "paradox of meritocracy" (Castilla & Benard, Administrative Science Quarterly, 2010). Believing you're impartial gives you permission to act on bias.
The durable lever isn't fixing minds; it's redesigning the process. Unconscious bias is hard and costly to train away — diversity workshops show limited and sometimes counter-productive effects — so the reliable move is to de-bias the organisation, not the individual (Bohnet, What Works, 2016). And it isn't only directional bias you're fighting: there's also noise — random variability between equally competent raters judging the same work. Structured, criterion-first, independent judgment ("decision hygiene") reduces both (Kahneman, Sibony & Sunstein, Noise, 2021). A nicer form leaves all of that untouched.
How do you design the process to contain bias?
Put the guardrails before and around the judgment, where they can actually catch it. Four do the heavy lifting:
- Fix the criteria before you review. Define, in observable behaviours, what "good" looks like at each level, and write it down before the cycle (CIPD, Performance management). When the bar is set in advance, it can't quietly shift to fit the person in front of you.
- Judge evidence against the shared bar — gather first, score second. Collect concrete examples tied to the criteria ("renegotiated scope when the client moved the goalposts"), then rate. The moment a proposed score is visible, everyone anchors to it; independent inputs are what dilute a single rater's slant.
- Cover the whole period and more than one viewpoint. Running notes across engagements beat end-of-cycle recall (the antidote to recency), and inputs from the different leads and peers who actually saw the work — weighted by proximity, not the org chart — dilute any one rater's idiosyncrasy.
- Calibrate to catch outliers, not to massage numbers. Compare ratings across reviewers and levels to surface the lenient lead, the harsh one, and the idiosyncratic rater before any decision is made. Calibration is a noise-and-bias check, not a quota.
How do you de-bias the written review — and where does AI help?
Swap vague or loaded language for evidence, then check the wording before it ships. A written review is where bias gets encoded into the record: personality words ("abrasive", "not partner material", "a joy to work with") instead of behaviours, different vocabulary for the same act depending on who did it, praise that's all warmth and no substance. The fix is to make every claim point to a concrete example against the shared bar — and to read the final text for words that judge the person rather than describe the work.
This is exactly where AI earns a place — as a human-in-the-loop check on wording and consistency, never as the thing that assigns the rating. A model is good at flagging vague adjectives, loaded language, and uneven tone across a batch of reviews so a human can rewrite them against the evidence. It is not a fair, accountable judge of a person's year, and outsourcing the rating to it just hides the bias behind a confident sentence. Use it to tighten the language; keep the judgment with the people who own it.
How do you check whether your reviews are actually fair?
Audit the outputs — because a process you can't see is a process you can't trust. After the cycle, look at rating patterns by reviewer, by level, and by group: if the same leads always run high or low, if one client team always scores above the rest, or if a particular group is consistently rated down, your guardrails aren't holding (CIPD, Performance management: Could do better?). A fair, transparent, inclusive process is something you verify with the numbers, not something you assume because you wrote a competency framework.
Keep the whole thing developmental, too. Feedback raised performance on average across 607 studies, but over a third of interventions actually made performance worse (Kluger & DeNisi, Psychological Bulletin, 1996), and after a century of research there's little evidence appraisal on its own improves anything (DeNisi & Murphy, Journal of Applied Psychology, 2017). A "cleaner" rating that leaves people demotivated isn't a win — de-biasing has to make the judgment both more accurate and more useful.
Why does removing bias matter more for agencies and consulting boutiques?
Because in a project-based firm there is no single manager who watched someone all year. People are staffed across engagements and rated by whichever leads happened to run them, so the last client, the most-visible project, or the partner's favourite tends to crowd out everything else — and a promotion or partner-track decision rides on it. The 62% rater-idiosyncrasy finding (Scullen, Mount & Goff, 2000) isn't abstract here; it's the difference between a defensible promotion case and a political one.
That makes de-biasing structural, not a one-off training. A shared bar lets different engagement leads mean the same thing. Running notes and weighted inputs stop a brilliant final sprint or a single difficult client from defining the cycle. Calibration across leads catches the reviewer whose ratings say more about them than about the people they rated. And because the result decides who makes the partner track, defensible guardrails are also how a boutique keeps the senior talent it can least afford to lose — an opaque or unfair verdict is read as one, and people leave over it. Build the guardrails light enough to run between billable hours, or they won't run at all.
A quick self-check: how bias-resistant is your review process?
Score one point per "yes". Six or seven and your process genuinely contains bias; four or fewer and the rating is still mostly the rater.
- What "good" looks like at each level is written down in observable behaviours before the cycle starts.
- Reviewers gather concrete examples first and score against the shared bar second — not the other way round.
- The rating draws on the whole period (running notes), not just the last engagement.
- More than one lead/peer input feeds the rating, weighted by who actually saw the work.
- Ratings are calibrated across reviewers to catch the lenient, harsh, and idiosyncratic ones.
- The written review is checked for vague or loaded language (AI can flag; a human rewrites).
- After the cycle you audit rating patterns by reviewer, level, and group — and act on what you find.
Scored four or fewer? Book a call and we'll show you where bias is leaking into your ratings and what to fix first.
FAQ
Can you completely remove bias from performance reviews?
No — and any process that claims to should make you suspicious. Bias is built into human judgment; the realistic, honest goal is to reduce and contain it so the rating reflects the work rather than the rater. Remember that most of a single score is the rater, not the person (about 62% rater idiosyncrasy vs ~21% performance; Scullen, Mount & Goff, 2000), so the lever is structure — criteria, evidence, multiple inputs, calibration, audit — not a promise of objectivity.
Doesn't unconscious-bias training fix this?
Mostly not. Training people to "be objective" barely moves ratings, and framing the firm as a pure meritocracy can actually increase biased decisions (Castilla & Benard, 2010). Unconscious bias is hard and costly to train away, so the reliable fix is to redesign the process rather than the person (Bohnet, What Works, 2016). Use training to explain the guardrails — not as the guardrail itself.
What's the single biggest bias in project-based reviews?
Usually recency combined with the idiosyncratic rater effect: the last engagement dominates (Steiner & Rain, 1989) and a big share of the score is just the particular lead who wrote it (Scullen, Mount & Goff, 2000). Running notes across the whole period and inputs from more than one lead, weighted by proximity, are the most direct counters.
Is it safe to use AI to remove bias from reviews?
Only as a human-in-the-loop assistant on the wording, never as the judge. AI is useful for flagging vague adjectives, loaded language, and inconsistent tone across a batch of reviews so a person can rewrite them against the evidence. Letting a model assign the rating doesn't remove bias — it hides it behind fluent text and removes accountability. Keep the judgment with the people who own it.
How do we know if our de-biasing is actually working?
Audit the outputs. Look at rating distributions by reviewer, level, and group after each cycle; if the same names always run high or low, or one group is consistently rated down, the guardrails aren't holding (CIPD). And keep it developmental — a cleaner rating that leaves people worse off isn't progress, since over a third of feedback interventions reduce performance when handled badly (Kluger & DeNisi, 1996).

