Polar Bear / Blog / High-Quality Review Process

How to combine feedback from leads and peers into a fair growth picture

Combining feedback from several leads and peers isn't about averaging their scores or reconciling a single rating — it's about turning multiple credible viewpoints into one honest, evidence-based picture of how a person is growing. Each lead and peer saw a different slice of the work, so where they disagree is usually signal, not noise: a clue about context, scope or recency you'd erase by splitting the difference. The method is the same whatever your firm's politics: gather concrete, example-based input against a shared bar, weight it by who actually worked with the person on what, investigate the disagreements instead of averaging them, reconcile the person's self-view against what others saw, and end in a forward-looking growth plan — kept separate from whatever rating the firm calibrates elsewhere.

The reason you can't lean on one lead is blunt: when researchers decomposed 4,492 managers' ratings, about 62% of the variance came from the rater's idiosyncrasies and only ~21% from actual performance (Scullen, Mount & Goff, Journal of Applied Psychology, 2000). Multiple inputs, read well, are how you get the picture back.

Pulling together a fair picture from several leads? Book a free call — we'll walk through how to weight and reconcile inputs so the result holds up on the partner track.
Book a call →

What does combining feedback into a fair growth picture actually mean?

It means synthesis, not arithmetic. The output you want is a short, honest story — recurring strengths, real development areas, and a concrete next step — built from several people who each saw part of the year. It is explicitly not a reconciled number. Your firm may still calibrate a grade somewhere (that has its place), but the growth picture is a different conversation with a different purpose: it looks forward and is owned by the person, not filed against them.

The distinction matters because of what feedback does when you get it wrong. Across 607 studies, feedback raised performance on average but over a third of interventions actually made performance worse (Kluger & DeNisi, Psychological Bulletin, 1996), and after a century of research there is little evidence that appraisal on its own improves anything (DeNisi & Murphy, Journal of Applied Psychology, 2017). A pile of averaged scores is appraisal. A picture that points somewhere is development.

Why isn't one lead's view enough?

Because a single rater is a noisy instrument, however senior. A supervisor's rating of overall performance has an interrater reliability of only about .52 — two competent raters of the same person agree only moderately (Viswesvaran, Ones & Schmidt, Journal of Applied Psychology, 1996). That isn't incompetence; it's the nature of judging knowledge work from one vantage point. And much of what does land in a single score is the rater, not the ratee: the 62% rater-idiosyncrasy finding (Scullen, Mount & Goff, 2000) is the empirical case for never letting one lead stand in for the whole picture.

There's a deeper point. Appraisal is a social, contextual process, not a neutral measurement (Murphy & Cleveland, Understanding Performance Appraisal, 1995). Different leads, on different engagements, under different client pressure, legitimately see different people. That's exactly why you gather several — and why the disagreements between them are worth reading rather than smoothing over.

Which inputs do you gather, and from whom?

Pick contributors by proximity to the real work, not by the org chart. For a typical cycle that means three kinds of input:

  1. The person's self-assessment. Their own account of what they worked on and what they're trying to grow. Treat it as evidence to reconcile, not as the answer — self-views diverge from how others see the same work (more on that below).
  2. The staffing leads who actually ran their engagements. The people who set the brief, watched delivery and dealt with the client. Weight them by how much of the cycle they really covered.
  3. The peers who worked alongside them. Collaboration, reliability, how they make the team better — things a lead above the work often can't see. Choose peers who shared real work, not friendly bystanders.

Note who is missing or thin. A two-week overlap or a second-hand opinion is a data point with a big error bar — label it, don't let it count like a lead who ran 80% of the cycle.

How do you collect feedback that's actually comparable?

Comparable inputs need a common yardstick and concrete examples — gather first, judge second. Two rules do most of the work:

  1. Rate against a shared bar. A written competency framework that says, in observable behaviours, what "good" looks like at each level (CIPD, Competence and competency frameworks) means a "strong on ownership" from one lead means roughly what it means from another. Without it, you're stitching together different yardsticks and calling the seam a picture.
  2. Ask for evidence, not adjectives. Every input should be anchored to a specific example tied to the framework — "renegotiated the scope on the X engagement when the client moved the goalposts," not "great under pressure." Examples are what let you compare, weight and, later, investigate a disagreement. Adjectives just average.

Collect the raw inputs before anyone sees a summary or a proposed grade. The moment a number is on the table, contributors anchor to it and you lose the independence that made multiple inputs worth gathering.

What do you do when leads and peers disagree?

Treat a split view as a question to investigate, not an average to compute. Disagreement usually encodes information: one lead saw a turnaround engagement and another saw routine delivery; one worked with the person in month two and another in month ten; one is rating outcomes and another is rating behaviours. Averaging a 4 and a 2 into a 3 throws all of that away and describes no one.

So when inputs diverge, ask three things: What did each person actually see (scope, role, how much of the cycle)? When — is one view simply more recent? Against what — are they even rating the same behaviour on the shared bar? Often the "disagreement" dissolves into "both true, different contexts," which is far more useful in a growth conversation than a blended number.

Self-versus-others deserves its own care, because the gap is systematic, not occasional: self and supervisor ratings of the same person correlate only about .22 (Heidemeier & Moser, Journal of Applied Psychology, 2009). When someone's self-view diverges from what leads and peers saw, close the gap with evidence and examples, not authority — "here's what three people independently observed" lands very differently from "I disagree."

How do you turn the inputs into a growth picture, not a score?

Synthesise, then point forward. Read across all the weighted inputs for themes — strengths that show up under more than one lead, development areas that recur across engagements — and separate a genuine pattern from a one-off tied to a single hard project. Then write the short narrative and, crucially, end in a concrete plan: two or three development priorities, what support or staffing makes them possible, and a date to revisit.

That ending is not decoration — it's the part that works. Improvement after multisource feedback is generally small and shows up mainly when feedback is followed by coaching and goal-setting, and the effect differs sharply by source (Smither, London & Reilly, Personnel Psychology, 2005). The picture earns its keep by changing what the person does next, not by scoring what they already did.

Whose feedback moves performance most? Mean improvement (effect size) after multisource feedback, by rater source Direct reports 0.24 Supervisors (leads) 0.14 Peers 0.12 Self 0.00
Mean unweighted effect size of performance improvement after multisource feedback, by rater source. Source: Smither, London & Reilly, Personnel Psychology, 2005. Effects are small overall and larger when feedback is followed by coaching and goal-setting.
From many viewpoints to one growth picture Self-assessment Staffing leads Peers Each weighted by proximity to the work Investigate divergence what · when · against what ONE GROWTH PICTURE Themes → forward plan → revisit date Calibrated rating same evidence · separate conversation
Type-B schematic. Several weighted inputs are reconciled by investigating divergence, then synthesised into one forward-looking growth plan — kept apart from any calibrated rating.

How is this different from calibrating ratings?

Same raw evidence, opposite purpose. Calibration is an evaluation step: leads who rated different people reconcile their numbers against a shared bar so a grade means the same thing across the firm, because that grade feeds pay, promotion and the partner track. The growth picture is a development step: it synthesises the same inputs into a forward plan the person owns.

Keep them in separate conversations. Tie the growth talk to the grade in the same meeting and candour collapses — no one volunteers a weakness that will lower their score, and the development conversation quietly becomes a negotiation. High-quality firms run both off the same evidence but keep the rating and the growth picture apart, so people can hear hard feedback without it instantly threatening their number.

Why does this matter for agencies and consulting boutiques?

Because in a project-based firm, the manager who watched someone all year doesn't exist. People are staffed across several engagements under different leads, and they work alongside different peers each cycle, so no single reviewer ever saw the whole thing — and the person writing it up often wasn't beside them for most of the work. The 62% rater-idiosyncrasy finding (Scullen, Mount & Goff, 2000) isn't abstract here; it's the difference between a defensible promotion case and a political one.

So combining feedback fairly is the core skill, not a nicety. A shared bar lets different leads mean the same thing. Weighting by proximity stops a brief cameo from outvoting the lead who carried the engagement. Investigating divergence turns "my leads disagree" into real signal about where someone thrives. And because the result decides who makes the partner track, an evidence-based growth picture is what makes the call fair — and what keeps the senior talent a boutique can least afford to lose, since an opaque verdict reads as an unfair one. Build it light enough to run without burning the billable hours that pay for it.

A quick self-check: is your combined picture fair?

Score one point per "yes". Six or seven and your synthesis is genuinely fair; four or fewer and you're averaging where you should be investigating.

  • Inputs come from the self, the staffing leads and the peers who actually did the work — chosen by proximity, not the org chart.
  • Everyone rates against the same written bar of what "good" looks like at each level.
  • Every input is anchored to a concrete example, not an adjective.
  • Raw inputs are collected before any score or summary is shared.
  • Disagreements are investigated (what / when / against what), not averaged.
  • Thin or second-hand input is flagged and down-weighted, not counted equally.
  • The picture ends in a forward-looking growth plan with a revisit date — kept separate from any calibrated rating.

Scored four or fewer? Book a call and we'll show you where the picture is leaking and what to fix first.

FAQ

Should I just average the scores from each lead and peer?

No. Averaging treats disagreement as error, but much of it is signal — different leads saw different engagements, scopes and moments. A supervisor's single rating is only moderately reliable to begin with (.52; Viswesvaran, Ones & Schmidt, 1996), and most of what a single score captures is the rater, not the person (Scullen, Mount & Goff, 2000). Investigate the splits and synthesise; don't blend them into a number that describes no one.

Whose feedback should count most?

Weight by proximity to the actual work and by how much of the cycle each person covered — a lead who ran most of someone's engagements outweighs a peer who overlapped for two weeks. There's no universal ranking by job title; the question is who saw the behaviour you're assessing. Multisource research shows the effect of feedback differs by source (Smither, London & Reilly, 2005), so read each input for what it's good at rather than treating all sources as equal.

What if the person's self-assessment doesn't match what their leads say?

Expect some gap — self and supervisor ratings correlate only about .22 (Heidemeier & Moser, 2009). Close it with evidence, not authority: lay the concrete examples side by side and let the pattern speak. A divergence is a conversation about what each side saw, not proof that one of them is wrong.

Is the growth picture the same as the performance rating?

No, and keeping them separate is the point. The rating is evaluation — a calibrated number that feeds pay and the partner track. The growth picture is development — a forward plan the person owns. Tie them together in one meeting and candour disappears, because admitting a weakness now costs you a grade (the backdrop to why feedback so often fails to help: Kluger & DeNisi, 1996).

How do we do this without eating billable hours?

Make the heavy parts one-time and the recurring parts light. Build the shared bar and the input templates once; each cycle is then mostly collecting short, example-based inputs asynchronously around client work, plus a focused session to investigate divergences. The cost is in chasing vague adjectives — a clear bar and concrete examples make the synthesis fast.

About us

Both ex-McKinsey, we bring the best practices of people growth to the agency world, building simple, lovable people systems without the corporate HR heritage.

Pauline Bertry

Pauline Bertry

Product Growth · CX Design

10+ years leading product & design teams. Built from scratch and led Design Hubs at McKinsey Moscow and Budapest. Created career frameworks and growth systems tested with 100+ person cross-functional product teams.

Meet Pauline →
Alexey Lobachev

Alexey Lobachev

People Strategy · Engagement

15+ years inside top professional services organisations. At McKinsey led Employee Engagement and Talent Management programmes, including competencies review, region-wide DEI transformation, and a Top Talent Retention Program.

Meet AlexeyComing soon

Dealing with a people challenge and not sure where to start?

Let's have a conversation

Sources

  1. Scullen, S. E., Mount, M. K., & Goff, M. (2000). Understanding the latent structure of job performance ratings. Journal of Applied Psychology, 85(6), 956–970. Across 4,492 managers, ~62% of rating variance traced to idiosyncratic rater effects and only ~21% to actual performance. Semantic Scholar
  2. Viswesvaran, C., Ones, D. S., & Schmidt, F. L. (1996). Comparative analysis of the reliability of job performance ratings. Journal of Applied Psychology, 81(5), 557–574. Interrater reliability of supervisory ratings of overall job performance ≈ .52 (intra-rater reliability >.80). PMC
  3. Smither, J. W., London, M., & Reilly, R. R. (2005). Does performance improve following multisource feedback? Personnel Psychology, 58, 33–66. Improvement generally small and conditional; mean unweighted effect sizes by source: direct reports .24, supervisors .14, peers .12, self .00; larger when followed by coaching and goal-setting. Wiley Online Library
  4. Heidemeier, H., & Moser, K. (2009). Self–other agreement in job performance ratings: A meta-analytic test of a process model. Journal of Applied Psychology, 94(2), 353–370. Self and supervisor ratings of the same person correlate only about .22 (corrected ρ ≈ .34; k = 115, n = 37,752). psycnet.apa.org
  5. Kluger, A. N., & DeNisi, A. (1996). The effects of feedback interventions on performance. Psychological Bulletin, 119(2), 254–284. Across 607 effect sizes, feedback raised performance on average (d = .41), but over one-third of interventions decreased it. Reference
  6. DeNisi, A. S., & Murphy, K. R. (2017). Performance appraisal and performance management: 100 years of progress? Journal of Applied Psychology, 102(3), 421–433. Little consistent evidence that appraisal on its own improves performance. psycnet.apa.org
  7. Murphy, K. R., & Cleveland, J. N. (1995). Understanding Performance Appraisal: Social, Organizational, and Goal-Based Perspectives. Sage Publications. Appraisal is a social, goal-driven process, so divergence between raters carries meaning about context, not just error. Semantic Scholar
  8. CIPD. Competence and competency frameworks (factsheet). A competency framework sets out the behaviours valued and recognised at each level — a behavioural "map" for roles. cipd.org