4.6 Multi-Instance & Multi-Pass Review

4.6.1 Who Watches the Watcher?

Task Statement 4.6 — the last in Domain 4 — is about REVIEWING output for quality, and it brings together threads from across the course: the independent-review idea from CI (3.6) and the attention-dilution idea from task decomposition (1.6). The central question is deceptively simple: if Claude generates something, how do you reliably check it's good?

The tempting answer is 'just ask the same Claude to review its own work.' Don't — and the reason is human-familiar. When you write an essay and immediately proofread it, you read what you MEANT to say, not what's on the page; your mind fills the gaps. A model that just generated code is in the same position: it carries its generation REASONING in context, so when it reviews, it tends to CONFIRM its earlier decisions rather than challenge them. Self-review is structurally biased toward approval.

This lesson covers two techniques that overcome that bias: using an INDEPENDENT instance to review (so there's no generation reasoning to confirm), and MULTI-PASS review (so a large review doesn't suffer attention dilution). Both are about getting an honest, thorough check rather than a rubber stamp.

A session that generated code confirms its own decisions when reviewing (like proofreading your own essay). An independent instance with no prior reasoning reviews critically and catches more.

ℹ️

The one idea to hold onto

A model reviewing its OWN work in the same session tends to confirm its generation reasoning rather than challenge it. Use an INDEPENDENT instance (no prior context) for an honest review, and multi-pass review to avoid attention dilution on large reviews.

4.6.2 Independent Review Beats Self-Review

The first technique: when you want code (or any output) reviewed, use a SEPARATE Claude instance with no prior context — not the session that generated it. The independent reviewer hasn't talked itself into believing the code is correct; it approaches the work cold, the way a second engineer reviewing your PR would, and is far more likely to catch subtle issues the author glossed over.

Notice what does NOT achieve this. Telling the same session 'now review your work critically' doesn't remove the generation reasoning — it's still in context, still biasing toward confirmation. Even extended thinking in the same session doesn't escape the bias, because the reasoning that needs challenging is the very reasoning still sitting in context. The ONLY clean fix is a genuinely independent instance with no prior context. (This is the same principle as the CI independent-review in Lesson 3.6 — there, a separate claude -p invocation; here, a fresh instance — same idea, applied to review quality.)

⭐

4.6.2 — Key Concept

Use an INDEPENDENT Claude instance (no prior context) to review generated output — it has no generation reasoning to confirm and catches more. Self-review instructions ('review critically') or extended thinking in the SAME session don't remove the confirmation bias, because the biasing reasoning is still in context.

4.6.3 Multi-Pass Review for Large Reviews

The second technique tackles SIZE. Recall attention dilution from Lesson 1.6: when one pass processes many items, quality thins out — thorough early, superficial late, sometimes contradictory (flagging a pattern in one file, approving identical code in another). A large code review is exactly this situation, so it suffers the same dilution.

The fix is the same multi-pass approach: split the review into per-file LOCAL passes (analyze each file on its own, so each gets full attention) plus a separate cross-file INTEGRATION pass (examine how the files interact — data flow, shared assumptions). Per-file passes give consistent depth; the integration pass catches cross-cutting issues no single-file view would. And the crucial reminder, tested here just as in 1.6: a bigger context window does NOT fix this. The problem is attention QUALITY spread across items, not capacity — only splitting the work restores depth.

This is the third time the multi-pass principle has appeared (1.6 decomposition, then implied in 3.6 CI review, now here) — a signal of how central it is to the exam. Whenever you see inconsistent quality across many items in one pass, the answer is structural: split into per-item plus integration passes, not a bigger model or window.

⭐

4.6.3 — Key Concept

For large multi-file reviews, split into per-file LOCAL passes (consistent depth) plus a separate cross-file INTEGRATION pass (cross-cutting issues) to avoid attention dilution. A bigger context window does NOT fix it — the problem is attention quality, not capacity.

4.6.4 Confidence Calibration for Routing

A third practice helps decide WHERE review effort goes. You can have the model self-report a confidence score alongside each finding, then ROUTE based on it — high-confidence findings go straight to developers, low-confidence ones to human review. This concentrates scarce human attention where it's most needed.

But there's the now-familiar caveat (from 4.1, and revisited in 5.5): raw model self-confidence is POORLY CALIBRATED — a reported 0.95 might mean 88% accurate in one context and 99% in another. So you can't use the raw score directly. You CALIBRATE it against a LABELED validation set: measure what the model's confidence levels actually correspond to in reality, then set your routing thresholds based on that real mapping. Confidence routing is useful — but only after calibration turns a vibe into a measured probability.

ℹ️

4.6.4 — Key Concept

Confidence-based routing (high → developers, low → human review) concentrates review effort, but raw self-reported confidence is poorly calibrated — calibrate it against a labeled validation set before setting thresholds (revisited in Domain 5.5).

4.6.5 The Exam Traps

The 4.6 traps test the self-vs-independent review distinction, the multi-pass fix for big reviews, and uncalibrated confidence. They echo 1.6 and 3.6, so the patterns should feel familiar.

Symptom	Wrong fix	Right fix
Want code reviewed well	Ask the generating session to self-review	A separate independent instance, no prior context
Self-review misses issues	Tell it to 'review more critically'	Independent instance — instructions don't remove the bias
14-file review inconsistent/contradictory	Bigger model or context window	Per-file passes + integration pass
Routing review by confidence	Use raw self-confidence directly	Calibrate against a labeled validation set first

The recurring distractors: same-session self-review, 'review more critically', a bigger window, and raw uncalibrated confidence. Each has a structural correct answer.

⚠️

4.6.5 — Exam Trap

✗ Same-session self-review (or 'review critically' / extended thinking in-session) — the generation reasoning still biases it. ✗ A bigger model/window for an inconsistent large review (it's attention dilution). ✗ Raw self-confidence for routing. ✓ Independent instance, per-file + integration passes, and calibrated confidence.

4.6.6 Put It Together: Review at Scale

You now have the three review techniques: independent instances, multi-pass review, and calibrated confidence routing. The exercise contrasts self-review with independent review and reproduces the attention dilution fix one more time so it's locked in.

✨

4.6.6 — Build Exercise (30 min)

(1) Generate some code, then review it two ways — in the same session, and via a fresh independent instance with no prior context — and compare which catches more issues. (2) Take a 14-file review, run it as one pass (note the inconsistency and contradictions), then restructure into per-file passes plus a cross-file integration pass and compare. (3) Have the model emit a confidence score per finding; build a small labeled validation set, calibrate the scores, and set routing thresholds (high → auto-accept, low → human) based on the calibrated mapping rather than the raw numbers.

That completes Domain 4 — explicit criteria (4.1), few-shot (4.2), structured output (4.3), validation-retry (4.4), batch processing (4.5), and review architectures (4.6). The final domain, Domain 5, ties everything together around CONTEXT and RELIABILITY — managing long conversations, escalation, error propagation, large-codebase context, human review, and provenance.

ℹ️

Where this shows up on the exam

4.6 questions ask how to review reliably (independent instance, not self-review), how to fix an inconsistent large review (multi-pass, not a bigger window), and how to route by confidence (calibrate first). These echo 1.6 and 3.6 — the multi-pass and independent-review patterns recur.

Key Takeaways

✓A model reviewing its OWN work in the same session tends to confirm its generation reasoning rather than challenge it — self-review is structurally biased toward approval.
✓Use an INDEPENDENT instance (no prior context) for review; it catches more because it has no generation reasoning to confirm — the same principle as CI independent review (3.6).
✓'Review more critically' instructions or extended thinking in the SAME session don't remove the bias, because the biasing reasoning is still in context — only a fresh instance does.
✓For large multi-file reviews, split into per-file LOCAL passes (consistent depth) plus a separate cross-file INTEGRATION pass (cross-cutting issues) to avoid attention dilution.
✓A bigger context window does NOT fix an inconsistent large review — it's an attention-quality problem (the 1.6 multi-pass principle), not capacity.
✓Confidence-based routing (high → developers, low → human) concentrates review effort, but raw self-confidence is poorly calibrated.
✓Calibrate confidence against a labeled validation set before setting routing thresholds (revisited in 5.5).

Check Your Understanding

Test what you learned in this lesson.

Q1.You want Claude to review code it just generated, and you want the review to catch real issues. What's the most effective approach?

Q2.A single-pass review of a 14-file PR gives inconsistent depth and even contradicts itself. What's the right restructuring?

Q3.Why doesn't telling the generating session to 'review your work critically' fix self-review bias?

Q4.You want to route review findings by the model's confidence (high → developers, low → human). What must you do before trusting the scores?

Practice This Lesson

4.5 Batch Processing Strategies

5.1 Managing Conversation Context