5.5 Human Review & Confidence Calibration

5.5.1 When Can You Trust the Automation?

Task Statement 5.5 is about a decision every production AI system eventually faces: how much human review do you need, and where should it go? Reduce human review too aggressively and errors slip into production; keep too much and you've automated nothing. The skill is designing review workflows that put scarce human attention exactly where it's needed — and the lesson opens with a trap that makes systems look far safer than they are.

That trap is the AGGREGATE-METRICS trap. Your extraction system reports '97% overall accuracy,' and that sounds great — surely safe to automate most of it. But the aggregate HIDES the distribution. Break that 97% down by document type and you might find: invoices 99.5%, handwritten forms 60%, scanned PDFs 72%, international documents 45%. The headline number is a weighted average dominated by the easy, common cases, while whole categories are failing badly. It's like saying a hospital has a '97% survival rate' while one ward is at 45% — the average conceals the disaster. An aggregate metric can make a system look production-ready when specific segments are nowhere near it.

So this lesson is about looking BENEATH the aggregate — validating by segment, sampling intelligently, calibrating confidence, and routing review where uncertainty is highest. Let's take them in order.

A 97% aggregate accuracy hides a wide distribution by document type (handwritten 60%, international 45%). Validate by segment, never by aggregate alone, before reducing human review.

ℹ️

The one idea to hold onto

Aggregate accuracy metrics ('97% overall') hide poor performance on specific document types or fields. Validate by SEGMENT (document type and field), never aggregate alone, before reducing human review.

5.5.2 Validate by Segment, Sample by Stratum

The fix for the aggregate trap is to validate accuracy by DOCUMENT TYPE and FIELD SEGMENT before trusting automation. Don't ask 'is the system accurate?'; ask 'is it accurate on invoices? on handwritten forms? on the date field? on the total field?' Only when each segment meets your bar is it safe to reduce human review for that segment — and the failing segments (handwritten at 60%) keep their human review. Validation-before-automation means proving accuracy per segment first, not flipping the whole system to automatic on the strength of an average.

How do you MEASURE ongoing accuracy without reviewing everything? Stratified random sampling. Divide the population into strata (by document type, say) and sample from each — and crucially, include the HIGH-CONFIDENCE items in your sampling, not just the low-confidence ones. Why sample the items the model is sure about? Because that's where NOVEL error patterns hide: a new failure mode might produce confidently-wrong outputs that you'd never catch if you only review low-confidence cases. Sampling each stratum, including the confident ones, surfaces problems an only-review-the-uncertain approach would miss.

⭐

5.5.2 — Key Concept

Validate accuracy by document TYPE and FIELD segment before reducing human review (validation-before-automation). Use stratified random sampling that INCLUDES high-confidence items — that's where novel, confidently-wrong error patterns hide.

Practice	Why
Validate by type & field segment	Aggregate hides per-segment failures
Reduce review per segment, not globally	Failing segments (e.g. handwritten) keep review
Stratified random sampling	Measures each segment, not just the common one
Include high-confidence items in sampling	Catches novel confidently-wrong error patterns

Look beneath the aggregate: validate and sample by segment, and don't skip the high-confidence items — novel failures often look confident.

5.5.3 Calibrating Confidence and Routing Review

You can have the model output a CONFIDENCE score per field, then route review by it — but only after addressing the calibration problem you've now met several times (4.1, 4.6). Raw self-reported confidence is POORLY CALIBRATED: a reported 0.95 might correspond to 88% real accuracy in one context and 99% in another. So you can't read the raw number as a probability.

The fix is to CALIBRATE field-level confidence against a LABELED VALIDATION SET. You take items where you KNOW the right answer, see what the model's confidence levels actually correspond to in reality, and build a calibration curve mapping reported confidence to true accuracy. Now '0.9 calibrated' actually means ~90% accurate, and you can set routing thresholds that mean something. Calibration turns a vibe into a measured probability.

With calibrated confidence, route to concentrate scarce human attention: send the HIGHEST-uncertainty items — and ambiguous or contradictory source documents — to human review FIRST, as a dynamic priority queue, rather than distributing reviewers evenly across everything. The lowest-uncertainty, well-calibrated-as-accurate items can be auto-accepted. The principle: spend human review where the risk is, and a calibrated confidence score is what tells you where that is.

⭐

5.5.3 — Key Concept

Raw self-reported confidence is poorly calibrated — calibrate field-level confidence against a LABELED validation set to map reported confidence to true accuracy, then ROUTE highest-uncertainty / ambiguous items to human review first (a dynamic priority queue), not an even distribution.

5.5.4 The Exam Traps

The 5.5 traps test the aggregate trap, segment validation, calibration, and review prioritization. The signature scenario: '97% accuracy, so let's automate everything above 95% confidence.'

•Trusting the aggregate. ✗ Automating because overall accuracy is 97%. ✓ Validate by document type and field — segments may be failing (handwritten 60%, international 45%).
•Raw confidence for automation. ✗ Auto-accepting everything above 0.95 self-reported confidence. ✓ Calibrate against labeled data first — raw scores are poorly calibrated.
•Sampling only low-confidence items. ✗ Reviewing only what the model is unsure about. ✓ Stratified sampling that includes high-confidence items catches novel errors.
•Even reviewer distribution. ✗ Spreading reviewers evenly across all items. ✓ Route highest-uncertainty / ambiguous items first.

⚠️

5.5.4 — Exam Trap

For '97% accuracy, automate above 95% confidence': ✗ trusting the aggregate (validate by segment — some types fail badly); ✗ trusting raw confidence (calibrate against labeled data first). ✓ Segment validation, stratified sampling including high-confidence items, calibrated confidence, and uncertainty-first review routing.

5.5.5 Put It Together: Design a Review Workflow

You now know the aggregate trap, segment validation, stratified sampling, confidence calibration, and uncertainty-first routing. The exercise has you build a review workflow that's honest about where the system actually fails.

✨

5.5.5 — Build Exercise (30 min)

(1) Take a system reporting ~97% aggregate accuracy and break it down by document type and field; identify the segments that are actually failing. (2) Set up stratified random sampling that includes high-confidence items, and look for novel error patterns among the confidently-wrong. (3) Have the model output field-level confidence, build a small labeled validation set, and calibrate the scores into a reported-vs-true-accuracy curve. (4) Build a dynamic review queue that routes the highest-uncertainty and ambiguous/contradictory documents to humans first, auto-accepting only well-calibrated high-accuracy segments. (5) Confirm you validated per segment BEFORE reducing review anywhere.

Human review and calibration tell you when to trust automated output. The final lesson, 5.6, closes Domain 5 — and the course — with provenance: preserving where information came from when synthesizing across many sources.

ℹ️

Where this shows up on the exam

5.5 questions feature a high aggregate accuracy and a proposal to automate. The risks: the aggregate hides per-segment failures, and raw confidence needs calibration. Validate by segment, sample including high-confidence items, calibrate, and route by uncertainty.

Key Takeaways

✓Aggregate accuracy ('97% overall') hides poor performance on specific document types or fields — validate by SEGMENT (type and field), never aggregate alone, before reducing human review.
✓Validation-before-automation: prove accuracy per segment first; failing segments (e.g. handwritten 60%) keep their human review while strong segments can be automated.
✓Use stratified random sampling that INCLUDES high-confidence items — novel, confidently-wrong error patterns hide there and an only-review-the-uncertain approach misses them.
✓Raw self-reported confidence is poorly calibrated (a 0.95 may mean 88% in one context, 99% in another) — you can't read it as a probability directly.
✓Calibrate field-level confidence against a LABELED validation set to map reported confidence to true accuracy before setting routing thresholds.
✓Route the highest-uncertainty and ambiguous/contradictory items to human review FIRST (a dynamic priority queue), not an even distribution — spend review where the risk is.
✓This calibration theme recurs across the exam (4.1 confidence filtering, 4.6 review routing) — raw self-confidence always needs calibration.

Check Your Understanding

Test what you learned in this lesson.

Q1.Your extraction system reports 97% overall accuracy, and someone proposes automating everything with >95% confidence. What's the primary risk to flag?

Q2.When sampling extractions to measure ongoing accuracy, why include HIGH-confidence items in the sample?

Q3.You want to route review by the model's confidence scores. What must you do before the scores are trustworthy?

Q4.With limited human reviewers, how should you prioritize which extractions they review?

Practice This Lesson

5.4 Context in Large Codebase Exploration

5.6 Information Provenance & Multi-Source Synthesis