Domain 4: Prompt Engineering & Structured Output (20%)Lesson 23 of 30

4.5 Batch Processing Strategies

4.5.1 Processing Thousands of Documents Cheaply

Lessons 4.3 and 4.4 handled ONE document at a time. But many real tasks involve VOLUME — extract data from 10,000 invoices, run an evaluation over thousands of test cases, generate an overnight report across a whole codebase. Sending those one by one through the normal real-time API works, but it's expensive and you're paying a premium for speed you may not need. Task Statement 4.5 is about the Message Batches API: a way to process large volumes at much lower cost, and — just as importantly — knowing when it's the wrong choice.

The trade-off is simple and worth internalizing: the Message Batches API costs 50% LESS, in exchange for being slower and asynchronous. You submit a whole batch of requests, they're processed in the background, and you collect the results later. Think of it like standard shipping versus overnight courier — standard is far cheaper, and perfectly fine for anything that isn't urgent. You'd never pay courier prices to send yourself a document you don't need until next week.

So the whole lesson reduces to one judgment: is this workload URGENT (someone's waiting) or LATENCY-TOLERANT (it can finish whenever)? Get that right and batch saves you half your cost on the right jobs; get it wrong and you either overpay or leave a developer waiting hours. Let's nail down the exact numbers and the matching rule.

Real-time vs. Message Batches APIReal-time (synchronous)immediate, full priceovernight courierfor BLOCKING workflowsBatches API50% cheaper, ≤24h windowstandard shippingfor LATENCY-TOLERANT jobs

The Message Batches API trades speed for cost: 50% cheaper, asynchronous, with up to a 24-hour window. Use it for latency-tolerant volume work, not for anything a person is waiting on.

ℹ️

The one idea to hold onto

The Message Batches API processes large volumes at 50% lower cost, asynchronously. The whole decision is whether the workload is latency-tolerant (use batch) or someone is waiting on it (use real-time).

4.5.2 The Numbers That Decide

Four facts about the Batch API are exam-tested, and they all bear on the latency trade-off. First, the cost: 50% savings versus the synchronous API. Second, the timing: most batches finish in under an hour, BUT there is NO guaranteed latency SLA, and the window can be up to 24 HOURS — a batch even EXPIRES if it doesn't complete within 24 hours. That 'up to 24 hours, no guarantee' is the crux: you cannot promise a batch result by any particular moment.

Third, a structural limitation that surprises people: the Batch API does NOT support multi-turn tool calling within a single request. You can't have a batched request execute a tool mid-way and feed the result back — batch requests are one-shot. So an agentic loop (Domain 1) cannot run inside a batch. Fourth, each request carries a custom_id, a label YOU assign, and the results come back tagged with it so you can correlate each response to the request that produced it (results don't arrive in guaranteed order).

FactDetail
Cost50% cheaper than the synchronous API
TimingMost <1h, but up to 24h; NO latency SLA; expires after 24h
Multi-turn tool callingNOT supported within a single batched request
custom_idYou assign it; correlates each response to its request

Four Batch API facts. The 24-hour-no-SLA window rules out blocking work; the no-multi-turn-tools limit rules out agentic loops; custom_id is how you match results to requests.

4.5.2 — Key Concept

The Batch API: 50% cheaper, up to a 24-hour window with NO latency SLA (expires after 24h), does NOT support multi-turn tool calling in a request, and uses custom_id to correlate each response to its request.

4.5.3 The Matching Rule: Blocking vs Latency-Tolerant

Now the decision the exam tests most. Because batch has no latency guarantee (up to 24h), the deciding factor is whether the workflow BLOCKS someone who is waiting on the result.

A BLOCKING workflow has a human or a process waiting on the output before it can proceed — a pre-merge check that gates a developer's merge, a real-time user request. These MUST use the synchronous (real-time) API, even at full price, because a 24-hour wait is unacceptable when someone's blocked. A LATENCY-TOLERANT workflow has nobody waiting — an overnight technical-debt report, a weekly audit, nightly test generation. These should use the Batch API to capture the 50% saving; it doesn't matter if they take three hours.

The classic exam scenario: a manager proposes moving BOTH a blocking pre-merge check AND an overnight report to batch for the cost savings. The right evaluation: batch the overnight report (nobody waits), keep the pre-merge check on the synchronous API (developers wait to merge). It's not all-or-nothing — you match each workflow to its latency tolerance. There's even a planning angle: if you must meet a 30-hour SLA with a 24-hour batch window, that leaves a 6-hour buffer, so submitting every 4–6 hours keeps you safe.

4.5.3 — Key Concept

Match the API to latency tolerance: synchronous (real-time) for BLOCKING workflows (pre-merge checks, user-facing requests where someone waits); Batch API for LATENCY-TOLERANT workflows (overnight/weekly reports) to capture the 50% saving. Evaluate each workflow separately — not all-or-nothing.

4.5.4 Running Batches Well

Once you've decided batch is right, two practices make it cost-effective. First, handle failures by custom_id: when some requests in a batch fail, don't resubmit the whole batch — use the custom_id to identify ONLY the failures and resubmit just those, with modifications (for example, chunk a document that exceeded the context limit). Targeted resubmission keeps cost down.

Second — and this is the highest-leverage habit — REFINE on a small sample BEFORE running the full batch. Take 5–10 representative documents, iterate your prompt until the first-pass success rate is high, THEN launch the big batch. The math is compelling: at a 90% first-pass rate, 1,000 documents need ~100 retries; at 60%, they need ~400. A little refinement up front saves a mountain of resubmission cost and time. Batching amplifies whatever quality your prompt has — so get the prompt right on a sample first.

ℹ️

4.5.4 — Key Concept

Handle batch failures by custom_id — resubmit ONLY the failed requests (with fixes like chunking oversized docs), not the whole batch. And refine your prompt on a 5–10 doc sample BEFORE the full batch: a higher first-pass rate dramatically cuts retries (90% vs 60% = ~100 vs ~400 retries on 1,000 docs).

4.5.5 The Exam Traps

The 4.5 traps test the blocking-vs-tolerant decision, the 'batch is fast' misconception, and the no-multi-turn-tools limit.

  • Batching a blocking workflow. ✗ Moving a pre-merge check to batch for the savings. ✓ Blocking work stays on the synchronous API; only latency-tolerant work goes to batch.
  • Assuming batch is fast. ✗ 'It usually finishes in an hour, so it's fine for blocking work.' ✓ There's NO SLA — it can take up to 24h; never rely on it being fast.
  • All-or-nothing thinking. ✗ Moving every workflow to batch (or none). ✓ Evaluate each: batch the tolerant ones, keep blocking ones real-time.
  • Multi-turn tools in a batch. ✗ Running an agentic tool-calling loop inside a batched request. ✓ Batch doesn't support multi-turn tool calling — use the synchronous API for that.
⚠️

4.5.5 — Exam Trap

✗ Batching blocking workflows (no SLA, up to 24h). ✗ Assuming batch is reliably fast. ✗ All-or-nothing moves. ✗ Multi-turn tool calling in a batch (unsupported). ✓ Batch only latency-tolerant volume work for the 50% saving; evaluate each workflow; correlate results by custom_id; refine on a sample first.

4.5.6 Put It Together: Choose and Run a Batch

You now know the Batch API's cost/timing trade-off, its limits, the blocking-vs-tolerant matching rule, and how to run batches cost-effectively. The exercise has you classify workloads and run a batch end-to-end.

4.5.6 — Build Exercise (45 min)

(1) Classify two workflows: a blocking pre-merge check (synchronous) and an overnight report (batch); justify each by latency tolerance. (2) Submit a batch of ~100 documents with a custom_id on each. (3) Handle failures by custom_id — resubmit only the failures, chunking any that exceeded the context limit. (4) Calculate submission frequency for a 30-hour SLA given the 24-hour window (≈ every 4–6h). (5) Refine your prompt on 5–10 sample docs first and compare the first-pass success rate (and resulting retry count) against running the batch unrefined.

Batch processing scales extraction to volume. The final lesson of Domain 4, 4.6, returns to QUALITY at scale — multi-instance and multi-pass review architectures that catch what a single pass misses.

ℹ️

Where this shows up on the exam

4.5 questions ask whether to use batch (latency-tolerant → yes; blocking → no), evaluate a 'move everything to batch' proposal (split by workflow), or recall the facts (50% cheaper, ≤24h no SLA, no multi-turn tools, custom_id).

Key Takeaways

  • The Message Batches API processes large volumes at 50% lower cost, asynchronously — the whole decision is latency tolerance.
  • Key facts: 50% cheaper; most batches <1h but up to a 24-hour window with NO latency SLA (expires after 24h); does NOT support multi-turn tool calling in a request; custom_id correlates each response to its request.
  • Matching rule: synchronous (real-time) API for BLOCKING workflows (pre-merge checks, user-facing requests); Batch API for LATENCY-TOLERANT work (overnight/weekly reports) to capture the 50% saving.
  • Evaluate each workflow separately — a 'move everything to batch' proposal should batch the tolerant jobs and keep blocking ones real-time; it's not all-or-nothing.
  • Never assume batch is fast — there's no SLA; relying on it finishing quickly for blocking work is a mistake.
  • Handle failures by custom_id: resubmit only the failed requests with fixes (e.g. chunk oversized docs), not the whole batch.
  • Refine the prompt on a 5–10 doc sample BEFORE the full batch — a higher first-pass rate dramatically cuts retries (90% vs 60% ≈ 100 vs 400 retries on 1,000 docs).

Check Your Understanding

Test what you learned in this lesson.

Q1.Your team wants to cut costs by moving two workflows to the Message Batches API: (1) a blocking pre-merge check developers wait on before merging, and (2) an overnight technical-debt report. How should you evaluate this?

Q2.Which is a true limitation of the Message Batches API?

Q3.Why should you refine your prompt on a small sample before submitting a large batch?

Q4.Some requests in your 100-document batch fail. What's the most cost-effective way to handle them?

Practice This Lesson