How-To Guide

How to automate document processing

Modern LLMs handle unstructured documents at 99%+ accuracy with proper architecture. This guide covers the extraction pipeline: ingest → AI extract → confidence-score → human review on edge cases → push to system of record.

Time12h

Cost$100-500/month

Steps7

Tools you'll need

Claude or Reducto for extractionYour accounting / ERP / CRMAudit log database

Steps

1
Identify the top document type to automate
Most valuable: invoices (high volume, structured), tax docs (high accuracy required), contracts (high reasoning), receipts (high volume). Start with one type; perfect that pipeline before expanding.
2
Define the extraction schema
Write the JSON schema you want extracted. For invoices: vendor name, vendor address, invoice #, invoice date, due date, line items (description + quantity + unit price + total), subtotal, tax, total.
3
Build the extraction prompt
System prompt: 'You are a document data extractor. Output valid JSON matching this schema: [schema]. If a field isn't present, output null. Don't hallucinate.' User prompt: 'Extract from this document: [doc text or image]'.
4
Add confidence scoring per field
Ask the LLM to output confidence (0-100) per extracted field. Use Claude's structured output mode or function calling for guaranteed JSON validity.
5
Route by confidence
All fields >95%: auto-post to ERP/CRM. Any field 70-95%: queue for human review with the AI's reasoning shown. Any field <70%: human handles from scratch.
6
Build the audit log
Every extraction logs: input doc hash, model used, prompt sent, output JSON, confidence per field, reviewer ID (if reviewed), decision (approve/edit/reject), timestamp. Required for compliance, useful for re-training.
7
Re-train on human corrections weekly
Every human edit is supervised learning data. Aggregate corrections weekly, update few-shot examples in the prompt. Track accuracy week-over-week; target 99%+ steady-state.

Steps

Identify the top document type to automate

Most valuable: invoices (high volume, structured), tax docs (high accuracy required), contracts (high reasoning), receipts (high volume). Start with one type; perfect that pipeline before expanding.

Define the extraction schema

Write the JSON schema you want extracted. For invoices: vendor name, vendor address, invoice #, invoice date, due date, line items (description + quantity + unit price + total), subtotal, tax, total.

Build the extraction prompt

System prompt: 'You are a document data extractor. Output valid JSON matching this schema: [schema]. If a field isn't present, output null. Don't hallucinate.' User prompt: 'Extract from this document: [doc text or image]'.

Add confidence scoring per field

Ask the LLM to output confidence (0-100) per extracted field. Use Claude's structured output mode or function calling for guaranteed JSON validity.

Route by confidence

All fields >95%: auto-post to ERP/CRM. Any field 70-95%: queue for human review with the AI's reasoning shown. Any field <70%: human handles from scratch.

Build the audit log

Every extraction logs: input doc hash, model used, prompt sent, output JSON, confidence per field, reviewer ID (if reviewed), decision (approve/edit/reject), timestamp. Required for compliance, useful for re-training.

Re-train on human corrections weekly

Every human edit is supervised learning data. Aggregate corrections weekly, update few-shot examples in the prompt. Track accuracy week-over-week; target 99%+ steady-state.