Rebuilt invoice OCR with 94% accuracy

Industry

Procurement

Engagement

5 months

Team

2 engineers + 1 ML lead

Stack

Python, AWS Textract, custom transformer, Postgres

The challenge

The legacy OCR pipeline was a brittle Tesseract chain hitting 71% accuracy on real-world invoices. Misreads cascaded into AP automation errors that customers had to reconcile by hand.

Our approach

We built a hybrid pipeline — Textract for layout, a fine-tuned transformer for field extraction, with confidence-aware fallbacks to human-in-the-loop. Every release went through a regression eval against a 4,000-invoice gold set.

The outcome

94% accuracy in production. Customer support tickets on AP errors dropped by more than half. Tooling is now testable and the team can iterate weekly.

Let's
build
something.

Tell us what you're working on. We'll reply within a business day with a short plan, not a brochure.

Start a project →

Rebuilt invoice OCR with 94% accuracy

Let'sbuildsomething.

Let's
build
something.