Document AI / Financial Services

Structured Extraction From 50,000 Financial Documents for a Document AI Vendor

How a layered annotation pipeline modeled on LayoutLMv3 and TableFormer raised field-level extraction accuracy from 71% to 94.3% on invoices, contracts, and bank statements.

PDF ExtractionDocument AILayoutLMTables

Client

Series B Document AI vendor

Volume

50,000 mixed PDFs

Duration

8 weeks

Team

24 PDF extraction specialists, 4 senior reviewers

Languages

English C1

The challenge

The client's document understanding system was trained on a mix of public datasets and synthetic data. It plateaued at 71% field-level accuracy on production traffic with a 14% silent failure rate, where fields were extracted but populated with the wrong value.

Their template coverage spanned 12 invoice formats, 6 contract templates, and 8 bank statement layouts across 5 jurisdictions. Rotated scans, low-DPI faxes, and handwritten amendments broke their OCR + rules baseline. The manual review queue saturated at 1,200 documents per day.

They needed annotated training data covering three things at once: region-level layout segmentation, named-entity extraction with cross-page linking, and full table reconstruction across page breaks.

Our approach

Stage 1: Layout segmentation

Annotators classified document regions using a schema adapted from LayoutLMv3 (Huang et al., 2022). Regions included header, body text, table, footer, signature block, and stamp.

Each region was outlined with a polygon, not a bounding box, because financial documents often have non-rectangular layouts (rotated stamps, overlapping signatures, marginal handwritten notes).

Reading order tags for multi-column layouts
Per-region rotation angle when content was off-axis
Cross-page region linking for multi-page contracts

Stage 2: Entity extraction

Entity-level annotation followed the FUNSD form-understanding schema (Jaume et al., 2019), extended with a cross-page entity linking layer for contracts and multi-page statements.

Entity types: counterparty name, amount, currency, date, account number, IBAN, tax ID
Per-entity confidence flag for ambiguous cases (smudged numerals, partial occlusion)
Semantic role labels distinguishing tax amount from subtotal from total

Stage 3: Table reconstruction

Table annotation referenced the schema published in PubTables-1M (Smock et al., 2022, Microsoft Research) and the cell structure approach from TableFormer (Nassar et al., 2022, IBM Research).

Borderless tables, nested tables, and tables rotated for fold-out printing were treated as separate edge cases with their own guideline pages.

Row and column indices, including merged cells
Header row vs. data row classification
Continuation tables across page breaks linked by table ID

Quality assurance

Every document went through a two-pass review. Cohen's kappa stayed above 0.87 across the project. The domain lead audited a 10% random sample weekly and adjudicated all disagreements above an 80% IoU threshold for region polygons.

Results

94.3%

Field accuracy

4.2%

Silent failure rate

68%

Review queue reduction

48,219

Documents delivered

What made it work

1
Native English annotators with finance-adjacent backgrounds (bookkeeping, accounts payable) made fewer terminology errors than generalist annotators on the same documents.
2
Guidelines went through four revisions in the first week. Locking the schema too early would have re-annotation cost across the full corpus.
3
Daily standups between the client's ML team and our reviewers during the first sprint cut clarification cycles from two days to two hours.
4
Polygons beat bounding boxes for financial documents. Rotated stamps and signatures inside dense layouts cost accuracy when boxed.

References

Published research that informed the labeling schema and workflow.

Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F. (2022). LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. · ACM MM 2022
Nassar, A., Livathinos, N., Lysak, M., Staar, P. (2022). TableFormer: Table Structure Understanding with Transformers. · CVPR 2022
Jaume, G., Ekenel, H. K., Thiran, J.-P. (2019). FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents. · ICDAR-OST 2019
Smock, B., Pesala, R., Abraham, R. (2022). PubTables-1M: Towards Comprehensive Table Extraction from Unstructured Documents. · CVPR 2022
Blecher, L., Cucurull, G., Scialom, T., Stojnic, R. (2023). Nougat: Neural Optical Understanding for Academic Documents. · Meta AI

More case studies

Generative Video / Image Quality Assessment

Subjective Video Quality Scoring at 98% Agreement for a Generative Video Model Team

Robotics / Imitation Learning

Action Trajectory Labeling for a Robotics Lab Training Manipulation Policies

Healthcare / Medical Imaging

Whole-Slide Pathology Annotation for a Histopathology AI Vendor

Agentic AI / AI Safety Evaluation

Decision-Quality Annotation for an Agentic AI in Security Incident Response

Robotics / Vision-Language Foundation Models

Scaling Multi-View Robotic Video Annotation From Manual Process to 1,000-Hour Ramp

Have a similar project?

Share your data and requirements. We will scope the workflow, team, timeline, and pricing model.

Start a Pilot Explore Services