Document AI / Financial Services
Structured Extraction From 50,000 Financial Documents for a Document AI Vendor
How a layered annotation pipeline modeled on LayoutLMv3 and TableFormer raised field-level extraction accuracy from 71% to 94.3% on invoices, contracts, and bank statements.
Client
Series B Document AI vendor
Volume
50,000 mixed PDFs
Duration
8 weeks
Team
24 PDF extraction specialists, 4 senior reviewers
Languages
English C1
The challenge
The client's document understanding system was trained on a mix of public datasets and synthetic data. It plateaued at 71% field-level accuracy on production traffic with a 14% silent failure rate, where fields were extracted but populated with the wrong value.
Their template coverage spanned 12 invoice formats, 6 contract templates, and 8 bank statement layouts across 5 jurisdictions. Rotated scans, low-DPI faxes, and handwritten amendments broke their OCR + rules baseline. The manual review queue saturated at 1,200 documents per day.
They needed annotated training data covering three things at once: region-level layout segmentation, named-entity extraction with cross-page linking, and full table reconstruction across page breaks.
Our approach
Stage 1: Layout segmentation
Annotators classified document regions using a schema adapted from LayoutLMv3 (Huang et al., 2022). Regions included header, body text, table, footer, signature block, and stamp.
Each region was outlined with a polygon, not a bounding box, because financial documents often have non-rectangular layouts (rotated stamps, overlapping signatures, marginal handwritten notes).
- Reading order tags for multi-column layouts
- Per-region rotation angle when content was off-axis
- Cross-page region linking for multi-page contracts
Stage 2: Entity extraction
Entity-level annotation followed the FUNSD form-understanding schema (Jaume et al., 2019), extended with a cross-page entity linking layer for contracts and multi-page statements.
- Entity types: counterparty name, amount, currency, date, account number, IBAN, tax ID
- Per-entity confidence flag for ambiguous cases (smudged numerals, partial occlusion)
- Semantic role labels distinguishing tax amount from subtotal from total
Stage 3: Table reconstruction
Table annotation referenced the schema published in PubTables-1M (Smock et al., 2022, Microsoft Research) and the cell structure approach from TableFormer (Nassar et al., 2022, IBM Research).
Borderless tables, nested tables, and tables rotated for fold-out printing were treated as separate edge cases with their own guideline pages.
- Row and column indices, including merged cells
- Header row vs. data row classification
- Continuation tables across page breaks linked by table ID
Quality assurance
Every document went through a two-pass review. Cohen's kappa stayed above 0.87 across the project. The domain lead audited a 10% random sample weekly and adjudicated all disagreements above an 80% IoU threshold for region polygons.
Results
94.3%
Field accuracy
4.2%
Silent failure rate
68%
Review queue reduction
48,219
Documents delivered
What made it work
- 1
Native English annotators with finance-adjacent backgrounds (bookkeeping, accounts payable) made fewer terminology errors than generalist annotators on the same documents.
- 2
Guidelines went through four revisions in the first week. Locking the schema too early would have re-annotation cost across the full corpus.
- 3
Daily standups between the client's ML team and our reviewers during the first sprint cut clarification cycles from two days to two hours.
- 4
Polygons beat bounding boxes for financial documents. Rotated stamps and signatures inside dense layouts cost accuracy when boxed.
References
Published research that informed the labeling schema and workflow.
- Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F. (2022). LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. · ACM MM 2022
- Nassar, A., Livathinos, N., Lysak, M., Staar, P. (2022). TableFormer: Table Structure Understanding with Transformers. · CVPR 2022
- Jaume, G., Ekenel, H. K., Thiran, J.-P. (2019). FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents. · ICDAR-OST 2019
- Smock, B., Pesala, R., Abraham, R. (2022). PubTables-1M: Towards Comprehensive Table Extraction from Unstructured Documents. · CVPR 2022
- Blecher, L., Cucurull, G., Scialom, T., Stojnic, R. (2023). Nougat: Neural Optical Understanding for Academic Documents. · Meta AI
More case studies
Generative Video / Image Quality Assessment
Subjective Video Quality Scoring at 98% Agreement for a Generative Video Model Team
Robotics / Imitation Learning
Action Trajectory Labeling for a Robotics Lab Training Manipulation Policies
Healthcare / Medical Imaging
Whole-Slide Pathology Annotation for a Histopathology AI Vendor
Agentic AI / AI Safety Evaluation
Decision-Quality Annotation for an Agentic AI in Security Incident Response
Robotics / Vision-Language Foundation Models
Scaling Multi-View Robotic Video Annotation From Manual Process to 1,000-Hour Ramp
Have a similar project?
Share your data and requirements. We will scope the workflow, team, timeline, and pricing model.