Document AIPDF ExtractionLayoutLMTablesAnnotation Workflow

What Document AI Research Tells Us About PDF Annotation Workflows

LayoutLMv3, Nougat, and TableFormer changed how researchers approach document understanding. We pull out the parts that matter when annotating real-world PDFs at production scale.

2025-058 min read

Most teams building document AI hit the same wall: their model works on the benchmark dataset and breaks on real client PDFs. Faxed scans, rotated stamps, marginalia, low-DPI invoices, multi-column statements. None of that is in CORD or FUNSD.

The research community has been working on exactly this gap for the last five years. Three lines of work changed how we think about annotation: pre-training on document structure (LayoutLMv1 through v3), end-to-end transcription without OCR (Donut and Nougat), and structured table understanding (TableFormer and PubTables-1M). Reading these papers does not give you a labeling guideline. But ignoring them costs you accuracy you can't get back.

Layout, not just text

Early document AI annotation treated PDFs as text streams with bounding boxes. The text is in the boxes, the boxes are in regions, the regions are in pages. LayoutLMv3 (Huang et al., 2022) showed that joint pre-training on text, layout, and image yields representations that transfer well across document types.

The practical implication for annotation is that layout is a first-class signal, not metadata. Region polygons (not bounding boxes), reading order tags, and rotation angles need to be in the annotation schema from day one. Adding them later is a re-annotation job, not a schema tweak.

Polygon regions, because financial and legal documents are not all rectangles
Reading order labels for multi-column layouts
Rotation angle for stamps and signatures off the page axis
Cross-page region linking for multi-page contracts

Tables are their own modality

PubTables-1M (Smock et al., 2022) and TableFormer (Nassar et al., 2022) treat table structure as a parsing problem distinct from text extraction. Cells have row and column indices. Headers are typed. Merged cells are handled. Tables continue across page breaks.

Annotation schemas that flatten tables into bounding boxes lose all of this. The downstream model can't reconstruct the table because the labels can't either. The practical rule: annotate the structure you want the model to produce. If the model needs to output cell coordinates, annotate cell coordinates.

End-to-end transcription is real

Nougat (Blecher et al., 2023, Meta AI) demonstrated that academic PDFs can be transcribed end-to-end into structured markup without a separate OCR stage. Donut (Kim et al., 2022) made the same case for forms and receipts.

For annotation work, this changes what training data looks like. Instead of separately annotating regions, then text, then layout, then tables, the labels become the final structured output you want the model to produce. The annotation work goes up per document but down per pipeline stage. Net wins depend on the corpus.

What stays the same

Three things research papers consistently underplay because they are operational rather than methodological. Native-speaker annotators on the relevant document language. Domain-adjacent backgrounds for terminology-heavy work like finance or healthcare. Iterative guideline revisions during the pilot week. None of these show up in benchmarks. All three matter more than the model architecture choice on most production projects.

The takeaway

The research literature is more practical than it looks. Layout-aware schemas, structured table annotation, and end-to-end transcription approaches all reflect real lessons about what makes document AI generalize. Read the papers, then design the annotation schema around what you want the model to predict.

References

Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F. (2022). LayoutLMv3. · ACM MM 2022
Smock, B., Pesala, R., Abraham, R. (2022). PubTables-1M. · CVPR 2022
Nassar, A. et al. (2022). TableFormer. · CVPR 2022
Blecher, L. et al. (2023). Nougat: Neural Optical Understanding for Academic Documents. · Meta AI
Kim, G. et al. (2022). OCR-free Document Understanding Transformer (Donut). · ECCV 2022

What Document AI Research Tells Us About PDF Annotation Workflows

Layout, not just text

Tables are their own modality

End-to-end transcription is real

What stays the same

The takeaway

References

More posts

Getting Better Training Data for Your AI Model: What Actually Works

Why Robotics Datasets Need More Than Bounding Boxes

Want this kind of work on your data?