AI vs Human PDF Extraction: Building a 2.5 Million Row University Database

The Brief: A Century of University Graduation Records

A prominent US research institute — one you would recognise — came to us with an ambitious data project. They needed a comprehensive, structured database of student graduation records spanning 100 years across 50 US universities. The end goal was a clean, consistent, queryable dataset of 2.5 million rows covering a century of academic history.

The source material was university yearbooks — scanned PDFs spanning from the early 1900s through to the present day. The project timeline was six months. The team would be 30 contracted staff plus project management and technical oversight. This was not a small job.

The Source Material: Why This Was Hard

Anyone who has looked at a university yearbook from 1923 knows immediately why automated extraction was going to be a challenge. The documents were scanned PDFs of physical yearbooks spanning an entire century of printing technology, design conventions, and photographic reproduction quality.

The challenges stacked up quickly. Print quality varied enormously — early yearbooks were often printed with very small type, and scans of already-aged physical books produced image quality that made text recognition genuinely difficult even for a human reading carefully. Formatting had zero consistency — not just between universities and decades, but sometimes between different sections of the same yearbook. Layouts changed every year at every institution, with no standard structure for how graduation records were presented. Some yearbooks listed students in columns, some in paragraphs, some with photographs and captions, some without. Names, degrees, hometowns, honours — the fields that existed and how they were laid out changed constantly across 50 institutions and 100 years.

There was no template that worked. There was no pattern that held across the dataset. Every document was, in some meaningful sense, its own unique problem.

AI vs Humans: The Honest Result

Before committing to a fully human-led extraction approach, AI extraction was tested thoroughly on representative samples from across the document set. The conclusion was unambiguous: AI 0, Humans 1.

The core problem was that every current AI extraction approach relies on pattern recognition. It learns from examples, identifies structure, and applies that structure to new documents. That works well when there is consistent structure to learn from. When there isn't — when every document is genuinely different — pattern recognition has nothing reliable to work with.

On the clean, modern yearbooks from recent decades, AI extraction produced reasonable output. On anything pre-1960, anything with damaged or low-quality scans, anything with unusual layouts, or anything with very small print, the error rates were unacceptable. Not marginally wrong — fundamentally unreliable. Names were misread, records were merged incorrectly, fields were confused with each other, and entire sections were silently skipped.

The insidious problem with AI extraction errors on this kind of material is that they aren't always visible. Output that looks plausible but contains quietly incorrect data is worse than output that obviously fails — because at least obvious failures get caught. With 2.5 million records going into a research database, silent errors at scale were not an acceptable outcome.

AI was not used for extraction on this project. That decision was made early, tested rigorously, and not revisited.

The Approach: Human Extraction With Structured QA

The solution was a 30-person contracted team working through the yearbook scans manually, supported by a structured workflow that maintained consistency and caught errors before they reached the final database.

The technical infrastructure handled everything the humans shouldn't have to think about. Document organisation, batching, progress tracking, data ingestion, format normalisation, and database output were all managed systematically so the team could focus entirely on the extraction work itself. Each extracted record went through a QA validation layer before being committed to the database, with ambiguous or uncertain records flagged for secondary review rather than being guessed at.

Consistency across 30 people working on documents spanning a century of varying formats required clear standards and ongoing calibration. When the team encountered new document types or edge cases — which happened regularly across such a diverse source set — those were handled centrally and the resolution fed back to the full team rather than being resolved differently by different operators.

The batch delivery model meant the research institute received validated data throughout the six-month project rather than waiting for a single end-of-project handover. That also meant any systematic issues could be identified and corrected early rather than discovered at the end.

The Result: 2.5 Million Rows, Consistent and Clean

Six months after project start, the research institute had a 2.5 million row database of student graduation records covering 50 US universities and 100 years of academic history — structured, consistent, and validated to the accuracy standard required for research use.

A century of graduation records that had existed only as physical yearbooks and low-quality scans was now queryable, analysable, and usable for research purposes. That transformation required human expertise applied at scale, not automation.

When Does AI Actually Win?

To be fair to the technology — and to give you an honest answer rather than a self-serving one — AI extraction genuinely works well in the right conditions. Clean, machine-generated PDFs with consistent structure. Modern documents with standard layouts. High-volume processing where occasional errors are acceptable and easy to spot. Situations where the document set is homogeneous enough that a trained model can learn reliable patterns.

The mistake is assuming that because AI extraction works in those conditions, it will work in all conditions. It won't. Legacy documents, scanned physical materials, inconsistent formats, very small print, handwritten content, and document sets that span decades of changing design conventions are all situations where human extraction — properly organised and quality-controlled — produces better results.

The honest answer to "AI or humans?" is always: it depends on the documents. Anyone who tells you AI extraction works on everything hasn't tried it on a 1924 university yearbook.

Have a complex document extraction challenge?

Let's look at the documents, understand the accuracy requirements, and give you an honest answer on what approach will actually work — AI, human, or a combination.

Book a free 30-minute call →

Other Case Studies

The same problem-first approach applied to different business challenges.

AI vs Humans: Building a 100-Year Graduate Database from Scanned University Yearbooks