Assessing and Improving OCR Quality in the HathiTrust

The rise of large-scale digitized book collections—such as those provided by Google Books, the HathiTrust and the Internet Archive—is enabling a fundamentally new kind of text analysis that exploits the scale of collections to ask questions not possible with smaller corpora. This movement into distant reading has allowed researchers to ask questions about the changing semantic fields of British novels from the long 19th-century (Heuser and Le-Khac., 2012), the variable attention given to geographical locations throughout American texts published before and after the Civil War (Wilkens, 2013), and the representation of gender across two centuries of literary history (Underwood et al., 2018), among many others.

Our long-term goal is to enable this kind of work in large-scale distant reading by general researchers—both for current experts in computational text analysis, and also for the next generation of literary scholars who are currently learning empirical methods alongside traditional techniques for close reading. While large-scale digital libraries like the HathiTrust in many ways present the greatest opportunity for computational critical research, they present several important challenges as well. Our goal is to improve these collections along each of the following dimensions: improving metadata, recognizing the document structure of OCR’d books, and measuring and improving OCR quality.

Researchers

David Bamman, Cody Hennesy

Department/school

School of Information, The Library

Project type

Research