Digitization Workflows: Scanning, OCR, and Audio Transcription

Converting documents, text, images, and sound files to digital and/or machine-readable formats is a prerequisite for many digital humanities projects. Digitization is the process of capturing analog materials as digital images. Optical Character Recognition (OCR) programs “read” these images and convert them to text documents which can be easily searched, copied, edited, or used for computational text analysis methods. Transcription is the process of translating audio or video files into a text format. Explore more tools these tasks in the ‘Capture’ category on the DiRT Directory.

Digitization / Capture

Whether you plan to publish scanned images in a digital collection or apply OCR tools, aim to capture the highest resolution images possible. In cases where a scanner is not available (i.e. for archival materials), a DSLR camera may provide a good substitute for high quality image capture. A variety of scanning stations and image capture services are available on campus:

BookScan stations offer free high-resolution scanning for documents up to 11” x 17” in a variety of campus libraries. Several stations are scattered throughout Main Stacks, Doe Library, and Moffitt Library. Scanning stations allow users to perform basic editing functions (cropping, rotating, etc.) before saving files to a USB drive.
Free flatbed scanning is available at several Educational Technology Services (ETS) computer facilities. Appointments can be made to use reserve the scanners for up to 2 hours per day.
Large format scanning and printing can be performed by staff at Earth Sciences & Map Library. Note: printing is for cartographic materials only.
Self-service microfilm / microfiche reader-scanners are available at the Newspapers & Microfilm Library.
Bancroft Library
- The Duplication Services Unit offers the services below. Duplication can be ordered using this form.
  - High-resolution imaging (publication quality)
  - Mid-resolution PDFs (research quality)
  - Paper photocopies
  - Audio-visual (via CD/DVD)
  - Microfilm-to-microfilm duplication
- Personal cameras can be used at the Bancroft Library, but users must pay a fee. See the Personal Camera Use Policy.

Optical Character Recognition (OCR)

OCR programs process scanned documents (e.g. books, newspapers) to extract text. The quality of the text extraction will depend on the resolution of the scanned document, the format and quality of the print materials, and how well the OCR program deals with other languages or diacritical marks. No OCR program is 100% accurate. How much time a project decides to invest in correcting OCR will depend on the aims of the project. For example, researchers producing a digital edition as a resource for other scholars will need to invest significant resources in correcting OCR accuracy, whereas researchers performing large scale computational analyses (e.g. thousands of newspaper articles) may accept a lower degree of OCR accuracy. In the latter case, testing OCR accuracy on a subset of documents and quantifying OCR accuracy is good methodological practice (see Digital Humanities Quarterly for a discussion of this question as it applies to digitized historical newspapers). OCR on smaller collections (such as a 50 page PDF) can help researchers search for details within a single document. Comparison of original book scan and OCR results

A variety of programs allow users to perform OCR on documents in batches and export collections of documents. Note that the accuracy of OCR will vary from program to program, and the quality of the scanned document will be the determining factor. OCR capabilities are limited for scholars working with cursive or handwriting, though this is an active area of research in the field.

Software

Adobe Acrobat Pro (Windows / Mac, closed source, commercial)
Acrobat Pro is available to all UC Berkeley affiliates via campus license. Claim your Adobe Creative Cloud license to install Acrobat on your machine or utilize ETS computer facilities.

ABBYY Finereader (Windows / Mac, closed source, commercial with educational discount and free trial)
ABBYY FineReader is a robust tool for OCR. ABBYY FineReader works well with digital camera images, unusually structured text (e.g. magazine layouts, newspaper columns), offers automated workflows for conversion, and supports up to 190 languages.

Tesseract (Windows / Mac / Linux, open source, free)
Tesseract is an open source OCR engine. It can be used directly (via the command line) or with an API. Several third-party graphical user interfaces (GUI) are available for users who would like a drag-and-drop interface. Specialized packages for working with different languages and scripts, such as cuneiform and Vietnamese, are also available. Read Ammon Shepherd’s “Watermarking and OCR-ing Your Images” blog for a short walkthrough of using Tesseract without a GUI. Shepherd also provides scripts for batch processing.

Google Docs (Web, free)
Google Docs allows users to perform OCR on uploaded images and PDFs. See this blog for a walkthrough and screenshots. Read about recommended document specifications here.

Audio Transcription

Pop Up Archive (web, commercial with free monthly quota)
Researchers who are interested in working with collections of sound and would like to retrieve automated, time-stamped transcripts should consider working with the Pop Up Archive. Use Pop Up Archive’s free hour of processing per month to test out the service with your sound recordings. Basic transcripts are machine-generated; premium transcripts draw upon archival audio for a more accurate transcript and can identify multiple speakers ( Pop Up Archive’s premium transcripts are English-only and are optimized for North American and British dialects). Transcripts are correctable within the Pop Up Archive platform. Processed files are exportable to a variety of formats, such as SRT (for captions), XML, and JSON. See the following tutorials for a preview of their features:

oTranscribe (web, open source, free)
oTranscribe combines an audio and video file player with a transcription window. Keyboard shortcuts allow a user to pause, play, and rewind footage, and insert interactive timestamps. Transcript files can be exported to Markdown, plain text and Google Docs. Files are stored locally (on your machine), not on the web. See oTranscribe’s discussion of file security here.

Other resources:

Spreads: a Python package for digitization workflows
Ammon Shepherd “Watermarking and OCR-ing Your Images” (dh+lib review)
Thomas Padilla, “Getting Started with Open Refine” - a tool for cleaning data

Image: Comparison of typewritten text and Tesseract OCR results | Courtesy of Ammon Shepherd (CC-BY 4.0)

Last updated: Jun 15, 2015