This fall, David Bamman joined the School of Information faculty as an Assistant Professor. His research in natural language processing and machine learning has direct applications for digital humanities scholarship. Bamman himself has a background in the humanities, including undergraduate studies in classics and English literature at the University of Wisconsin-Madison and an M.A. in Applied Linguistics from Boston University. He has also worked as a senior researcher for the Perseus Project, developing analysis tools for Greek and Latin texts.
Bamman’s interdisciplinary interests extend to his teaching as well. While Bamman pursued his Ph.D. in Computer Science at Carnegie Mellon University’s Language Technologies Institute, he co-taught “Digital Literary and Cultural Studies” with Chris Warren, Associate Professor of English. In the course, undergraduates learned a “broad range of methods in the digital humanities (including social network analysis, clustering, and classification) with application to problems in Renaissance and Early Modern Studies.” Bamman explained that these interdisciplinary classes are an important place for students from different disciplines to construct a shared vocabulary by elucidating and thinking critically about core assumptions in their home discipline. Bamman offered problematizing the concept of validity as an example. “In computer science, validity is taken as a given. If you can’t verify your method, it’s not useful,” Bamman said. “But validity adheres to different criteria in literary studies. What counts as a meaningful contribution to the field of study? What is evidence?” In literature, these arguments don’t necessarily need to be quantifiable, but they do still need to adhere to standards of evidence. To engage in interdisciplinary work, Bamman argued that we need to begin by “establishing that there are very different perspectives on these core ideas.”
Bamman invites students from a variety of disciplines to join his upcoming spring semester course, “Information 290: Deconstructing Data Science.” Students will explore “a range of methods in machine learning and data analysis that leverage information produced by people in order to draw inferences.” Bamman’s example applications include discerning the authorship of documents, examining the political sentiments of social media users, charting the reuse of language in legislative bills, tagging the genres of songs, and extracting social networks from literary texts. Machine learning has seen increasing adoption in Berkeley’s digital humanities community; DH Fellow Elizabeth Honig currently collaborates with researchers at Duke University who are using machine learning to discern common motifs in large collections of paintings (such as images adapted from templates in the Brueghel family workshop). As students survey these methods, they will also engage critically with the models of the world these methods construct and their limitations. Students will have the option to complete the homework in two different ways: either by writing a critique of an algorithm and published work that has used it or, if they have experience with programming, by implementing and evaluating a method on a dataset.
Bamman highlighted some of these critical approaches during a keynote address, “NLP for the Long Tail” at the Digital Humanities at Berkeley Summer Institute. Natural language processing, a field concerned with “the development of automatic methods that can reason about the internal structure of language, including part-of-speech tagging, syntactic parsing and named entity recognition—identifying the people and places in text and discovering the structure of who does what to whom”, is currently limited by the scope of the data used to train these tools. Prominent tools have been developed and tested against gigantic corpora of standard English newswire, but perform poorly when analyzing literary English, modern non-English languages, or historical languages. With an understanding of these limitations, Bamman exhorted humanists to join him in creating of a repository of annotated texts in underserved languages that might form the basis of future NLP research and developing robust tools for humanistic inquiry.
More resources:
- Information 290: Deconstructing Data Science (Mondays and Wednesdays, 9-10:30 AM,CCN: 41616)
- Blog: Math and Art History find common ground in dictionary learning