This course examines the use of natural language processing as a set of methods for exploring and reasoning about text as data, focusing especially on the applied side of NLP — using existing NLP methods and libraries in Python in new and creative ways (rather than exploring the core algorithms underlying them; see Info 159/259 for that).
Students will apply and extend existing libraries (including scikit-learn, tensorflow, keras and spacy) to textual problems. Topics include text-driven forecasting and prediction (using text for problems involving classification or regression); experimental design; the representation of text, including features derived from linguistic structure (such as parts of speech, named entities, syntax, and coreference) and features derived from low-dimensional representations of words, sentences and documents; exploring textual similarity for the purpose of clustering, deduplication and retrieval; information extraction (extracting relations between entities mentioned in text); and human-in-the-loop interactive NLP — involving people in the NLP pipeline, including active learning for annotation, computer-assisted clustering, and interactive search. This class will focus both on modern neural methods for these problems (including architectures such as CNNs, RNNs, LSTMs, and attention) and on classical methods (logistic/linear regression, Bayesian models).
This is an applied course; each class period will be divided between a short lecture and in-class lab work using Jupyter notebooks (roughly 50% each). Students will be programming extensively during class, and will work in groups with other students and the instructors. Students must prepare for each class and submit preparatory materials before class; attendance in class is required.
This course is targeted to graduate students across a range of disciplines (including information, English, sociology, public policy, journalism, computer science, law, etc.) who are interested in text as data and can program in Python but may not have formal technical backgrounds.
Prerequisites
Graduate student status; proficient programming in Python (programs of at least 200 lines of code).
Course Website