My first encounter with text analysis was the summer between finishing my undergraduate degree and starting the PhD program at GMU. Wanting to be studious and proactive before entering the program I chose for its digital humanities specialty, I started reading about text mining and topic modeling. The first article I read, I can’t recall what it was specifically, instantly went over my head as it discussed the statistics association with text analysis. I came away form the article with a sinking feeling that I was “getting in over my head.” Thankfully, that feeling subsided and I have learned a lot about text analysis from my fellowship and my Clio Wired classes. In fact, my Clio Wired One final project was using Voyant and text mining to analyze a corpus of Revolutionary War Sermons (check it out here).
The readings for this week represent a shifting point in the course. Prior to Text Analysis, the topics were more abstract and theory based. Instead of talking about an actual program or task, they instead addressed the state of the field. As a result, their was less “arguments” made in the reading than prior weeks. Instead, I was able to read about the application of various text analysis methods (text mining or topic modeling). Distant reading really got its start, within the humanities, in literary studies. It was interesting to read the various ways of using distant reading as described in Matthew Jockers’ book Macroanalysis: Digital methods & Literary History. He discussed the theory behind microanalysis, or distant reading, while also explaining its application to various levels of documents (metadata, theme, gender etc.). It was interesting to see how different questions can be explored (down to trying to identify the author of an anonymous text). Personally, his text helped me transition the chasm of document to data. Understanding the corpus as a grouping of various levels of data facilitates new ways of answering questions or at least prompts new questions. It was enjoyable, if at times a bit heavy on the literary stuff (understandably).
Yet, I was left unsatisified… Multiple articles delved into the scholar’s research interests, their use of distant reading methods, and the results it produced to varying degrees. However, there seemed to be very little discussion of the pitfalls that inexperienced DHers encounter as they apply distant reading methods to their fields. Ben Schmidt’s article “Words Alone: Dismantling Topic Models in the Humanities” helped to highlight some of these. While somewhat beyond me at points, Schmidt basically outlines two assumptions made in topic modeling that aren’t always true:
1. Topic Modeling is coherent: “a topic is a set of words that all tend to appear together, and will therefore have a number of things in common.”
2. Topic Modeling is stable: “if a topic appears at the same rate in two different types of documents, it means essentially the same ting in both.”
Prior to his article, I would have made these assumptions blatantly and without question at times. This is in part because I am an inexperienced DHer who has done little with text analysis. To understand how these two assumptions could be false, I encourage you to read Schmidt’s article. For my purposes, he highlights that there are assumptions/questions/methods/etc. that need to be articulated and explained to better understand the process of text analysis.
To close, I want to ask about one of those assumptions made in the reading. Before the user runs the topic modeling program (most likely MALLET), they need to identify the number of topics they want the program to assign. Most of the projects we read all used 40 topics as if it were some default. Yet, I came away from each project, including the ones not using 40, asking “Why?”. Perhaps I missed the explanation but how does one go about choosing the number of topics? Is it arbitrary thus the reason to run the model numerous times with varying numbers of topics? On top of which, how much change in the topics occurs when you run the same corpus with 10, 30, or 100 topics? Ultimately, I came away not knowing the importance I need to be placing on topic numbers…