Mining and Modeling Text

Digital Historians are not focused only on increasing accessibility to sources. Accessibility is just one facet of the overall benefit of the digital world. I can picture in my mind, trying to explain how a source is made available online to my friend. I could go into detail about what source should be used during the digitization process (the original – not microfilm), the quality of the scanned image, the pros and cons of OCR, the importance of transparency, databases, user interfaces etc. etc. After all of that, they would say, “Yeah, but what can you do with it?” That is the ultimate question. Once we have digitized sources, what then are we suppose to do with it. This weeks readings on text/data mining and topic modeling provide answers to that question.

Text/data mining and topic modeling are two phrases I have heard mentioned around the Center for History and New Media as well as within the community of Digital Humanities yet I did not fully understand what they were. Ted Underwood’s article “Theorizing Research Practices We Forgot to Theorize Twenty Years Ago” really helped to shed some light on them. Granted, I do not understand the statistics that go into the “black boxes” of topic modeling or text/data mining, but I can understand the process and end result. Text/data mining is a statistical method to analyze large collections of text. An example of text/data mining would be a keyword search through a database. Topic Modeling is a statistical method that organizes text into clusters of topics or words that tend to occur in the same bodies of text. In essence, it is modeling out the different topics in the collection of texts. Both of these are powerful aids in the research process.

I found it interesting in Frederick Gibbs and Dan Cohen’s article “A Conversation with Data: Prospecting Victorian Words and Ideas” that historians push back against these computational/statistical methods. What is more interesting is that I find myself pushing back as well. As Gibbs and Cohen articulate, there is this cleft between close reading (traditionally reading through a text) and the computer-enhanced distant reading (“reading” through a large collection of texts to find over-arching themes, topics and meaing.) It is the notion of reducing a book to the mere occurrence of words. The romantic in me wants to push back against the notion that a novel, or play, or journal can be reduced to just its words. During my undergraduate, I took an online music theory course. At the time, I was dating my wife (well she wasn’t then…anyways), and I was explaining how you could break apart music into mathematical sequences. I found it very interesting whereas she pulled back from it. For her, she didn’t like the notion that a piece of music that she plays on the piano could be reduced to mathematics. By knowing this, for her, it would somehow degrade the beauty of the music. I can sympathize with that as I read through the text/data mining and topic modeling articles. However, I also find these things extremely interesting. Ultimately, both close and distance reading are important to research in the humanities. Researchers and glean information from distance reading that is not possible to do in close reading, whereas close reading is necessary when doing in depth research on various sources. They both serve the greater purpose of arriving at a better understanding of our past than we had previously.

Another reaction to the reading is how much more needs to be done to better use these methods. In Ted Underwood’s blog post “Where to start with Text Mining,” he touches on the issue of available, usable sources. Underwood states “Where do I get all those texts? That’s what I was asking myself 18 months ago. A lot of excitement about digital humanities is premised on the notion that we already have large collections of digitized sources waiting to be used. But it’s not true, because page images are not the same thing as clean, machine-readable text.” This is an issue I see in my field of Mormon History. What we have is a great opportunity but the resources are lacking. As I have mentioned in my previous post on OCR, our ability to translate images into machine-readable text is lacking. It would appear to me that more should be done to remedy this. It is like we are doing great with Part A (providing digitized sources) and Part C (distance reading) but Part B (OCR – translating digital sources into machine-readable text) is lacking. Maybe I am disillusioned with my recent experiment using Google Drive’s OCR…

Also, as a Geographer, I really enjoyed reading Cameron Blevins article “Mining and Mapping the Production of Space.” Interactive web maps hold a special place in my heart (yes I know my geek is showing.) His use of text mining for frequency counts of place names in the newspapers was really interesting. I am always looking for ways of utilizing mapping in Digital History and this article is a great example. The maps were fun to play with (although they didn’t work on Firefox, had to use Chrome) and they were very well done from a cartographic stand point. This got me really excited to utilize these methods in my own work.

Mining and Modeling Text
Tagged on:             

Leave a Reply

Your email address will not be published. Required fields are marked *