For the most part, I didn’t know what I this project would yield. I did have hopes about discovering trends in scriptural usage but for the most part, this was entirely an exploratory exercise. Much like finding your way through a darkened tunnel, I was feeling my way along with both the corpus and with the tools.
I was surprised at how time consuming data prep can take. I thought that since Internet Archive was providing the OCRed text, that I would be able to work through the sermons quickly. While the full text format was INCREDIBLY helpful, I still had to devote more than half of my project time to cleaning and formatting the text. I didn’t come into this project without any experience with data preparation. During my undergraduate, I had to take a data management class. The assignments were focused around preparing data for use and consumption in GIS. Everything from converting and cleaning CAD data to creating a points layer from a list of addresses were explored. Even with this experience, I had the false assumption that working with actual text would be easier and quicker.
Also, the better your data preparation, the better your analysis will go. I could have easily skipped fixing so many typos or errors in the text. Some of the words I was fixing did not pertain to my initial questions. However, when looking at keywords in context, having typos in would be distracting and possibly obscuring insights. Furthermore, my thoroughness in data preparation now will aid me in the future. This project prompted additional questions that I will explore in the future using other programs and tools. Typos and other issues with the text could adversely affect those processes.
I am coming out of this project with a different view of my corpus of sermons. I did not obtain the results I was expecting. My focus at the beginning of the project was on scriptural use by the ministers. My system of asterisks and standardized footnotes, ultimately, was not successful. Other than the asterisks, there did not exist any unifying term that could be searched to identify all the scriptural references. This is a lesson on data preparation. For future analysis, I will need to establish a better system to group the scriptures together. Perhaps the use of a markup system would prove fruitful.
However, the results of the text mining lead me in other directions. I took notice of other words that show interesting trends and usages. These discoveries would not necessarily have occurred unless I undertook the task of text mining the corpus.
Voyant (Text Mining):
My reaction to Voyant is twofold:
Voyant is an extremely powerful and useful tool for analysis. The corpus of sermons are ones that I am familiar with. I have done a close reading of them (though it was years ago) and have written already not their content. Yet Voyant was able to highlight and articulate other points of interests. The power of word frequencies as well as the location of those words in documents becomes evident in this project.
Voyant is not be the “end all” for your analysis but should prompt further investigation. I did not come away from this project with the notion that my questions had been answered. At most, I was given a large amount of data, both related to my questions and prompting new ones. It did answer very simple questions such as “Was the word ‘law’ used often?” Voyant’s results provide a resounding “No”. However, the follow up question, “Why was it not discussed during a time rife with political action?” cannot be easily answered with the Voyant results. This would require much more analysis and possibly some topic modeling. The scope of this project did not allow me to do any topic modeling (I was planning on using Mallet but did not have the time).