Text Analysis

Text mining is a quantitative method of distant reading. Distant reading is defined by the New York Times as “understanding literature not by studying particular texts, but by aggregating and analyzing massive amounts of data.” In simpler terms, by compiling statistics on the entire corpus, we are able to gain new perspectives on the material.  Text mining is centered around words counts or what is called word frequencies. While this may seem trivial, having word frequencies for every word in the corpus allows for greater insight into the structure of the documents.

The program I used for text mining is Voyant version 1.0. It is a web based program that is free to use. It provides a plethora of differing visualization tools and on top of that, they are all embeddable. After I fed in my 21 sermons, I spent some time looking through the statistics and the various visualizations.

Note: Another valuable part of Voyant is that it provides the option for stop words. A stop word is a word that is inconsequential to your analysis. Examples could be: the, and, 1, 2, is, am, was, but, might, now, perhaps etc. Voyant allows the user to decide what stop words they want to employ and they even provide pre-made lists. In my analysis I used the “English taporware” list as provided by Voyant.  I decided to use their list as it is much more comprehensive in nature than anything I could try to create. I skimmed through the contents to at least be aware of what words I was excluding from my analysis.

Summary Statistics:

The first thing I did after Voyant returned my results was to look at the summary statistics. I begin by examining Cirrus or the word cloud. While not entirely practical, the word cloud is fun to look through and does provide a very quick visual as to the most common words in your corpus.

After I explored the word cloud, I look through the “Summary” window. The general information about your corpus is provided. I found the range in the size of my documents interesting. The difference between my largest sermon (by Ezra Stiles) and my shortest (by Jacob Duche) is 26,645 words. With such a large spread, I wondered if this could influence what was discussed in each sermon. Other factors such as allotted time to speak, location of sermon, and the audience are issues that could influence the content of a sermon.

The final summary window I used before delving deeper into the material is the Words in the Entire Corpus window. While it repeats what I saw in Cirrus (word frequencies), it presents the data in a list which is more practical. Beyond that, it shows a reference graph called Trend. Essentially this is the Word Trends window inserted into a column in this window. Using this as a quick reference, I was both able to see the frequency of the word in total and its distribution across my corpus. For helpful reference tool as I began to investigate specific words.

Delving Deeper:

After looking over the summary information, I wanted to see how many scriptural citations were in the corpus. I tried to search for “*” in both the Word Trends window and in the Words in the Entire Corpus window. In both cases, the search did not execute as the program did not recognize the asterisk. At the suggestion of my Professor, I checked the stop words list and found a “/*” in the list. I deleted it from the list with no effect on the search. This is disappointing as I purposely left the asterisks in for this function.Without the asterisk search, I could only search for scriptural book titles or common words used in scriptures. Each method comes with their own issues.

Note: The Word Trends graph has two different return value settings (found in the bottom left corner). The default is relative. This means the number is relative to 10,000 words. This is a great way for me to standardize my dates as my sermons vary greatly in size. The second setting is Raw. This returns the raw frequency count of the term. Since I wanted to see how often scriptures were used, I set the return value to Raw.

Searching for scriptural books would require searching for 66 different books (this is based on Protestant scriptures and not Catholic or others who contain even more books). Voyant limits the number of items included in Word Trends to five. This meant that I would have to create 14 different graphs to even see all the books and even then I would not be able to see them all together.

Searching for common scriptural terms was even more of a hit or miss. Not every scripture contains one of the five words I could choose to search. Beyond this, “old English” style writing (the “thee”s and “thou”s style) was also common use in sermons in general. Meaning, the use of shall could be in the context of a scripture or it could just be the minister speaking in a scriptural voice.

With scriptural references obscured, I decided to explore other terms.

I decided to compare the use of “king” with the use of “subject”. In this chart, you can see how both of the terms follow a similar pattern up until 1778. After that subject fluctuates as king almost flatlines. The biggest variable in this is the context of the term king. For some sermons, the discuss the kings of the Old Testament in addition to the king of England.

While scrolling through the Words in your Entire Corpus window, I noticed an interesting trend in the words “law” and “laws”. In this graph I collapsed the terms. This means I combined the results of the terms into one graph line. It is interesting to note that, except for a handful of sermons, the majority did not discuss law or laws. While not definitive, these results provide fodder for further exploration on the subject.

The terms “war” and “wars” also proved to be very interesting. The graph shows little use of the terms up to the end of the war. The exception to this is Nathaniel Whitaker’s Antidote against Toryism delivered in 1777. Then there is an increase after the Treaty of Paris in 1783. This prompts a series of questions: Why was the term used so little before and during the revolution? Why did Whitaker’s sermon use it so often? Why was there an increase after the conclusion of the Revolution? Ultimately, this data provides a great platform to launch into additional research.

With Nathaniel Whitaker’s sermon being such an “outlier,” seeing his use of war in context would be helpful. Voyant provides a Keyword in Context window that does just that. I can scroll through this list and see each use of war with its surround text. This allows me to see if he is using the term in reference to the current times or if he is using examples of war from the scriptures.

Differing Visualizations:
Part of the purpose of this project is to better understand the tool(s) in addition to the material. With this in mind, I explored the various ways data is visualized using Voyant. They provide so many different visualization because each one approaches the data differently. Not only can you see different information about the data, one visualization may “click” with you more than another. My experience is that some of the visualizations Voyant provides do not work. I am not sure if they do not work with my corpus or if the tool itself is not functional. However I will highlight two visualizations I found insightful. I will use the same terms (war and wars) so that a comparison between these visualizations and the Word Trends graph can be made.

The MicroSearch tool is extremely useful. Each column represents one of the documents in the corpus. The length of the column corresponds to the length of the document. Thus the longest column represents Ezra Stiles 1783 Election sermon (number 18) which is my longest document at over 30,000 words. The search operates by locating that word in each document and highlighting that location with a red marker. The example above shows where the words “war” and “wars” can be found in the corpus.This is extremely useful in that it not only shows you how often it is used but where in the sermon they used the word. For example, Nathaniel Whitaker’s sermon Antidote against Toryism (number 11) uses these terms throughout his sermon but the most frequent is in the middle of his sermon. This can help me examine the documents with a greater understanding of the distribution of words. Very useful tool…

The Bubble Line visualization, while similar to the MicroSearch, puts the emphasis on the term. Using proportional circles, this visualization shows not just the distribution along a line but highlights where the term(s) are grouped together. Since the circles are color coded, comparisons between multiple terms can be made. For example, I could see if the term liberty and freedom cluster or occur in the same place. This allows for another level of analysis within the corpus.

Leave a Reply

Your email address will not be published. Required fields are marked *