In a very naive way, one would think digitizing documents is just a question of taking the time to do so. If I wanted to digitize one of my undergraduate essays, it would take me less then 5 minutes to scan them with my home printer and upload the document to my website. Simple and quick… However in the larger scheme of digitization and digital history, the process is much more complex then my 5 minute scenario.
What is the purpose to digitizing the document? When I was surfing the web for Mormon History websites I came across one that had some primary sources that had been transcribed into an HTML document. The author of the website did not post scanned images of the original document and I found myself “dinging” that site. I thought, “Well I can’t see the original document so I am a bit skeptical.” However on another website I found scanned images of newspapers from the nineteenth century. In my mind I ranked this site “higher” than the one that did not have images available. Why did I do that? This weeks readings (Found here) fleshed out the various reasons a document should be digitized. Both of these websites were presenting historical sources but in different ways. Depending on the purpose of my research I could choose one over the other. The transcribed webpages are searchable whereas the scanned images of the newspapers are not searchable. Yet the images of the newspapers allow me to see the layout, font style, spacing, margins etc. of the document, something the transcribed documents don’t necessarily provide. Finishing this weeks readings, I have become much more cognizant of the pros and cons associated with each way of digitizing sources. Really, the digitizer needs to ask (and answer) the question of who is my audience and what are their needs in regards to this document.
What about the “false” negative? An important aspect of digitization is Optical Character Recognition (OCR). Simon Tanner in his article Deciding whether Optical Character Recognition is Feasible describes OCR as “a type of document image analysis where a scanned digital image that contains either machine printed or handwritten script is input into an OCR software engine and translating it into an editable machine readable digital text format. ” In essence OCR software takes an image of text and converts it into a searchable text document. This process is not perfect. Errors occur and how those errors are mitigated become an important question. One that struck me as profound are “false” negatives. For example, if I have sources that have been run through OCR software to produce searchable documents and I search those documents for a phrase or word like, “Mormon,” all the occurrences of the word “Mormon” will be returned. If the OCR software transcribes the word “Mormon” from one document as “Moman” or “Mermon” then that document will not show up in my search. This is called a “false” negative and it presents an issue in research. As a historian, by becoming reliant on keyword search engines of documents, I am putting my trust in the programmers, the OCR software, the website administrator and others that they did quality work. My research becomes dependent on them as well as myself. It is a strange feeling and has caused me to pause for a few seconds before I click the “search” button.
How do we mitigate too much of a reliance on digitized records? The digitized sources that are available are amazing. Between the Library of Congress and various University projects, it is amazing what can be found through the internet. However, this amounts to mere grains of sand on a vast and rich seashore that is history. Ian Milligan discusses this issue in his article Illusionary Order: Online Databases, Optical Character Recognition, and Canadian History, 1997-2010. He compares various Ph.D. dissertations and the sources they use and noticed an increase in some while a decrease in others. One of his concerns is the reliance on the digitized copies has increased in some areas further than it should.The better source was not available in digital form so the researcher generalizes the digitized newspaper for their research. This particular instance may not be a major problem but the idea behind it, that of relying more on easily accessed, possibly searchable sources rather than those not digitized could mean skewed research. This might not be the case but it does cause me to think about my own research and research methods.