A Tale of Two Projects

Link to the original post

In a week’s time, the semester and by extension the DH Fellowship will come to an end. As such, it is time for the end of the semester blog post. IN the time since my last blog post, I have had divided my time into two projects associated with Digital Humanities Now. The first project (Web Scraping) was focused on the content published by DHNow while the second (Web Mapping) focused on DHNow’s Editors-at-Large base.

Web Scraping

Over the years, Digital Humanities Now has published hundreds of Editor’s Choice pieces. For 2014 alone, roughly 165 Editor’s Choice articles from numerous authors were featured. Such a large corpus of documents provided a ready source of data about the publishing patterns of DHNow. In order to translate the documents into usable data we needed to format the Editor’s Choice articles into a usable format, namely machine-readable text. The task, then, was to go through each Editor’s Choice article and scrape the body text down into a .txt file. I had never scraped a website before, so this project was going to be a great learning opportunity.

I began the project by reading through the Beautiful Soup web scraping tutorial on Programming Historian by Jeri Wieringa. It uses a Python library called Beautiful Soup to go into a website and scrape the data. During my rotation in the Research Division last semester, the three first year Fellows had quickly worked through the Beautiful Soup tutorial but I needed a refresher. However, I made a switch from Python and to R. This change came from the suggestion of Amanda Regan who has experience using R. As she explained it, R is a statistical computing language and would be a better resource in analyzing the corpus of Editor’s Choices than Python. After downloading R Studio (a great IDE) and playing around with R, I found it to be a fairly intuitive language (more so for those who have some background in coding). I came to rely on Mandy and Lincoln Mullen when running into issues and they were both extremely helpful. Learning R was fun and it was also exciting as R is the primary language taught and used in the Clio Wired III course, which I plan on taking the next time it is offered.

In order to scrape the body text of each post, I relied on the class names of each html tags containing the text. I imported in a .csv file of all the Editor’s Choice articles and search each website for a specific class name. When found, R would scrape all the text found in that tag, place it in a .txt file whose name corresponds with the articles ID number. Finding the class name was a hang up, but I was able to use the Selector Gadget tool to expedite the process. It essentially makes your webpage’s css structure interactive allowing you to click on items to view their extent and class names. I learned a lot about website structures in while identifying each body text’s class name. In the end, I was able to scrape 150 of the 165 Editor’s Choice articles.

You can find my code on my Github account here.

Web Mapping

The second project I was fortunate to work on was displaying our Editor-at-Large spatially on a map. My undergraduate work is in Geographic Information Systems (GIS) so this project in part came out of my interests and prior experience. In association with this project I am writing two blog posts for the soon to be DHNow blog. The first will detail the process of developing and designing the map while the second will delve into what the map is “telling us.” For the sake of the Fellows blog, I will instead reflect on my experience in creating the Editors-at-Large map and will link to the other two blogs when they are published.

It had been almost a year since I devoted any real time to cartography. I decided to use the same model I went through in my undergraduate capstone class on web mapping. To being with, I needed a dataset that I could use on the web. During my undergraduate, I used ArcGIS to convert a .csv into a geoJSON file that could be used on the web. However, since coming to GMU and the Center, I have embraced Open Source (both by choice and by financial force) and instead relied on Quantum GIS (QGIS). I had no real experience with QGIS so this project provided me an opportunity to become familiar with the QGIS platform. This was an added benefit that I both appreciated and enjoyed. In the end, converting to a geoJSON format was fairly straightforward.

To render the web map, I used Leaflet, which I was introduced to in my undergraduate coursework. While as an undergraduate, I found Leaflet somewhat difficult to use but this is probably because I was simultaneously learning HTML, CSS, and Javascript while working with Leaflet. Returning to Leaflet, my impression was how easy it was to use and its fairly intuitive design. I attribute this change in attitude to the training in and supportive nature of the Research Division as I was exposed to Python and other coding languages. In the end the map turned out pretty good and my work on the project has reignited my passion for cartography and all things spatial.

In the final days of the Fellowship, I feel both excited and melancholy. I am sad that the fellowship is coming to an end and I am moving out from the Center. It has been a wonderful experience working with great people on interesting and engaging projects. Yet, it is exciting to think back to myself on the first day of the Fellowship and realize how far I have come in my digital work.

A Tale of Two Projects
Tagged on:             

Leave a Reply

Your email address will not be published. Required fields are marked *

css.php