For this exercise in my Digital Humanities class, we practiced creating and organizing data taken from the 1885 Seventh-day Adventist Yearbook (PDF). While we each focused on only a small sample of the data contained in the yearbook, we thought critically about how to organize and classify our selected data to optimize potential analyses and to facilitate questions that could be asked of it. The students in the class worked on a collaborative spreadsheet, with each of us creating a new tab for our individual work. My tab is “Jade-EducationalSociety (p. 10).”
The Educational Society’s Faculty
The specific data I chose to organize, the Educational Society’s Faculty, can be found on page ten of the yearbook pdf. What follows is a screenshot of my organizing spreadsheet and an explanation of my schema.
The observation type I identified for the data was the faculty of the Educational Society and their respective roles, with each role comprising an observation and thus occupying a row. For each observation I thought it best to create the categories of role, prefix, full name as listed, last name, first name, first initial, abbreviated first, middle initial, and degree for each person.
As can be seen, several of the columns do not possess a value for all observations; however, to preserve the information as listed, I did not fundamentally change its presentation or attempt to fill in gaps. I preserved the original punctuation, capitalization, spellings, and abbreviations. For example, while I can guess that the abbreviated “Wm.” means “William,” I do not want to make the data say something that it does not; therefore, I created for it a separate variable, “abbreviatedFirst.” While the degree “A. B.” probably refers to a Bachelor of Arts and differs from the usual “B. A.” that we use currently, the presentation of the degree in this way could be an interesting historical artifact itself that I would want to preserve.
I also wanted to provide a variable for the full name of the faculty member as listed so as not to cut up the identity as the person into discrete parts without a holistic context. This enables an analyst to sort this data by any individual variable that I provide—by last name perhaps—but we do not lose the person’s entire name as listed.
I chose “role” as a variable name over “department” or “subject” because at least four members are responsible for duties that do not fall within commonly-understood academic subject categories. Similarly, the president may not necessarily be a “department.” To me, “role” best encompassed the majority of the observations along with the outliers.
Ferris S. Hafford occupied a unique position in the data in that this individual seems to have taught three subjects. According to “tidy data” principles, each role/subject would comprise a distinct observation, so I placed Hafford in three different rows, one for each role. I determined that the information presented in the yearbook did indeed refer to three separate roles, rather than one long role, because of the use of commas. The “Greek and Latin Languages and Literature” role, for example, does not contain any commas that might suggest it refers to separate roles. Additionally, the spreadsheet, which contains one table, adheres to tidy data principles because it describes only one observation type: faculty role.
When creating this data from the text of the yearbook, I attempted to uphold several principles:
- present the data in a tidy format;
- communicate all information from the section of the yearbook as listed;
- err on the side of abundant variables, even if a variable contains few values, instead of cutting out information that do not fit perfectly;
- and formulate column names that are easily understood and more wholly encompass the nature of all of its values.
Analysts may decide to clean data in certain ways and exclude some variables or values, but I did not make these design decisions at this point of data handling.