Tidy Data

Hadley Wickham’s “tidy data” approach makes it easier for data to be manipulated in spreadsheets. The three criteria for tidy data are as follows:

  1. There is one variable per column.
  2. Each observation gets a row.
  3. Each observation type gets a table.

In the following example, I have manually transformed a messy data set into a tidy data set with an additional calculation. The messy data set is the top, wide table, and the tidy data set is the bottom, long table. The raw data comes from a Pew research poll.

In the tidy data set, the variables are the religious tradition, frequency of prayer, sample size of religious group, percentage, and count (count is the additional calculation that I performed). The fixed variables, which provide context, are religious tradition, frequency, and sample size. The measured variables are percentage and count. Each observation is each row — for example, the percentage of Buddhists sampled that pray weekly is sixteen percent. This is one observation. The values are what’s in the individual cells themselves, excluding column headings — for example, “Buddhist,” “Weekly,” “16%,” and so on are values.

Tidy data sets are arranged in a specific way to make calculations, analyses, or manipulations easier for computers. From the messy data set it is hard to use the built-in formulas to calculate the counts of people from the percentages. It would require restructuring the table to add count cells or including the counts in extant cells. By contrast, with the tidy data set, counts can be easily appended as another column, and it is easier to calculate counts using a spreadsheet formula. Analytically, this calls more attention to the sample sizes, from which, for instance, one can easily see that one percent of Hindus sampled is a much smaller number than one percent of Evangelical Protestants sampled. This is harder to see in the messy data set.

Leave a Reply

Your email address will not be published. Required fields are marked *