Unfortunately, data comes in all shapes and sizes. Especially when analyzing data from authorities. You’ll have to be able to deal with pdfs, fused table cells and frequent changes in terms and spelling.

When I analyzed the swiss arms export data as an intern at SRF Data, we had to work with scanned copies of data sheets that weren’t machine-readable, datasets with either french, german or french and german countrynames in the same column as well as fused cells and changing spelling of the categories.

Unsurprisingly, preparing and cleaning messy datasets is often the most time-consuming part of data analysis. Hardley Wickham, creator of R packages like ggplot and reshape, wrote a very interesting paper about an important part of this data cleaning: the data tidying.
According to him, tidy data has a specific structure:

Each variable is a column, each observation is a row, and each type of observational unit is a table. This framework makes it easy to tidy messy datasets because only a small set of tools are needed to deal with a wide range of un-tidy datasets.

As you may have seen in our post on ggplot2, Wickham calls this tidy format molten data. The idea behind this is to facilitate the analysis procedure by minimizing the effort in preparing the data for the different tools over and over again. His suggestion: Working on tidy, molten data with a set of tidy tools, allowing you to use the saved time to focus on the results.

Bildschirmfoto 2016-02-29 um 15.02.26

Excerpt of Wickhams „Tidy Data“

Practicing data tidying

But how do we tidy messy data? How do we get from raw to molten data and what are tidy tools? Let’s practice this on a messy dataset.

On our GitHub-page, we deposited an Excel file containing some data on marriages in Germany per state and for different years. Download it and open it with Excel to have a first look at it. As you’ll see, it’s a workbook with seven sheets. We have data for 1990, 2003 and for every year from 2010 through 2014. Although this is a quite small dataset which we could tidy manually in Excel, we’ll use this to practice skills that will come in handy when it comes to bigger datasets.

Now check whether this marriage data needs to be tidied:

  • Are there any changing terms?
  • Is the spelling correct?
  • Is every column that contains numbers correctly saved as a numeric column?
  • Are there needless titles, fused cells, empty cells or other annoying noise?

Spoiler alert: The sheets on 2010-2015 are okay, but the first two — not that much. We have different spelling and terms here, as well as three useless columns and one row plus all the numbers saved as text in the first sheet. As said, the mess in this example is limited and we could tidy it up manually with a few clicks. But let’s keep in mind that we’re here to learn how to handle those problems with larger datasets as well.

Within Excel, we will:

  • Delete spare rows and columns (we could do that in R too when it comes to REALLY messy data)
  • Save columns containing numbers as numeric type

Now we’ll change to R.

First of all, we need to install and require all the packages we’re going to use. We’ll do that with an if-statement telling R only to install the package if it hasn’t been installed yet. You could of course do this in the straightforward way without the conditional statement if you remember wether you already installed the package, but this is a quick way to make sure you don’t install something twice needlessly.

# readxl to efficiently read in the excel workbook
if(!require(readxl)) {
  devtools::install_github("hadley/readxl")
  require(readxl)
}
# tidyr for tidying the data 
if(!require(tidyr)) { 
install.packages("tidyr", repos="http://cran.us.r-project.org") 
require(tidyr) 
}
# dplyr for fastly preparing the data for different analyses 
if(!require(dplyr)) { 
install.packages("dplyr", repos="http://cran.us.r-project.org") 
require(dplyr) 
} 
# ggplot2 for the visual analysis 
if(!require(ggplot2)) { 
install.packages("ggplot2", repos="http://cran.us.r-project.org") 
require(ggplot2) 
}

To read in the sheets of an Excel workbook, read_excel() from the readxl-package is a useful function. Because we don’t want to load the sheets separately, we’re going to use a loop for this. If you’re interested in learning more about loop functions, stay tuned for our upcomming tutorial on this topic.

# as always, don't forget to set the working directory correctly.
# in my case, it's: 
setwd("/Users/MarieLou/Desktop/Journocoding/GitHub/tidydata")

# create empty list named messy_data
messy_data <- list()

# load messy data in empty list with loop and read_excel
for(i in 1:7){
# save sheets as local data frames  
  messy_data[[i]] <- read_excel("messydata.xlsx", sheet = i)
# add a column named timestamp to every listelement  
  messy_data[[i]]$timestamp <- i 
}

messy_data 

messy_data is now a list of seven local data frames with messy_data[[1]] containing the data for 1990, messy_data[[2]] for 2003 and so on. Also, we added a „timestamp“ column to each list element which contains the index of the list element.

To save the sheets as list elements is time saving, but we want all the data in one data frame:

# bind list to data frame
messy_dataframe <- do.call(rbind.data.frame, messy_data)

messy_dataframe

If you get an error telling you the frames have different lengths you probably forgot to delete the spare columns in the 1990 sheet. Sometimes there even seems to be something invisble left in empty excel columns. I usually delete three or so of the empty columns and empty rows next to my data to be sure there isn’t something left I can’t see.

Next part: Restructuring the data

With the function gather() of Wickhams tidyr-package, we’ll melt the raw data frame to convert it to a molten format. And let’s change the timestamps created by the read-in loop to the actual year (we could do that with a loop, too, but this is good enough for now).

# restructuring data
# # condensing raw data to molten data (from wide to long)
data_restructed <- messy_dataframe %>% 
  as.data.frame() %>% 
  tidyr::gather(key = "variable", value = "value", 2:5)

# change timestamps to actual Years
data_restructed$timestamp[data_restructed$timestamp == "1"] <- "1990"
data_restructed$timestamp[data_restructed$timestamp == "2"] <- "2003"
data_restructed$timestamp[data_restructed$timestamp == "3"] <- "2010"
data_restructed$timestamp[data_restructed$timestamp == "4"] <- "2011"
data_restructed$timestamp[data_restructed$timestamp == "5"] <- "2012"
data_restructed$timestamp[data_restructed$timestamp == "6"] <- "2013"
data_restructed$timestamp[data_restructed$timestamp == "7"] <- "2014"

# look at the restructed data
data_restructed

Oo-De-Lally! This is tidy data in a tidy format! Now we can check if we have to correct the state names (because with bigger datasets, you can’t quickly check and correct spelling and term issues within Excel):

# have a closer look at the statenames
states <- arrange(data_restructed, State) # sort according to states, ascending
length(unique(states$State)) # how many unique names are there?

So we got 19 different german Bundesländer. But Google tells us that there are only 16 states in Germany! Let’s have a closer look at the names to check whether we’ll find duplicates:

unique(states$State) # Are there any duplicates?

Yes, there are! For example Baden-Württemberg and BaWü refer to the same state, as well as Hessen, Hesse and Hesssen. You can just manually correct this. For really big datasets, you could also work with regular expressions and string replacement to find the duplicates, but for now, this should be enough:

# manually correct wrong statenames
data_restructed$State[data_restructed$State == "BaWü"] <- "Baden-Württemberg"
data_restructed$State[data_restructed$State == "Hesssen" | data_restructed$State == "Hesse" ] <- "Hessen"

length(unique(data_restructed$State)) # how many statenames are left now?
sort(unique(data_restructed$State)) # this should now be "16". Overlooked something?

data_restructed # now THIS is tidy data!

# if you want, save the tidy data as a csv
# write.csv(data_restructed, file = "tidydata.csv", fileEncoding = "latin1")

Now that your data is tidy, the actual analysis can start. A very useful package to prepare molten data is dplyr. Its functions ease filtering the data or grouping it. Not only is this great for taking a closer look at certain subsets of your data, but, because Wickhams graphics package ggplot2 was created to fit the tidy data principle, we can quickly shape the data to be visually analyzed, too.

Here we have some examples for you showing how tidy data and tidy tools can work hand in hand. If you want to learn something about the graphics package ggplot2 first, visit our post for beginners on this!

# # Start analyzing your tidy data
# how many marriages do we have per year?
marriageperyear <- data_restructed %>% 
  group_by(timestamp) %>% # groups data by year in column "timestamp"
  dplyr::summarise(value = sum(value)) # adds up the values per year. The "dplyr::" part is necessary 
                                       # if you've installed another package with a function called 
                                       # "summarise" which can be confusing to R. This now basically 
                                       # tells R, that at this point I want the summarise-function 
                                       # from the dplyr-package

marriageperyear # subset of data_restructed only holding the informations I wanted to look at

Visual analysis with ggplot2: this may look complicated at first, but once you have coded the first ggplot you only have to change or/and add a few things to create several more and totally different plots.

## PLOT 1
# Compare the marriages of 1990, 2003 and 2014 and colourize them by season
# filter data by the years 1990, 2003 and 2014, group it by year and season and add up the values
marriageperyearnseason <- data_restructed %>%
  filter(timestamp %in% c("1990", "2003", "2014")) %>%
  group_by(timestamp, variable) %>% 
  dplyr::summarise(value = sum(value)) 

# ggplot the filtered data
ggplot(data = marriageperyearnseason, aes(x = timestamp, y = value, 
                                          fill = factor(variable), order = variable)) +  
geom_bar(stat = "identity") + # stacked barplot
theme_minimal() + # theme with white background
xlab("Year") +  # names of x- and y-axes
scale_y_continuous(name="Marriages", 
                 # specify aesthetics of y-axis labels
                   labels=function(x) format(x, big.mark = "'", scientific = FALSE)) + 
guides(fill=guide_legend(title="season", reverse = T)) + # edit legend
ggtitle("Marriages per Year and season") # graphic title

Bildschirmfoto 2016-03-05 um 00.15.33

## PLOT 2
# Compare the number of marriages in 2014 per state and colourize them by season
# filter data by the year 2014 and group it by season and state, add up the values
marriageperstate14 <- data_restructed %>%
  filter(timestamp %in% c(2014)) %>%
  group_by(variable, State) %>% 
  dplyr::summarise(value = sum(value)) 

# stacked barplot with ggplot (incl. dashed meanline)
meanlabel <- (sum(marriageperstate14$value)/16)-2000 # y-coordinate of label for meanline

ggplot(data = marriageperstate14, aes(x = State, y = value, 
                                      fill = factor(variable), order = variable)) +  
  geom_bar(stat = "identity") +
  theme_minimal() +
  xlab("State") + 
  scale_y_continuous(breaks = seq(0,85000, 10000), name="Marriages", 
                     labels=function(x) format(abs(seq(0,85000,10000)), 
                                               big.mark = "'", scientific = FALSE)) +
# change x-axis labels to shortcuts
  scale_x_discrete(labels=c("BaWü","Bay","Ber","Bra","Bre","Ham","Hes","Meck","Nie","Nor",
                            "Rhe","Saa","Sac","SaAn","Sch","Thü")) + 
  theme(axis.text.x = element_text(angle = - 50, vjust = 0.9, hjust = 0.1)) +
  guides(fill=guide_legend(title="season", reverse = T)) +
  ggtitle("Marriages per state and season \n 2014") +
# add meanline
  geom_hline(aes(yintercept = (sum(value)/16)), color = "black", linetype = "dashed", 
             size = 0.5) +  # geom_hline for horizontal lines, geom_vline fpr vertical lines
# add text to meanline
  annotate("text", x = 10, y = meanlabel, label="Mean", color="black", 
           size = 4, hjust = 1)

Maybe you’ve got some other questions this data could answer for you? Feel free to continue this analysis or try to tidy your own data set!

If you have any questions, problems or feedback, simply leave a comment, email us or join our slack team to talk to us any time!

 

{Credits for the awesome featured image go to Phil Ninh}