While crunching numbers, a visual analysis of your data may help you get an overview of your data or compare filtered information at a glance. Aside from the built-in graphics package, R has many additional packages to help you with that.
We want to focus on ggplot2 by Hadley Wickham, which is a very nice and quite popular graphics package.

Ggplot2 is based on a kind of statistical philosophy from a book I really recommend reading. In The Grammar of Graphics, author Leland Wilkinson goes deep into the structure of quantitative plotting. As a product, he establishes a rulebook for building charts the right way. Hardley Wickham built ggplot2 to follow these aesthetics and principles.

To imitate the inherent structure of graphics Wilkinson describes, a ggplot is constructed piece by piece: First, you create the canvas and axes for your chart, then you determine the type of chart and its specific look, like colors and legends. The range of charts ggplot2 offers is rather big: from normal barplots and tileplots to even maps. There’s a bunch of different, independent components you can choose from. And, of course, ggplot follows Wickhams understanding of tidy data. In my first week as an intern at the SRF Data Team, I analyzed data on the swiss arms exports according to these principles. We mostly used graphics for data analysis and for the fact sheet we would give to the other editors.

So why is ggplot a good choice? Because it’s fast, clear and by default good looking. So let’s start ggplotting!

Generating data

Earlier, we noted that ggplot2 wants to be given tidy data. In the next post, we will show you how to convert your messy numbers into a well structured form and what tidy data really means, so stay tuned for this important topic. For now, we’ll start by simply generating a random dataset for our plots:

# first, fill some vectors with random numbers
a <- sample(1:20, 10, replace=TRUE)
b <- sample(1:6, 10, replace=TRUE)
c <- sample(1:30, 10, replace=TRUE)
d <- sample(1:10, 10, replace=TRUE)
# as a reminder: this function generates a vector of ten random numbers out of
# the intervall one to 20/6/30/10. If replace is = TRUE this means 
# there can be duplicates of the same number in the sample

# going for the structured data frame
gg_data <- data.frame(year = c(rep((2001:2010), length(nrow))),
                      value = c(a, b, c, d),
                      variable = c(rep("A", length(a)), 
                                   rep("B", length(b)),
                                   rep("C", length(c)), 
                                   rep("D", length(d))


This data frame looks odd, doesn’t it? Instead of a wide table we built a condensed, long data frame. Wickham calls this format molten data.

Bildschirmfoto 2016-02-29 um 15.02.26

Excerpt from Wickhams „Tidy Data“


Say we want to compare the sum of values for each year, but also see the amounts of A, B, C and D. How do we do that? With a stacked barplot in ggplot2.

First of all, let’s install the package. The following function only installs and loads the package if you haven’t done it yet.

if(!require(ggplot2)) { 
  install.packages("ggplot2", repos="http://cran.us.r-project.org")

The function basically tells R „if ggplot2 isn’t required yet, please install and require it now“.

So now we’re going through this step by step. For the whole code without my nasty explanations, visit the ggplotting repository on our GitHub-Page.

Every ggplot starts with the function ggplot(). Within this function, we specify the data we want to plot. aes() is short for aesthetics and holds additional plotting information like which values should be plotted to the axes. The argument fill, in the case of a bar plot, tells ggplot what variables are used to colorize the bars. This works for areaplots as well, but is replaced with colour() for line- and scatterplots. You can use colour() for barplots, too, but this doesn’t fill the bars but change the color of their edges.

gg_stackedbar <- ggplot(data = gg_data, aes(x = year, y = value, fill = variable))

On its own, this function doesn’t do anything visual yet. It initializes a plot, but we haven’t told ggplot what type of plot we actually want. So now we have to add more components. Like, what kind of plot did we want to do again?

ggplot2 offers a range of possibilities for this: geom_area, geom_line and geom_point are just a few of them. For our stacked bar chart, we’ll use geom_bar and set stat = „identity“. stat defines the statistical transformation and is default setted on „bin“. This makes the height of each bar equal to the number of cases in each group. If you use this setting, you can’t specify the y argument in aes(). If you want the height of your bars to be defined by a column in your dataset, use stat = „identity“ instead to map a value to the y aesthetic.

Ggplot components are linked with the „+“ operator. It’s pretty intuitive: It tells R that all functions linked by a „+“ belong to the same plot. Let’s try it:

gg_stackedbar + geom_bar(stat = "identity")

Oo-De-Lally! Doesn’t it look awesome?Bildschirmfoto 2016-03-01 um 13.18.49

Well, yes. That’s nice. But we want more! What about a title or a different theme?

We will name each intermediate step individually to watch the differences the added bricks make step by step:

step1 <- gg_stackedbar + geom_bar(stat = "identity")

# add a title
step2 <- step1 + ggtitle("Values per Year and categorie")

# change axis labels
step3 <- step2 + xlab("YEAR") + ylab("VALUES")

# change legend title and order of shown variables
step4 <- step3 + guides(fill=guide_legend(title="CATEGORIES", 
                                          reverse = T

# change theme (there are some default themes but you can even write your own)
step5 <- step4 + theme_minimal()

# add specific breaks and labels for the x-axis 
step6 <- step5 + scale_x_continuous(breaks=seq(2000,2010,1),

# change angle of x-labels
step7 <- step6 + theme(axis.text.x = element_text(angle=40, vjust=0.5))

Hell yeah! A customized ggplot! Now that we’re getting the hang of ggplot2, let’s build some other plots by simply changing some arguments:

Area Plot

Try commenting out some components (like coord_flip() for example) or change values like vjust to check what they do.

ggplot(data = gg_data, aes(x = year, y = value, fill = variable)) +
  geom_area() +
  ggtitle("Area ggplot") +
  theme(plot.title=element_text(family="Arial", face="bold", size=14)) + # change font and font size
  xlab("YEAR") +
  ylab("VALUES") +
  guides(fill=guide_legend(title="VAR", reverse = F)) +
  theme_grey() +
                     labels=abs(seq(2000,2010,2))) +
  theme(axis.text.x = element_text(angle= 90, vjust=0.5)) +
  scale_fill_brewer(palette="Dark2") + # change colors with colour brewer palette
  coord_flip() # flip the plot

Line Plot

ggplot(data = gg_data, aes(x = year, y = value, colour = variable)) + # colour, not fill!
  geom_line(size = 0.2) +
  geom_point(size = 4) +
  ggtitle("Line ggplot \n with points") + 
  theme(plot.title=element_text(family="Times New", face="bold", size=20)) +
  xlab("X") +
  ylab("Y") +
# what about the y-labelling?  
                     labels=abs(seq(0,30,5))) +
  theme(axis.text.x = element_text(angle = - 45, vjust = -0.5), 
        axis.text.y = element_text(angle = -45, vjust = 0.5)) +
# changing colors manually  
    scale_colour_manual(values = c( "A" = "#6FCCF3", "B" = "#FF8705", 
                                    "C" = "#E557F5", "D" = "#8BFF92"), drop = F)
# the same works for geom_bar and geom_area if you use "scale_fill_manual()" instead!

Awwww yiss! Now feel free to change some arguments, add new functions and try out all the other ggplots! If you have any problems, questions or feedback simply leave a comment!


{Credits for the awesome featured image go to Phil Ninh}