This is the first in a series of posts about using basic tools for data analysis in R. I’ve put it together as much for my own pleasure as for anyone else to read. If you read this and like it, don’t hesitate to give me a thumbs up in the comments below!
Overview
Usually, any analyst will want to start her endeavour by diving into data. When I first started learning my way around R I used to believe that the best way of looking at data is by passing it to View()
or something similar, to be able to inspect data the Excel way. After all, this is how many analysts usually look at data using software like SPSS, SAS, or most other commercial “statistical packages” out there.
However, I quickly came to realise that this method is often highly subpar when it comes to gaining any deeper insights into data, and might sometimes even give you a false picture of what data actually looks like. Instead I have found that there are a multitude of ways of describing data that presents a more accurate view of the data at hand, and quickly presents you with a couple of crucial pieces of metadata.
In the sections below I describe four common techniques used for examining data; Inspection, Summarization, Aggregation, and Handling of missing values.
Inspection
The first commands any analyst will want to learn are used for simply inspecting data, i.e. just looking at it without any preprocessing. Here, I have found str()
and head()/tail()
particularly good. Below you’ll see them used on three built-in R datasets.
To summarize: str()
and head()/tail()
are useful commands for simply displaying parts of your data and some metadata about it.
Summarization
The step following inspection usually consists of carrying out some kind of summarization of variables contained within your data. As you might already have guessed, the summary()
command is usually an excellent starting point for this kind of examination. It can be used to describe vectors…
…of different kinds…
…displaying information relevant to the data at hand.
summary()
can, of course, also be used to describe tabular data. Note that the first four columns of the iris
dataset are all of the numeric()
type whereas the last column, Species, is a factor()
. summary()
adjusts its output accordingly.
To summarize (pun intended): summary()
is useful for describing your data in an automated way, providing you with more metadata than pure inspection will generate.
Aggregation
Many insights can be gained from aggregation of data. Especially when dealing with data that comprises both categorical variables (a.k.a. background variables) and measurement variables (i.e. outcomes) the analyst can often learn much from aggregating data in different fashions. Unsurprisingly, aggregate()
is the function you’ll need most of the time. It makes extensive use of R’s formula()
interface (activated by the ~
operator). This is a slightly more complex way of looking at data, but can be a very efficient tool to gain more advanced insights, e.g. about any significant within-data differences or which variables are potential predictors and which might be outcomes.
Here we see the result of aggregating the iris
dataset and taking the mean value for all values in each column, and each species (there are three species of iris flowers recorded in the dataset).
aggregate()
takes a FUN argument which is the function applied to the aggregated groups.
FUN can be any function, even a user defined one.
Another useful tool is the table()
function which is used to create frequency tables, and its nephew prop.table()
, used to create relative frequency tables. The most basic usage is to use it to tabulate frequencies in vectors:
However, there are more useful ways of putting table()
to work, since it can be applied to any kind of tabular data. However be warned: DO NOT pass an entire data frame containing many variables and/or rows to table()
since this will cause it to tabulate all possible combinations of variables in the data set. If you really want to try doing this to obtain a better understanding of how table()
works, make sure to try it on a really small dataset, like table(cars)
(cars
has the dimensions 2x50).
table()
can also be combined with the with()
function to create local measurement variables and categories.
table()
returns a table object that can be used as input in other functions, like prop.table()
to return a list of proportions.
If we want frequencies to sum up only row or column wise, we can pass a 1 (for rows) or 2 (columns) as an additional argument to prop.table()
To summarize: aggregate()
and table()
are methods used for categorizing data and understanding what values are more common and which are rare. prop.table()
helps you put the output from table()
in perspective. with()
can be used to maketable()
compute new variables when creating frequency tables.
Handling missing values
Sometimes (well actually, most of the time when working with real-world data) data will have missing values in it. In R this is usually represented by the NA
value. This is problematic, e.g. since many base functions like sum()
will return NA
if there is a single NA
among the input values.
To find out whether there are any missing values in a vector (or column, if you will) we turn to the summary()
function yet again:
So the column Solar.R
in the airquality
dataset contains 7 NA values. We can locate the positions of NA values using is.na()
and then passing the output to which()
…
…which returns a vector of positions within the vector. These are the row numbers in the airquality
dataset that have NAs in them. We can now print these rows to learn what other data these rows contain:
Missing values can be handled in several ways. We can remove them from a vector…
…or from a data frame…
More importantly, we can also impute missing values using impute()
from the Hmisc
package.
To summarize: is.na()
and which()
are useful commands for locating missing values and displaying them in some context (using tabular data). impute()
can be used to impute values in place of the missing values if needed.
Moving on
This article introduces a subset of the tools I find myself using recurrently for data examination. The last thing you saw in the article was how to impute missing values, which is actually not about learning about data but modifying it, in this case using a technique often referred to as Feature Engineering. This will be the topic of future posts.
Seasoned R programmers might also object that many staple R packages, like Hmisc
, psych
, dplyr
, data.table
, stats
, and many others have functions for data inspection that are vastly superior to base-R functionality. I tend to agree on this, but I thought it’d be nice to provide R beginners (as well as myself) with a basic set of tools that one can easily learn without having to familiarize yourself with the API of any particular package (as anyone familiar with both the dplyr and data.table packages will tell you, the differences in how data is handled can vary substantially between packages!). Also, I constantly find myself returning to the base-R functions for the occasions when there is a particular task that my favorite dplyr or Hmisc function doesn’t handle as well as it should.
All the source code on display here can be found in a separate Github repository here: LCHansson/rTutorials. If you think anything in the code should be improved, please consider making a pull request to that repository and I will update this page accordingly.
I hope this has been of some help to you if you made it this far into the article. Any feedback is more than welcome!