Lesson 3: DATA FRAMES

In this lesson, we will cover more details about how to work with data frames in R, including how to import data files into R, retrieve information about files, and manipulate data frames in R.

3.1 Setting your working directory

Today we will be working on importing data into R, so it is first important for us to know how to tell R where to look (i.e., in what folder on our computer) for the files we want to import.

In the first lesson, when we looked at the layout of RStudio, I noted that one of the tabs in the bottom right-hand corner is the “Files” tab. This tab will show you all the files that are in your working directory, which is the folder R will search to find and save files. R usually has a default folder that it will set as your working directory unless you tell it otherwise.

To help keep things organized, it is useful to store all of your R-related files for a single project in the same folder, and you will likely want to keep separate folders for separate projects. Then when you open R, you can set your working directory to that folder for the project you want to work on.

There are a few ways to set your working directory. The easiest way for beginners is to use the drop-down menus in RStudio. Select “Session” from the menu bar at the top. Then hover over “Set Working Directory” and select “Choose Directory” from the new menu that appears. Then you will be able to use your file explorer to select the folder you want. When you do this, you will see a line a code appear in your console. The line of code is telling R to change the directory. If you know you will want to use that working directory again, then I would recommend copying and pasting the line of code into the top of your R script file. In the future, when you are working with that R script file, you can simply run that line of code to set the working directory.

3.2 Loading data

In the previous lesson, we created a simple data frame directly in R. In general, particularly with large data sets it is far easier to enter your data into a program like Excel and then import the data into R. For the remainder of the lesson, we will work with a simple example data set I provided you with called “DataFrameExample.csv”. I recommend saving data files you want to work with in R as csv files. You can import Excel files into R, but it is often slower, especially if the files are large. The other benefit of csv files is that they are a non-proprietary file type, so they don’t require a specific program like Excel. This will make it easier to share your files with other researchers.

Save the “DataFrameExample.csv” file in the folder you want to use as your working directory for this lesson. If you haven’t already, follow the directions in the previous section to set your working directory. Now you are ready to import the data. We will use the read.csv() function to import the data, and we will save the data as an object called DATA. R will automatically save it as a data frame object. Here is the code to use:

DATA <- read.csv("DataFrameExample.csv")
DATA
##    Tree Year Number.cones Growth Temperature
## 1     A 2009            0   0.21        8.58
## 2     A 2010           14   0.26        8.67
## 3     A 2011            0   0.20        8.32
## 4     A 2012           35   0.15        9.68
## 5     A 2013           14   0.21        9.09
## 6     B 2009           64   0.21        7.28
## 7     B 2010           22   0.23        7.36
## 8     B 2011           64   0.23        7.05
## 9     B 2012           22   0.19        8.45
## 10    B 2013           57   0.17        7.85

Notice that R automatically identify the first row as a header row. That is because there is an optional “header” argument for the read.csv file, and the default for that argument is “TRUE”. If your data file does not have a header row, you would want to change the header argument to “FALSE”.

You now have a data frame object ready to work with in R!

3.3 Getting information about your data frame

R is pretty good at importing data and correctly identifying what time of data you have for each variable. Nevertheless, it is still important to take a look at your data frame to make sure it was imported correctly. You also might sometimes need a reminder of what variables you have in your data frame (I sometimes forget). Here, we will learn some ways to get basic information about your data frame.

Viewing your data

In the section above, we viewed the data frame right in the console by just typing the name of the data frame. If you have a small data frame like ours, that works pretty well, but if you have a large data frame with lots of variables, lots of observations, or both, that will be unwieldy, but we have some other options!

One option is to just view the top few rows of our data frame. This is a good option if you want to remind yourself of what is in your data frame and confirm that everything looks the way you would expect (this can be helpful after you have imported your data or made changes to your data frame). To view your the top few rows of your data frame, use the head() function:

head(DATA)
##   Tree Year Number.cones Growth Temperature
## 1    A 2009            0   0.21        8.58
## 2    A 2010           14   0.26        8.67
## 3    A 2011            0   0.20        8.32
## 4    A 2012           35   0.15        9.68
## 5    A 2013           14   0.21        9.09
## 6    B 2009           64   0.21        7.28

On occasion, you might want to view your entire data frame. If your data frame is large, you can use the View() function to view the data frame in a separate tab:

View(DATA)

You should now see your data frame appear in a new tab next to your script files.

You can also view just the names of the variable (i.e., the column names) in your data frame:

names(DATA)
## [1] "Tree"         "Year"         "Number.cones" "Growth"       "Temperature"

The structure of your data frame

The variables in your data frame can be a number of different data types, including continuous numeric data, integers, factors (categorical variables), and characters (text). When you import a data frame, R will do its best to identify the variable type based on the data in each column. It is usually pretty good, but sometimes it makes mistakes (often because we used some poor practices when entering the data, like putting notes into a variable column instead of having a separate column for notes). Sometimes it doesn’t matter if the data type is identified incorrectly, but for some purposes including statistical tests and graphing, it can be important. It is therefore good practice to check your data frame and make sure the data types are correct.

To look at the structure of a data frame, including the data type for each variable, we use the str() function.

str(DATA)
## 'data.frame':    10 obs. of  5 variables:
##  $ Tree        : chr  "A" "A" "A" "A" ...
##  $ Year        : int  2009 2010 2011 2012 2013 2009 2010 2011 2012 2013
##  $ Number.cones: int  0 14 0 35 14 64 22 64 22 57
##  $ Growth      : num  0.21 0.26 0.2 0.15 0.21 0.21 0.23 0.23 0.19 0.17
##  $ Temperature : num  8.58 8.67 8.32 9.68 9.09 7.28 7.36 7.05 8.45 7.85

You will now see a list of the variables in your data base. Next to each variable, it will show you the data type (chr = character, int = integer, num = continuous numeric), along with the first several values of each variable in your data frame. It looks like it did a good job identifying our data types, but let’s make one change to show you how it is done. Our Tree variable was identified as a character variable, but because that is a categorical variable identifying the tree each data point came from, it might be helpful to make it a factor variable instead. Run the following code to make the change:

DATA$Tree <- as.factor(DATA$Tree)

Notice I used the $ symbol to identify the variable in the data frame I wanted to change, and I told R to replace the original variable with the factor version of the save variable.

Now let’s look at the structure of the data frame again.

str(DATA)
## 'data.frame':    10 obs. of  5 variables:
##  $ Tree        : Factor w/ 2 levels "A","B": 1 1 1 1 1 2 2 2 2 2
##  $ Year        : int  2009 2010 2011 2012 2013 2009 2010 2011 2012 2013
##  $ Number.cones: int  0 14 0 35 14 64 22 64 22 57
##  $ Growth      : num  0.21 0.26 0.2 0.15 0.21 0.21 0.23 0.23 0.19 0.17
##  $ Temperature : num  8.58 8.67 8.32 9.68 9.09 7.28 7.36 7.05 8.45 7.85

We can see that the Tree variable is now a factor with two different levels: A and B.

There are equivalent functions to convert to the other variable types: as.numeric(), as.integer, and as.character().

3.4 Manipulating data frames with the dplyr package

There are any number of ways we might want to manipulate our data sets, and, like with many aspects of programming, there are various ways to go about it. In this class, we will focus on using the dplyr packages. We will look at a few different ways of manipulating our data frames, but there are many more! There is a great resource online called R for Data Science. It is a comprehensive guide to using R for managing, visualizing, and analyzing data in R, with a particular focus on dplyr. Check it out if you want/need to learn more!

For this section, you will need to activate the dplyr packages. This is something that needs to be done every time you start a new R session if there are packages in addition to base R that you want to use. Enter the following code to activate dplyr:

library(dplyr)

Filtering data

Filtering is used to select a subset of observations from your data frame. For example, our data frame contains data from two different trees, but we might want to create a data frame that contains only the data for a single tree. We can do this with the filter function as follows:

DATA_A <- filter(DATA, Tree=="A")
DATA_A
##   Tree Year Number.cones Growth Temperature
## 1    A 2009            0   0.21        8.58
## 2    A 2010           14   0.26        8.67
## 3    A 2011            0   0.20        8.32
## 4    A 2012           35   0.15        9.68
## 5    A 2013           14   0.21        9.09

The new data frame now only has the data for Tree A. Notice a couple things about the code. First, notice the use of the double equals signs (==). We use the double equals signs when we are making a logical statement (i.e., give me all the values that equal a particular value). Also notice that the “A” is in quotes. For factor and character variables, the value of the variable needs to be in quotes.

We are not restricted to using == in our logical statement. We can also use things like greater than (>) or less than (<), and we can include multiple criteria. Below is an example of using greater than or less than criteria for two different variables.

DATA_SUBSET <- filter(DATA, Growth < 0.2, Temperature > 8)
DATA_SUBSET
##   Tree Year Number.cones Growth Temperature
## 1    A 2012           35   0.15        9.68
## 2    B 2012           22   0.19        8.45

The resulting data frame has only observations where growth was less that 0.2 and temperature was greater than 8.

Selecting columns

In the section above, we selected a subset of observations, but we might also want to select a subset of variables (columns) from our data frame. For example, maybe we are only interested in the number of cones produced by each tree in each year. We can use the select() function to select only those three variables:

DATA_CONES <- select(DATA, Tree, Year, Number.cones)
DATA_CONES
##    Tree Year Number.cones
## 1     A 2009            0
## 2     A 2010           14
## 3     A 2011            0
## 4     A 2012           35
## 5     A 2013           14
## 6     B 2009           64
## 7     B 2010           22
## 8     B 2011           64
## 9     B 2012           22
## 10    B 2013           57

Creating new variables

One common manipulation we might want to make to a data frame is to calculate a new variable, possibly based on data we already have in our data frame. For example, we might have measured the length and width of a plant and then want to multiply the length and width variable together to calculate the area covered by the plant. We can use the mutate() function in the dplyr package to create new variables. In our example, we will log transform (i.e., take the natural log of) our temperature variable:

DATA_LOG <- mutate(DATA,log_Temperature = log(Temperature))
DATA_LOG
##    Tree Year Number.cones Growth Temperature log_Temperature
## 1     A 2009            0   0.21        8.58        2.149434
## 2     A 2010           14   0.26        8.67        2.159869
## 3     A 2011            0   0.20        8.32        2.118662
## 4     A 2012           35   0.15        9.68        2.270062
## 5     A 2013           14   0.21        9.09        2.207175
## 6     B 2009           64   0.21        7.28        1.985131
## 7     B 2010           22   0.23        7.36        1.996060
## 8     B 2011           64   0.23        7.05        1.953028
## 9     B 2012           22   0.19        8.45        2.134166
## 10    B 2013           57   0.17        7.85        2.060514

If we wanted to add multiple new variables at the same time, we would simply add each new variable as a new argument, separated by commas.

Summarizing data

Another useful manipulation is summarizing our data. Maybe we want to summarize the data for each tree across all years of the study.There are a number of ways to summarize the data. We could take the average, maximum, minimum, sum, or any number of other functions. We’ll use a different function on each variable we summarize, so you can see a variety of examples.

The first step in summarizing is to tell R what we want to use as our grouping variable(s). If we want to summarize the data for each tree across all years, Tree will be our grouping variable. We will create a new data frame, grouping our data by our grouping variable, using the group_by() function.

TREE <- group_by(DATA,Tree)

If we had wanted to group by multiple variables, we could add them after tree, separated by commas.

Now we will summarize the data for each tree. We will calculate the total (sum) number of cones, the average growth, and the maximum temperature.

DATA_SUMMARY <- summarise(TREE, Total.cones = sum(Number.cones), Mean.growth = mean(Growth), Max.temperature = max(Temperature))
DATA_SUMMARY
## # A tibble: 2 × 4
##   Tree  Total.cones Mean.growth Max.temperature
##   <fct>       <int>       <dbl>           <dbl>
## 1 A              63       0.206            9.68
## 2 B             229       0.206            8.45

When we use the summarise() function, notice that our first argument is not our original data frame. It is the new data frame we created with the group_by function. The following arguments are the summary variables we want to create. We first provide a name for our new variable, followed by an equal sign, and then provide the function and variable we want to summarize. In this case we are creating three summary variable using different summary functions.

Data Frames Team Challenge

Work with your team to complete the following tasks. Use the same DataFrameExample data set from above. (If the rest of your group isn’t ready yet, feel free to start on the Data Frames Practice Problems assignment while you wait.)

  1. Start with the original data frame. Extract the Number.cones column from the data frame and save it as a vector. (Review the Data Frame section in Lesson 2.3 if necessary.)
  2. Use indexing to remove the zeros from your vector. Then calculate the mean and standard deviation of the remaining numbers of cones. (Review the Vectors section in Lesson 2.3 if necessary.)
  3. Group the original data frame by year and summarize the data frame to calculate the mean growth and mean temperature for each year.
  4. Starting with your new summarized data frame, filter the data from to only include data from 2009. What was the average growth in 2009?