ESS 3500 R code templates

R Basics

Mathematical functions

log(X) # natural logarithm of x
log10(X) # logarithm base 10 of x
sqrt(X) # square root of x
abs(X) # absolute value of x

Vector functions

VECTOR <- c(X1,X2,X3,...) # concatenate function to put values together in a vector
VECTOR <- seq(X1,XN, by=INCREMENT) # create a vector of numbers in a sequence from X1 to XN, in increments of INCREMENT

sum(VECTOR, na.rm = T/F) # sum of a set of numbers; set na.rm to T to remove missing values
mean(VECTOR, na.rm = T/F) # mean of a set of numbers; set na.rm to T to remove missing values
sd(VECTOR, na.rm = T/F) # sd of a set of numbers; set na.rm to T to remove missing values

VECTOR[X] # extract value X from a vector
VECTOR[X1:X2] # extract values X1 through X2 from a vector
VECTOR[c(X1,X2)] # extract values X1 and X2 from a vector
VECTOR[-X] # remove value X from a vector
VECTOR[-(X1:X2)] # remove values X1 through X2 from a vector
VECTOR[-c(X1,X2)] # remove values X1 and X2 from a vector

Data frame basics

DATA <- data.frame(VARIALBE1=VALUES1,VARIALBE2=VALUES2,...) # create a data frame by inputting values for variables

DATA$VARIABLE # Extract a variable from a data frame
DATA[[X,Y]] # Extract the value in row X and column Y

Data frames

Creating and viewing data frames

DATA <- read.csv("FILENAME.csv") # create a data frame by uploading a csv file

head(DATA) # view the top few rows of your data frame
View(DATA) # view the entire data frame in a new tab (note the capital V)
names(DATA) # view the variable (column) names in your data frame
str(DATA) # view the structure of your data frame, including variable names and types

Changing variable types

DATA$VARIABLE <- as.factor(DATA$VARIABLE) # convert a variable to a factor variable
DATA$VARIABLE <- as.character(DATA$VARIABLE) # convert a variable to a character variable
DATA$VARIABLE <- as.integer(DATA$VARIABLE) # convert a variable to an integer variable
DATA$VARIABLE <- as.numeric(DATA$VARIABLE) # convert a variable to a numeric variable

Manipulating variables with the tidyverse package

library(tidyverse) # load tidyverse

DATA_FILTER <- filter(DATA, VARIABLE==CRITERIA) # Filter data so new data frame includes only the values for a variable the meet a certain CRITERIA. If the variable is a text variable, the CRITERIA need to go inside quotes. CRITERIA can also use <, >, and != (not equal to) in place of ==.
DATA_SELECT <- select(DATA, VARIABLE1, VARIABLE2,...) # select particular variables from a data frame
DATA_MUTATE <- mutate(DATA,NEWVARIABLE = FUNCTION(VARIABLE)) # Create a new variable in the data frame by applying a function to existing variable(s). Functions can include, but are not limited to: log, sum, min, max, mean.
GROUP <- group_by(DATA,VARIABLE) # Group a data frame by a particular variable(s) in preparation for summarizing a data frame. The grouped data frame will not look different than the original, but above the data frame, you will be able to see the number of groups for the variable you grouped by.
DATA_SUMMARY <- summarise(GROUP, SUMMARYVARIABLE1 = FUNCTION(VARIABLE1), SUMMARYVARIABLE2 = FUNCTION(VARIABLE2),...) # Summarize variables in a data frame. Note that the first argument is a grouped data frame created using the group_by function. You can summarize more than one variable at a time with this function.

Graphing with the ggplot2 package

Note: ggplot2 is part of the tidyverse set of packages

General structure

ggplot(DATA,aes(X=INDEPENDENTVARIABLE,Y=DEPENDENTVARIABLE)) # Initialize a ggplot graph by identifying the data frame, independent, and dependent variable. Note that some graphs (e.g., histograms) might only require an independent variable

# Graph components are added after the ggplot function, using a plus sign (+) to add each additional component, as follows
ggplot(DATA,aes(X=INDEPENDENTVARIABLE,Y=DEPENDENTVARIABLE)) +
  geom_GRAPHTYPE() +
  labs(x = "X AXIS LABEL", y = "Y AXIS LABEL") +
  theme_classic()

Specific graph types

Note: multiple graph types can be added as components to the same plot

# Histogram (use this to graph the frequency of value for a single variable)
ggplot(DATA,aes(x=INDEPENDENTVARIABLE)) +
  geom_histogram() +
  labs(x = "X AXIS LABEL", y = "Y AXIS LABEL") +
  theme_classic()

# Density (use this to graph the probability of value for a single variable)
ggplot(DATA,aes(x=INDEPENDENTVARIABLE,y=DENSITY)) +
  geom_density(stat="identity",linewidth=1)+
  labs(x = "X AXIS LABEL", y = "Y AXIS LABEL") +
  theme_classic()

# Vertical line (use this to add a vertical line to a graph)
ggplot(DATA,aes(x=INDEPENDENTVARIABLE)) +
  geom_vline(aes(xintercept=INTERCEPT),color="COLOR",linewidth=1) +
  labs(x = "X AXIS LABEL", y = "Y AXIS LABEL") +
  theme_classic()

# Boxplot (use this for a categorical independent variable and a continuous dependent variable)
ggplot(DATA, aes(x=INDEPENDENTVARIABLE, y=DEPENDENTVARIABLE)) +
  geom_boxplot() +
  labs(x = "X AXIS LABEL", y = "Y AXIS LABEL") +
  theme_classic()

# Scatterplot with best fit line (use this for a continuous independent variable and a continuous dependent variable)
ggplot(DATA, aes(x=INDEPENDENTVARIABLE, y=DEPENDENTVARIABLE))+
  geom_point() +
  geom_smooth(method = "lm") +
  labs(x = "X AXIS LABEL", y = "Y AXIS LABEL") +
  theme_classic()

Classical frequentist tests

Chi-square tests

Categorical independent variable, discrete (count) dependent variable

# Goodness of fit test: compare the frequency in one categorical variable to an expected frequency
chisq.test(DATATABLE, p=c(EXPECTEDFREQUENCY1,EXPECTEDFREQUENCY2,...)) # The data table should be a table with a simple count for each category. You need to include one expected frequency for each category, so add additional expected frequencies if you have more than two categories.

# Contingency table test: Use when you have counts for two categorical variables to test whether the frequencies of one categorical variable are affected by the other categorical variable
chisq.test(DATATABLE, simulate.p.value = TRUE/FALSE) # Set the simulate.p.value argument to true if you have a low sample size (<5) in one or more category

# If your data frame is not formatted as a simple count for each category, but instead lists the category value for each observation, you can use the table function to create the data tables for the chi.square function:
DATATABLE <- table(DATA$VARIABLE1,...) # If you have more than one categorical variable that you want to summarize, list all variables in the arguments

T-tests

Categorical independent variable (only 2 categories), continuous dependent variable

# Two sample t-test, equal variances
t.test(DEPENDENTVARIABLE ~ INDEPENDENTVARIABLE, data = DATA, var.equal = TRUE)

# Two sample t-test, unequal variances
t.test(DEPENDENTVARIABLE ~ INDEPENDENTVARIABLE, data = DATA, var.equal = FALSE)

# Mann-Whitney-Wilcox test (alternative to t-test for when residuals are not normally-distributed)
wilcox.test(DEPENDENTVARIABLE ~ INDEPENDENTVARIABLE, data = DATA, var.equal = TRUE)

# One sample t-test
t.test(x = DATA$DEPENDENTVARIABLE, mu = TRUEMEAN)

# Paired t-test
paired <- t.test(DEPENDENTVARIABLE ~ INDEPENDENTVARIABLE, data = DATA, paired = TRUE)

ANOVA

Categorical independent variable (2+ categories), continuous dependent variable

# ANOVA, equal variances
aov(DEPENDENTVARIABLE ~ INDEPENDENTVARIABLE, data = DATA, var.equal = TRUE)

# ANOVA, unequal variances (Welch's ANOVA)
oneway.test(DEPENDENTVARIABLE ~ INDEPENDENTVARIABLE, data = DATA, var.equal = FALSE)

# Kruskal-Wallis test (alternative to ANOVA for when residuals are not normally-distributed)
kruskal.test(DEPENDENTVARIABLE ~ INDEPENDENTVARIABLE, data = DATA, var.equal = TRUE)

Linear regression

Continuous independent variable, continuous dependent variable

# Build model
MODEL <- lm(DEPENDENTVARIABLE ~ INDEPENDENTVARIABLE, DATA)

# View model summary
summary(MODEL)

Multiple predictors

Continuous dependent variable, independent variables can be continuous or categorical

# Model with no interaction between independent variables
MODEL <- lm(DEPENDENTVARIABLE ~ INDEPENDENTVARIABLE1 + INDEPENDENTVARIABLE2, DATA)

# Build model with no interaction between independent variables
MODEL <- lm(DEPENDENTVARIABLE ~ INDEPENDENTVARIABLE1*INDEPENDENTVARIABLE2, DATA)

# View model summary
summary(MODEL)

General linear models

T-tests, ANOVAs, linear regressions, and models with multiple predictors (two-way ANOVAs, multiple regressions, ANCOVAs) are all part of a broader class of models called general linear models. The lm function can be used to build any of these models. To include random effects in the model in addition to fixed effects, use the lmer function.

# Fixed effects model
fixed <- lm(DEPENDENTVARIABLE ~ INDEPENDENTVARIABLE, DATA) # You can add in additional independent variables by adding them onto the forumla. Use a + sign to add variables if you don't want to include interactions between variables and a * to include interactions.

# Random effects model (SAMPLENUMBER is the random effect)
random <- lmer(DEPENDENTVARIABLE ~ INDEPENDENTVARIABLE + (1|SAMPLENUMBER), DATA) # You can add in additional independent variables by adding them onto the forumla. Use a + sign to add variables if you don't want to include interactions between variables and a * to include interactions.

# Extracting coefficients
COEFFICIENTS <- summary(MODELNAME)$coefficients

# Extracting model standard deviation
SIGMA <- summary(MODELNAME)$sigma

Maximum likelihood approach

Code has the same structure for all general linear model tests

# First use the lm function (see previous section) to build null and alternative models to represent your hypotheses
null <- lm(DEPENDENTVARIABLE ~ 1, DATA)

# Alternative model
alternative <- lm(DEPENDENTVARIABLE ~ INDEPENDENTVARIABLE, DATA)

# Calculate AIC to compare your models
AIC(null,altnernative) # You can compare more than two models at a time, so if you have more than one alternative model, you can add them to the list.