Lesson 9: Linear Regression

In this lesson, you will tackle a test that can be used when you have two continuous variables: linear regression. We will work with a plant cover data set that looks at climate variables as possible predictors of native plant cover.

9.1 Simple linear regression

In this first section, we will individually test the effect of mean temperature and maximum temperature on the cover of native plant species. To begin, we will build and visualize the models.

First, load the data set. Be sure your working directory is set correctly.

plant <- read.csv("PlantSumm.csv")

For now, we’ll build three models: the null model, an alternative model for mean temperature, and an alternative model for maximum temperature.

plant_null <- lm(tot_cover ~ 1, plant)
plant_mean <- lm(tot_cover ~ Mean_tempC, plant)
plant_max <- lm(tot_cover ~ Max_tempC, plant)

To view the output of your models, type the name of each model. Just like when we have worked with this model before, the null model will have just an intercept, and the alternative model will have a intercept and a slope term for the effect of either mean or maximum temperature on plant cover.

Now, let’s create graphs to visualize our data. Because we have two continuous variables, a scatterplot is a good option. We will also include a best fit line based on our alternative model. We’ll make two graphs, one for each of our independent variables. If you haven’t already, load the ggplot2 package first.

library(ggplot2)
ggplot(plant, aes(x=Mean_tempC, y=tot_cover)) +
  geom_point() +
  geom_smooth(method="lm")+
  labs(x="Mean temperature (C)", y="Total plant cover") +
  theme_classic()

ggplot(plant, aes(x=Max_tempC, y=tot_cover)) +
  geom_point() +
  geom_smooth(method="lm")+
  labs(x="Max temperature (C)", y="Total plant cover") +
  theme_classic()

Classical Frequentist Approach

Let’s again start by analyzing the models using a frequentist approach. We don’t actually need to run any additional tests for this. We can just look at some additional output from the models we already ran.

To view the additional output, use the summary function.

summary(plant_mean)
## 
## Call:
## lm(formula = tot_cover ~ Mean_tempC, data = plant)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.707 -13.027   4.722  10.964  21.556 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   40.812      4.359   9.362 1.42e-06 ***
## Mean_tempC    -2.451      3.993  -0.614    0.552    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.47 on 11 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.03311,    Adjusted R-squared:  -0.05479 
## F-statistic: 0.3767 on 1 and 11 DF,  p-value: 0.5518
summary(plant_max)
## 
## Call:
## lm(formula = tot_cover ~ Max_tempC, data = plant)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.058  -7.017   3.003   7.405  19.695 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   41.846      3.913  10.695 3.76e-07 ***
## Max_tempC     -6.526      3.583  -1.821   0.0958 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.79 on 11 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.2317, Adjusted R-squared:  0.1618 
## F-statistic: 3.317 on 1 and 11 DF,  p-value: 0.09585

When you view the output, you will see a number of things. First, you will be able to see the the formula you used to build the models. Then you will see some information on the distribution of the residuals (the leftover variation not explained by your model). Next, you will see the coefficients from your model, along with standard error of the estimates. The coefficients section will also show you t-values and p-values for each coefficient. These are one-sample t-tests comparing the value of the coefficient to zero.

The information we really want for our linear regression test is down at the very bottom. In the final section, you will see some R-squared values. These are a measure of how much variation in your dependent variable is explained by your independent variable (we will talk about this more in class). Below that, you will see the output of the linear regression test. First is the F-statistic (the same statistic that was calculated for the ANOVA). Then you will see the p-value. Based on these values, would your reject or accept the null hypothesis? What does this tell you about the effect of temperature on plant growth?

Likelihood-based approach

Next we will use a likelihood-based approach to test the same question. Once again, the approach is the same as what you used for the t-test and ANOVA, using the AIC function to compare the models. Since we have three models, we can include all three in the AIC function and then compare the AIC from both the plant_mean and plant_max models to the null model.

AIC(plant_null, plant_mean, plant_max)
##            df      AIC
## plant_null  2 110.3634
## plant_mean  3 111.9256
## plant_max   3 108.9372

Based on this output, what would you conclude about the effect of mean and maximum temperature on plant growth?