In the previous lessons, we have been working with t-tests, which are used when you have a categorical independent variable with only 2 values and a continuous dependent variable. In this lesson, we will build on what you have learned and cover ANOVAs. ANOVAs are similar to t-tests, but they can be used when the categorical independent variable has 2 or more values. In other words, they are are generalization of the t-test. They can be used in for the same type of data as a t-test, but in some additional circumstances as well.
One option for analyzing the effects of a categorical independent variable with more than two values would be to run two-sample t-test for each pair of values of the categorical values. For example, if we had categories A, B, and C, we could run a t-test comparing categories A and B, then run a second t-test comparing categories B and C, and then run a final t-test comparing categories A and C. Aside from being tedious, particularly if you have many possible values for your categories, why is this not the best idea?
This gets back to the concept of type I error that we discussed when we covered hypothesis testing. Type I error is when you reject the null hypothesis when the null hypothesis is true. We control this type of error with our threshold p-value. The lower we set the threshold, the lower the probability that we commit type I error. With our standard threshold of p=0.05, we would have a 5% chance of type I error. But what happens to this error rate if we conduct multiple tests (such as multiple t-tests) on the same data set? The short answer is that it increases. If we have a 5% chance of type I error for each individual test, then the chance of committing type I error on at least one test when we run multiple tests will be higher than 5%. Specifically, the probability will be \(1-(1-\alpha)^m\), where \(\alpha\) is the threshold p-value and \(m\) is the number of tests (this formula is based on our basic laws of probability). Scientists like to be conservative about committing type I error, so we try to avoid running multiple tests and inflating our type I error rate.
ANOVAs are one solution to this problem. With an ANOVA, we run a single test to do an overall comparison across all of our categories. If we find a significant difference in an ANOVA, we can then follow that with pairwise comparisons between the categories to see which specific pair differ from each other.
The general process for an ANOVA is very similar to a t-test. We start by identifying our hypotheses. Then we calculate a test statistic that summarizes the signal (strength of the pattern) and noise (variation) in our data. Finally, we use the test statistic and associated probability distribution to calculate a p-value
Our null and alternative hypotheses for an ANOVA are almost identical to those for a two-sample t-test, with a slight twist in the alternative hypothesis.
Null: The means of all groups are equal.
Alternative: The mean of at least one group is different.
Note the phrasing in the alternative hypothesis. We are not testing whether all of the groups differ from each other, or which specific groups differ from each other. In the ANOVA itself, we are simply testing if at least one group is different.
The test statistic for an ANOVA (the F-statistic) is conceptually similar to the t-statistic that we calculated for t-tests. It is, on the whole, a signal to noise ratio. It tells us how strong the pattern in our data is relative to the amount of variation in our data. The main difference is that when we have more than two groups that we are comparing to each other, the calculations for the signal and noise are slightly more complicated. I won’t provide the full formula for the F-statistic here. We will just go over the general concept.
The F-statistic is essentially a ratio of two variances: the between-group variance and the within-group variance. The numerator for the statistic is the between-group variance. It is a measure of the average difference between the means of each pair of groups. This is our measure of the signal. The bigger the difference in means, the higher the value of the F-statistic. The denominator for the statistic is the within-group variance. It is a measure of the average variance within each group, after accounting for any differences in means. This is our measure of the noise. The lower the variation within each group ,the higher the values of the F-statistic. This aligns with our expectations for what should lead to a statistically-significant result. If we have a stronger pattern and less variation, the F-statistic will be higher, and we will be more likely to get a statistically significant result.
Once you have the F-statistic, the process for calculating the p-value is the same process that we have covered for previous tests. You use a probability distribution (in this case, the F-distribution) to calculate the probability of getting an F-statistic equal to or more extreme that yours, assuming the null hypothesis is true. This is, once again, the p-value, and if the p-value is less than 0.05, you reject the null hypothesis.
Like with our other probability distributions for test statistics, the shape of the F-distribution depends on the degrees of freedom. For an ANOVA, we have two degrees of freedom: the between categories degrees of freedom (number of categories - 1) and the within categories degrees of freedom (total sample size - number of categories).
If, when we run an ANOVA, we decide to reject our null hypothesis, meaning there is a significant difference in at least one of the groups, we often want to follow the ANOVA with post-hoc tests, which will do pairwise comparisons between our groups to determine which specific groups differ from the other.
Post-hoc tests can be used as a follow-up to an ANOVA to do pairwise comparisons between groups, while accounting for the inflated type I error that comes from doing multiple comparisons. Different types of post-hoc tests handle the problem in different ways, but a common strategy, and the one used by the post-hoc tests we will cover in this class, is to adjust the p-value itself to reduce the chance of type I error. Essentially the p-value will be higher than what you would get from a normal t-test, so you will be less likely to reject the null hypothesis, and therefore less likely to commit type I error. One advantage of this strategy is that you can interpret the results in the same way that you would for a t-test. The null and alternative hypothesis are the same as what they would be for a t-test, and you would still reject the null hypothesis if p < 0.05.
The assumptions of an ANOVA and the associated post-hoc tests are the same as they are for a t-test:
We can use the same methods to test the latter two assumptions (graphical visualizations and formal tests) and if the assumptions are not met, we can address the problem in the same way. Data transformations (such as the log or square root transform) can often correct the problem, and, if not, there are alternative tests we can use.
Here are alternative tests that can be used for ANOVAs and post-hoc pairwise comparisons when assumptions are not met:
In the next section, you will learn how to use R to run ANOVAs, pairwise comparisons, and their alternatives.
For this lesson, we will work with a new data set on the Coachella Valley fringe-toed lizard. This lizard is very endangered, so we are interested in how to best manage the species to improve the chances of persistence. We have data based on a simulation that estimates the time to extinction (TTE) for three different management plans: no reserve (“none”), a single reserve (“single”), and multiple reserves in a network (“network”).
First, download the lizard data and load the data into R (don’t worry if you get a warning message when you do this - it’s not a problem).
lizard <- read.csv("lizard.csv")
Also, load the ggplot2 package:
library(ggplot2)
Now, we are ready to get started.
Because when we checked assumptions, we determined that we needed to work with the log-transformed extinction times, we will add the log-transformed variable to our data set:
lizard$log_TTE <- log(lizard$TTE)
To run an ANOVA with the frequentist approach, the structure is
similar to a t-test, but we will use a different function:
aov. However, the arguments (input) of the function are the
same as what we use for a t-test: our independent and dependent variable
(don’t forget to use the log of TTE) and the data set. To view the
output of the test, instead of just typing the name of the test object,
we will use the summary function, with the test object as
the input of the function.
lizard_anova <- aov(log_TTE ~ Plan, data=lizard)
summary(lizard_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## Plan 2 6.66 3.329 10.96 3.67e-05 ***
## Residuals 146 44.33 0.304
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As with the t-test, when we view the output, we can see the degrees of freedom (Df), the F value, and the p-value (Pr(>F)). Note the two values for the degrees of freedom. These are the between-category and within-category degrees of freedom that were discussed in the conceptual overview of ANOVAs above. We report both when we report our results.
Looking at the p-value, we can see that it is below 0.05, so we reject the null hypothesis and tentatively accept that at least one of our categories has a different time to extinction than the rest.
If your assumptions of normally-distributed residuals and/or unequal variances are not met and can’t be fixed with a transform, there are alternative tests that you can run.
The Kruskal-Wallis test is a rank test similar to the Mann-Whitney U test that we used as an alternative for the t-test. It can be used when your residuals are not normally distributed. If we were going to use it for the lizard data (we don’t have to because our assumption were met, but this is for the sake of demonstration), it would look like this:
kruskal.test(TTE ~ Plan, data=lizard)
If our variances are not equal, we can instead use the Welch’s ANOVA, which would look like this:
oneway.test(TTE ~ Plan, data=lizard, var.equal = FALSE)
Again, you don’t need to run these tests in addition to the normal ANOVA. You would run one of these as an alternative to the ANOVA if the assumption is not met.
If we want to know specifically which pairs of treatments differ from
each other, we can use post-hoc tests to do pairwise comparisons between
our categories. This is like running t-tests between each pair of
categories, except the test corrects for the problem of multiple
comparisons. We will use a function (pairwise.t.test) that
gives us the flexibility to choose the type of correction we want to do,
and will also work for data with unequal variances. We will use the Holm
correction, which is a good choice in general because it is not too
conservative but also doesn’t add other assumptions about our data.
The required inputs of this function are the dependent and independent variables for the test, in that order. There is not a separate argument for the data frame that contains those variables, so we have to provide the data frame nane as part of the variable names. To choose the type of p-value correction we want to use, we also need to add the “p.adjust.method” argument. We set this equal to “holm” to use the Holm correction.
pairwise.t.test(lizard$log_TTE,lizard$Plan,p.adjust.method="holm")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: lizard$log_TTE and lizard$Plan
##
## network none
## none 2e-05 -
## single 0.029 0.029
##
## P value adjustment method: holm
The output shows a table of p-values for each pairwise comparison between our groups. The row and column names tell you which two groups are being compared. Let’s look at the value in the upper left corner. It compares no reserve (“none”) to a network of reserves (“network”). The p-value for this pairwise comparison is 0.00005, which is clearly less than 0.05, so that tells us that there is a significant difference in the extinction time between the two treatments.
Now look at the remaining two pairwise comparisons. Based on the p-values, which other pairs of categories had significantly different extinction times?
In this example, it’s not necessary because after the log-tranform, our data meet the assumptions of an ANOVA and these follow-up post-hoc tests, but if our data did violate these assumptions, we can adjust the tests we use for the post hoc tests.
If we do not have equal variances, we can still use the
pairwise.t.test function. We just add an additional
argument (pool.sd=FALSE) to the function. If we were going to use this
approach for the lizard data, it would look like this:
pairwise.t.test(lizard$TTE,lizard$Plan,p.adjust.method="holm",pool.sd=FALSE)
##
## Pairwise comparisons using t tests with non-pooled SD
##
## data: lizard$TTE and lizard$Plan
##
## network none
## none 4.6e-05 -
## single 0.0490 0.0086
##
## P value adjustment method: holm
To do the pairwise comparisons if we don’t have normally-distributed
residuals, we need a new function: pairwise.wilcox.test.
This will run pairwise tests similar to t-test, but that allow for the
non-normal distribution. The arguments of this function are the same as
the arguments for the pairwise.t.test fucntion, so to use
this function on the lizard data, it would look like this:
pairwise.wilcox.test(lizard$TTE,lizard$Plan,p.adjust.method="holm")
##
## Pairwise comparisons using Wilcoxon rank sum test with continuity correction
##
## data: lizard$TTE and lizard$Plan
##
## network none
## none 0.00011 -
## single 0.07094 0.02147
##
## P value adjustment method: holm
We interpret the output of these test in the same way we would for the normal pairwise comparisons.