# Two-way analysis of variance

In statistics, the two-way analysis of variance (ANOVA) is an extension of the one-way ANOVA that examines the influence of two different categorical independent variables on one dependent variable. The two-way ANOVA can not only determine the main effect of contributions of each independent variable but also identifies if there is a significant interaction effect between them.

## History

In 1925, Ronald Fisher mentions the two-way ANOVA in his celebrated book from 1925, Statistical Methods for Research Workers (chapters 7 and 8). In 1934, Frank Yates published procedures for the unbalanced case.[1] Since then, an extensive literature has been produced, reviewed in 1993 by Fujikoshi.[2] In 2005, Andrew Gelman proposed a different approach of ANOVA, viewed as a multilevel model.[3]

## Assumptions to use two-way ANOVA

As with other parametric tests, we make the following assumptions when using two-way ANOVA:

## Model

Let us imagine a data set for which a dependent variable may be influenced by two factors (sources of variation). The first factor has $I$ levels ($i \in \{1,\ldots,I\}$) and the second has $J$ levels ($j \in \{1,\ldots,J\}$). Each combination $(i,j)$ defines a treatment, for a total of $I \times J$ treatments. We represent the number of replicates for treatment $(i,j)$ by $n_{ij}$, and let $k$ be the index of the replicate in this treatment ($k \in \{1,\ldots,n_{ij}\}$).

From these data, we can build a contingency table, where $n_{i+} = \sum_{j=1}^J n_{ij}$ and $n_{+j} = \sum_{i=1}^I n_{ij}$, and the total number of replicates is equal to $n = \sum_{i,j} n_{ij} = \sum_i n_{i+} = \sum_j n_{+j}$.

The design is balanced if each treatment has the same number of replicates, $K$. In such a case, the design is also said to be orthogonal, allowing to fully distinguish the effects of both factors. We hence can write $\forall i,j \; n_{ij} = K$, and $\forall i,j \; n_{ij} = \frac{n_{i+} \times n_{+j}}{n}$.

Let us denote as $y_{ijk}$ the value of the dependent variable of unit $k$ which received treatment $(i,j)$. The two-way ANOVA model can be written as:

$y_{ijk} = \mu_{ij} + \epsilon_{ijk}$ where $\epsilon_{ijk} \sim \mathcal{N}(0, \sigma^2)$

The effect of both factors are explicitly written as:

$\mu_{ij} = \mu + \alpha_i + \beta_j + \gamma_{ij}$

where $\mu$ is the grand mean, $\alpha_i$ is the additive main effect of level $i$ from the first factor (i-th row in the contigency table), $\beta_j$ is the additive main effect of level $j$ from the second factor (j-th column in the contigency table) and $\gamma_{ij}$ is the non-additive interaction effect of treatment $(i,j)$ from both factors (cell at row i and column j in the contigency table).

To ensure identifiability of parameters, we can add the following "sum-to-zero" constraints:

$\sum_i \alpha_i = \sum_j \beta_j = \sum_i \sum_j \gamma_{ij} = 0$

## Hypothesis testing

In the classical approach, testing null hypotheses (that the factors have no effect) is achieved via their significance which requires calculating sums of squares.

Testing if the interaction term is significant can be difficult because of the potentially-large number of degrees of freedom.[4]

## Notes

1. ^ Yates, Frank (March 1934). "The analysis of multiple classifications with unequal numbers in the different classes". Journal of the American Statistical Association (American Statistical Association) 29 (185): 51–66. Retrieved 19 June 2014.
2. ^ Fujikoshi, Yasunori (1993). "Two-way ANOVA models with unbalanced data". Discrete Mathematics (Elsevier) 116 (1): 315–334. doi:10.1016/0012-365X(93)90410-U.
3. ^ Gelman, Andrew (February 2005). "Analysis of variance? why it is more important than ever". The Annals of Statistics 33 (1): 1–53. doi:10.1214/009053604000001048.
4. ^ Yi-An Ko et al. (September 2013). "Novel Likelihood Ratio Tests for Screening Gene-Gene and Gene-Environment Interactions with Unbalanced Repeated-Measures Data". Genetic epidemiology 37 (6): 581–591. doi:10.1002/gepi.21744.