Introduction
A repeated-measures design is a research framework where multiple or repeated measurements are taken on the same subject [
1]. These repeated measurements can occur sequentially over time. Examples include measuring systolic blood pressure or body weight weekly or monthly. Repeated measurements can also be taken under different experimental conditions. Furthermore, multiple measurements can be taken from the same subject at a single point in time. This could involve observations from various parts within a single subject. A key characteristic of a repeated-measures design is the dependency or correlation among measurements taken within the same subject. Repeated measures analysis of variance (RM-ANOVA) is a specialized form of analysis of variance (ANOVA) that accounts for this dependency among observations measured repeatedly over time or parts of the body within the same subject.
A thorough understanding of RM-ANOVA allows its application to a wide range of research. Anesthesiology and pain medicine is a field that frequently deals with vital sign data measured at regular intervals. A previous article in the Korean Journal of Anesthesiology has explained the principles of RM-ANOVA in detail using various models [
2]. This article focuses on RM-ANOVA using repeated-measures variables with an order, such as time, within the same subject. It particularly emphasizes analyzing the interaction effect among between-subject and within-subject effects through contrast testing to identify when these interactions occur.
Meaning of P values in repeated measures analysis of variance
RM-ANOVA is used when two or more repeated measurements are taken from a single subject, and it can be applied when there is one or multiple groups. In a one-way RM-ANOVA with a single group, only the P value for the within-subject effect is presented. In a two-way RM-ANOVA with two or more groups, P values for the between-subject effect, within-subject effect, and interaction between these effects are presented. The meanings of these P values in a two-way RM-ANOVA are illustrated in
Fig. 1.
A common research design where investigators use RM-ANOVA involves dividing subjects into control and experimental groups based on the application of a specific treatment and analyzing the differences in treatment effects over time. Although the emphasis in interpreting RM-ANOVA results may vary across disciplines, this research design primarily aims to test changes in the outcome variable due to the treatment (between-subject effect) or over time (within-subject effect). Researchers often claim a significant difference when the P value for the between-subject or within-subject effect is significant; however, this interpretation is incomplete. The between-subject effect evaluates changes in the outcome variable based on treatment status, while the within-subject factor assesses changes due to repeated measurements at each time point. The P value for the between-subject effect statistically determines whether there is a difference in group means, without considering the changes in the outcome variable over time within each group (
Fig. 1A). Even if the outcome variable increases in one group and decreases in another, the P value will not show a significant difference if the group means are identical.
On the other aspect, the within-subject effect compares the means of measurements at each time point, irrespective of group assignment (
Fig. 1B). In this case, it is impossible to distinguish how each group influenced the change in the outcome variable. In RM-ANOVA, attention should be paid to the analysis results for the interaction between group and time (
Fig. 1C) [
3]. The P value in this analysis indicates whether there is a significant difference in the magnitude of change in the outcome variable between Time 1 and Time 2 across the different groups. This change is represented by the slope of the lines connecting the means of two measurements within each group. A difference in the slopes of the two lines indicates that the grouping factor or treatment influences the outcome variable over time. Meanwhile, if the two lines are parallel (i.e., no difference in slope), the treatment has no effect on the outcome variable over time.
Fig. 2 illustrates various examples of P values for tests concerning the between-subject effect, within-subject effect, and the group-by-time interaction effect, based on changes in the outcome variable. The presented figures and P values are not derived from actual data but are hypothetical examples created for better understanding. As seen in the P values for each scenario, the P values for group or time have no bearing on the significance of the interaction. Therefore, the interaction between the two factors is crucial for determining the effect, and the P values for group and time should only be considered as supplementary information to aid in understanding the data during result interpretation.
Sphericity test
Sphericity is one of the essential assumptions to be met for conducting RM-ANOVA, requiring that the variances of the differences between all possible pairs of within-subject conditions (or time points) are equal. Note that this does not refer to the variance of each repeated measurement level or time point itself. Mauchly’s test is the most commonly used method to verify the assumption of sphericity. If the sphericity assumption is not met, it means that the variances of the differences between repeated measurement levels or time points are unequal. In this case, the distribution of the F-statistic for the ANOVA test is distorted, increasing the probability of a Type I error, where the P value is underestimated. As such, the degrees of freedom (df) used for calculating the F-statistic are adjusted, and hypothesis testing proceeds under the assumption that the sphericity condition is met. To adjust the df, the original df are multiplied by epsilon (ε). The most commonly used ε are the Greenhouse-Geisser ε and the Huynh-Feldt ε; based on the ε of the Greenhouse-Geisser, it is generally recommended to apply the Huynh-Feldt correction when the ε is greater than 0.75, and to apply the Greenhouse–Geisser correction when the ε is less than or equal to 0.75 or when the assumption of sphericity is unknown or violated [
4].
Post-hoc analysis methods and interpretation
If the results of the RM-ANOVA show a significant difference, post-hoc tests are needed independently to identify where these differences occur. Using graphs can aid in the intuitive interpretation of post-hoc test results. When an interaction exists between factors, estimated marginal means must be used to calculate group means adjusted for the interaction.
Fig. 3 presents several examples of hypothetical result graphs that might be observed when post-hoc tests are performed in the presence of an interaction, along with possible interpretations. Group 1 is designated as the control group and Group 2 as the treatment group based on treatment status, and three measurements (Time 1 to 3) are taken from each subject. Multiple pairwise comparisons are performed for Time 1 vs. 2, Time 2 vs. 3, and Time 1 vs. 3, irrespective of group, yielding adjusted P values (P
12, P
23, P
13) for each comparison. Various comparison methods and P value adjustment methods (e.g., using paired t-tests for comparisons and Bonferroni correction for P values) exist to account for the increased Type I error from multiple comparisons. Refer to Lee and Lee for details [
5].
As shown in
Fig. 3A, the outcome variable is significantly higher in Group 2 than in Group 1, resulting in a significant difference between groups starting at Time 2 (P
12 < 0.05), which is maintained until Time 3 (P
13 < 0.05). In
Fig. 3B, although the difference in the outcome variable between groups significantly decreased from Time 2 to Time 3 (P
23 < 0.05), the difference remains significant, indicating a residual treatment effect (P
13 < 0.05).
Fig. 3C shows that the treatment effect disappeared at Time 3, as evidenced by no difference between groups (P
13 > 0.05).
Fig. 3D shows that the treatment effect first appeared at Time 3 (P
13 < 0.05). The interpretation of results is not strictly defined and can vary depending on the nature of the data and the established hypotheses.
When a significant difference is observed in the interaction effect, the most commonly used method for post-hoc testing is the simple effect (or simple main effect) analysis. The main effect refers to the influence of a specific independent factor on the outcome variable. It is the difference in the mean of the outcome variable between the levels of the independent variable (
Fig. 4A). A simple main effect refers to the main effect of one independent variable only at a specific level of another independent variable. The difference between the overall means of Group 1 and Group 2 is the main effect, while the group differences observed at each time point are simple main effects (
Fig. 4A). As the name suggests, the simple main effect analysis tests whether there is a significant difference in the mean outcome variable between the two groups at each measurement point. A significant difference at a specific time point suggests that the interaction occurs at that point.
However, this method can lead to distorted analysis results because it reflects both the main effect and the interaction effect. For example, consider a case where the outcome variable in Group 1 increases from 1 to 2 between Time 1 and Time 2, whereas it decreases from 2 to 1 in Group 2 (
Fig. 4B). Defining the mean outcome variable for Group n at Time n as G
n T
n, the differences in means between the two groups (simple main effects) at Time 1 and Time 2 are G
1T
1 − G
2T
1 = 1 − 2 = −1 and G
1T
2 − G
2T
2 = 2 − 1 = 1, respectively. That is, simple main effect at Time 2 is 1. The difference between the change in Group 1’s outcome and Group 2’s outcome from Time 1 to Time 2 (interaction effect) is (G
1T
1 − G
1T
2) − (G
2T
1 − G
2T
2) = (1 − 2) − (2 − 1) = −2. Thus, even when a difference is observed in the interaction effect, post-hoc analysis using simple main effects may fail to detect it.
Although not as commonly used as simple main effect analysis, interaction contrasts testing provides an excellent alternative. A contrast is a linear combination used to perform statistical analysis for specific comparisons between groups. In RM-ANOVA, contrasts are used to test main effects and interaction effects and are particularly useful for finely assessing differences between specific time points or groups. Unlike simple main effect testing, which compares group differences at each measurement point, interaction contrast testing focuses on the change in the outcome variable over time, which is a more appropriate method considering the meaning of interaction [
6]. Before explaining interaction contrasts in detail, let us look at the definition of a contrast. A contrast has the following mathematical form:
where the ci must satisfy the following condition:
For example, consider measuring an outcome variable at Times 1, 2, and 3 in Groups 1 and 2, assuming that each group has the same number of subjects
n and that there are no missing values during measurements. The means and standard deviations for each group and time point are shown in
Table 1. The group×time interaction contrasts between Time 1 and Time 2, Time 1 and Time 3, and Time 2 and Time 3 are the differences between the change in measurements over time in a group and the corresponding change over the same time in another group. They can be calculated as follows:
Table 2 summarizes the coefficients for each mean corresponding to the three contrasts above.
The sum of the coefficients for each contrast is as follows:
Therefore, they satisfy the condition for contrasts. The contrast set to test the interaction between two of the three time points is 1 and −1. Expressed as a matrix, the column vector L of contrast coefficients is
and the matrix of means (B) is
When L is transformed into a transposed vector to enable matrix operations with B, it becomes:
Furthermore, the transformation matrices required to extract two time points from the three means for calculating interaction contrasts are defined as follows:
Using the transposed column vector (L') representing comparison of two time points and the transformation matrices extracting two time points from three, each interaction contrast can be expressed in matrix form as follows:
An independent two-sample t-test is used to determine if the calculated interaction contrast is significant (
Supplementary Marterial 1). This method can determine whether there is a difference in the mean change of measurements between Time 1 and Time 2 across two groups, that is, whether there is a group × time interaction between Time 1 and Time 2. Similarly, it can determine the presence of a group × time interaction between Time 2 and Time 3 and between Time 1 and Time 3. Jang and Jung[
3], through computer simulations, compared the performance of simple main effect tests with that of interaction contrast tests used as post-hoc methods for interaction effects in repeated measures data. They reported that interaction contrast tests showed lower Type I error rates and higher power than did simple main effect analysis.
Example of RM-ANOVA and interaction contrast testing
Materials and Methods
The data (
Supplementary Marterial 2) provided in the supplement are hypothetical, with values measured at three time points in two groups. These data were statistically analyzed using RM-ANOVA with IBM SPSS Statistics for Windows (Version 20.0, IBM Corp.). The syntax and its explanation (after "/*") for performing RM-ANOVA and interaction contrast tests are as follows: to enter the syntax, select New from the File tab, then select Syntax. To run the syntax, select the desired commands and press Ctrl+R.
GLM Time 1 Time 2 Time 3 BY Group /* Analyze temporal changes in the outcome variable across two groups.
/WSFACTOR=Time 3 Polynomial /* Within-subject factor 'Time' has 3 levels.
/METHOD=SSTYPE(3)
/PRINT = DESCRIPTIVE /* Calculate mean and standard deviation.
/CRITERIA=ALPHA(.05) /* Significance level is 0.05.
/WSDESIGN=Time
/LMATRIX Group 1 − 1 /* Contrast coefficient matrix is 1-1.
/MMATRIX
“Time 1 vs Time 2” ALL 1 − 1 0 ; /* Transformation matrix for Time 1 vs Time 2 is 1-10.
“Time 1 vs Time 3” ALL 1 0 − 1 ; /* Transformation matrix for Time 1 vs Time 3 is 10-1.
“Time 2 vs Time 3” ALL 0 1 − 1 ; /* Transformation matrix for Time 2 vs Time 3 is 01-1.
/DESIGN=Group.
With a significance level of 0.05, a P value <0.05 is considered significant.
Results
The means and standard deviations (SDs) of the outcome variable measured from Time 1 to Time 3 in each group are shown in
Table 3. When the results are visualized by creating a graph (
Fig. 5), a difference in the slopes of the outcome variables between the two groups over time is visible, but further statistical analysis is required to determine if this difference is significant.
Mauchly’s test of sphericity
Given that there are three measurement time points for the dependent variable, RM-ANOVA requires satisfying the sphericity assumption, which means that the correlations between repeated measures must be equal. Mauchly’s test was performed to assess this. The P value was 0.031, which was less than 0.05; therefore, the data violated the sphericity assumption. However, because the violation was not severe (ε > 0.75), a univariate analysis with adjusted df for the F-statistic could be performed. Considering that the Greenhouse-Geisser ε = 0.897 is greater than 0.75, the df needs to be corrected using the Huynh-Feldt ε = 0.9398.
Test of group × time interaction effect
Table 4 shows the results for the group × time interaction effect from the SPSS output. The mean square was calculated by dividing the sum of squares for the interaction effect and the error by their respective Huynh-Feldt ε-corrected df. The F-statistic was then obtained by dividing the mean square for the interaction by the mean square for error. The corresponding P value was 0.019, confirming a significant interaction between group and time. Therefore, interaction contrast testing should be performed to explore between which time points the interactions occur.
Test of group × time interaction contrast
First, to calculate the group × time interaction contrast between Time 1 and Time 2, the mean (X11¯ - X12¯) and variance [S(X11 - X12)]2 of the difference in the value between Time 1 and Time 2 in Group 1 (X11 - X12) and the mean (X21¯ - X22¯) and variance [S(X21 - X22)]2 of the difference in the value between Time 1 and Time 2 in Group 2 (X21 - X22) should be calculated.
Given that the P value in Levene’s test is 0.879, we can assume equal variances for X11 − X12 and X21 − X22. To calculate the t-statistic, we first need to calculate the pooled variance [(Sp)2]:
Therefore, the t-statistic is
In the t-distribution with 30 + 30 − 2 = 58 df, the probability of obtaining an absolute value of the t-statistic greater than 1.467 (Pr [|T
58| > 1.467]) is 0.148, which is greater than the significance level of 0.05. Therefore, the group × time interaction contrast between Time 1 and Time 2 is not significant, implying no interaction between group and time in this interval. Testing the group × time interaction contrasts between Time 2 and Time 3 and between Time 1 and Time 3 in the same manner showed that the group × time interaction effect is significant only between Time 1 and Time 3 and not for the other time differences (Time 1 vs. Time 2, and Time 2 vs. Time 3) (
Table 5).
Advantages and disadvantages of RM-ANOVA, generalized estimating equations, and mixed effect models
RM-ANOVA can extract and remove variations related to individual differences from subjects through repeated measurements. Therefore, well-designed studies can obtain substantial data even with relatively small sample sizes and are more likely to achieve significant results. However, RM-ANOVA also has limitations in its application. First, obtaining data that meet the following basic assumptions required for its statistical execution [
7] can be challenging:
1. The dependent variable must be continuous.
2. Samples must be randomly drawn from the population, and the dependent variable must be independent between subjects.
3. The dependent variable must follow a normal distribution, and there should be no outliers.
4. If there are three or more measurement time points, the correlations between repeated measures must be equal, that is, sphericity must be satisfied.
Furthermore, when collecting data repeatedly from the same subjects, compliance tends to decrease as the number of repetitions increases, leading to more data loss. Dropouts can occur if subjects experience sufficient effects and no longer require treatment or, conversely, if they perceive no effect and withdraw. In such cases, it is not simply a matter of sample reduction but the possibility of introducing bias regarding treatment effects. Thus, significant effort is needed from the design phase to the study’s end to minimize dropouts. Nevertheless, missing values can occur during repeated measurements, and unfortunately, RM-ANOVA cannot use data from a subject if even a single measurement is missing.
Conversely, generalized estimating equations (GEEs) and mixed-effects models can use data from subjects with missing values during a series of repeated measurements [
8], minimizing data loss. They have been reported to be particularly advantageous over RM-ANOVA when missing values exist in small sample sizes [
9]. Ma et al. [
10], through a simulation study, reported that GEEs and mixed-effects models achieved higher power than did RM-ANOVA with smaller sample sizes or fewer repeated measurements, for both complete and missing data. Furthermore, these methods can analyze data where correlations between repeated measures are not equal [
8]. GEE, in particular, can be used to analyze non-continuous outcome variables, such as binary outcomes, using link functions and does not assume a normal distribution, making it applicable to non-normally distributed data or data with unknown distributions [
8]. If it is anticipated that the data collected from repeated measurements will not satisfy the assumptions of RM-ANOVA, the use of GEE or mixed-effects models should be planned in advance during the study design phase [
11]. Using these as substitutes for RM-ANOVA later, without considering them in the design phase, is not advisable.
Conclusions
Among the between-subject effect, within-subject effect, and their interaction shown by RM-ANOVA for analyzing temporally ordered data, the interaction effect, which evaluates the temporal change of the outcome variable according to groups, is crucial. If the interaction effect is significant, analyzing interaction contrasts can reveal between which time points the interaction effect exists. Therefore, researchers must use RM-ANOVA appropriately, fully understanding its assumptions and limitations, and choose alternative analysis methods when RM-ANOVA is not applicable.