- Classification (or Type) of Multiple Comparison: Single-step versus Stepwise Procedures
As mentioned earlier, repeated testing with given groups results in the serious problem known as α inflation. Therefore, numerous MCT methods have been developed in statistics over the years.
^{2)} Most of the researchers in the field are interested in understanding the differences between relevant groups. These groups could be all pairs in the experiments, or one control and other groups, or more than two groups (one subgroup) and another experiment groups (another subgroup). Irrespective of the type of pairs to be compared, all post hoc subgroup comparing methods should be applied under the significance of complete ANOVA result.
^{3)}
Usually, MCTs are categorized into two classes, single-step and stepwise procedures. Stepwise procedures are further divided into step-up and step-down methods. This classification depends on the method used to handle type I error. As indicated by its name, single-step procedure assumes one hypothetical type I error rate. Under this assumption, almost all pairwise comparisons (multiple hypotheses) are performed (tested using one critical value). In other words, every comparison is independent. A typical example is Fisher’s least significant difference (LSD) test. Other examples are Bonferroni, Sidak, Scheffé, Tukey, Tukey-Kramer, Hochberg’s GF2, Gabriel, and Dunnett tests.
The stepwise procedure handles type I error according to previously selected comparison results, that is, it processes pairwise comparisons in a predetermined order, and each comparison is performed only when the previous comparison result is statistically significant. In general, this method improves the statistical power of the process while preserving the type I error rate throughout. Among the comparison test statistics, the most significant test (for step-down procedures) or least significant test (for step-up procedures) is identified, and comparisons are successively performed when the previous test result is significant. If one comparison test during the process fails to reject a null hypothesis, all the remaining tests are rejected. This method does not determine the same level of significance as single-step methods; rather, it classifies all relevant groups into the statistically similar subgroups. The stepwise methods include Ryan-Einot-Gabriel-Welsch Q (REGWQ), Ryan-Einot-Gabriel-Welsch F (REGWF), Student-Newman-Keuls (SNK), and Duncan tests. These methods have different uses, for example, the SNK test is started to compare the two groups with the largest differences; the other two groups with the second largest differences are compared only if there is a significant difference in prior comparison. Therefore, this method is called as step-down methods because the extents of the differences are reduced as comparisons proceed. It is noted that the critical value for comparison varies for each pair. That is, it depends on the range of mean differences between groups. The smaller the range of comparison, the smaller the critical value for the range; hence, although the power increases, the probability of type I error increases.
All the aforementioned methods can be used only in the situation of equal variance assumption. If equal variance assumption is violent during the ANOVA process, pairwise comparisons should be based on the statistics of Tamhane’s T2, Dunnett’s T3, Games-Howell, and Dunnett’s C tests.
- Tukey method
- Tukey method
This test uses pairwise post-hoc testing to determine whether there is a difference between the mean of all possible pairs using a studentized range distribution. This method tests every possible pair of all groups. Initially, the Tukey test was called the ‘Honestly significant difference’ test, or simply the ‘T test,’
^{4)} because this method was based on the t-distribution. It is noted that the Tukey test is based on the same sample counts between groups (balanced data) as ANOVA. Subsequently, Kramer modified this method to apply it on unbalanced data, and it became known as the Tukey-Kramer test. This method uses the harmonic mean of the cell size of the two comparisons. The statistical assumptions of ANOVA should be applied to the Tukey method, as well.
^{5)}
Fig. 2 depicts the example results of one-way ANOVA and Tukey test for multiple comparisons. According to this figure, the Tukey test is performed with one critical level, as described earlier, and the results of all pairwise comparisons are presented in one table under the section ‘post-hoc test.’ The results conclude that groups A and B are different, whereas groups A and C are not different and groups B and C are also not different. These odd results are continued in the last table named ‘Homogeneous subsets.’ Groups A and C are similar and groups B and C are also similar; however, groups A and B are different. An inference of this type is different with the syllogistic reasoning. In mathematics, if A = B and B = C, then A = C. However, in statistics, when A = B and B = C, A is not the same as C because all these results are probable outcomes based on statistics. Such contradictory results can originate from inadequate statistical power, that is, a small sample size. The Tukey test is a generous method to detect the difference during pairwise comparison (less conservative); to avoid this illogical result, an adequate sample size should be guaranteed, which gives rise to smaller standard errors and increases the probability of rejecting the null hypothesis.
- Bonferroni method: ɑ splitting (Dunn’s method)
- Bonferroni method: ɑ splitting (Dunn’s method)
The Bonferroni method can be used to compare different groups at the baseline, study the relationship between variables, or examine one or more endpoints in clinical trials. It is applied as a post-hoc test in many statistical procedures such as ANOVA and its variants, including analysis of covariance (ANCOVA) and multivariate ANOVA (MANOVA); multiple t-tests; and Pearson’s correlation analysis. It is also used in several nonparametric tests, including the Mann-Whitney
U test, Wilcoxon signed rank test, and Kruskal-Wallis test by ranks [
4], and as a test for categorical data, such as Chi-squared test. When used as a post hoc test after ANOVA, the Bonferroni method uses thresholds based on the t-distribution; the Bonferroni method is more rigorous than the Tukey test, which tolerates type I errors, and more generous than the very conservative Scheffé’s method.
However, it has disadvantages, as well, since it is unnecessarily conservative (with weak statistical power). The adjusted α is often smaller than required, particularly if there are many tests and/or the test statistics are positively correlated. Therefore, this method often fails to detect real differences. If the proposed study requires that type II error should be avoided and possible effects should not be missed, we should not use Bonferroni correction. Rather, we should use a more liberal method like Fisher’s LSD, which does not control the family-wise error rate (FWER).
^{6)} Another alternative to the Bonferroni correction to yield overly conservative results is to use the stepwise (sequential) method, for which the Bonferroni-Holm and Hochberg methods are suitable, which are less conservative than the Bonferroni test [
5].
- Dunnett method
- Dunnett method
This is a particularly useful method to analyze studies having control groups, based on modified t-test statistics (Dunnett’s t-distribution). It is a powerful statistic and, therefore, can discover relatively small but significant differences among groups or combinations of groups. The Dunnett test is used by researchers interested in testing two or more experimental groups against a single control group. However, the Dunnett test has the disadvantage that it does not compare the groups other than the control group among themselves at all.
As an example, suppose there are three experimental groups A, B, and C, in which an experimental drug is used, and a control group in a study. In the Dunnett test, a comparison of control group with A, B, C, or their combinations is performed; however, no comparison is made between the experimental groups A, B, and C. Therefore, the power of the test is higher because the number of tests is reduced compared to the ‘all pairwise comparison.’
On the other hand, the Dunnett method is capable of ‘twotailed’ or ‘one-tailed’ testing, which makes it different from other pairwise comparison methods. For example, if the effect of a new drug is not known at all, the two-tailed test should be used to confirm whether the effect of the new drug is better or worse than that of a conventional control. Subsequently, a one-sided test is required to compare the new drug and control. Since the two-sided or single-sided test can be performed according to the situation, the Dunnett method can be used without any restrictions.
- Scheffé’s method: exploratory post-hoc method
- Scheffé’s method: exploratory post-hoc method
Scheffé’s method is not a simple pairwise comparison test. Based on F-distribution, it is a method for performing simultaneous, joint pairwise comparisons for all possible pairwise combinations of each group mean [
6]. It controls FWER after considering every possible pairwise combination, whereas the Tukey test controls the FWER when only all pairwise comparisons are made.
^{7)} This is why the Scheffé’s method is very conservative than other methods and has small power to detect the differences. Since Scheffé’s method generates hypotheses based on all possible comparisons to confirm significance, this method is preferred when theoretical background for differences between groups is unavailable or previous studies have not been completely implemented (exploratory data analysis). The hypotheses generated in this manner should be tested by subsequent studies that are specifically designed to test new hypotheses. This is important in exploratory data analysis or the theoretic testing process (e.g., if a type I error is likely to occur in this type of study and the differences should be identified in subsequent studies). Follow-up studies testing specific subgroup contrasts discovered through the application of Scheffé’s method should use. Bonferroni methods that are appropriate for theoretical test studies. It is further noted that Bonferroni methods are less sensitive to type I errors than Scheffé’s method. Finally, Scheffé’s method enables simple or complex averaging comparisons in both balanced and unbalanced data.
- Violation of the assumption of equivalence of variance
- Violation of the assumption of equivalence of variance
One-way ANOVA is performed only in cases where the assumption of equivalence of variance holds. However, it is a robust statistic that can be used even when there is a deviation from the equivalence assumption. In such cases, the Games-Howell, Tamhane’s T2, Dunnett’s T3, and Dunnett’s C tests can be applied.
The Games-Howell method is an improved version of the Tukey-Kramer method and is applicable in cases where the equivalence of variance assumption is violated. It is a t-test using Welch’s degree of freedom. This method uses a strategy for controlling the type I error for the entire comparison and is known to maintain the preset significance level even when the size of the sample is different. However, the smaller the number of samples in each group, the it is more tolerant the type I error control. Thus, this method can be applied when the number of samples is six or more.
Tamhane’s T2 method gives a test statistic using the t-distribution by applying the concept of ‘multiplicative inequality’ introduced by Sidak. Sidak’s multiplicative inequality theorem implies that the probability of occurrence of intersection of each event is more than or equal to the probability of occurrence of each event. Compared to the Games-Howell method, Sidak’s theorem provides a more rigorous multiple comparison method by adjusting the significance level. In other words, it is more conservative than type I error control. Contrarily, Dunnett’s T3 method does not use the t-distribution but uses a quasi-normalized maximum-magnitude distribution (studentized maximum modulus distribution), which always provides a narrower CI than T2. The degrees of freedom are calculated using the Welch methods, such as Games-Howell or T2. This Dunnett’s T3 test is understood to be more appropriate than the Games-Howell test when the number of samples in the each group is less than 50. It is noted that Dunnett’s C test uses studentized range distribution, which generates a slightly narrower CI than the Games-Howell test for a sample size of 50 or more in the experimental group; however, the power of Dunnett’s C test is better than that of the Games-Howell test.