### Multiple Comparison Test and Its Imitations

^{1)}When the null hypothesis (H

_{0}) is rejected after ANOVA, that is, in the case of three groups, H

_{0}: μ

_{A}= μ

_{B}= μ

_{C}, we do not know how one group differs from a certain group. The result of ANOVA does not provide detailed information regarding the differences among various combinations of groups. Therefore, researchers usually perform additional analysis to clarify the differences between particular pairs of experimental groups. If the null hypothesis (H

_{0}) is rejected in the ANOVA for the three groups, the following cases are considered:

### Meaning of P value and ɑ Inflation

_{0}is true. The significance of an experiment is a random variable that is defined in the sample space of the experiment and has a value between 0 and 1.

_{0}is statistically rejected even though it is actually true, whereas type II error refers to a false negative, H

_{0}is statistically accepted but H

_{0}is false (Table 1). In the situation of comparing the three groups, they may form the following three pairs: group 1 versus group 2, group 2 versus group 3, and group 1 versus group 3. A pair for this comparison is called ‘family.’ The type I error that occurs when each family is compared is called the ‘family-wise error’ (FWE). In other words, the method developed to appropriately adjust the FWE is a multiple comparison method. The α inflation can occur when the same (without adjustment) significant level is applied to the statistical analysis to one and other families simultaneously [2]. For example, if one performs a Student’s

*t*-test between two given groups A and B under 5% α error and significantly indifferent statistical result, the probability of trueness of H

_{0}(the hypothesis that groups A and B are same) is 95%. At this point, let us consider another group called group C, which we want to compare it and groups A and B. If one performs another Student’s

*t*-test between the groups A and B and its result is also nonsignificant, the real probability of a nonsignificant result between A and B, and B and C is 0.95 × 0.95 = 0.9025, 90.25% and, consequently, the testing α error is 1 − 0.9025 = 0.0975, not 0.05. At the same time, if the statistical analysis between groups A and C also has a nonsignificant result, the probability of nonsignificance of all the three pairs (families) is 0.95 × 0.95 × 0.95 = 0.857 and the actual testing α error is 1 − 0.857 = 0.143, which is more than 14%.

_{0}according to the number of comparisons.

### Classification (or Type) of Multiple Comparison: Single-step versus Stepwise Procedures

^{2)}Most of the researchers in the field are interested in understanding the differences between relevant groups. These groups could be all pairs in the experiments, or one control and other groups, or more than two groups (one subgroup) and another experiment groups (another subgroup). Irrespective of the type of pairs to be compared, all post hoc subgroup comparing methods should be applied under the significance of complete ANOVA result.

^{3)}

### Tukey method

^{4)}because this method was based on the t-distribution. It is noted that the Tukey test is based on the same sample counts between groups (balanced data) as ANOVA. Subsequently, Kramer modified this method to apply it on unbalanced data, and it became known as the Tukey-Kramer test. This method uses the harmonic mean of the cell size of the two comparisons. The statistical assumptions of ANOVA should be applied to the Tukey method, as well.

^{5)}

### Bonferroni method: ɑ splitting (Dunn’s method)

*U*test, Wilcoxon signed rank test, and Kruskal-Wallis test by ranks [4], and as a test for categorical data, such as Chi-squared test. When used as a post hoc test after ANOVA, the Bonferroni method uses thresholds based on the t-distribution; the Bonferroni method is more rigorous than the Tukey test, which tolerates type I errors, and more generous than the very conservative Scheffé’s method.

^{6)}Another alternative to the Bonferroni correction to yield overly conservative results is to use the stepwise (sequential) method, for which the Bonferroni-Holm and Hochberg methods are suitable, which are less conservative than the Bonferroni test [5].

### Dunnett method

*t*-test statistics (Dunnett’s t-distribution). It is a powerful statistic and, therefore, can discover relatively small but significant differences among groups or combinations of groups. The Dunnett test is used by researchers interested in testing two or more experimental groups against a single control group. However, the Dunnett test has the disadvantage that it does not compare the groups other than the control group among themselves at all.

### Scheffé’s method: exploratory post-hoc method

^{7)}This is why the Scheffé’s method is very conservative than other methods and has small power to detect the differences. Since Scheffé’s method generates hypotheses based on all possible comparisons to confirm significance, this method is preferred when theoretical background for differences between groups is unavailable or previous studies have not been completely implemented (exploratory data analysis). The hypotheses generated in this manner should be tested by subsequent studies that are specifically designed to test new hypotheses. This is important in exploratory data analysis or the theoretic testing process (e.g., if a type I error is likely to occur in this type of study and the differences should be identified in subsequent studies). Follow-up studies testing specific subgroup contrasts discovered through the application of Scheffé’s method should use. Bonferroni methods that are appropriate for theoretical test studies. It is further noted that Bonferroni methods are less sensitive to type I errors than Scheffé’s method. Finally, Scheffé’s method enables simple or complex averaging comparisons in both balanced and unbalanced data.

### Violation of the assumption of equivalence of variance

*t*-test using Welch’s degree of freedom. This method uses a strategy for controlling the type I error for the entire comparison and is known to maintain the preset significance level even when the size of the sample is different. However, the smaller the number of samples in each group, the it is more tolerant the type I error control. Thus, this method can be applied when the number of samples is six or more.

### Methods for Adjusting P value

### Controlling the family-wise error rate: Bonferroni adjustment

_{0}is true for all tests, the probability of obtaining a significant result from this new, lower critical value is 0.05. In other words, if all the null hypotheses, H

_{0}, are true, the probability that the family of tests includes one or more false positives due to chance is 0.05. Usually, these methods are used when it is important not to make any type I errors at all. The methods belonging to this category are Bonferroni, Holm, Hochberg, Hommel adjustment, and so on. The Bonferroni method is one of the most commonly used methods to control FWER. With an increase in the number of hypotheses tested, type I error increases. Therefore, the significance level is divided into numbers of hypotheses tests. In this manner, type I error can be lowered. In other words, the higher the number of hypotheses to be tested, the more stringent the criterion, the lesser the probability of production of type I errors, and the lower the power.

*t*-tests, one would set each

*t*-test to 0.05 / 50 = 0.001. Therefore, one should consider the test as significant only for P < 0.001, not P < 0.05 (equation 2).

_{0}) that all tests are not significant. This is true for the following situations, as well: to avoid type I error or perform many tests without a preplanned hypothesis for the purpose of obtaining significant results [8].

### Controlling the false discovery rate: Benjamini-Hochberg adjustment

*i*= 1, the next smallest has

*i*= 2, and so on.

*i*/ m)∙Q (equation 3) (i, rank; m, total number of tests; Q, chosen FDR)

*i*/ m)∙Q is significant, and all the P values smaller than the largest value are also significant, even the ones that are not less than their Benjamini-Hochberg critical value.

### Conclusions and Implications

*Korean Journal of Anesthesiology*has not formally examined this view, it is expected that the journal’s view on this subject is not significantly different from the view expressed by this paper [8]. Therefore, it is important that all authors are aware of the problems posed by multiple comparisons, and further research is required to spread awareness regarding these problems and their solutions.