Korean Journal of Anesthesiology

In and Lee: Alternatives to the P value: connotations of significance

Statistical Round: Published online: April 4, 2024

DOI: https://doi.org/10.4097/kja.23630; Alternatives to the P value: connotations of significance; Junyong In, Dong Kyu Lee; Department of Anesthesiology and Pain Medicine, Dongguk University Ilsan Hospital, Goyang, Korea; Corresponding author: Dong Kyu Lee, M.D., Ph.D Department of Anesthesiology and Pain Medicine, Dongguk University Ilsan Hospital, 27 Dongguk-ro, Ilsandong-gu, Goyang 10326, Korea
Tel: +82-31-961-7869 Fax: +82-31-961-7864 Email: entopic@dongguk.edu; Received August 20, 2023 Revised November 26, 2023 Accepted March 11, 2024; © The Korean Society of Anesthesiologists, 2024

This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract: The statistical significance of a clinical trial analysis result is determined by a mathematical calculation and probability based on null hypothesis significance testing. However, statistical significance does not always align with meaningful clinical effects; thus, assigning clinical relevance to statistical significance is unreasonable. A statistical result incorporating a clinically meaningful difference is a better approach to present statistical significance. Thus, the minimal clinically important difference (MCID), which requires integrating minimum clinically relevant changes from the early stages of research design, has been introduced. As a follow-up to the previous statistical round article on P values, confidence intervals, and effect sizes, in this article, we present hands-on examples of MCID and various effect sizes and discuss the terms statistical significance and clinical relevance, including cautions regarding their use.

Keywords: Clinical relevance, Clinical significance, Confidence intervals, Effect size, Minimal clinically important difference, Patient outcome assessment, P value, Statistical significance, Statistics

Introduction

Introduction

Confidence intervals and effect sizes

Confidence intervals and effect sizes

Minimal clinically important difference (MCID)

Minimal clinically important difference (MCID)

Clinical relevance vs. statistical significance

Clinical relevance vs. statistical significance

Conclusion

Conclusion

NOTES

1) In the “Recommendations for the conduct, reporting, editing, and publication of scholarly work in medical journals,” published by the ICMJE in 2023, clinical relevance is expressed using the term clinical significance. In this article, we limit the use of the term significance to refer to statistical significance (a statistical term) and thus use the term clinical relevance rather than clinical significance.

2) Besides MCID, several terms have been proposed, such as the minimal clinical difference (MCD), minimal clinically important improvement (MCII), and robust clinically important difference (RCID).

3) The triangulating method is a measurement technique that utilizes the properties of triangles for measurement, known as triangulation. Here, it is used to determine a more accurate and reliable MCID through deciding between two different MCID values.

NOTES

Funding

None.

Conflicts of Interest

No potential conflict of interest relevant to this article was reported.

Data Availability

The datasets generated during and analyzed during the current article are available as supplementary material 2.

Author Contributions

Junyong In (Writing – original draft; Writing – review & editing)

Dong Kyu Lee (Conceptualization; Data curation; Formal analysis; Methodology; Software; Validation; Writing – original draft; Writing – review & editing)

Supplementary Materials

Supplementary Materials

Fig. 1.

An example of the Global Assessment Scale (GAS). The term 'moderately' suggests that changes are noticeable but not dramatic. 'Substantially' indicates that the changes are considerable and impactful. 'Significantly' denotes that the changes have far exceeded what was anticipated.

Table 1.

Statistical Methods and Recommended Effect Sizes

Statistical method	Effect size	Calculation		Effect size interpretation
Student’s t-test	Cohen’s d	d = XT -XC Spooled	XT, XC: Means of treatment and control groups	0.2 to < 0.5	Small effect
		Spooled = (nT - 1)sT2 + (nC - 1)sC2nT + nC - 2	S_pooled: Pooled SD	0.5 to < 0.8	Medium effect
			n_T, n_C: Sample size of treatment and control groups	≥ 0.8	Large effect
			S_T, S_C: SD of treatment and control groups
	Correlation coefficient r (BESD)	r = t2t2+df	t: Test statistics	0.1 to < 0.3	Small effect
		= d2(d2 + 4)	df: Degrees of freedom	0.3 to < 0.5	Medium effect
			d: Cohen’s d	≥ 0.5	Large effect
Mann-Whitney rank sum test (Mann-Whitney U test)	r	r = \|z\|n	z: Test statistics	0.1 to < 0.3	Small effect
			n: Total sample size	0.3 to < 0.5	Medium effect
				≥ 0.5	Large effect
	Vargha and Delaney’s A	VDA = Un1×n2	VDA: Vargha and Delaney’s A	0.56 to < 0.64,	Small effect
			U: U statistics	> 0.34 to 0.44
			n₁, n₂: Sample size of each group	0.64 to < 0.71,	Medium effect
				> 0.29 to 0.34
				≥ 0.71, ≤ 0.29	Large effect
	Cliff’s delta^*	δ = 2(VDA - 0.5)	δ: Cliff’s delta	0.11 to < 0.28	Small effect
			VDA: Vargha and Delaney’s A	0.28 to < 0.43	Medium effect
				≥ 0.43	Large effect
Paired t-test	Cohen’s d	d = Xafter - XbeforeSdifference	Xafter, Xbefore: Mean of observations before and after	0.2 to < 0.5	Small effect
			S_difference: SD from differences in observations	0.5 to < 0.8	Medium effect
				≥ 0.8	Large effect
Wilcoxon signed rank test (Wilcoxon Z test)	Matched-pairs rank biserial correlation coefficient	rc = 4\|T-0.5 (R+ + R-)\|N (N+1)	r_c: Biserial correlation coefficient	0.1 to < 0.3	Small effect
			T: Smaller value between R₊ and R₋	0.3 to < 0.5	Medium effect
			R₊, R₋: Sum of ranks with a positive or a negative sign	≥ 0.5	Large effect
			N: Number of pairs
	r	r = zN	z: Test statistics	0.1 to < 0.3	Small effect
			N: Number of pairs	0.3 to < 0.5	Medium effect
				≥ 0.5	Large effect
ANOVA	Coefficient η²	η2 = SSefSSt	SS_ef: Sum of squares for the effect^†	0.02 to < 0.13	Small effect
			SS_t: Total sum of squares	0.13 to < 0.26	Medium effect
				≥ 0.26	Large effect
	Partial η²	η2p = SSefSSef+SSer	η²_p: Partial eta-squared	0.02 to < 0.13	Small effect
			SS_ef: Sum of squares for the effect	0.13 to < 0.26	Medium effect
			SS_er: Sum of squares error^†	≥ 0.26	Large effect
	Coefficient ω² (for between-subject designs, Unbiased estimate of η²)	ω2 = dfef (MSef - MSer)SSt + MSer	df_ef: Degrees of freedom for the effect	0.01 to < 0.06	Small effect
			MS_ef, MS_er: Mean square of the effect, error	0.06 to < 0.14	Medium effect
			SS_t: Total sum of squares	≥ 0.14	Large effect
	Partial ω²	ω2p = dfef MSef - MSerdfefMSef + (n - dfef)MSer	ω²_p: Partial ω²	0.01 to < 0.06	Small effect
			df_ef: Degrees of freedom for the effect	0.06 to < 0.14	Medium effect
			MS_ef, MS_er: Mean square of the effect, error	≥ 0.14	Large effect
			n: Sample size
	Cohen’s f	f = ∑j=1p μj-μ2 / pσ2 =η21-η2	p: Number of groups	0.10 to < 0.25	Small effect
			μ_j: Mean of each group	0.25 to < 0.40	Medium effect
			μ: Mean of whole sample	≥ 0.40	Large effect
			σ: Standard deviation of whole sample
			η²: Coefficient η² (Cohen’s f is interchangeable with η², as shown)

Kruskal-Wallis ANOVA on ranks (Kruskal-Wallis H)	η²	η2 = H - k + 1n - k	H: Kruskal-Wallis test statistic	0.02 to < 0.13	Small effect
			k: Number of groups	0.13 to < 0.26	Medium effect
			n: Total number of observations	≥ 0.26	Large effect
	E² (Epsilon-squared)	Ε2 = H(n2 - 1)/(n + 1)	H: Kruskal-Wallis test statistic	0.01 to < 0.08	Small effect
			n: Total number of observations	0.08 to < 0.26	Medium effect
				≥ 0.26	Large effect
RM ANOVA	Partial η²	η2p = SSefSSef + SSer	η²_p: Partial eta-squared	0.02 to < 0.13	Small effect
			SS_ef: Sum of squares for the effect	0.13 to < 0.26	Medium effect
			SS_er: Sum of squares error^†	≥ 0.26	Large effect
	Generalized η²	η2G = SSefδ × SSef + ∑SSmeasured	η²_G: Generalized η²	0.02 to < 0.13	Small effect
			δ: 0 if the effect involves one or more measured factors;	0.13 to < 0.26	Medium effect
			1 if the effect involves only manipulated factors	≥ 0.26	Large effect
			SS_ef: Sum of squares for the effect
			SS_measured: Sum of squares for the measurement^‡
	Partial ω²	ω2p = dfef (MSef - MSer)dfefMSef + (n - dfef)MSer	ω²_p: Partial ω²	0.01 to < 0.06	Small effect
			df_ef: Degrees of freedom for the effect	0.06 to < 0.14	Medium effect
			MS_ef, MS_er: Mean square of the effect, error	≥ 0.14	Large effect
			n: Sample size
Friedman RM ANOVA on ranks	Kendall’s W (coefficient of concordance)	W = χw2N (k-1)	χw²: Friedman test statistic	0.1 to < 0.3	Small effect
			N: Sample size	0.3 to < 0.5	Medium effect
			k: Number of measurements per subject	≥ 0.5 (see footnote^§)	Large effect
Chi-square test	Cramér’s V (Cramér’s phi)^¶	ϕc = χ2n (k-1)	ϕ_c: Cramér’s V (extended phi coefficient for 2 × 2 table)	k = 2
			χ²: Chi-square statistic	0.1 to < 0.3	Small effect
			n: Sample size	0.3 to < 0.5	Medium effect
			k: Lesser value between numbers of column and row	≥ 0.5 (see footnote)	Large effect
Fisher’s exact test	Phi coefficient^**	ϕ = χ2n	χ²: Chi-square statistic	0.1 to < 0.3	Small effect
			n: Sample size	0.3 to < 0.5	Medium effect
				≥ 0.5	Large effect
Two-proportions z-test	Cohen’s h	h=\|2arcsinp1 - 2arcsinp2\|	p₁, p₂: Two given probabilities or proportions	0.2 to < 0.5	Small effect
			2arcsinp: Arcsine transformation	0.5 to < 0.8	Medium effect
				≥ 0.8	Large effect
Correlation analysis	Pearson correlation coefficient r	r = ∑(xi - x¯) (yi -y¯)∑(xi -x¯)2 (yi -y¯)2	r: Correlation coefficient	0.1 to < 0.3	Small effect
			x¯, y¯: Means of x and y	0.3 to < 0.5	Medium effect
			x_i, y_i: Samples of variable x and y	≥ 0.5	Large effect
	Spearman’s ρ	rs = cov (R (X), R (Y))σR XσR Y	r_s: Spearman’ s ρ	0.1 to < 0.3	Small effect
			cov (R (X), R (Y)): Covariance of the rank variables	0.3 to < 0.5	Medium effect
			σ_{R (X)}, σ_{R (Y)}: Standard deviations of the rank variables	≥ 0.5	Large effect

The effect sizes listed in this table are not a complete list of available effect sizes. The authors chose some effect sizes according to their preference and recommendation.

BESD: binomial effect size display [3], SD: standard deviation, ANOVA: analysis of variance, RM: repeated measures. ^*The original definition of Cliff’s delta involves a pre-defined matrix and the following process [7]. Fortunately, Cliff’s delta is linearly related to Vargha and Delaney’s A [8]. Therefore, the interpretation of Cliff’s delta is converted from the recommendation of Vargha and Delaney’s A. ^†“effect” indicates the within- and between-subject factors. ^‡Generalized η² is different from η² and partial η². Generalized η² is an effect size considering the interaction and has various formulas according to the study design. For details, refer to the article by Bakeman [9]. ^§Kendall’s W uses Cohen’s interpretation guidelines. Kendall’s W is a test statistic of agreement between groups where W = 1 indicates that all groups have identical rank by the intervention, a complete agreement. That is, high Kendall’s W represents concordant changes by the intervention (a repeated measures factor). ^∥As mentioned above, Kendall’s W is a statistic related to Friedman’s ANOVA and represents the general effect of the overall ANOVA test. The effect size r from each multiple comparison (such as Bonferroni corrected Wilcoxon signed-rank tests) would be informative. ^¶Guideline based on Cohen’s suggestion. Alternatively, refer the Table 4 of the previously published article [3].

Small effect Medium effect Large effect

k = 2 0.100 to < 0.300 0.300 to < 0.500 ≥ 0.500

k = 3 0.071 to < 0.212 0.212 to < 0.354 ≥ 0.354

k = 4 0.058 to < 0.173 0.173 to < 0.289 ≥ 0.289

k = 5 0.050 to < 0.150 0.150 to < 0.250 ≥ 0.250

k = 6 0.045 to < 0.134 0.134 to < 0.224 ≥ 0.224

^**Using the phi coefficient for the Fisher’s exact test is controversial because it comes from the chi-square statistic. Using the odds ratio instead of the phi coefficient has been recommended. However, some articles still report using the phi coefficient or Cramér’s V for the Fisher’s exact test results. One problem with using the phi coefficient, Cramér’s V, and Cohen’s w is that they require uniformly distributed marginals for a 2 × 2 table.

References

1. Wasserstein RL, Lazar NA. The ASA statement on p-values: context, process, and purpose. Am Stat 2016; 70: 129-33.
[Article]

2. Recommendations for the conduct, reporting, editing, and publication of scholarly work in medical journals 2023 [Internet]. International Committee of Medical Journal Editors. Available from http://www.icmje.org/icmje-recommendations.pdf

3. Lee DK. Alternatives to P value: confidence interval and effect size. Korean J Anesthesiol 2016; 69: 555-62.
[Article] [PubMed] [PMC]

4. Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, et al. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol 2016; 31: 337-50.
[Article] [PubMed] [PMC]

5. Kwak S. Are Only p-values less than 0.05 significant? A p-value greater than 0.05 is also significant! J Lipid Atheroscler 2023; 12: 89-95.
[Article] [PubMed] [PMC]

6. Benjamini Y, Veaux RD, Efron B, Evans S, Glickman M, Graubard BI. The ASA president’s task force statement on statistical significance and replicability. Ann Appl Stat 2021; 15: 1084-5.
[Article]

7. Cliff N. Dominance statistics: ordinal analyses to answer ordinal questions. Psychol Bull 1993; 114: 494-509.
[Article]

8. Vargha A, Delaney HD. A critique and improvement of the CL common language effect size statistics of McGraw and Wong. J Educ Behav Stat 2000; 25: 101-32.
[Article]

9. Bakeman R. Recommended effect size statistics for repeated measures designs. Behav Res Methods 2005; 37: 379-84.
[Article] [PubMed]

10. Libster R, Pérez Marc G, Wappner D, Coviello S, Bianchi A, Braem V, et al. Early high-titer plasma therapy to prevent severe Covid-19 in older adults. N Engl J Med 2021; 384: 610-8.
[Article] [PubMed] [PMC]

11. Concannon P, Rich SS, Nepom GT. Genetics of type 1A diabetes. N Engl J Med 2009; 360: 1646-54.
[Article] [PubMed]

12. Hays J, Ockene JK, Brunner RL, Kotchen JM, Manson JE, Patterson RE, et al. Effects of estrogen plus progestin on health-related quality of life. N Engl J Med 2003; 348: 1839-54.
[Article] [PubMed]

13. Chow JT, Turkstra TP, Yim E, Jones PM. The degree of adherence to CONSORT reporting guidelines for the abstracts of randomised clinical trials published in anaesthesia journals: a cross-sectional study of reporting adherence in 2010 and 2016. Eur J Anaesthesiol 2018; 942-8.
[Article] [PubMed]

14. Jaeschke R, Singer J, Guyatt GH. Measurement of health status. Ascertaining the minimal clinically important difference. Control Clin Trials 1989; 10: 407-15.
[Article] [PubMed]

15. Batterham AM, Hopkins WG. Making meaningful inferences about magnitudes. Int J Sports Physiol Perform 2006; 1: 50-7.
[Article] [PubMed]

16. Sainani KL. The problem with “magnitude-based inference”. Med Sci Sports Exerc 2018; 50: 2166-76.
[Article] [PubMed]

17. Sainani KL, Lohse KR, Jones PR, Vickers A. Magnitude-based Inference is not Bayesian and is not a valid method of inference. Scand J Med Sci Sports 2019; 29: 1428-36.
[Article] [PubMed] [PMC]

18. Lemieux J, Beaton DE, Hogg-Johnson S, Bordeleau LJ, Goodwin PJ. Three methods for minimally important difference: no relationship was found with the net proportion of patients improving. J Clin Epidemiol 2007; 60: 448-55.
[Article] [PubMed]

19. Wyrwich KW. Minimal important difference thresholds and the standard error of measurement: is there a connection? J Biopharm Stat 2004; 14: 97-110.
[Article] [PubMed]

20. Todd KH, Funk KG, Funk JP, Bonacci R. Clinical significance of reported changes in pain severity. Ann Emerg Med 1996; 27: 485-9.
[Article] [PubMed]

21. Danoff JR, Goel R, Sutton R, Maltenfort MG, Austin MS. How much pain is significant? Defining the minimal clinically important difference for the visual analog scale for pain after total joint arthroplasty. J Arthroplasty 2018; 33: S71-5.
[Article] [PubMed]

22. Tubach F, Ravaud P, Baron G, Falissard B, Logeart I, Bellamy N, et al. Evaluation of clinically relevant changes in patient reported outcomes in knee and hip osteoarthritis: the minimal clinically important improvement. Ann Rheum Dis 2005; 64: 29-33.
[Article] [PubMed] [PMC]

23. Howard R, Phillips P, Johnson T, O’Brien J, Sheehan B, Lindesay J, et al. Determining the minimum clinically important differences for outcomes in the DOMINO trial. Int J Geriatr Psychiatry 2011; 26: 812-7.
[Article] [PubMed]

24. Powell CV, Kelly AM, Williams A. Determining the minimum clinically significant difference in visual analog pain score for children. Ann Emerg Med 2001; 37: 28-31.
[Article] [PubMed]

25. Juniper EF, Guyatt GH, Willan A, Griffith LE. Determining a minimal important change in a disease-specific quality of life questionnaire. J Clin Epidemiol 1994; 47: 81-7.
[Article] [PubMed]

26. Farrar JT, Portenoy RK, Berlin JA, Kinman JL, Strom BL. Defining the clinically important difference in pain outcome measures. Pain 2000; 88: 287-94.
[Article] [PubMed]

27. Myles PS, Myles DB, Galagher W, Boyd D, Chew C, MacDonald N, et al. Measuring acute postoperative pain using the visual analog scale: the minimal clinically important difference and patient acceptable symptom state. Br J Anaesth 2017; 118: 424-9.
[Article] [PubMed]

28. Malec JF, Ketchum JM. A standard method for determining the minimal clinically important difference for rehabilitation measures. Arch Phys Med Rehabil 2020; 101: 1090-4.
[Article] [PubMed]

29. Muñoz-Leyva F, El-Boghdadly K, Chan V. Is the minimal clinically important difference (MCID) in acute pain a good measure of analgesic efficacy in regional anesthesia? Reg Anesth Pain Med 2020; 45: 1000-5.
[Article] [PubMed]

Korean Journal of Anesthesiology

Alternatives to the P value: connotations of significance

Introduction

Introduction

Confidence intervals and effect sizes

Confidence intervals and effect sizes

Confidence intervals

Confidence intervals

Effect size

Effect size

(1)

Minimal clinically important difference (MCID)

Minimal clinically important difference (MCID)

Clinical relevance vs. statistical significance

Clinical relevance vs. statistical significance

Conclusion

Conclusion

Supplementary Materials

Supplementary Materials

Supplementary Material 1.

Supplementary Material 2.

Fig. 1.

Table 1.