Korean J Anesthesiol Search

CLOSE


Korean J Anesthesiol > Volume 78(3); 2025 > Article
Kwak and Kim: Comprehensive reporting guidelines and checklist for studies developing and utilizing artificial intelligence models

Abstract

Background

The rapid advancement of artificial intelligence (AI) in healthcare necessitates comprehensive and standardized reporting guidelines to ensure transparency, reproducibility, and ethical applications in clinical research. Existing reporting standards are limited by their focus on specific study designs. We aimed to develop a comprehensive set of guidelines and a checklist for reporting studies that develop and utilize AI models in healthcare, covering all essential components of AI research regardless of the study design.

Methods

Two experts in statistics from the Statistical Round of the Korean Journal of Anesthesiology developed these guidelines and checklist. The key elements essential for AI model reporting were identified and organized into structured sections, including study design, data preparation, model training and evaluation, ethical considerations, and clinical implementation. Iterative reviews and feedback from clinicians and researchers were used to finalize the guidelines and checklist.

Results

These guidelines provide a detailed description of each item on the checklist, ensuring comprehensive reporting of AI model research. Full details regarding the AI model specifications and data-handling processes are provided.

Conclusions

These guidelines and checklist are meant to serve as valuable tools for researchers, addressing key aspects of AI reporting, and thereby supporting the reliability, accountability, and ethical use of AI in healthcare research.

Introduction

The use of artificial intelligence (AI) in healthcare research is rapidly transforming clinical practice and decision-making by enhancing diagnostic accuracy, improving treatment strategies, and streamlining patient management [1,2]. Given the increase in the use of AI models across various medical disciplines, ensuring AI-based studies are transparent, reproducible, and ethically sound is crucial [3]. Existing guidelines, such as CONSORT-AI [4], DECIDE-AI [5], TRIPOD-AI [6], and CLAIM [7], provide valuable standards for reporting AI research. However, these guidelines are tailored to specific study designs or applications, such as clinical trials or diagnostic accuracy studies, and may not encompass the full range of AI study methodologies.
Comprehensive reporting of the critical aspects of AI research, such as model training, data handling, evaluation metrics, and safety protocols, is essential to ensure that AI systems are effective and reliable in real-world applications. Furthermore, issues of equity, transparency, and patient safety should be addressed to build public trust and ensure that AI innovations contribute positively to healthcare outcomes [8].
This article presents a set of comprehensive reporting guidelines and a checklist designed to standardize the reporting of AI studies across diverse study designs. The checklist provides a succinct overview, while the guidelines elaborate on each checklist item. Key aspects of AI model development and utilization are addressed, including study design, data handling, model training and evaluation, error management, clinical applicability, and safety and ethical considerations, to support the reproducibility, validity, and ethical integrity of AI research [9]. By standardizing the reporting process, these guidelines and checklist aim to foster trust in AI research and advance the responsible use of AI in clinical practice.  

Materials and Methods

Identification of key reporting elements

These guidelines and checklist were developed by two experts in statistics from the Statistical Round of the Korean Journal of Anesthesiology. At the beginning of the development process, all elements essential for the transparent and accurate reporting of AI-based studies were identified. These elements encompass all critical aspects of AI model development and usage in clinical settings and address the practical challenges that researchers face when documenting AI research processes to ultimately enhance reproducibility and accountability.

Drafting of the guidelines and checklist

A preliminary draft was created, with the guidelines structured into sections relevant to AI model reporting. Concurrently, a checklist was designed to include the items that researchers could follow to systematically document their work. Researchers are meant to use this checklist as a guide for comprehensive reporting.

Incorporation of clinical scenarios

Clinical scenarios were incorporated into the guidelines when necessary. These examples supplement the checklist items by illustrating common challenges encountered in AI-based clinical research, such as variability in data sources, potential biases in model predictions, and considerations for patient safety when using AI in clinical decision-making.

Finalization of the draft after iterative review and revision

The guidelines and checklist underwent multiple rounds of review and revision. Each item was carefully evaluated to ensure that the final version was comprehensive and user-friendly. Feedback from clinicians and researchers was sought for diverse AI research applications in medical practice. The draft was finalized once a structured and systematic framework for documenting AI studies in clinical research was reached.

Results

This section presents the guidelines and checklist that were developed by the authors (Table 1). The guidelines provide detailed elaborations of each checklist item to guide researchers in thoroughly documenting each component of their AI model study.

Comprehensive reporting guidelines for studies developing and utilizing artificial intelligence models

Title

Indicating the use of AI techniques in the title enhances the searchability of the study. Using broader terms such as ‘artificial intelligence’ or ‘machine learning’ in the title can also be understood by a wider audience. However, if a specific type of AI model is well known (e.g., deep learning), it may also be used in the title. More precise terminology on the specific AI models and architectures may be reserved for the Abstract.

Abstract

The abstract should provide a structured summary that includes the study design/setting (e.g., prospective or retrospective), an overview of the study population (number of patients, users, examinations, and/or images and age and sex distributions), the type of underlying AI algorithms, an outline of the statistical analyses performed (e.g., P values for comparisons, 95% CI), primary and secondary outcomes, main results, and conclusions. The abstract should thus be comprehensible without reading the full manuscript. The abstract should also indicate the public availability of the software, data, and/or resulting model.

Introduction

The use or setting of AI techniques and the selection of the target population should be supported by pre-existing or unpublished evidence that addresses clinically and scientifically important issues and questions requiring the use of the AI system. This evidence may indicate previous development of the AI model, internal and external validation, and/or modifications made prior to the current study. The medical conditions of interest, their related problems (e.g., limitations of the standard medical practice that the AI-based model is to be compared with), and the target population should be clearly described. The role that the AI will play within the clinical pathway to resolve the addressed clinical problems should also be described. In addition, a specific question that could be answered using the AI model should be established. As the role of the AI model may differ depending on who uses it, its users (e.g., physicians, patients, the public) should also be clearly defined. The study objectives, rationale, hypothesis, and anticipated clinical effects or outcomes should also be described.

Methods

1. Study design, setting, and population

1) Study design

The authors should clearly indicate whether the study was conducted prospectively or retrospectively. Additionally, whether the study goal was to evaluate the feasibility of the AI system or its superiority or non-inferiority to the reference test or model (current standard methodologies or models), to conduct an exploratory analysis, or to build a predictive model should be clearly stated. Details on the study goal should also be described (e.g., screening, diagnosis, or staging of a disease; prediction of disease development; anticipation of the prognosis). A reference (e.g., current clinical standards or model), which is to be compared with the AI system to assess its performance, should be clearly described.

2) Study settings

The authors should provide details on the setting where the study was conducted. This includes the type and size of the study environment (e.g., tertiary university hospital, private clinics, public health centers); its sublocation (e.g., operating room, examination room, immunization unit); and the availability of supporting facilities, services, or technologies relevant to the study (e.g., robotic surgical system, otoscope, COVID-19 vaccines). How the study setting and cohorts represent real-world clinical conditions should also be stated. Technical requirements and configurations specific to each study site should also be described in detail (e.g., software, hardware, specialized computing devices, vendor-specific equipment, site-specific modification of the AI algorithm). Because AI models may perform well in the environment where they were developed, providing this information allows for these limitations in the generalizability to be better understood [10].

3) Study population

The process used to recruit the participants should be stated, along with the inclusion and exclusion criteria. A flow diagram indicating the number of participants included at each stage should be included if possible [4]. Alternatively, a flow chart can be presented and explained in the Results section.

2. Description of the AI system

1) Study data

The characteristics and quality of the input data significantly affect the performance of AI systems [11]. In addition, a detailed description of input data handling allows for the replication of AI system use outside the study setting and can be used to determine whether data handling is standardized across study sites. Therefore, the original data source (e.g., electronic medical records, public data registry, prospective data collection) and the process used to obtain it from the study population should be stated along with the time period when the data were obtained. If the data were acquired using a specific device/software (e.g., electrocardiographic waveform from a patient monitor), the product information (e.g., device/software name and model, manufacturer information [name, city, country of origin]) and data acquisition protocols should be described in detail (e.g., the frequency at which the waveform was recorded, the type of filter [low-pass or band-pass filter], and the ranges of the filtered frequency). If the data underwent reformatting, the process should be fully reported with its relevant parameters (e.g., the frequency at which the obtained waveforms were resampled or downsampled). If data collection depended on the investigator’s subjective expertise, the number of investigators, their qualifications, and the introductions and training materials used should be described. Whether the measurements and/or observations were independent among the investigators and how inter- and intra-investigator variability was detected and handled should be specified as well. Additionally, the inclusion and exclusion criteria for the AI system input data should be described.
The authors should also specify whether the data were structured. Structured data have clearly defined features (e.g., name of diagnosis, medical procedure, medication, laboratory test result values, population characteristic variables [age, sex, height, weight]), whereas unstructured data lack features that can be explicitly defined (e.g., images, videos, audio recordings, text data, time-series data).
Data pre-processing converts raw data with different formats from various sources into a uniform and consistent format that can be read and used as input by the AI system. The data format compatible with the intended use of the AI model varies according to the type of AI system used (e.g., radiographic images, hemodynamic parameters, laboratory results). Minimum requirements should be set that determine the eligibility of the data before input to the AI system (e.g., image resolution ranges, number of complete or missing data per participant). The authors should also describe how data that did not meet the minimum requirements were handled and how this impacted the clinical pathway that includes the AI system.
In particular, the definition of missing and poor-quality data (e.g., electrocautery artifacts in the electroencephalogram) and outliers, their quantity, and how they were detected and handled should be described, as they diminish AI system performance [12]. If these data were imputed using specific techniques (e.g., last observation carried forward), the resulting biases should be described. The same information should be provided for the comparator (control or reference intervention).
Details on data transformation (e.g., normalization, standardization, rescaling, natural-log transformation, encoding categorical variables), feature engineering, and feature selection should be provided such that other researchers can reproduce the process. In particular, the data should be de-identified, protected health information should be completely removed, and facial images should also be rendered unidentifiable [13]. Accordingly, the processes used to de-identify and protect personal information should be fully described.
Whether the data were processed or unprocessed before the analysis should be specified, along with whether the data were acquired before the application of the AI system or were generated with the use of the AI system.
The ground truth, which is used as a reference for comparison in supervised learning and can be clinically measured using the gold standard (e.g., histopathologic diagnosis and consensus agreement from a panel of experts), should be annotated with a precise definition. For example, hypotension is defined as a clinical condition in which each beat-to-beat systolic blood pressure measured from a catheter placed in the lumen of the right radial artery with the transducer diaphragm placed at the level of the mid-axillary line intersecting the fourth intercostal space [14] is maintained below 80 mmHg for more than 1 min between anesthesia induction and the end of surgery. Using unclear definitions with insufficient information, such as blood pressure < 80 mmHg, should be avoided.

2) Study output

The output of the AI system should be specified. In the context of the clinical problem that the current study intends to address, this can include disease diagnosis, grading of disease severity, probability of a clinical event or disease occurrence, disease treatment options, and prediction of clinical parameter values. A uniform format should be used for both the output and ground truth.
If the AI output is used to determine subsequent clinical management that ultimately affects clinical outcomes, this should be described in detail. If clinical management involves medical practice performed by the researchers, this should be standardized. The output of the AI system should also be fully understood and interpreted by the researchers involved in clinical management guided by the output. For example, an intraoperative hypotension prediction algorithm that calculates the probability of a hypotensive episode requires researchers to interpret the probability and take standardized actions based on the probability threshold (e.g., administering weight-based doses of a vasoactive agent when the probability of a hypotensive episode is > 70%).
If the study design compares the performance of the AI system to that of a reference clinical protocol, the process used in the reference protocol to determine subsequent clinical management should be explained in the same manner that the AI system is used to make clinical decisions. Accordingly, the rationale for using the reference standard and its inherent limitations, including errors and biases, should be described.

3) Data separation

The process used to split up the full dataset at the beginning of the study should be described. The dataset can be split into training and test sets, or into training, validation (tuning), and test sets. The proportion of each set to the full dataset should be reported with the rationale. Using an external test set from an independent study site to test the trained model (external validation) is the ideal standard. Otherwise (in the case of internal validation), explicitly reporting and justifying the decision not to take the test set data from a data source external to that of the training data is essential. If the data structure between the training and test sets differs, the measures used to accommodate the difference should be explained.
To prevent bias, the test set needs to represent the target population. Certain methods (e.g., stratified sampling) can be used to maintain the distribution of the clinical outcome variables in the test set such that it is similar to that of the target population. Accordingly, the distribution of variables (including demographics and clinical parameters) in the training, validation, and test sets should be reported and statistically compared to show that the distributions are similar across sets. If any systematic differences are found, the factors involved should be investigated.
To prevent overfitting of the trained model, which shows good model performance on the training set but poor model performance on the validation set, internal k-fold cross-validation can be conducted. This type of validation involves splitting the dataset into k-subsets and training and validating the model k times using each subset as the validation set and the remaining k-1 subsets as the training set. When splitting a dataset, information leakage should be prevented. The information leakage occurs when the model is trained using the training set contaminated with information from the validation or test sets that should be exclusive to the training set. To ensure that each set is divided at the patient level or higher, each set should be split from the study population at the beginning of the study before data pre-processing and model training. Details on the steps taken to prevent overfitting and information leakage, which result in poor generalization of the model, should be provided.

4) Concise description of the AI system

Scientific rationales should be used to determine the type of model to train. The model task (e.g., classification and regression [numerical prediction]) and its beneficiaries (if any) should be specified. The mathematical algorithm of the AI system; hardware environment; and supporting software, library, or package required for its operation, including the versions, should be described. Relevant information includes the name of the developer and/or manufacturer, their location, and specific configuration settings. If previous development/validation studies of the AI system used are available, they should be cited in the manuscript and presented in the same manner that information about the AI system used in the current study is presented. The provision of AI system information from development/validation studies allows for changes in the performance of the AI system to be assessed as the current study population differs from that of the development/validation studies. Using a standardized reporting system also allows for concise information about the AI model to be provided [15]. If a new unpublished mathematical model is developed and used in the study, a full description should be provided as an appendix or as supplementary material at the end of the manuscript or should be published in an established public database along with accession details, which either does not allow for the model development processes to be arbitrarily revised once they are registered in the database or mandates retaining a complete history of revisions.
Several sequential stages, which require considerable time and computing power, are required to construct the final AI model. Throughout this process, several versions of AI models might be created. If the AI system was used in a clinical trial [4] or its ability to make appropriate or optimal clinical decisions is being assessed [5], the version should be clearly indicated with a regulatory marker, such as a unique device identifier. If the version of the AI model was modified, this should be justified by scientific rationale, and the changes made to the original version should be described.
The architecture of the AI model should be fully described such that it can be reconstructed by other researchers. This includes the inputs, outputs, and components specific to the type of AI model. Scientific rationale for the selection of each component should be provided. The architecture of the AI model can be provided in code as supplementary data.
As an example, a convolutional neural network model consists of 1) an input layer (e.g., one-dimensional 5-min electrocardiogram waveform collected at 300 Hz) characterized by the number of nodes and number and size of batches; 2) convolutional and pooling layers, which are characterized by the number and order of the layers, number and size of the kernel(s) in each layer, type of activation function (e.g., rectifier activation function, softplus), type of pooling operations (e.g., max, min, average, hybrid), type of normalization of output from a previous layer (e.g., batch or layer normalization), and use of dropout layers and their dropout rates; 3) fully connected layers, which are characterized by the number of hidden layers and number of nodes from each layer, and the type of activation function, similar to that of the previous convolutional and pooling layers; 4) output layers (e.g., whether hypotension develops); 5) loss (objective) function (e.g., mean squared error, mean absolute error for regression models, binary or categorical cross-entropy loss for classification models) and model optimization algorithms (e.g., Adam optimizer, gradient descent) with their hyperparameters (e.g., learning rates [degree of error reduction], exponential decay rates, epsilon); 6) regularization algorithms (e.g., L1 regularization [Lasso regularization], L2 regularization [Ridge regularization], elastic net regularization that combines the two regularization techniques); 7) hyperparameter tuning strategy (e.g., grid search, random search); 8) stopping criteria for training (e.g., maximum number of epochs after which training processes stop regardless of whether the trained model converges, patience parameters [number of epochs during which validation performance is allowed to improve]); and 9) criteria used to select the model with the best performance.

5) Model training

Details on all the model training processes used should be provided such that other researchers could reproduce them. Providing them in code is strongly encouraged.
If training data augmentation is required (for images, text, audio, etc.), the techniques used should be described (e.g., geometric transformations, paraphrasing, introducing noise).
The initialization of the parameters in the AI model (to prevent issues such as vanishing or exploding gradients and to enhance the convergence speed and model performance) should be described. If the initial parameters are randomly drawn from a specific distribution (e.g., uniform or normal distribution), the distribution should be described with its key parameters (e.g., lower and upper bounds for a uniform distribution and mean and standard deviation for a normal distribution). If the initial parameters are obtained from a model previously trained on a different large dataset (transfer learning), the source of the initial parameters from the pre-trained model should be provided. When using both random initialization and transfer learning, the initialized parameters and the modality used to initialize them should be specified. If some parameters are obtained from transfer learning and cannot be modified, they should be indicated as frozen or restricted. In addition, details on the specific restrictions applied and the portion of training affected by the restrictions should be specified.
The convergence of the model should be monitored by checking whether the pre-defined stopping criteria for the training are satisfied by the best hyperparameter combinations. If convergence of the model is not achieved, reviewing the data quality, feature scaling, learning rate, batch size, model architecture, parameter initialization, regularization (if any), and gradient descent issues, such as vanishing or exploding gradient problems, optimization, and hyperparameters, is mandatory. For example, for a neural network model that fails to converge, the researcher can attempt to vary the number of hidden layers and their nodes, apply different network activation functions, or adjust the learning rate.
The metrics used for model validation should also be described in detail (e.g., sensitivity, specificity, positive predictive value [precision], negative predictive value, area under the receiver operating characteristic curve, mean squared error, root mean squared error, accuracy).

6) Model evaluation

If more than one AI model is trained and planned to be evaluated using a test set that is independent of both the training and validation sets, there must be a modality and parameters for assessing model performance, which are used to select the most relevant model that typically demonstrates the best performance (e.g., area under the receiver operating characteristic curve, accuracy, precision for classification, and mean squared error for regression [numeric prediction]). The metrics used to measure model performance should be presented with statistical uncertainty (e.g., 95% CI) and compared between models using appropriate statistical tests that determine the statistical significance of the metric differences, thereby addressing the clinical problems that the current study is meant to address. If CIs cannot be directly calculated owing to unknown error distributions, they can be non-parametrically estimated using bootstrapping.
As no single methodology is perfect for evaluating a model, using more than one technique is strongly recommended. However, authors should have the flexibility to choose the most appropriate evaluation method and should provide a rationale for the selection. Strictly adhering to a pre-defined or standardized evaluation protocol is discouraged.
If multiple models are determined to be the best-performing models, the final model selection should be justified. If the goal of the study is to construct an ensemble of models, descriptions of the three components of the ensemble method should be provided: 1) the allocation function that assigns training data (e.g., via bootstrapping sampling) to each model; 2) the combination function that reconciles the prediction disagreements among models (e.g., the final prediction is made by a majority vote from models, weighting votes from each model based on their performance, or learning various combinations of each model’s prediction [stacking]); and 3) a full description of each model in the ensemble, as mentioned above. If models have previously been published that address the same clinical problem being addressed in the current study, comparisons of the final model to those models should be provided. To assess the robustness of the study findings, the sensitivity of the AI model should be analyzed using different assumptions or various initial conditions.
Because misinterpretations of model outcomes can lead to biases and inappropriate applications in healthcare settings [8], the intended manner of interpreting or explaining the results of the AI model should be described (e.g., an AI model developed to predict intraoperative hypotension is used to predict the development of hypotension in intensive care unit settings).

3. Miscellaneous aspects of the AI model description

1) Defining features and response variables

If possible, using common data elements that provide standardized, uniform, and consistent names, definitions, formats, and coding of variables across studies that are compatible with different study settings is strongly recommended [16].

2) Sample size estimation

Whenever possible, the sample size required for the study should be calculated from the results of a pilot or previous study to achieve the predetermined statistical power at an acceptable type I error rate.

Results

1. Study data

Including a flowchart or diagram to show the inclusion or exclusion of participants and/or data at each stage based on their corresponding exclusion/inclusion criteria is strongly recommended. The number of included or excluded subjects and data along with the criteria used for inclusion/exclusion should be presented. The resulting number of participants or data included at each stage should also be reported. If a flowchart or diagram is provided in the Methods section, presenting it in the Results section is redundant.
When summarizing the technical characteristics of the dataset, authors should specify in the Methods section whether the dataset was prepared as planned. If the characteristics of each partitioned dataset with statistical comparisons are reported in the Methods section, reporting them again in the Results section is redundant.
Even minor differences in the input datasets have a significant impact on the output of the AI model (i.e., performance), and subsequently, on patient safety if the model is to be used for major clinical decision-making [17,18]. Therefore, caution is advised if the distribution of the data changes (dataset shift) [18] between the training and test sets or between study settings. To address this issue, the population characteristics of the training and test sets should be described and compared.
The baseline characteristics should be selectively reported according to the task of the AI model, factors influencing the study outcomes, or protection of privacy (e.g., age, sex, gender, race, ethnicity, socioeconomic status, geographical location, prevalence, distribution [categorization/severity], risk factors of the medical conditions of interest, features input into the AI model, and coexisting medical conditions relevant to the study).
The presence of missing data significantly affects model performance and contributes to ethical issues [19-21]. The degree of missing data depends on the study settings (e.g., computer simulation modeling vs. clinical settings). Accordingly, the quantity of missing data according to the data features input into the AI model should be clearly reported.

2. Model performance

1) Reporting metrics with statistical uncertainty

Metrics with statistical uncertainty and the significance of model performance on the training, validation, and test sets should be reported as planned in the Methods section. The types of metrics reported are dependent on the data type and models used in the study (e.g., F-score, Dice-Sørensen coefficient).

2) Clinical translation of model performance

In addition to the performance of the model itself, evaluating how its predictive performance translates into clinical outcomes with specific metrics such as sensitivity, specificity, positive predictive value, negative predictive value, area under the receiver operating characteristic curve, and numbers needed to treat is essential. Accordingly, a justification of the metrics selected should be presented with the scientific rationale. The performance of the final model can be statistically compared with that of the standard technique or baseline model.

3) Feature contribution analysis

The contribution of each feature to the predictive performance of the AI model should be clearly described. For example, a SHapley Additive exPlanations (SHAP) plot is useful for showing the impact of every feature from each sample on the output predicted by the model.

4) Sub-group performance

If a subgroup analysis was performed, the subgroups for which the AI model performed best and worst should be indicated. The performance of any important subgroup should also be reported, as mentioned above. To demonstrate the performance and limitations of a classification model, providing a confusion matrix that shows whether the predicted classification matches the actual classification can be helpful.

5) Sensitivity analysis

For a sensitivity analysis of the classification models, descriptions of cases that present the highest model confidence with correct and incorrect prediction and the lowest confidence regardless of prediction correctness can be provided. For example, in an 84-year-old male patient with hypertension and congestive heart failure who underwent emergent pneumonectomy, a classification model predicting reintubation in the post-anesthetic care unit calculated a model confidence of 95%, and his trachea was actually reintubated according to the ground truth. This case shows high model confidence and correct prediction, implying that the model correctly identified high-risk patients for reintubation with definite risk factors such as old age, multiple comorbidities, and surgery involving the respiratory tract.
Similarly, for regression models, a sensitivity analysis can be performed by describing cases with the largest difference (error) between a lower predicted value and a higher actual value, cases with the largest difference between a higher predicted value and a lower actual value, and cases with the smallest difference between the predicted and actual values. For example, a case in which the systolic blood pressure predicted by a regression model is 300 mmHg and the actual value is 150 mmHg may be identified as the case with the largest difference between a higher predicted value and a lower actual value.

6) Unsupervised model assessment

The results from unsupervised learning can be assessed by field experts for accuracy and relevance by comparing them with typical patterns. Thus, sensitivity analysis enhances the understanding of the model behavior, fosters transparency, and guides future improvements.

3. Use of an AI model in clinical practice

If an AI model is planned to be used as part of clinical practice or clinical decision-making, the adherence or non-adherence of investigators to the study protocols for the use of the AI model should be reported because it not only affects the study outcomes but also provides useful information for the implementation of the same AI model in subsequent studies. If possible, it would be helpful to describe an exemplary non-adherence case where the AI model could not be used, either deliberately or accidentally, even though it had been planned to be used.
Unexpected changes in or impacts on medical practice or patient experience, which are caused by using an AI model, should be reported because they may act as confounding factors. For example, laboratory tests and/or radiographic imaging required for the AI model, which are performed in addition to routine clinical practice; manual input of the data to the AI model interface; or manual retrieval of the AI output and subsequent recording of the output in the medical record, can increase patient discomfort and inconvenience, risks to patient safety, medical personnel workload, and/or time required before clinical decision-making and relevant clinical practice is performed. If any changes external to the implementation of the AI model are considered to have affected AI model performance and conduct of the study, they should be reported.
If any modification was made to the AI algorithm during the study, the kind of modification, the stage of the study that it was made, and its impact on the study outcomes should be fully reported.
If the clinical recommendations provided by the AI model are determined to be erroneous based on the ground truth and have the potential to threaten patient safety, the subsequent steps in the clinical pathway to the wrong recommendations should be interrupted and then correctly guided by the investigators. By contrast, a correct decision that disagrees with the investigators’ decision can be made using the AI model, highlighting its effectiveness. For an appropriate appraisal of the AI model for clinical decision-making, the quantity of both agreement and disagreement in the clinical decision between the AI model and the investigators, which are determined by the ground truth, should be reported.

Discussion

A summary of the study results, their contribution to advancing our knowledge, their clinical implications, and their impact on relevant academic fields should be described. Whether the use of the AI model is supported by the study results compared to previous studies or current standards can also be stated. Comparisons can be made by referring to the performance metrics presented in the Results section.
Study limitations should be stated regarding the study materials and methods, unanticipated results, statistical uncertainty, any kind of bias, generalizability of the study results, any issues and challenges preventing a wide application of the AI model to clinical fields, and questions that remain unanswered by the current work. By balancing the strengths and weaknesses (limitations) of the evidence provided, the extent of support for the tested AI model can be determined solely by its potential benefits.
The effects of human factors on model performance should be discussed. Future actions to be taken based on the study results should also be described (e.g., improvements to and/or modifications of the AI model for the next phase [widening its indications in different clinical settings]).
For safety issues, the authors should discuss the following: errors and risks related to the use of the AI model, adverse events and significant changes to the subsequent steps in the clinical pathways as a result, including whether they were attributable to errors, and the contributions of human factors to the errors. To mitigate these aspects for future studies, specific strategies, such as retraining or modification of the models, should be suggested with relevant rationales (e.g., model modification is recommended because it requires less time than retraining the model from scratch and has a higher likelihood of reducing risks compared to retraining). To obtain public trust in new technologies, all safety issues and strategies to mitigate or prevent them should be reported fully and transparently and discussed openly.

Public accessibility of the AI system, source code, and raw data

In the absence of the source code and data used to train the AI model, the model cannot be reproduced. The algorithm with relevant source code and training data, as well as the data collected during the study using the AI model, must be shared publicly so that the generalizability of the AI model can be evaluated transparently and unbiased comparisons with different models in different settings can be conducted.
To enable independent researchers to verify the code and replicate the results claimed by the original authors without modifying the code, the code should be provided in well-documented and easily understandable scripts or notebooks with clear and detailed explanations and annotations. Formatted raw data used as the model input, versions of libraries, packages, modules, or software components necessary for the code to function correctly, and any computer system configuration requirements should also be shared. Accessible links to repositories, contact information, and instructions for obtaining access should be provided.
Beyond the final outcome, generating as many intermediate results or outputs as possible at each stage of the model-building process can help independent researchers identify the specific steps at which replication may diverge from the original process. This detailed replication process enables other researchers to validate or adapt the model to their clinical cohorts, thereby accelerating the development of new, similar models for different clinical settings and thus establishing best clinical practices.
Unless the AI system is proprietary to commercial entities or governed by licenses that restrict its use, openly sharing the AI system and/or its code for public access and use is strongly recommended. If access to the AI system is restricted, the reason should be stated. If privacy protection issues limit access to the training data, at least the source code of the AI model should be publicly released.
For reference, an AI modeling checklist can be used, which categorizes the level of sharing on a 4-tier scale from fully open sharing to no sharing [22]. Reproducibility standards with three levels of computational reproducibility are also available [23]. Accordingly, model repositories and academic journals can set appropriate levels of sharing based on their policies, standards, and/or requirements.

Other information

1. Pre-registration of AI research to prevent p-hacking

P-hacking is closely associated with the issues of overfitting and information leakage in AI research. When researchers selectively report results by repeatedly performing analyses until significant outcomes are obtained [24], often by testing multiple model configurations with repetitive parameter tuning, they inadvertently commit overfitting, causing the model to capture noise or irrelevant patterns in the training data rather than the true characteristics that are generalizable to different data. This behavior undermines the performance of the model on unseen data, resulting in misleading predictive power [25]. Furthermore, p-hacking can increase the risk of information leakage, where data meant to remain independent from validation or testing inadvertently influence the training process, leading to artificially inflated performance metrics that fail to perform well in real-world settings. Pre-registration of AI research addresses these issues by requiring researchers to commit to a specific study design, data-handling procedures, and analysis methods before accessing the data. This commitment limits the flexibility that facilitates p-hacking and enforces strict guidelines for splitting the dataset to prevent information leakage, thereby promoting transparency in the research process that ensures true model performance and generalizability [26].

2. Safety issues related to errors following the use of AI models in medical practice

If the recommendations provided by the AI system are found to mislead clinical practice and affect patient safety and clinical outcomes, they can be rejected by the researchers. Authors should indicate who makes clinical decisions at each step along the clinical pathway.
Errors that occur as a result of using an AI algorithm are unforeseen and can cause catastrophic results when used on a large scale. Therefore, AI algorithm (performance) errors, errors external to AI model use, and human errors should be reported along with their occurrence rate, causes, and impacts on the clinical pathway/practice, study outcomes, and patient safety. How the errors were detected and handled should also be described so they can be corrected, along with whether the AI algorithm and/or human errors were detected before patient safety was jeopardized. Accordingly, efforts should be made to reduce the risks caused by these errors. Transparency in reporting and analyzing these errors helps to prevent repeating the same errors in future studies using the same AI model. It also helps to improve and upgrade the AI model.
Both the direct and indirect and expected and unexpected adverse events, which are attributed to errors and misuse of the AI model and even its correct use, should be reported along with the strategies used to mitigate them. The relevant risks to patient safety should also be identified and assessed. The safety profile of the AI model based on the aforementioned harms and risks can be the cornerstone of preventive measures to mitigate them in future studies and help determine appropriate target populations and timing for safe AI model use.
To demonstrate how well the investigators, who collected the data in AI model development and the data produced by the AI model, are trained, metrics for learning curves should be presented chronologically. A graphical representation is also encouraged so that other investigators can use the same AI model in different clinical settings.
Unless reporting errors and related safety profiles are planned or performed, the reasons for the omission should be explained.

3. Human errors in AI model use

Human errors in data preparation can significantly affect the performance of an AI system and its outcomes. Whether the data were prepared manually or based on an automatic operating algorithm must be stated. If the process is automated, the tools and algorithms should be fully described along with the parameters. If the input data are selectively acquired by a researcher, the researcher should be fully trained in appropriate data selection according to a standardized data selection protocol. Otherwise, ethical issues can be addressed in the case of adverse events. For example, a histological image from normal tissue rather than from a cancerous region may be input into an AI algorithm, thereby leading to misdiagnosis. In the absence of a standardized data selection protocol, whether the adverse events were caused by human errors in input data selection or algorithmic flaws in the AI system will be unclear. In addition, determining whether real clinical practice can accommodate the standardized data selection protocol is essential. Researchers using AI models in clinical practice should also be fully trained because data collected by poorly trained researchers who are not familiar with AI model use could bias the study results. Effectively presenting the training status could also involve evaluating the learning curves.

4. Errors external to the AI system

If the AI system is used for real clinical settings, the external factors influencing its performance should be considered. For example, any clinical decision that is made by the AI system and then approved by the researchers can be declined by patients for their own private reasons, irrelevant to the clinical conditions and medical treatments (e.g., financial conditions, religion, life perspective).

5. Ethical considerations regarding equity

Considerable effort should be made to assess and promote fairness and equity when developing an AI model, given that inequity is embedded in the current reference standard practice in the healthcare system. The rationale for these efforts should be described in accordance with the study goals. For example, the American Heart Association Get with the Guidelines-Heart Failure Risk Score systematically gives black patients low risk scores [27] without a rationale for this adjustment [28], making this population less likely to benefit from the cardiology service [29]. To address these issues, the AI model that is intended to predict heart failure risk should be appropriately adjusted by including a sufficient proportion of black patients and features relevant to this population (e.g., high blood pressure, left ventricular hypertrophy) in the input data [30].

Discussion

Unlike existing reporting standards, the guidelines and checklist developed in this study offer a comprehensive and versatile framework for reporting studies involving AI models in healthcare regardless of the specific study design. In particular, efforts have been made to ensure that essential components of AI research and reporting are not overlooked. However, these guidelines and checklist were developed solely by the two authors, both experts in statistics, and may not fully reflect diverse perspectives from other AI experts. Further refinement may thus be necessary to address these limitations and incorporate broader expertise.
In conclusion, these guidelines and checklist provide a valuable tool for reporting AI model research through enhancing transparency, ensuring reproducibility, and promoting the appropriate use of AI models in clinical practice.

Funding

None.

Conflicts of Interest

Sang Gyu Kwak and Jonghae Kim have been board members of the Statistical Rounds of the Korean Journal of Anesthesiology since 2016. However, they were not involved in any review process for this article, including peer reviewer selection, evaluation, or decision-making. No other potential conflict of interest relevant to this article was reported.

Data Availability

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

Author Contributions

Sang Gyu Kwak (Conceptualization; Methodology; Project administration; Supervision; Validation; Writing – original draft; Writing – review & editing)

Jonghae Kim (Conceptualization; Methodology; Project administration; Resources; Supervision; Validation; Writing – original draft; Writing – review & editing)

Table 1.
Comprehensive Reporting Checklist for Studies Developing and Utilizing Artificial Intelligence Models
Item number Section/sub-section Elaboration Page number or reasons for omitting the corresponding item
1 Title and abstract
1-1 Title Use broad AI terms like “artificial intelligence” or “machine learning” in the title, reserving specific model details for the abstract.
1-2 Abstract Provide a structured summary including the study design, population details, AI algorithm type, statistical analysis methods, primary and secondary outcomes, main results, conclusions, and any public availability of the software, data, or model.
2 Introduction Provide the scientific and clinical background, including pre-existing evidence for AI use, related clinical problems, and the clinical role the AI plays in solving them; the research questions answered using the AI model; and study objectives and hypotheses.
3 Methods
3-1 Study design, setting, and population
3-1-1 Study design Clearly define the study design (prospective or retrospective), specific objectives (feasibility, superiority, prediction), and the reference standard used to benchmark the AI model’s performance.
3-1-2 Study settings Provide comprehensive details on the study settings, including the type, size, and specific location of the environment (e.g., hospital type, clinic area), availability of relevant facilities or technologies, representativeness of the real-world clinical conditions, and any technical configurations or site-specific adaptations, to clarify limitations of generalizing the AI model.
3-1-3 Study population Describe the participant recruitment process, including inclusion and exclusion criteria, and illustrate the number of participants at each stage using a flow diagram, if possible.
3-2 Description of the AI system
3-2-1 Study data · Data source and collection: specify the origin of input data and the collection period. Include any devices or software used, along with detailed acquisition protocols.
· Data structuring: indicate whether data is structured or unstructured.
· Data preprocessing and transformation: describe preprocessing steps to standardize and format data for the AI system. Include criteria for minimum data quality and how outliers or missing data were handled.
· Investigator expertise: if data collection relied on investigator expertise, list the number of investigators and their qualifications, any training provided, and the methods used to address inter- and intra-investigator variability.
· De-identification and privacy protection: provide details on the methods used for de-identifying data and protecting personal health information.
· Ground truth annotation: clearly define the ground truth reference (e.g., gold-standard clinical measurements) used for model validation. Ensure the definition is precise and reproducible.
· Data processing status: state whether data were pre-processed before analysis or processed during AI system application, and specify whether the data were generated during or before AI use.
3-2-2 Study output · Definition of AI output: specify the AI system’s output in relation to the clinical problem. Ensure that the format of both the AI output and ground truth are uniform.
· Impact on clinical management: describe how the AI output guides clinical management, especially if it influences clinical outcomes. If researchers perform standardized clinical actions based on the AI output, this should be documented.
· Interpretability: confirm that researchers understand and can interpret the AI output, which determines whether standardized actions are required (e.g., administering medication based on a probability threshold generated by the AI).
· Reference protocol comparison: if comparing AI performance to a reference clinical protocol, provide details on how the AI system and reference are used to guide clinical decision-making. Explain the rationale for using the reference standard, including its limitations, errors, and potential biases.
3-2-3 Data separation · Dataset splitting: describe how the dataset was initially split (e.g., training, validation, and test sets), including the proportions and rationale for each subset. Preferably, use an external test set for validation; if using internal validation, explain and justify the method used.
· Population representation: ensure that the test set represents the target population, using methods such as stratified sampling if necessary. Report and statistically compare the distribution of key variables across training, validation, and test sets, and investigate any systematic differences.
· Prevention of overfitting and information leakage to ensure generalizability: outline the methods used to prevent overfitting, such as k-fold cross-validation. Ensure that dataset splitting was done at the beginning, maintaining separation of sets to prevent information leakage. Describe the measures taken to prevent both issues to ensure the model is generalizable beyond the training data.
3-2-4 Concise description of the AI system · Model selection and task specification: identify the AI model type, intended task (e.g., classification or regression), and any specific beneficiaries. Include scientific rationales for model selection.
· Algorithm and supporting environment: describe the mathematical algorithm, hardware, and software (with versions) supporting the AI system, as well as developer or manufacturer details and configuration settings.
· Model versions from previous studies: cite prior development/validation studies, presenting the AI system information as it was used in those studies to facilitate performance comparison. If the AI was modified, provide scientific rationales for changes and describe the modifications clearly.
· Supplemental data for new models: provide unpublished models as supplementary materials or register them in a public database with accession details to ensure version history and transparency.
· AI model architecture: fully document the model architecture to enable replication, detailing inputs, outputs, components, and the scientific rationale for each. Include architectural elements such as layers, activation functions, pooling and normalization types, dropout layers, optimization algorithms, and hyperparameters.
· Version history and identification: for clinical trials, provide a unique device identifier or regulatory marker for the AI model. If the model has multiple versions, document modifications and the rationale behind each change.
· Reporting standards: use a standardized format for reporting concise information in the AI model, when possible.
3-2-5 Model training · Detailed training process: document all training processes in detail to ensure reproducibility, ideally providing the code.
· Data augmentation: describe any data augmentation techniques used (e.g., geometric transformations, paraphrasing, noise introduction) if required for specific data types, such as images or text.
· Parameter initialization: explain the initialization method for model parameters. For random initialization, describe the distributions from which the parameters are drawn (e.g., uniform, normal) and the key parameters of the distributions. For transfer learning, the source of the initial parameters should be provided. Specify any combination of random initialization and transfer learning, indicating unmodifiable parameters where applicable.
· Convergence monitoring: provide details on the methods used for monitoring model convergence, including pre-defined stopping criteria and hyperparameters. If convergence was not achieved, describe any adjustments considered (e.g., feature scaling, learning rate, batch size, architectural changes).
· Validation metrics: specify the metrics used to validate model performance (e.g., sensitivity, specificity, precision, area under the receiver operating characteristic curve, mean squared error).
3-2-6 Model evaluation · Performance metrics: specify the primary performance metric for model selection (e.g., area under the receiver operating characteristic curve, accuracy, and precision for classification; mean squared error for regression). Present metrics with statistical uncertainty (e.g., 95% CI) and compare them between models using appropriate statistical tests.
· Model evaluation methodology: provide a rationale for the chosen model evaluation method, allowing for flexibility in method selection rather than adherence to a predefined protocol.
· Multiple model selection: if more than one model is chosen, justify the selection. For ensemble models, provide 1) the training data allocation to each model, 2) the combination function to resolve disagreements among models, and 3) a full description of each model in the ensemble.
· Comparison with existing models: if applicable, compare the final model with previously published models addressing the same clinical problem.
· Sensitivity analysis: assess model robustness through sensitivity analysis by testing the model under various assumptions or initial conditions.
· Result interpretation: provide clear guidance on interpreting the AI model results to prevent misapplication in clinical settings.
3-3 Miscellaneous aspects of the AI model description
3-3-1 Defining features and response variables Use common data elements to ensure standardized, consistent, and compatible variable definitions and formats across study settings.
3-3-2 Sample size estimation Calculate the required sample size, if possible, based on results from a pilot or previous study to achieve the pre-determined statistical power with an acceptable type I error rate.
4 Results
4-1 Study data · Flowchart for inclusion/exclusion: use a flowchart or diagram to illustrate the inclusion and exclusion of participants or data at each stage, based on the specified criteria. Include the number of participants/data included or excluded and their corresponding criteria. If a flowchart is already provided in the Methods section, disregard this item.
· Dataset preparation: confirm the dataset was prepared as previously planned in the Methods section. If statistical comparisons among partitioned data are already reported in the Methods section, disregard this item.
· Population characteristics: describe and compare the characteristics of the training and test sets to detect any dataset shifts.
· Selective reporting of baseline characteristics: report baseline characteristics relevant to the AI model’s task or study outcomes.
· Missing data: clearly report the amount of missing data across the dataset’s features.
4-2 Model performance · Reporting metrics with statistical uncertainty: report performance metrics on the training, validation, and test sets, including statistical uncertainty and significance as outlined in the Methods section. Provide scientific rationale for each selected metric.
· Clinical translation of model performance: evaluate and justify how model performance metrics relate to clinical outcomes. Statistically compare the performance of the final model to that of standard techniques or baseline models.
· Feature contribution analysis: describe the contribution of each feature to model performance, with relevant plots for visualizing feature contributions if applicable.
· Sub-group performance: report sub-group analysis, identifying groups for which the model performed best and worst, as well as any critical sub-group performance.
· Sensitivity analysis: describe cases based on model confidence and prediction correctness for classification models. Report cases with the largest and smallest differences between predicted and actual values for regression models.
· Unsupervised model assessment: assess the accuracy and relevance of unsupervised model outputs through expert review.
4-3 Use of the AI model in clinical practice · Adherence to protocols: report adherence or non-adherence of the investigators to the study protocols for AI model use in clinical practice. Include details on any cases of non-adherence, particularly if the AI model was not used as planned.
· Impact on medical practice and patient experience: document any unexpected changes in medical practice or patient experience caused by AI model usage. Note any additional procedures, manual data handling, or increased workload associated with AI integration.
· External influences on model performance: report any external changes or influences beyond AI model implementation that may affect model performance or conduct of the study.
· Modifications to the AI algorithm: describe any modifications made to the AI algorithm during the study.
· Error handling: document instances of both agreement and disagreement between the AI model’s recommendations and investigators’ decisions, based on the ground truth, to assess the model’s reliability in clinical decision-making.
5 Discussion · Summary of study results: summarize the study findings, their contribution to advancing knowledge, clinical implications, and impact on the relevant academic field. Indicate whether the study results support the use of the AI model compared to previous studies or current standards, referring to performance metrics when applicable.
· Study limitations: identify study limitations related to materials and methods, unexpected results, statistical uncertainty, biases, generalizability, challenges in clinical application, and unanswered questions. Balance these limitations against the study’s strengths to assess the extent of the evidence supporting the AI model’s potential benefits.
· Human factors: discuss the effects of human factors on model performance
· Future actions: outline potential future actions based on the study results.
· Safety issues: address any safety concerns and propose specific strategies to mitigate these issues in future studies, along with the rationale for each approach.
6 Public accessibility of the AI system, source code, and raw data · Source code and training data availability: ensure that the algorithm, source code, and training data, as well as data collected during the study, are shared publicly.
· Code documentation: provide the code in well-documented, easily understandable scripts or notebooks, with clear explanations and annotations.
· Formatted data and software/hardware requirements: share the formatted raw data used as model input, along with the versions of libraries, packages, modules, and software components, and any specific computer system configuration requirements necessary for the code to function.
· Repository access information: include accessible links to repositories, contact information, or instructions for obtaining access to all relevant files.
· Intermediate results: generate and share as many intermediate outputs as possible at each stage of model development.
· Access restrictions: if access to the AI system or data is restricted due to proprietary or licensing issues, clearly state the reason. If privacy concerns limit access to training data, release at least the model’s source code for public access.
· Levels of sharing: refer to the existing categorization of levels of sharing and adhere to model repository or journal-specific policies on data and code sharing.
7 Other information
7-1 Pre-registration of AI research to prevent p-hacking Pre-registration of AI research protocols: pre-register the study design, data handling (including strict separation among training, validation, and testing datasets), model configurations, and parameter tuning procedures.
7-2 Safety issues related to errors in AI model use in medical practice · Misleading recommendations: indicate whether the AI model’s recommendations were found to mislead clinical practice, affecting patient safety and clinical outcomes. Specify who is responsible for clinical decisions at each step in the clinical pathway.
· Reporting errors: document AI algorithm errors, errors external to AI model use, and human errors, including their occurrence rate, causes, and impacts on the clinical pathway, study outcomes, and patient safety.
· Error detection and management: describe how errors were detected, managed, and corrected, noting whether AI algorithm or human errors were identified before they jeopardized patient safety.
· Risk reduction efforts: outline any efforts made to reduce the risks caused by AI-related and external errors.
· Reporting adverse events: report all direct and indirect, expected and unexpected adverse events related to AI model use, misuse, or even correct use, along with strategies to mitigate these events.
· Risk assessment for patient safety: identify and assess relevant risks to patient safety associated with AI model use.
· Learning curve metrics: provide metrics for learning curves of the investigators involved in data collection for AI model development. Present these metrics chronologically, with graphical representation, if possible.
7-3 Human errors in AI model use · Data preparation method: specify whether data were prepared manually or through an automated algorithm. If automated, describe the tools, algorithms, and parameters used in the process.
· Researcher training for data selection: if input data were selectively acquired by the researchers, confirm they were fully trained in a standardized data selection protocol. Clarify whether this protocol can be accommodated in real clinical practice.
· Researcher training for AI model use in clinical practice: confirm that researchers using the AI model in clinical settings received adequate training.
7-4 Errors external to the AI system Document external factors that could influence AI system performance in clinical settings.
7-5 Ethical considerations regarding equity · Fairness and equity assessment: describe efforts made to assess and promote fairness and equity in the AI model, acknowledging any existing inequities in current healthcare standards.
· Inclusion of underrepresented groups: ensure the input data adequately include underrepresented groups (e.g., racial or ethnic populations) and relevant features to support fair prediction.

Indicate whether each item and sub-item from this checklist applies to the submitted work. In the last column for each item, add the page numbers underlined where relevant items are described in the text body. Note that not all the items may be relevant to the submitted work. If an item is not applicable, please provide, underlined, justifiable reasons for its omittance in the last column for each item. Elaborations on each item are provided in the text body. This checklist focuses on evaluating the processes involved in AI model development and/or usage. Miscellaneous aspects not directly related to the AI model itself are listed in the item, “Other information”. AI: artificial intelligence.

Kwak and Kim, 2025. This table is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). https://creativecommons.org/licenses/by-nc/4.0/

References

1. Obermeyer Z, Emanuel EJ. Predicting the future - big data, machine learning, and clinical medicine. N Engl J Med 2016; 375: 1216-9.
crossref pmid pmc
2. Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med 2019; 25: 44-56.
crossref pmid pdf
3. CONSORT-AI and SPIRIT-AI Steering Group. Reporting guidelines for clinical trials evaluating artificial intelligence interventions are needed. Nat Med 2019; 25: 1467-8.
crossref pmid pdf
4. Liu X, Cruz Rivera S, Moher D, Calvert MJ, Denniston AK; SPIRIT-AI and CONSORT-AI Working Group. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat Med 2020; 26: 1364-74.
crossref pmid pmc
5. Vasey B, Nagendran M, Campbell B, Clifton DA, Collins GS, Denaxas S, et al. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nat Med 2022; 28: 924-33.
crossref pmid pmc
6. Collins GS, Moons KG, Dhiman P, Riley RD, Beam AL, Van Calster B, et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 2024; 385: e078378.
crossref pmid pmc
7. Mongan J, Moy L, Kahn CE Jr. Checklist for artificial intelligence in medical imaging (CLAIM): a guide for authors and reviewers. Radiol Artif Intell 2020; 2: e200029.
crossref pmid pmc
8. Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 2019; 366: 447-53.
crossref pmid
9. Deo RC. Machine learning in medicine. Circulation 2015; 132: 1920-30.
crossref pmid pmc
10. Kelly A, Shults J, Mostoufi-Moab S, McCormack SE, Stallings VA, Schall JI, et al. Pediatric bone mineral accrual z-score calculation equations and their application in childhood disease. J Bone Miner Res 2019; 34: 195-203.
crossref pmid pmc pdf
11. Sabottke CF, Spieler BM. The effect of image resolution on deep learning in radiography. Radiol Artif Intell 2020; 2: e190015.
crossref pmid pmc
12. Heaven D. Why deep-learning AIs are so easy to fool. Nature 2019; 574: 163-6.
crossref pmid pdf
13. Willemink MJ, Koszek WA, Hardell C, Wu J, Fleischmann D, Harvey H, et al. Preparing medical imaging data for machine learning. Radiology 2020; 295: 4-15.
crossref pmid pmc
14. Ortega R, Connor C, Kotova F, Deng W, Lacerra C. Use of pressure transducers. N Engl J Med 2017; 376: e26.
crossref pmid
15. Sendak MP, Gao M, Brajer N, Balu S. Presenting machine learning model information to clinical end users with model facts labels. NPJ Digit Med 2020; 3: 41.
crossref pmid pmc pdf
16. Wandner LD, Domenichiello AF, Beierlein J, Pogorzala L, Aquino G, Siddons A, et al. NIH's helping to end addiction long-termSM initiative (NIH HEAL Initiative) clinical pain management common data element program. J Pain 2022; 23: 370-8.
crossref pmid
17. Finlayson SG, Subbaswamy A, Singh K, Bowers J, Kupke A, Zittrain J, et al. The clinician and dataset shift in artificial intelligence. N Engl J Med 2021; 385: 283-6.
crossref pmid pmc pdf
18. Subbaswamy A, Saria S. From development to deployment: dataset shift, causality, and shift-stable models in health AI. Biostatistics 2020; 21: 345-52.
crossref pmid pdf
19. Gianfrancesco MA, Tamang S, Yazdany J, Schmajuk G. Potential biases in machine learning algorithms using electronic health record data. JAMA Intern Med 2018; 178: 1544-7.
crossref pmid pmc
20. Marshall A, Altman DG, Royston P, Holder RL. Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study. BMC Med Res Methodol 2010; 10: 7.
crossref pmid pmc pdf
21. Rajkomar A, Hardt M, Howell MD, Corrado G, Chin MH. Ensuring fairness in machine learning to advance health equity. Ann Intern Med 2018; 169: 866-72.
crossref pmid pmc pdf
22. Norgeot B, Quer G, Beaulieu-Jones BK, Torkamani A, Dias R, Gianfrancesco M, et al. Minimum information about clinical artificial intelligence modeling: the MI-CLAIM checklist. Nat Med 2020; 26: 1320-4.
crossref pmid pmc pdf
23. Heil BJ, Hoffman MM, Markowetz F, Lee SI, Greene CS, Hicks SC. Reproducibility standards for machine learning in the life sciences. Nat Methods 2021; 18: 1132-5.
crossref pmid pmc pdf
24. Head ML, Holman L, Lanfear R, Kahn AT, Jennions MD. The extent and consequences of p-hacking in science. PLoS Biol 2015; 13: e1002106.
crossref pmid pmc
25. Ioannidis JP. Why most published research findings are false. PLoS Med 2005; 2: e124.
crossref pmid pmc
26. Nosek BA, Ebersole CR, DeHaven AC, Mellor DT. The preregistration revolution. Proc Natl Acad Sci U S A 2018; 115: 2600-6.
crossref pmid pmc
27. Peterson PN, Rumsfeld JS, Liang L, Albert NM, Hernandez AF, Peterson ED, et al. A validated risk score for in-hospital mortality in patients with heart failure from the American Heart Association get with the guidelines program. Circ Cardiovasc Qual Outcomes 2010; 3: 25-32.
crossref pmid
28. Vyas DA, Eisenstein LG, Jones DS. Hidden in plain sight - reconsidering the use of race correction in clinical algorithms. N Engl J Med 2020; 383: 874-82.
crossref pmid
29. Eberly LA, Richterman A, Beckett AG, Wispelwey B, Marsh RH, Cleveland Manchanda EC, et al. Identification of racial inequities in access to specialized inpatient heart failure care at an academic medical center. Circ Heart Fail 2019; 12: e006214.
crossref pmid pmc
30. Havranek EP, Froshaug DB, Emserman CD, Hanratty R, Krantz MJ, Masoudi FA, et al. Left ventricular hypertrophy and cardiovascular mortality by race and ethnicity. Am J Med 2008; 121: 870-5.
crossref pmid pmc