Introducing our Responsible AI Framework
Learn how to responsibly, safely, and compassionately develop and deploy AI solutions for healthcare.
Prepared By: Mattea Welch, Benjamin Grant, and Christopher Deutschman
In Consultation With: Clare McElcheran, Adam Badzynski, Jennifer A.H. Bell, Andrew Hope, Robert C. Grant, Tran Truong, Kelly Lane, Patti Leake, Divya Sharma, Ian Stedman, Mike Lovas, Ale Berlin, Jeremy Petch, Benjamin Haibe-Kains, and James A. Anderson
1. General
1.1. Fairness Definitions
While both ‘bias’ and ‘inequity’ have myriad meanings, for the purposes of this document, the terms will be defined as follows:
Bias
An umbrella term for both social biases and methodological biases. Social bias, which can be conscious or unconscious, can be defined as “discrimination for, or against, a person or group, or a set of ideas or beliefs, in a way that is prejudicial”.(1) Methodological bias can be defined as “systematic errors stemming from choices made during the research process”(2), such as selection bias, observer bias, response bias, and publication bias.(3) In the machine learning sphere, there is often an intersection between social and methodological biases (for example, data based on nursing assessments may have a methodological observer bias, where part of the inter-observer variability is due to each nurse’s individual social biases). Within the context of AI, a biased solution is one that has different performance for different subgroups of patients or inequitable impact on different subgroups of patients.
Inequity
A term defined broadly as “unjust differences between populations in the access, use, quality, and outcomes of care.”(4) The key difference between ‘inequities’ and ‘disparities’ is that ‘inequity’ is a value-laden term – the differences described are unjust or unfair – whereas ‘disparity’ merely describes that a difference is present. Serious scrutiny is required to determine in which cases a ‘disparity’ should or should not be considered an ‘inequity’. Inequities are systematic differences in the opportunities groups have to achieve optimal health.(5) These systematic differences are influenced by social biases and by structural differential access opportunities.
Responsibility
A ‘responsible’ AI solution is one that is applied in a way that is the most appropriate (just, ethical) – but what is most appropriate for a given AI solution in a given situation is not static. The proceeding framework is adaptable to most definitions of ‘responsibility’ that could be chosen. An AI solution that is developed and deployed following all of the steps provided in this framework should implicitly follow, at minimum, the principals of equity, fairness, scientific rigor, and democratic engagement, but other values will be infused by choices made throughout the AI lifecycle (such as which patient populations are served or prioritized, how fairness is defined and to what extent the definition is enforced, how information about the AI is disseminated to end users, and so on). Identifying which values are important to the project at the outset is critical to making these decisions downstream.
An AI solution could be considered responsible if at the end, it functions in alignment with the core principles of clinical ethics(6) (beneficence, non-maleficence, autonomy, and justice); or if it meets the standards of a pre-existing set of ethical standards such as those put forth by the Coalition for Health AI(7) (usefulness, fairness, safety, transparency, and privacy). Alternatively, the guiding definition for responsibility for your solution may be, for example, that the AI contributes to the common good(8). However, when using a broader philosophical definition of responsibility, care must be taken to operationalize it appropriately. For example, in the case of defining responsible AI as contributing to the common good, who is the “common” – the patients served by your institution? All residents in the city of your institution? The nation? Humanity? After defining “common”, what is the definition of “good”? Is it and who gets to have input on that definition?
1.2. Common Methodological Biases
Below is a list of commonly found methodological biases that can be present at different times of an AI-Solutions development and deployment. This list is not meant to be a comprehensive overview of all potential biases, but to be an introduction to some of the most commonly found ones.
- Selection bias: Occurs when the training data is not representative of the broader patient population, resulting in skewed results. This can lead to models that perform well on certain subgroups but fail on others, especially if certain patient demographics are underrepresented (e.g., minority groups).(9,10) This can be the result of biased sampling, measurement errors, or incomplete data collection, leading to non-representative datasets. In healthcare, this can occur if electronic health records (EHRs) or datasets are incomplete or contain errors, leading to skewed models. For example, if data is collected only from a specific hospital type or region, the model may not generalize well to other settings. Missing data also causes biases when missing in a systematic way, leading to skewed results. For instance, if patients who are healthier are more likely to have missing records (i.e. less tests performed), models may underpredict positive outcomes.
- Confirmation bias: The tendency to search for, interpret, or prioritize information that confirms one’s preexisting beliefs or hypotheses, while disregarding contradictory evidence. e.g. a belief that a population of patients is not impacted by inequitable care, and therefore a thorough literature search is not conducted. (11,12)
- Incorporation bias: A subtype of verification bias, where the predictive model includes features or data points that are part of the outcome it is trying to predict. For example, if an ML model predicts a diagnosis and uses test results that are part of the diagnostic process itself, it may artificially inflate model performance.(13)
- Survivorship bias: Arises when only patients who have survived or remained in care and included in the study and those who have died or dropped out are ignored. In ML model predictions, this can lead to over-optimistic performance in predicting patient outcomes, as the model ignores negative outcomes. For qualitative research related to patient experiences, this will also, likely, result in over-optimistic evaluation of an AI-Solution.(14,15)
- Model overfitting: When a ML model is too complex, it may fit the noise or anomalies in the training data rather than the underlying patterns. This leads to excellent performance on the training set but poor generalization to new, unseen patient data.(16,17)
- Spectrum effects: Occur when a model’s performance varies across different patient subgroups, such as those with different disease severities. This bias can result in a model performing well for a mild disease but poorly for severe cases, leading to misleading evaluations of model accuracy.(18)
- Automation bias: This bias occurs when clinicians overly trust ML model outputs without critically evaluating them. Over-reliance on automation can lead to errors, especially if the model is flawed or the input data is incorrect. (19)
- Unconscious bias: Also known as implicit bias, unconscious bias can be introduced if the data or model development reflects the prejudices of the research team. For example, models might inadvertently favor certain patient groups over others if the underlying data contains implicit biases related to race, gender, or socioeconomic status.(20)
2. Problem Identification and Study Design
2.1. Define the problem and the solution
Assembling your cross-functional team
The cross-functional team composition and involvement of specific individuals should depend on the scope, topic and the current stage of the project. The team is not a formal review committee, but one that ensures each stage of development is considered by an appropriate expert. Over the lifetime of the project, the team should, at some point, include ML researchers, clinicians, data analysts, statisticians, equity and ethics experts, patient representatives, clinical champions and end-users, as well as ethics, privacy and strategy representatives.
Patient engagement and participation
Patient voices, represented by patient partners, family and caregivers, and/or professional patient advocates, should be present within and beyond the cross-functional team. Patient partnerships are critical to make sure that patients are guiding the research in identifying unmet needs and improving outcomes for those needs. Patient voices can be present in the whole spectrum of AI research, from problem identification and study design through to clinical implementation and lifecycle monitoring.
Many institutions and disease specific advocacy groups will have their own supports for patient partners, so it is important to look for local initiatives. In Canada, the Canadian Institutes of Health Research has a framework for patient-oriented research and engagement (https://cihr-irsc.gc.ca/e/48413.html) – similar initiatives may exist in other countries.
Non-technical solutions
The allure of AI in healthcare often creates a “shiny object syndrome”(21) where research teams and organizations rush to implement cutting-edge technology simply because it’s novel and exciting, rather than because it’s the most effective solution. An AI-first mindset can lead to overlooking simpler, proven interventions that might better serve patients and healthcare workers. When enthusiasm for AI’s potential overshadows careful consideration of alternatives, organizations risk investing substantial resources into complex technical solutions for problems that might be better solved through basic process improvements, better staffing, or enhanced communication systems. It’s worth remembering that sometimes what appears to be a technology problem is actually a human systems problem in disguise, and no amount of sophisticated AI can fix underlying issues with workflow, staffing, or resource allocation.
When addressing healthcare challenges, it’s crucial to first explore fundamental non-technical and non-AI-Solutions that may be more effective and sustainable. These approaches may require fewer resources, face fewer implementation challenges, and can be more easily adapted to local contexts compared to AI-Solutions.
2.2. Focus on Equity in the Problem Space
Explore and document baseline inequities
A key component of responsible AI development and deployment is mitigating bias, fairness and inequity. In order to do this, there need to be clearly defined equity objectives and fairness metrics to measure success. These definitions will be specific to the population, problem space, and solution that is being addressed by a particular AI-Solution. Exploring and understanding baseline inequities is paramount to developing a fair and equitable model, and can be done in a few different ways (please see sections below). Taken together, these learnings lay the foundation for quantitative analysis of biases and inequities within your target population once you have access to local retrospective data to explore in Section 3.2.
Literature review (social determinants of health, sources of bias and inequity)
The equitable and compassionate components of this framework rely on an extensive review of documented biases in healthcare literature. However, it should be acknowledged that biases identified in the literature are not guaranteed to be exhaustive, nor always fully representative in the collected data or healthcare system, particularly when the evolution of protected attributes over time is considered (e.g. newly adopted gender definitions that are more granular and representative, or new discoveries in a disease definition that increases specificity). Key patient factors associated with healthcare inequity include race(22–26), age(27–30), sex(31–33), geographical location of residence(34,35), patient support systems(36,37), primary language, income level, and other social determinants of health. Each of these elements independently may impact healthcare access, treatment, and/or outcomes, and often the effects compound when elements are present together, creating unique barriers for those with intersectional identities. Common intersections may include age/sex, and ethnicity/sex. Accounting for intersectional identities and intersections of other health-related groupings remains challenging(38); there are far more combinations of intersecting demographic factors than can be practically addressed. Broadening bias assessment to include intersectional identities is essential, but often constrained by data availability.
Consult with knowledgeable partners
There will be gaps in the healthcare literature regarding what biases exist in a given problem space. By consulting with individuals who understand the patient experience in the specific problem space, as recommended in our framework, the investigator may begin to fill in these gaps. There are numerous stakeholders that can inform the team’s understanding of these biases, including patients, patient partners, families/caregivers, patient navigators, and healthcare providers. The goal of this consultation process is to identify any biases or inequities, real or perceived, that must be considered and either successfully mitigated or identified as a limitation to the work.
Working with First Nations, Indigenous, and Métis Research and Data
For AI-Solutions that are focused on First Nations, Indigenous, and/or Métis populations, or are known to utilize First Nations, Indigenous, or Métis data, special care should be taken to respect the data sovereignty of these groups. If the AI-Solution will use data from specific individual nations, the project team must seek approval from appropriate leaders from those nations prior to commencing the project.
To bolster the researcher’s own capacity to respect/assert the principles of data sovereignty, it is recommended that all members of the research team complete The Fundamentals of OCAP® course that was developed by the First Nations Information Governance Centre. OCAP stands for Ownership, Control, Access and Possession, and the course provides valuable information on how to appropriately engage with these communities and their data.
Through this course, participants will learn about the history and motivation behind OCAP, as well as receive practical steps for participating in, or seeking to participate in relevant research studies. As examples, individuals should ask themselves the following questions, among others that are outlined in the course:
- Was this data collected with the approval and knowledge of the First Nations?
- Does the project align with the priorities of the First Nation(s) from whom the data is coming?
- Could undertaking this project/using this data cause harm to the First Nation(s) or its members?
- Are First Nation(s) members being included in the project from conception to analysis to implementation?
2.3. Outcome Measurements and Data Requirements
Assess AI Output Integration into Clinic
Prior to embarking on the lengthy and costly endeavor of AI-Solution development, testing and integration ensure that the solution will have a measurable impact on clinical processes. Two important questions to ask are: (1) What will be done differently in the clinic, either operationally or for patients, with the AI-Solution outputs; and (2) do pathways exist to act on AI-Solution outputs.
The clinical impact of an AI-Solution’s output may be affected by a variety of scenarios including the state of the care facilities outside of healthcare (eg. an output recommending discharge of patients to long term care centers which do not have vacancies), available treatments (eg. an output identifying patients who will become septic which has not had advances in treatment in recent decades), and healthcare resources (eg. an output that predicts emergency room visits which are currently at capacity). The ability to address scenarios to improve the impact of AI-Solutions ranges in feasibility and control of the research team and should be thoughtfully considered.
Outcome measurement
Defining the ideal outcome and determining what should be measured to assess that outcome are two separate steps. First, the ideal outcome should be defined based on the ideal clinical state after AI-Solution integration. For example, for an AI-Solution meant to predict and intervene in post-surgical pain crisis, the ideal clinical state post-deployment might be ‘more patients with appropriately managed pain’. After the ideal clinical state has been defined, the most appropriate outcome measurement(s) can be identified. In this example, measurement may be simple, such as self-reported patient pain or pre- and post-deployment number of analgesic orders per patient.
The challenge lies in making sure that the measurement(s) chosen is/are fair and equally assessable across all affected subpopulations. Patients of colour are often prescribed pain medication at a lower rate than white patients, despite being in equal or higher levels of pain. When they are prescribed analgesics, patients of colour are also more likely to be given a lower dose of the medication or given a different regiment entirely. If the AI-Solution is making predictions of the likelihood of pain crisis based in part on historic analgesic orders, the prediction will not be appropriately sensitive for patients of colour.
In this example, measuring outcomes such as ‘rate of analgesics ordered across target subpopulations’ or ‘average patient self-reported pain pre- and post-deployment per target subpopulation’ may be more appropriate to capture how the AI-Solution is affecting different groups. Without taking care to do measurement by subpopulation, there may still be an overall increase in orders or overall decrease in reported pain after surgery, but those trends may be primarily influenced by the experiences of white patients, and the unequal increase in orders/dosing would lead to an overall inequitable clinical state.
Data requirements and availability
Before moving on to model training and development, it is important to consider the data that is required to allow your model to perform effectively and equitably. This includes determining specific outcome measures, the features that need to be included in the model, and whether they are collected in a format, and for a purpose, that is adequate for your specific problem and population. Most clinicians, researchers, and AI specialists are not particularly familiar with the sources, types, structures, and availability of data, so it is worth doing this investigation up front to avoid significant problems at later stages.
It is also important when identifying data elements and data sources to understand the context in which the data was gathered. How and why the data was collected and the intended purpose/user of the data is important information when assessing potential biases and limitations with the data.
3. Model Training and Development
3.1. Accessing Retrospective Data
If you are doing any kind of health research that involves real-world, non-public healthcare data, you will need approval to conduct this research from an official governing body. It is up to your team to identify what the appropriate pathway is to get approval, and to seek that approval before accessing any data.
3.2. Appropriateness of Retrospective Data
Identify current clinical benchmarks
Clinical benchmarks are essential for evaluating new machine learning models in healthcare because they ensure relevance and impact. By setting a comparison standard grounded in real-world clinical practice, these benchmarks help establish whether a model provides tangible benefits over existing methods. This relevance ensures models address actual clinical needs rather than producing technically interesting but clinically irrelevant results.
Furthermore, benchmarks foster trust and safety. They align models with regulatory standards, reducing risks and proving the model’s utility before deployment. Comparative benchmarking also helps stakeholders gauge the model’s effectiveness against current methods, ensuring that any advancements are not just statistically significant but clinically meaningful, thereby justifying the model’s use in healthcare settings.
Assessing dataset for disparities and ‘fit for purpose’
Biases that should be considered during this phase of the project include sampling, convergence and participation bias. Additionally, using known societal inequities and biases that were identified during the literature search of the problem space, assess your retrospective dataset to ensure it is properly representative of these known issues. This can be achieved most simply by looking at dataset distributions, and minority and majority class representations. Furthermore, for protected or minority groups, a statistical power analysis should be performed to ensure there are enough individuals in the dataset to garner meaningful information.
3.3. Defining Objectives and Metrics
Appropriate statistical measures of performance
Defining appropriate statistical measures of performance for an AI-Solution is crucial to ensuring its effectiveness and reliability. The choice of metrics should align with the specific problem domain, objectives, and nature of the dataset. For instance, in binary classification tasks, measures like AUC-ROC (Area Under the Receiver Operating Characteristic Curve)(39,40) and PR-AUC (Precision-Recall Area Under the Curve)(41,42) are often used. However, selecting between these metrics depends on the data characteristics and known imbalances. AUC-ROC provides an overall view of the model’s ability to discriminate between classes, but it may not be as informative in cases of class imbalance. PR-AUC, on the other hand, focuses on the precision and recall trade-off, making it a better choice when the positive class is rare or when false positives and false negatives have differing consequences. Failure to select the right metric can result in a misleading evaluation of the model’s performance and potentially biased performances for the majority class.
Some metrics also have known dependencies that must be considered to ensure accurate interpretation of performance. For example, the Dice Similarity Coefficient (DICE), commonly used in medical image segmentation tasks, is influenced by the size or volume of the segmentation(43,44). Larger segmentations may artificially inflate the DICE score, while smaller ones might unfairly penalize it. Understanding these dependencies is essential to avoid obscuring the true performance of the model. Practitioners should complement such metrics with additional measures or normalization techniques to account for these biases and provide a more holistic evaluation.
Another key aspect is the validation strategy employed to assess the AI model. Techniques like k-fold cross-validation help to mitigate overfitting and provide a robust estimate of the model’s generalizability(45,46). Using too few folds might lead to high variance in performance estimates, while excessively large numbers of folds can increase computational costs without significant gains in reliability. An inadequate validation process can lead to over-optimistic results on retrospective data that fail to generalize to new prospective data, undermining the solution’s practical applicability.
Overfitting is another common pitfall that arises when statistical measures are not appropriately applied or understood(16). Overfitting occurs when a model learns the training data too well, capturing noise instead of generalizable patterns. This often happens when performance metrics are optimized exclusively on the training data or when complex models are used without adequate regularization. For instance, reporting high accuracy on training data while neglecting poor performance on unseen data can give a false sense of success. Similarly, reliance on a single metric, such as accuracy, in imbalanced datasets can obscure significant deficiencies, such as a model’s inability to correctly identify minority class instances.
Misusing statistical measures—whether by selecting inappropriate metrics, inadequately validating models, or overfitting to training data—can lead to incorrect conclusions and poor real-world performance. By thoughtfully addressing these considerations, practitioners can create AI-Solutions that are both accurate and trustworthy.
Defining equity objectives
Equity objectives for a project are informed by current inequities related to the problem space, and will fall on a spectrum from maintaining current inequity levels to reducing them; an AI-Solution should never make inequity levels worse. To obtain the identified equity objective, appropriate and complementary fairness metrics are required.
Measures of Fairness
Selecting appropriate measures of fairness is imperative to the assessment of developed AI-Solutions. A few example measures are highlighted below, but investigators are encouraged to do a review of the literature to see if any newer and more relevant measures have been developed. (47–50)
Fairness metrics
Designed to analytically assess equality and equity issues in order to give insight into the nature of the models performance.
- Group Unawareness: Group unawareness means that a model does not use a sensitive variable during prediction(51). For example, in a simple linear regression, the coefficient of the sensitive variable would be set to 0. Group Unawareness can be assessed using metrics such as SHAP, or by looking at the statistical significance of a univariable Ordinary Least Squares Regression that has been fit to predict the outcome of interest can be predicted using a single feature. Caution when using Group Unawareness is needed in scenarios where the sensitive variable may be highly correlated with a proxy variable.
- Statistical Parity/Demographic Parity: measurement of whether a model predicts positive outcome at equal rates for each segment of a subgroup.(52)
- Equal Opportunity: measures whether individuals from each segment of a protected class, who are eligible/qualified, have the same probability of receiving a positive outcome (e.g. being offered participation in a clinical trial). However, assessment of eligibility/qualification can be subjective, and Equal Opportunity does not address biases in the assessment process.(49,53)
- Equal Odds: this metric is used to quantify whether equal true positive rates and false positive rates exist between different groups. It is more restrictive than equal opportunity making it more appropriate when there are existing biases in the data. However, this is a very restrictive metric and may reduce the model’s performance.(54,55)
- Positive Predicted Value (PPV) – Parity: given a positive prediction, the precision is equal across different groups. For example, if a model positively predicts that a patient should receive “treatment X”, the probability of this treatment being successful is equal in all segments of a protected class.(56)
- False Positive Rate (FPR) – Parity: this metric is the opposite of PPV-Parity and wants to ensure that each segment of a protected class has the same false positive rate. For example, if the model positively predicts that a patient should receive “treatment X”, the probability of this treatment not being successful is equal in all segments of the protected class. (57)
Open-Source fairness metric libraries
- FairLearn: An open-source python package designed for metric calculation and reduction of bias in algorithms.
- Fairness Indicators: A package from Google designed to work with TensorFlow. Can be used for evaluation and visualization of group disparities.
- AIF360: An open-source package used to detect and mitigate biases. It is known for its extensive documentation and tutorials.
- Themis-ML: An open-source package for easy integration with scikit-learn. It is used mainly for binary classes and evaluation.
3.4. Model Training and Testing
Acquisition of a third-party model
Assessment of an AI-model, or AI-Solution, acquired from a third party for deployment in a healthcare institution is imperative for the safety of patients. Comprehensive questioning is required across multiple domains. Some questions that should be asked include:
- General Information
- What is the intended purpose and scope of the model?
- Has the model been used in clinical settings similar to ours?
- What are the specific clinical problems it aims to solve?
- Were patients and end-users consulted during the conception and development of this model?
- Development and Validation
- What data was used to train and validate the model? (e.g., size, sources, geographic diversity, demographic representation) Are we able to do an internal assessment of the data?
- What validation processes were followed? (e.g., external validation, cross-validation)
- What are the key performance metrics, and how do they vary across subpopulations of interest?
- Bias and Fairness Assessment
- Was the training data representative of the populations the model will serve? This may be challenging if the training data cannot be accessed and compared to local data.
- What methods, if any, were used to detect and mitigate biases in the development phase?
- Are there known performance disparities across demographic groups (e.g., age, sex, race, ethnicity)?
- What fairness frameworks or metrics were used to evaluate the model?
- Does the model include safeguards to minimize inequities in its recommendations?
- Are there transparency mechanisms to report when biases or unfair outcomes are detected post-deployment?
- Safety and Risk Management
- What safeguards are in place to ensure patient safety in case of errors?
- Has the model undergone stress testing for edge cases or rare clinical scenarios?
- What adverse outcomes or unintended consequences were identified during testing?
- Is there a protocol for monitoring and reporting errors or adverse events post-deployment?
- Compliance and Regulatory Adherence
- Does the model comply with relevant regulations?
- Are there documented audit trails for data usage and model outputs?
- Operational Integration
- What are the technical requirements for integration with our existing systems (EHR, PACS, etc.)?
- What resources are needed for deployment and maintenance?
- What user training and support are provided?
- Post-Deployment Monitoring
- What tools or processes are available for continuous monitoring of model performance, clinical impact, and operational and regulatory adherence?
- How frequently should the model be retrained or updated?
- How is feedback from clinicians and patients incorporated into updates?
- Are there mechanisms to adjust or recalibrate the model based on observed disparities?
- Ownership and Intellectual Property
- Who owns the model and any updates made during its use?
- What are the terms of data sharing, if applicable?
Data Representativeness and Quality
When the retrospective dataset is split into training and testing cohorts (through e.g. randomized stratified splitting, bootstrapping, time wise split), investigators should take care to assure each cohort retains the same representation of known biases and inequities (i.e. the distributions should be the same when compared to the full retrospective dataset).
Data quality should also be assessed at this step. Data missingness and consistency in variable naming are two such tests that should be completed prior to commencing ML training. Additionally, if using longitudinal data, an understanding of whether there were any changes in clinical practice over time is required to avoid temporal bias (e.g. variable naming standards, standard treatment regimens, pandemics affecting patient presentation).
Mitigation methodologies
Similarly to the fairness and equity measures section presented in Section 3.3 of our appendix, this section is used only to highlight a few different methods to improve model fairness. Additional methods can be found in some of the open-source fairness metric libraries mentioned above. A literature review is recommended if none of the methods below are appropriate, or to determine if there are newer and more relevant methods that have been developed.
Data pre-processing
In some scenarios, it is appropriate to change or adjust the dataset to be fairer before training an ML model.
- Relabeling and perturbation: These methods involve changing the end-point label or features in a dataset. Relabeling attempts to balance the dataset by changing the end-point label, while perturbation involves varying the features/variables to create a more balanced representation of the data. Two examples of these methods are disparate impact remover(58) and “massaging”(59). However, relabeling or perturbing data can introduce inaccuracies and distort the data’s original distribution. This may lead to models learning incorrect or overly simplified relationships, so these techniques must be thoroughly validated against the original data distributions.
- Sampling: A dataset can be sampled up or down by adding or removing samples, respectively, to change the sample distributions to one that is more balanced. Up sampling the minority class can be done using duplication of existing samples or synthetic data, but caution is warranted since the model may start to overfit to duplicated or synthetic examples. Down sampling can involve removing majority group samples, but similarly, caution is warranted since data complexity can be reduced. Methods such as Synthetic Minority Over-sampling Technique (SMOTE)(60) attempt to balance these methods.
ML training methods
These methods are used to modify or alter ML training algorithms to improve model fairness.
- Regularization and constraints: these methods alter the loss function of an algorithm. During regularization, an extra term is used to penalize discrimination, while constraints are used to limit the allowed bias level according to a certain loss function. Prejudice Remover(61), Exponentiated Gradient Reduction(62), Grid Search Reduction(62) and Meta Fair Classifier(63) are all examples of these techniques. A potential drawback of these methods is the difficulty in correctly defining the penalization terms or constraints. Overly strict constraints might lead to underfitting and reduced model performance, while poorly chosen terms can fail to adequately address bias. Additionally, these techniques may require significant computational resources and careful tuning, which can be challenging in practice.
- Adversarial learning involves training two models that compete to improve their performance(64). Specifically, one model attempts to predict the true label of a dataset, while the other model attempts to exploit a known fairness issue using equality metrics. A drawback of adversarial learning is its complexity and the risk of instability during training, as the competing objectives of the models can lead to convergence issues. Adversarial models may inadvertently reduce overall predictive accuracy if fairness constraints conflict significantly with optimizing performance. Furthermore, adversarial models are not appropriate in scenarios where there are known differences between subgroups. As an example, when developing auto-segmentation models with a known performance difference between males and females adversarial learning should not be used since morphological differences are known to exist between sexes(65).
Prediction post-processing
Post-processing methods act on the model predictions and are used when access to training data or the model is limited. These methods would be more appropriately used in scenarios where the model is commissioned from an external group.
- Classifier correction: A trained ML model is adapted to remove discrimination based on equalized odds and equality of opportunity constraints. Calibrated Equalized Odds(66) is an example of one of these methods. However, classifier correction depends on accurately identifying sources of bias and setting appropriate constraints. Poorly defined constraints can lead to either insufficient fairness improvements or a decrease in the model’s predictive accuracy. Additionally, applying such corrections post-training may not address deeper issues of bias inherent in the data, limiting their effectiveness in mitigating unfairness.
- Output correction: model outputs are modified to obtain fairer distribution of the data. Reject Option based Classification(67) is an example of this type of method that assigns more favourable outcomes to protected groups based upon low confidence regions of the classifier. While output correction can improve fairness metrics, it may distort predicted probabilities and reduce trust in model predictions. This approach addresses bias at the output level without resolving biases present in the training data or the model itself, potentially leading to superficial fairness improvements that fail to generalize to new datasets. Furthermore, aggressive corrections can harm overall model performance and usability for certain tasks.
4. Silent Deployment and Clinical Evaluation
4.1. Prospective Deployment Preparation
Approvals for prospective deployment
Similar to Section 3.1., access to and use of prospective clinical data requires institutional approval. Which type of institutional approval to seek depends on the specifics of the project, including the type of data that is required, the sensitivity of that data, the intended audience of the model outputs, and the nature of any potential interventions based on the model outputs.
Some important questions to consider:
- Which clinical system(s) will provide prospective data?
- Do you require data on demand (live data feed) or on a schedule?
- Is there sufficient infrastructure to meet your data needs?
- Are there limitations on how the prospective data can be used (e.g., Silent Mode data cannot be used to develop new models)?
- What are the potential interventions that could be prompted by your deployed model and could they have a direct impact on clinical decision-making?
Retrospective to prospective mapping
Healthcare data and EHR systems are constantly changing. In many cases, AI models are developed and tested using historical data, and/or from historical records systems. In these cases, every effort must be made to evaluate how this data corresponds to the data expected from defined sources of prospective healthcare data, and clearly map the differences. It is also important to consider spectrum effects and any unintended disparities in data.
While navigating a change in EMR system between model development and silent deployment is a dramatic (albeit common) scenario, similar approaches are a necessary part of any prospective deployment. Even within any given EMR, important definitions such as procedure codes, names of fields, and other critical data may change over time without warning. After any initial data mapping is complete, prospective data needs to be compared to historical data to ensure consistency and accuracy. Questions you can ask yourself are:
- Are all of the features included in your model being captured?
- Are there any unexplained shifts in your data? (missingness, volumes, values, etc.)
This is an iterative process until the prospective data quality meets expectations, but will continue as part of lifecycle monitoring.
Prospective Data Representativeness
In ideal scenarios the prospective data used during Silent Mode and Prospective Deployment will have demographic distributions and data collection methods that are similar. However, there are certain scenarios where retrospective and prospective datasets will differ, but the model still meets defined objectives and metrics. In this scenario, the model is robust to certain acceptable variable/data variations. These acceptable variations should be assessed and documented so that the model does not undergo unnecessary retraining or decommissioning.
4.2. Prospective Model Assessment (Silent Mode)
Silent Deployment Considerations
Many approaches to silent mode
There are many different potential approaches and phases to silent mode, and what is best will completely depend on the specific solution being evaluated. As a general guide, silent mode involves: (1) developing required workflow integration components that consider human factors and user experience; (2) validation of the models performance; (3) reliability and integration within the clinical workflow under real-world settings; and (4) identification of potential biases, usability issues and any unintended consequences.
Threshold selection
Some AI-Solutions will require the selection of a statistical threshold, above and below which different actions are taken. This threshold determines the binarization point of a ML prediction. This threshold should be selected in collaboration with clinical-end-users and the project’s cross-functional team. It should also be chosen to balance false positive rates and false negative rates, to minimize unnecessary operational burden and interventions, and delayed diagnosis or intervention, respectively(68). These thresholds should be selected with consideration of subgroups.
Assessing Silent Mode Performance
Silent mode performance should broadly assess how the model performs in a real-world setting, as well as how the solution is integrated into current clinical work-flows. The cross-functional team should be consulted if the model fails to meet any of the defined statistical measures of performance, equity objectives of fairness metrics to determine if the model should be adapted, mitigation strategies should be implemented, or if the AI-Solution is not meeting the intent of the project and should be discontinued.
Define auditing methods
In collaboration with the cross-functional team, an auditing plan should be generated prior to solution rollout. The frequency of audits will be determined by the team. Audits of the solution are comprehensive and are not meant to replace regular assessment of model performance. An audit of the solution should consider the following:
- Ethics, security and privacy check of solution and data being used.
- Changes in standard practice, structural changes, or other sources of data drift, that may affect the data being used.
- Model performance across patient subgroups.
- Whether or not equity objectives and fairness metrics are still appropriate and being met.
- Feedback from end-users regarding experience, questions, adoption and any additional recommendations about the solution.
- Failures that have occurred since the last audit (including if they were properly communicated and addressed)
4.3. Prospective Clinical Evaluation
Education materials
Developing education materials to provide during the Clinical Evaluation of an AI-Solution in healthcare is critical to ensure all stakeholders understand the technology, its goals, and their roles. This section highlights a sampling of materials a cross-functional team may need to develop for their Clinical Evaluation, depending on the stakeholders involved.
- End-users:
- User Manuals and Quick Start Guides:
- Include examples of AI outputs and how to interpret them.
- Provide mechanisms for users to report issues or provide feedback.
- Decision Support Documentation:
- Highlight evidence supporting the AI model’s recommendations.
- Address limitations, biases, and areas of uncertainty.
- Describe situations where the model should not be used or where caution is needed.
- Compliance and Safety Guidelines:
- Clarify the pilot’s regulatory compliance and safeguards to prevent patient harm.
- User Manuals and Quick Start Guides:
- For Patients (and Caregivers, if applicable):
- Simplified Information Sheets or Videos:
- Describe the AI-Solution, its role in their care, and expected benefits.
- Trust-Building FAQs:
- Address concerns about AI (e.g., “Will AI replace my doctor?”).
- Reassure participants about clinician oversight and privacy safeguards as relevant.
- Patient Feedback Channels:
- Provide materials explaining how participants can offer feedback or raise concerns.
- Simplified Information Sheets or Videos:
Recruit end-users for future clinical integration
When assessing the AI-Solution during a Clinical Evaluation, obtaining feedback from end-users and patients involved in the pilot is essential to evaluate whether clinical integration is effective. Various methods can be used to obtain this feedback including surveys, interviews, focus groups, or direct observation. Here are some examples of feedback to consider:
Feedback from end-users:
- Is the AI-Solution intuitive and user-friendly?
- Are the AI insights/actionable recommendations accurate and clinically meaningful?
- Do the AI insights/actionable recommendations align with evidence-based practices or clinical guidelines?
- Does the AI tool save time or reduce manual workload? Does it increase confidence in developing a care plan? Does it improve equitable care?
- Is it reducing cognitive load or decision fatigue?
- Is the rationale behind AI outputs clear and explainable?
- If the end-user is a clinician: Do you feel confident relying on the tool?
- If the end-user is a clinician: Are you noticing measurable improvements in patient care quality or outcomes?
- Was adequate training provided on how to use this tool?
- Is ongoing technical support accessible and helpful?
Feedback from Patients:
- Are you aware that AI is part of your care process?
- Do you understand how the AI contributes to your treatment?
- Do you feel the tool is enhancing your care (e.g., faster results, personalized treatments, fewer errors or delays)?
- Are there any concerns about depersonalization of care?
- Do you feel your data is secure and used responsibly?
- Are you comfortable with AI being used in decision-making?
- Do you trust the collaboration between clinicians and AI?
5. Operationalization and Lifecycle Monitoring
5.1 Preparation and Documentation
Comprehensive documentation
Comprehensive documentation bridges the gap between development, deployment, and real-world use. Documentation should be generated both for internal usage (to ensure safe, ethical, and effective implementation within the local care system) and external publishing (to foster transparency, reproducibility, and collaboration).
Some best practices to keep in mind for both internal and external documentation include:
(1) generating plain language summaries for accessibility to non-technical stakeholders;
(2) ensuring all documentation aligns with FAIR principles (Findable, Accessible, Interoperable, and Reusable)(69) when publishing; and (3) including a revision history of any updates and improvements to the AI-Solution.
Below are some recommendations for documentation for internal usage and external publishing:
Internal Documentation:
1. Model Development and Technical Specifications
-
- Model Overview:
- Objectives and intended use cases.
- Description of the problem the AI addresses.
- Model Architecture:
- Detailed explanation of algorithms, architecture, and training pipeline.
- Hyperparameters and optimization techniques.
- Data Sources and Preprocessing:
- Description of datasets used for training, validation, and testing.
- Details on data cleaning, augmentation, and handling missing values.
- Consider use of Data Labels(70)
- Fairness, Equity and Performance Metrics:
- Intended equity objective(s) and utilized fairness metric(s)
- Performance evaluation metrics (e.g., accuracy, precision, recall, AUC-ROC).
- Comparison with baseline methods.
- Ethics and Bias:
- Description of steps taken to minimize bias and promote fairness.
- Description of known biases and inequities in AI-Solution.
- Versioning:
- Model version history and updates.
- Model Overview:
2. Deployment Documentation
-
- System Integration:
- How the AI integrates with existing healthcare infrastructure (including e.g. what data sources are accessed, who the primary end-users are, etc).
- Operational Guidelines:
- Deployment process, runtime environment, and maintenance requirements.
- Interfaces:
- User interface guides for clinicians or administrators.
- System Integration:
3. Regulatory and Compliance Documentation
-
- Data Privacy and Security:
- Compliance with regulations like GDPR, HIPAA, or equivalent.
- Explanation of data storage, access controls, and encryption methods.
- Risk Management:
- Identified risks and mitigation strategies.
- Data Privacy and Security:
4. Monitoring and Post-Deployment Validation
-
- Performance Monitoring Plans:
- Procedures for tracking model performance in real-world settings.
- Update and Retraining Guidelines:
- When and how the model should be updated or retrained.
- Audit Logs:
- Documentation of decision-making processes for accountability.
- Performance Monitoring Plans:
External Documentation (For Publishing to Journals or Open Access Repositories):The TRIPOD+AI(71) statement is an excellent resource that should be followed for external publishing of an AI-Solution. As a starting point, we recommend the following sections for inclusion in any external publication:
- Research and Model Development
- Introduction and Background:
- Problem statement, clinical relevance, and literature review.
- Methods:
- Detailed description of the AI development process, including:
- Model architecture and algorithms.
- Training pipeline and dataset descriptions.
- Statistical methods used for validation.
- Explanation of fairness assessments and steps to mitigate bias
- Detailed description of the AI development process, including:
- Results:
- Intended equity objective and utilized fairness metric
- Evaluation metrics (e.g., accuracy, precision, recall, AUC-ROC), statistical significance tests, and comparative analysis..
- Comparison with baseline methods or alternatives.
- Visualization of results (e.g., confusion matrices, ROC curves).
- Discussion:
- Interpretation of findings, limitations, and implications for clinical practice.
- Intended users
- Known limitations
- Introduction and Background:
- Dataset Documentation (if sharing data)
- Data Description:
- Features, data sources, and collection methods.
- Data Preprocessing Steps:
- Cleaning, normalization, and other transformations.
- Data Dictionary:
- Definitions of variables and labels.
- Ethical and Legal Considerations:
- How patient privacy was protected.
- Any restrictions on dataset usage.
- Data Description:
- Open Access Repositories
- Code Repository (e.g., GitHub, GitLab):
- Source code with comprehensive comments and clear organization.
- Instructions for reproducing results (e.g., environment setup, dependencies).
- Model Repository (e.g., Hugging Face, Zenodo):
- Pre-trained models with detailed usage instructions.
- Associated metadata for easy reference.
- Code Repository (e.g., GitHub, GitLab):
- FAQs:
- Common questions from reviewers, researchers, or practitioners.
5.2. Communication and Education
Modifying pilot educational materials
Well-designed educational materials help build confidence, trust and engagement across all stakeholders, enabling a smooth clinical integration of the AI-Solution. During this step the education materials made during the Clinical Pilot should be modified for a broad audience. Some recommended additions to the above section on education materials (Section 4.3) include:
- End-users:
- User Manuals and Quick Start Guides:
- Expand to explain system functionalities, workflows and common troubleshooting steps.
- Clinical Workflow Integration Training:
- Provide interactive sessions or videos on how to use the AI-Solution.
- FAQ and Troubleshooting Tips:
- Address common concerns and technical issues.
- User Manuals and Quick Start Guides:
- For Organizational Stakeholders (e.g. administrators, IT):
- Implementation Guides:
- Highlight infrastructure, security, and resource requirements.
- Compliance and Governance Briefs:
- Outline regulatory, ethical, and operational responsibilities.
- Implementation Guides:
Communication plan
The communication plan should assure that the team has the following information readily available, complete and free from reporting bias:
- Equity objectives
- Fairness metrics
- Model performance across defined subgroups
- Information about non-technical components of AI-Solution
- Contact information of solution management team
- Educational material to avoid automation bias
- Educational material on system usage
Communication plans should encompass dissemination of regular reports and audits, patient and end-user educational materials, and adverse event reporting. Plans should also be generated in advance for events such as AI-Solution update, AI-Solution failure, and AI-Solution decommissioning.
5.3. AI-Solution Rollout and Monitoring
Regular monitoring and reporting
The metrics that should be monitored and reported on will depend on the final AI-Solution. However, it is recommended that the AI-Solution monitoring focuses on performance, safety, compliance and operational impact. As examples: (1) model performance and data quality could be monitored by looking at error rates (false positives and false negatives), drift detection (in both data and model performance), and statistical metrics; (2) Clinical outcomes could be monitored by looking at workflow integration and adverse events, such as patient safety issues; (3) Operational metrics, security and privacy can be monitored by looking at uptime and response times, as well as usage metrics, data breaches and access logs; (4) finally, periodic revalidation of the AI-Solution may be beneficial using updated data or in the case of a system update that impacts the software and hardware used by the AI-Solution.
Utilizing the developed communication plan, regularly report on the AI-Solution’s impact and performance, as determined by audits (Section 4.2), feedback from end-users and patients, and regular monitoring metrics. These regular reports should be comprehensive and include disparities in AI-Solution performance, current clinical adoption and impact, as well as any adverse events that have occurred.
5.4. Updating or Decommissioning of AI-Solution
If at any point an AI-Solution if failing to meet the set objectives and requirements (i.e. fairness and performance metrics are worsening, institutional requirements are no longer being met, or end-users and/or patients are no longer seeing benefit from the solution), the cross-functional team should be consulted. During this consultation there are three possible next steps: Pausing the AI-Solution, Updating the AI-Solution, or Decommissioning the AI-Solution.
Each of these steps will look different depending on the institution and AI-Solution. Broadly speaking, it should be determined what each of these steps would look like from a clinical and regulatory perspective and who should be informed. Additionally, depending on the decided step, the results and/or changes will need to be communicated.
References
- Webster CS, Taylor S, Thomas C, Weller JM. Social bias, discrimination and inequity in healthcare: mechanisms, implications and recommendations. BJA Education. 2022 Apr 1;22(4):131–7.
- Fernández Pinto M. Methodological and Cognitive Biases in Science: Issues for Current Research and Ways to Counteract Them. Perspectives on Science. 2023 Oct 1;31(5):535–54.
- Popovic A, Huecker MR. Study Bias. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2024 [cited 2024 Oct 7]. Available from: http://www.ncbi.nlm.nih.gov/books/NBK574513/
- Baumann AA, Cabassa LJ. Reframing implementation science to address inequities in healthcare delivery. BMC Health Serv Res. 2020 Mar 12;20(1):190.
- National Academies of Sciences E, Division H and M, Practice B on PH and PH, States C on CBS to PHE in the U, Baciu A, Negussie Y, et al. The Root Causes of Health Inequity. In: Communities in Action: Pathways to Health Equity [Internet]. National Academies Press (US); 2017 [cited 2024 Oct 25]. Available from: https://www.ncbi.nlm.nih.gov/books/NBK425845/
- Varkey B. Principles of Clinical Ethics and Their Application to Practice. Med Princ Pract. 2021 Feb;30(1):17–28.
- Coalition for Health AI. Responsible AI Guidance. [Internet]. [cited 2025 July 24]. Available from: https://www.chai.org/workgroup/responsible-ai
- Coeckelbergh M. Artificial intelligence, the common good, and the democratic deficit in AI governance. AI Ethics. 2025 Apr 1;5(2):1491–7.
- Cortes C, Mohri M, Riley M, Rostamizadeh A. Sample Selection Bias Correction Theory. In: Lecture Notes in Computer Science [Internet]. Berlin, Heidelberg: Springer-Verlag; 2008 [cited 2025 July 30]. p. 38–53. Available from: https://doi.org/10.1007/978-3-540-87987-9_8
- National Cancer Institute. NCI Dictionary of Cancer Terms. 2011 [cited 2025 July 30]. Definition of selection bias. Available from: https://www.cancer.gov/publications/dictionaries/cancer-terms/def/selection-bias
- Nickerson RS. Confirmation Bias: A Ubiquitous Phenomenon in Many Guises. Review of General Psychology. 1998 June 1;2(2):175–220.
- Pilat D, Krastev S. The Decision Lab. 2021 [cited 2025 July 30]. Confirmation Bias – Biases & Heuristics. Available from: https://thedecisionlab.com/biases/confirmation-bias
- Kea B, Hall MK, Wang R. Recognising bias in studies of diagnostic tests part 2: interpreting and verifying the index test. Emerg Med J. 2019 Aug;36(8):501–5.
- Elston DM. Survivorship bias. Journal of the American Academy of Dermatology [Internet]. 2021 June 17 [cited 2025 July 30];0(0). Available from: https://www.jaad.org/article/S0190-9622(21)01986-1/abstract
- Pasqualetti F, Barberis A, Zanotti S, Montemurro N, De Salvo GL, Soffietti R, et al. The impact of survivorship bias in glioblastoma research. Critical Reviews in Oncology/Hematology. 2023 Aug 1;188:104065.
- Hawkins DM. The Problem of Overfitting. J Chem Inf Comput Sci. 2004 Jan 1;44(1):1–12.
- Park Y, Ho JC. Tackling Overfitting in Boosting for Noisy Healthcare Data. IEEE Transactions on Knowledge and Data Engineering. 2021 July;33(7):2995–3006.
- Usher-Smith JA, Stephen J Sharp, Griffin SJ. The spectrum effect in tests for risk prediction, screening, and diagnosis. BMJ. 2016 June 22;353:i3139.
- Abdelwanis M, Alarafati HK, Tammam MMS, Simsekler MCE. Exploring the risks of automation bias in healthcare artificial intelligence applications: A Bowtie analysis. Journal of Safety Science and Resilience. 2024 Dec 1;5(4):460–9.
- Sabin JA. Tackling Implicit Bias in Health Care. New England Journal of Medicine. 2022 July 13;387(2):105–7.
- Entrepreneur [Internet]. 2017 [cited 2025 July 30]. Do You Have “Shiny Object” Syndrome? What It Is and How to Beat It. Available from: https://www.entrepreneur.com/living/do-you-have-shiny-object-syndrome-what-it-is-and-how-to/288370
- Guo LN, Lee MS, Kassamali B, Mita C, Nambudiri VE. Bias in, bias out: Underreporting and underrepresentation of diverse skin types in machine learning research for skin cancer detection—A scoping review. Journal of the American Academy of Dermatology. 2022 July 1;87(1):157–9.
- Seyyed-Kalantari L, Zhang H, McDermott MBA, Chen IY, Ghassemi M. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nat Med. 2021 Dec;27(12):2176–82.
- Khor S, Haupt EC, Hahn EE, Lyons LJL, Shankaran V, Bansal A. Racial and Ethnic Bias in Risk Prediction Models for Colorectal Cancer Recurrence When Race and Ethnicity Are Omitted as Predictors. JAMA Netw Open. 2023 June 1;6(6):e2318495.
- Gichoya JW, Banerjee I, Bhimireddy AR, Burns JL, Celi LA, Chen LC, et al. AI recognition of patient race in medical imaging: a modelling study. Lancet Digit Health. 2022 June;4(6):e406–14.
- Reeder-Hayes KE, Jackson BE, Kuo TM, Baggett CD, Yanguela J, LeBlanc MR, et al. Structural Racism and Treatment Delay Among Black and White Patients With Breast Cancer. J Clin Oncol. 2024 Nov 10;42(32):3858–66.
- Ring A, Harder H, Langridge C, Ballinger RS, Fallowfield LJ. Adjuvant chemotherapy in elderly women with breast cancer (AChEW): an observational study identifying MDT perceptions and barriers to decision making. Ann Oncol. 2013 May;24(5):1211–9.
- Protière C, Viens P, Rousseau F, Moatti JP. Prescribers’ attitudes toward elderly breast cancer patients. Discrimination or empathy? Crit Rev Oncol Hematol. 2010 Aug;75(2):138–50.
- Ring A. The influences of age and co-morbidities on treatment decisions for patients with HER2-positive early breast cancer. Crit Rev Oncol Hematol. 2010 Nov;76(2):127–32.
- Haase KR, Sattar S, Pilleron S, Lambrechts Y, Hannan M, Navarrete E, et al. A scoping review of ageism towards older adults in cancer care. J Geriatr Oncol. 2023 Jan;14(1):101385.
- Cirillo D, Catuara-Solarz S, Morey C, Guney E, Subirats L, Mellino S, et al. Sex and gender differences and biases in artificial intelligence for biomedicine and healthcare. NPJ Digit Med. 2020;3:81.
- Lee MS, Guo LN, Nambudiri VE. Towards gender equity in artificial intelligence and machine learning applications in dermatology. J Am Med Inform Assoc. 2022 Jan 12;29(2):400–3.
- Larrazabal AJ, Nieto N, Peterson V, Milone DH, Ferrante E. Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proceedings of the National Academy of Sciences. 2020 June 9;117(23):12592–4.
- Turner M, Fielding S, Ong Y, Dibben C, Feng Z, Brewster DH, et al. A cancer geography paradox? Poorer cancer outcomes with longer travelling times to healthcare facilities despite prompter diagnosis and treatment: a data-linkage study. Br J Cancer. 2017 July 25;117(3):439–49.
- Ambroggi M, Biasini C, Del Giovane C, Fornari F, Cavanna L. Distance as a Barrier to Cancer Diagnosis and Treatment: Review of the Literature. Oncologist. 2015 Dec;20(12):1378–85.
- Maly RC, Umezawa Y, Leake B, Silliman RA. Mental health outcomes in older women with breast cancer: impact of perceived family support and adjustment. Psychooncology. 2005 July;14(7):535–45.
- Bevan JL, Pecchioni LL. Understanding the impact of family caregiver cancer literacy on patient health outcomes. Patient Educ Couns. 2008 June;71(3):356–64.
- Mickel J. The Importance of Multi-Dimensional Intersectionality in Algorithmic Fairness and AI Model Development. [Internet]. The University of Texas at Austin; 2023 [cited 2024 Dec 3]. Available from: https://hdl.handle.net/2152/122644
- Hajian-Tilaki K. Receiver Operating Characteristic (ROC) Curve Analysis for Medical Diagnostic Test Evaluation. Caspian J Intern Med. 2013;4(2):627–35.
- Çorbacıoğlu ŞK, Aksel G. Receiver operating characteristic curve analysis in diagnostic accuracy studies: A guide to interpreting the area under the curve value. Turk J Emerg Med. 2023;23(4):195–8.
- Czakon J. F1 Score vs ROC AUC vs Accuracy vs PR AUC: Which Evaluation Metric Should You Choose? [Internet]. neptune.ai. 2022 [cited 2025 July 30]. Available from: https://neptune.ai/blog/f1-score-accuracy-roc-auc-pr-auc
- Richardson E, Trevizani R, Greenbaum JA, Carter H, Nielsen M, Peters B. The receiver operating characteristic curve accurately assesses imbalanced datasets. PATTER [Internet]. 2024 June 14 [cited 2025 July 30];5(6). Available from: https://www.cell.com/patterns/abstract/S2666-3899(24)00109-0
- Bertels J, Robben D, Vandermeulen D, Suetens P. Theoretical analysis and experimental validation of volume bias of soft Dice optimized segmentation maps in the context of inherent uncertainty. Medical Image Analysis. 2021 Jan 1;67:101833.
- Taha AA, Hanbury A. Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC Medical Imaging. 2015 Aug 12;15(1):29.
- Gorriz JM, Clemente RM, Segovia F, Ramirez J, Ortiz A, Suckling J. Is K-fold cross validation the best model selection method for Machine Learning? [Internet]. arXiv; 2024 [cited 2025 July 30]. Available from: http://arxiv.org/abs/2401.16407
- Jung Y, Hu J. A K-fold Averaging Cross-validation Procedure. J Nonparametr Stat. 2015;27(2):167–79.
- Carey S, Pang A, Kamps M de. Fairness in AI for healthcare. Future Healthcare Journal. 2024 Sept 1;11(3):100177.
- Gao J, Chou B, McCaw ZR, Thurston H, Varghese P, Hong C, et al. What is Fair? Defining Fairness in Machine Learning for Health [Internet]. arXiv; 2025 [cited 2025 July 30]. Available from: http://arxiv.org/abs/2406.09307
- Rajkomar A, Hardt M, Howell MD, Corrado G, Chin MH. Ensuring Fairness in Machine Learning to Advance Health Equity. Ann Intern Med. 2018 Dec 18;169(12):866–72.
- Chen RJ, Wang JJ, Williamson DFK, Chen TY, Lipkova J, Lu MY, et al. Algorithmic fairness in artificial intelligence for medicine and healthcare. Nat Biomed Eng. 2023 June;7(6):719–42.
- Fabris A, Esuli A, Moreo A, Sebastiani F. Measuring Fairness Under Unawareness of Sensitive Attributes: A Quantification-Based Approach. Journal of Artificial Intelligence Research. 2023 Apr 22;76:1117–80.
- Bou V. Achieving Demographic Parity Across Multiple Artificial Intelligence Applications: A New Approach for Real-Time Bias Mitigation [Internet]. Preprints; 2024 [cited 2025 July 30]. Available from: https://www.preprints.org/manuscript/202412.0468/v1
- Davoudi A, Chae S, Evans L, Sridharan S, Song J, Bowles KH, et al. Fairness gaps in Machine learning models for hospitalization and emergency department visit risk prediction in home healthcare patients with heart failure. International Journal of Medical Informatics. 2024 Nov 1;191:105534.
- Roberts-Licklider K, Trafalis T. Machine Learning Techniques with Fairness for Prediction of Completion of Drug and Alcohol Rehabilitation [Internet]. arXiv; 2024 [cited 2025 July 30]. Available from: http://arxiv.org/abs/2404.15418
- Feng Q, Du M, Zou N, Hu X. Fair Machine Learning in Healthcare: A Review [Internet]. arXiv; 2024 [cited 2025 July 30]. Available from: http://arxiv.org/abs/2206.14397
- D’Souza G, Zhang HH, D’Souza WD, Meyer RR, Gillison ML. Moderate predictive value of demographic and behavioral characteristics for a diagnosis of HPV16-positive and HPV16-negative head and neck cancer. Oral Oncol. 2010 Feb;46(2):100.
- Kim SY, Choi Y, Kim EK, Han BK, Yoon JH, Choi JS, et al. Deep learning-based computer-aided diagnosis in screening breast ultrasound to reduce false-positive diagnoses. Sci Rep. 2021 Jan 11;11(1):395.
- Feldman M, Friedler SA, Moeller J, Scheidegger C, Venkatasubramanian S. Certifying and Removing Disparate Impact. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining [Internet]. New York, NY, USA: Association for Computing Machinery; 2015 [cited 2025 July 30]. p. 259–68. (KDD ’15). Available from: https://dl.acm.org/doi/10.1145/2783258.2783311
- Kamiran F, Calders T. Classifying without discriminating. In: Control and Communication 2009 2nd International Conference on Computer [Internet]. 2009 [cited 2025 July 30]. p. 1–6. Available from: https://ieeexplore.ieee.org/document/4909197
- Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Int Res. 2002 June 1;16(1):321–57.
- Kamishima T, Akaho S, Asoh H, Sakuma J. Fairness-Aware Classifier with Prejudice Remover Regularizer. In: Flach PA, De Bie T, Cristianini N, editors. Machine Learning and Knowledge Discovery in Databases. Berlin, Heidelberg: Springer; 2012. p. 35–50.
- Agarwal A, Beygelzimer A, Dudík M, Langford J, Wallach H. A Reductions Approach to Fair Classification [Internet]. arXiv; 2018 [cited 2025 July 30]. Available from: http://arxiv.org/abs/1803.02453
- Celis LE, Huang L, Keswani V, Vishnoi NK. Classification with Fairness Constraints: A Meta-Algorithm with Provable Guarantees [Internet]. arXiv; 2020 [cited 2025 July 30]. Available from: http://arxiv.org/abs/1806.06055
- Yang J, Soltan AAS, Eyre DW, Yang Y, Clifton DA. An adversarial training framework for mitigating algorithmic biases in clinical machine learning. NPJ Digit Med. 2023 Mar 29;6:55.
- Fan Y, Penington A, Kilpatrick N, Hardiman R, Schneider P, Clement J, et al. Quantification of mandibular sexual dimorphism during adolescence. J Anat. 2019 May;234(5):709–17.
- Pleiss G, Raghavan M, Wu F, Kleinberg J, Weinberger KQ. On Fairness and Calibration [Internet]. arXiv; 2017 [cited 2025 July 30]. Available from: http://arxiv.org/abs/1709.02012
- Franc V, Prusa D, Voracek V. Optimal Strategies for Reject Option Classifiers. Journal of Machine Learning Research. 2023;24(11):1–49.
- Ruopp MD, Perkins NJ, Whitcomb BW, Schisterman EF. Youden Index and Optimal Cut-Point Estimated from Observations Affected by a Lower Limit of Detection. Biometrical Journal. 2008;50(3):419–30.
- Wilkinson MD, Dumontier M, Aalbersberg IjJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016 Mar 15;3(1):160018.
- Chmielinski KS, Newman S, Taylor M, Joseph J, Thomas K, Yurkofsky J, et al. The Dataset Nutrition Label (2nd Gen): Leveraging Context to Mitigate Harms in Artificial Intelligence [Internet]. arXiv; 2022 [cited 2025 July 30]. Available from: http://arxiv.org/abs/2201.03954
- Collins GS, Moons KGM, Dhiman P, Riley RD, Beam AL, Van Calster B, et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024 Apr 16;385:e078378.


