Predicting Pathologic Complete Response to Neoadjuvant Chemotherapy in Breast Cancer Using Machine Learning Models.

Florence Nightingale Hospital, Istanbul

Status

Completed

Conditions

Breast Cancer

Study type

Observational

Funder types

Other

Identifiers

NCT07426653

FNH2026-1

Details and patient eligibility

About

This retrospective observational study aims to develop and validate a clinicopathology-based machine learning model to predict pathological complete response (pCR) following neoadjuvant chemotherapy in patients with breast cancer. Clinical and pathological data collected between 2010 and 2025 were used to train and evaluate multiple machine learning algorithms using cross-validation and independent holdout testing. The primary outcome was pathological complete response after neoadjuvant chemotherapy. Model performance was assessed using discrimination and classification metrics, including ROC-AUC, precision-recall AUC, F1-score, and Matthews correlation coefficient. The resulting model is intended to support clinical decision-making by providing individualized probability estimates of treatment response.

Full description

This retrospective observational study was conducted using a breast cancer registry containing clinical and pathological data from patients who received neoadjuvant chemotherapy between January 2010 and December 2025. The objective of the study was to develop and validate a machine learning-based predictive model for pathological complete response (pCR) using routinely available clinicopathological variables.

An initial dataset consisting of 298 patients and 144 recorded variables was curated by breast oncology experts to identify clinically relevant predictors. A total of 20 established clinicopathological variables were selected, representing demographic characteristics, tumor staging, biomarker profiles, and treatment-related factors. Feature engineering techniques, including ordinal encoding, one-hot encoding, and binary mapping, were applied to prepare the dataset for model development. Missing values were handled using median imputation within a cross-validation pipeline to prevent data leakage.

Feature selection was performed using a hybrid importance framework integrating mutual information analysis, SHAP-based attribution from gradient boosting models, and L1-regularized logistic regression coefficients. Sequential feature subset evaluation identified an optimal subset of 10 predictors for model development.

Multiple machine learning algorithms-including logistic regression, random forest, gradient boosting models, support vector machines, k-nearest neighbors, and ensemble learning approaches-were trained and evaluated using 5-fold stratified cross-validation. Final performance was assessed on independent validation and holdout datasets using ROC-AUC, precision-recall AUC, F1-score, and Matthews correlation coefficient.

The primary outcome was pathological complete response following neoadjuvant chemotherapy. Threshold optimization was performed to identify a clinically meaningful probability cutoff that balanced sensitivity and specificity for predicting treatment response. Model performance was compared against a prevalence-adjusted stochastic baseline using Monte Carlo simulation to confirm predictive validity beyond chance.

This study evaluates the feasibility of applying clinicopathology-based machine learning models to predict treatment response in breast cancer and to support individualized clinical decision-making in the neoadjuvant setting.

Enrollment

298 patients

Sex

Female

Ages

18 to 90 years old

Volunteers

No Healthy Volunteers

Inclusion criteria