Front Med (Lausanne). 2026 ;13
1754916
Background: Hypertension is a critical comorbidity in patients with type 2 diabetes mellitus that significantly increases cardiovascular risk. Although several predictive models have been developed using conventional logistic regression or basic machine learning algorithms, these approaches often face significant limitations. Many existing models suffer from a lack of external validation which limits their generalizability, or they operate as black boxes without providing interpretable clinical insights. Furthermore, most prior studies have focused exclusively on biological indicators while overlooking the potential impact of socioeconomic determinants and lifestyle factors on disease progression.
Objective: To address these gaps, this study aimed to develop a high-performance Random Forest model for predicting hypertension risk in diabetic patients by integrating multidimensional data, including clinical metrics, lifestyle habits, and socioeconomic status. The study further sought to validate the model's robustness using an independent external cohort and assess its clinical utility through SHAP analysis, providing transparent interpretations of risk factors to guide personalized medical decision-making.
Methods: A multicenter retrospective cohort study was conducted using electronic medical records from two tertiary hospitals. Eligible adults with type 2 diabetes and no prior hypertension were included. A total of 900 eligible patients were included, with 420, 180, and 300 participants in the training, testing, and external validation cohorts, respectively. Feature selection combined Boruta and LASSO methods, yielding seven predictors. Seven algorithms were tested, and model performance was assessed through cross-validation, independent testing, and external validation. The random forest model was explained using SHAP analysis.
Results: Among 900 participants, the random forest model achieved the best discrimination, with AUCs of 0.89 in internal testing and 0.83 in external validation. Calibration and decision curve analyses confirmed stability and clinical utility. Key predictors included alcohol consumption, triglycerides, diabetes duration, health insurance type, fasting blood glucose, estimated glomerular filtration rate, and exercise frequency.
Conclusion: The validated random forest model effectively predicts hypertension in type 2 diabetes patients, integrating metabolic, behavioral, and socioeconomic factors. Its interpretability and robust performance support its potential use for early identification and personalized prevention of hypertension in clinical practice.
Keywords: hypertension risk; machine learning; predictive modeling; random forest; type 2 diabetes mellitus