J Voice. 2025 Oct 13. pii: S0892-1997(25)00400-X. [Epub ahead of print]
OBJECTIVE: To develop and validate a multimodal, machine learning-based framework that integrates acoustic voice features with baseline clinical parameters for noninvasive and accurate screening of type 2 diabetes mellitus (T2DM).
MATERIALS AND METHODS: We analyzed data from 3129 individuals, including 1158 with T2DM and 1971 without. Voice recordings were collected under standardized conditions and processed with the openSMILE toolkit to extract 88 acoustic features, encompassing prosodic, spectral, cepstral, and quality-related parameters. In parallel, 30 clinical features were obtained from demographic, anthropometric, biochemical, lifestyle, and medical history variables. After preprocessing and imputation, feature selection was conducted using LASSO, ANOVA, Mutual Information, and Recursive Feature Elimination. Dimensionality reduction with Principal Component Analysis was also evaluated. Models, including Logistic Regression, Random Forest, XGBoost, TabNet, and TabTransformer, were trained with cross-validation and tuned through grid and randomized searches. Performance was assessed on an independent test set using accuracy, recall, and area under the curve (AUC). Model interpretability was addressed via SHAP analysis, t-SNE visualization, and radar plots. Clinical utility was assessed with nomogram construction, calibration, and decision curve analysis (DCA).
RESULTS: Models using clinical features alone achieved moderate performance (AUC ≈ 69%). Acoustic-only models performed better, with the LASSO + XGBoost combination reaching an AUC of 80.8%. The fused feature set markedly outperformed both unimodal approaches, with the LASSO + XGBoost model achieving 94.1% accuracy, 93.6% recall, and an AUC of 95.2%. SHAP analysis identified HbA1c, fasting glucose, HOMA-IR, and acoustic markers such as jitter and shimmer as top predictors. Calibration plots showed excellent agreement between predicted and observed probabilities, while DCA demonstrated superior net clinical benefit.
CONCLUSIONS: Our multimodal framework provides an accurate, interpretable, and clinically actionable approach for noninvasive T2DM screening. Future studies should validate these findings in diverse populations and explore integration into real-world digital health platforms.
Keywords: Acoustic analysis; Decision curve analysis; Machine learning; Nomogram; Type 2 diabetes mellitus; Voice biomarkers