Front Public Health. 2026 ;14
1760871
Background: Gestational diabetes mellitus (GDM) is increasingly prevalent worldwide and is associated with substantial short- and long-term risks for mothers and offspring, making high-quality, accessible health information essential. At the same time, artificial intelligence (AI) chatbots based on large language models are being widely used for health queries, yet their accuracy, reliability and readability in the context of GDM remain unclear.
Methods: We first evaluated six AI chatbots (ChatGPT-5, ChatGPT-4o, DeepSeek-V3.2, DeepSeek-R1, Gemini 2.5 Pro and Claude Sonnet 4.5) using 200 single-best-answer multiple-choice questions (MCQs) on GDM drawn from MedQA, MedMCQA and the Chinese National Medical Examination item bank, covering four domains: epidemiology and risk factors, clinical manifestations and diagnosis, maternal and neonatal outcomes, and management and treatment. Each item was posed three times to every model under a standardized prompting protocol, and accuracy was defined as the proportion of correctly answered questions. For public-facing information, we identified 15 core GDM education questions using Google Trends and expert review, and queried four chatbots (ChatGPT-5, DeepSeek-V3.2, Claude Sonnet 4.5 and Gemini 2.5 Pro). Two obstetricians independently assessed reliability using DISCERN, EQIP, GQS and JAMA benchmarks, and readability was quantified using ARI, CL, FKGL, FRES, GFI and SMOG indices.
Results: Overall MCQ accuracy differed significantly across the six chatbots (p < 0.0001), with ChatGPT-5 achieving the highest mean accuracy (92.17%) and DeepSeek-V3.2 and Gemini 2.5 Pro performing comparably well, while ChatGPT-4o, DeepSeek-R1 and Claude Sonnet 4.5 scored lower. Newer model generations (ChatGPT-5 vs. ChatGPT-4o; DeepSeek-V3.2 vs. DeepSeek-R1) consistently outperformed their predecessors across all four domains. Among the four models evaluated on public-education questions, ChatGPT-5 achieved the highest reliability scores (DISCERN 42.53 ± 7.20; EQIP 71.67 ± 6.17), whereas Claude Sonnet 4.5, DeepSeek-V3.2 and Gemini 2.5 Pro scored lower. JAMA scores were uniformly low (0-0.07/4), reflecting poor transparency. All models produced text above the recommended sixth-grade reading level; ChatGPT-5 showed the most favorable readability profile (for example, FKGL 7.43 ± 2.42, FRES 62.47 ± 13.51) but still did not meet guideline targets.
Conclusion: Contemporary AI chatbots can generate generally accurate and moderately reliable GDM-related information, with newer model generations showing clear gains in diagnostic validity. However, limited transparency and systematically high reading levels indicate that these tools are not yet suitable as stand-alone resources for GDM patient education and should be used as adjuncts to clinician counseling and professionally curated materials.
Keywords: artificial intelligence; gestational diabetes mellitus; large language models; patient education; readability