Digit Health. 2026 Jan-Dec;12:12
20552076261435836
Objective: This study aimed to systematically evaluate five leading LLMs-ChatGPT, DeepSeek, Copilot, Gemini, and Perplexity-in providing MHD-related health information. The primary objectives were to determine (1) the reliability of MHD-related information generated by LLMs and (2) whether its readability meets the recommended standards for patient educational materials.
Methods: A cross-sectional comparative design was adopted. The approximate timeframe during which the responses were generated was October 2025. Seventeen frequently asked MHD-related questions were identified using Google Trends and two online patient-caregiver forums. Each query was input into the five LLMs (ChatGPT-4o, Copilot, Gemini 2.5 Pro, Perplexity Pro, and DeepSeek-V3.2-Exp), and their responses were assessed using DISCERN, EQIP, JAMA, and GQS criteria for reliability, alongside FKGL, FRES, SMOG, CLI, ARI, and LWF readability indices. A heatmap analysis was also conducted to evaluate intra-model response variability.
Results: High inter-rater reliability was confirmed between the two experts (ICC for average measures ranged from 0.851 to 0.879, all P < .001). Significant differences were observed among the five LLMs in both reliability and readability. Overall reliability scores were relatively low; however, Perplexity consistently achieved higher DISCERN, EQIP, and JAMA scores compared with Gemini, ChatGPT, Copilot, and DeepSeek (P < .001). In terms of readability, all models produced texts exceeding the sixth-grade reading level. Their ARI, GFI, FKGL, CLI, and SMOG scores were notably higher than recommended, while FRES scores were substantially below the 80-90 range. Heatmap analysis further demonstrated that although Perplexity and ChatGPT maintained relatively stable mean scores, they exhibited higher variability across different queries.
Conclusions: Current large language models (LLMs) exhibit significant variability in delivering maintenance hemodialysis information. While all five evaluated models demonstrated limitations in information quality, transparency, and readability, Perplexity performed relatively better overall. However, persistent deficiencies in source attribution, language accessibility, and response consistency limit their immediate clinical and educational utility. Future LLM development should prioritize readability optimization and context-aware customization to better support patient education.
Keywords: Large language models; health communication; maintenance hemodialysis; readability; reliability