J Med Syst. 2026 Feb 10. 50(1):
20
Large language models (LLMs) are increasingly used for medical advice; despite this, their response readability and quality remain suboptimal. Current research focuses on evaluating LLM outputs, with little investigation into practical optimization strategies for clinical use. On August 9, 2025, we identified the top 25 search keywords for five common cancers via Google Trends and adapted them into six prompt types. Each was posed to ChatGPT-4o and ChatGPT-5 between August 10 and August 12, 2025 under two query conditions: isolated (single question per page) and aggregated (all questions for one cancer type on the same page). Readability was assessed using four indices: Flesch-Kincaid Grade Level (FKGL), Flesch Reading Ease (FKRE), Gunning Fog Index (GFI), and the Simple Measure of Gobbledygook (SMOG). Quality was evaluated on a 5-point Likert scale across accuracy, relevance, comprehensiveness, empathy, and falsehood. ChatGPT-5 generated responses with significantly fewer words (316.81 ± 12.96), sentences (19.79 ± 1.01), syllables (551.93 ± 24.55), and hard words (62.33 ± 3.60) than ChatGPT-4o (292.85 ± 14.52, p = 0.003; 18.77 ± 1.07, p = 0.039; 515.01 ± 27.89, p = 0.006; 58.35 ± 4.05, p = 0.005), while also achieving higher scores in accuracy (W = 2.116, p = 0.034), relevance (W = 2.454, p = 0.014), comprehensiveness (W = 2.574, p = 0.010), and empathy (W = 2.174, p = 0.030). The 6th-grade prompt markedly improved readability over other strategies (ChatGPT-5: FKRE:64.92 ± 8.56, GFI:8.10 ± 1.13, FKGL:8.74 ± 1.73, SMOG:6.97 ± 1.26; ChatGPT-4o:65.44 ± 7.43, GFI:8.04 ± 1.48, FKGL:8.73 ± 1.80, SMOG:6.86 ± 1.53). Aggregating queries on a single page yielded higher accuracy, relevance, and comprehensiveness scores compared to isolated questioning (ChatGPT-4o: W = 4.451, p < 0.001; W = 4.356, p < 0.001; W = 1.965, p = 0.049. ChatGPT-5: W = 3.234, p < 0.001; W = 2.697, p = 0.007; W = 3.885, p < 0.001). ChatGPT-5 produces more concise and qualitatively superior responses than ChatGPT-4o. The patient prompt generated responses with high readability and strong empathy, and is therefore recommended for patient use. Consequently, aggregating related questions on a single page is advised to obtain higher-quality answers.
Keywords: Cancer; ChatGPT; LLM; Prompt Engineering; Readability