J Oral Maxillofac Surg. 2026 Jan 09. pii: S0278-2391(26)00032-7. [Epub ahead of print]
BACKGROUND: Large language models (LLMs) such as Chat Generative Pre-Trained Transformer (ChatGPT; OpenAI, San Francisco, CA) and Claude (Anthropic, San Francisco, CA) are increasingly used by patients seeking information about surgical procedures, including external sinus lifting. However, the accuracy, quality, and readability of these artificial intelligence (AI)-generated explanations remain uncertain.
PURPOSE: The study purpose was to measure and compare 2 AI language models regarding the reliability, quality, usefulness, and readability of their responses to frequently asked patient questions about external sinus lifting.
STUDY DESIGN, SETTING, AND SAMPLE: This cross-sectional study assessed computer-generated responses provided by LLMs, referred to as decoder-only-based LLM (DO-LLM) and transformer-based LLM (TB-LLM) to standardized patient questions.
PREDICTOR VARIABLE: The predictor variable was AI model type (DO-LLM vs TB-LLM).
MAIN OUTCOME VARIABLES: Outcome measures included reliability, quality, usefulness, and readability. These were assessed using the modified DISCERN instrument, Global Quality Score, a 4-point usefulness scale, and 2 readability indices (Flesch Reading Ease and Flesch-Kincaid Grade Level). Seventy-two standardized questions across 10 clinical domains were submitted to both models. Responses were independently evaluated by one oral and maxillofacial surgeon and 2 periodontists, followed by consensus scoring.
COVARIATES: Not applicable.
ANALYSES: Descriptive statistics summarized outcomes. Depending on normality, comparisons used the independent samples t-test or Mann-Whitney U test. Associations between categorical variables were analyzed using Pearson's χ2 or Fisher's Exact test.
RESULTS: For modified DISCERN, DO-LLM scored 21.88 (3.09), 22.14 (2.04), and 22.63 (2.56) in preoperative preparation, graft materials, and risks/complications, whereas TB-LLM scored 13.88 (4.52), 17.29 (4.27), and 19 (1.77), respectively (P < .05). For Global Quality Score in lifestyle and behavioral recommendations, TB-LLM scored 4 (0) compared with 3.29 (0.49) for DO-LLM (P < .05). Moderate-quality responses were more common with DO-LLM (56.9%), while TB-LLM produced a higher proportion of good quality scores (29.2%) (P < .05).
CONCLUSION AND RELEVANCE: Both AI models demonstrated potential value for patient education on external sinus lifting, though their strengths differed by content domain. DO-LLM provided stronger procedural and risk-related explanations, whereas TB-LLM offered more comprehensive lifestyle-related guidance. Continued refinement of dental-specific AI tools and integration of patient-centered considerations remain essential.