J Dent. 2025 Apr 05. pii: S0300-5712(25)00177-0. [Epub ahead of print] 105733
OBJECTIVES: The performance of chatbots for discrete steps of a systematic review (SR) on artificial intelligence (AI) in pediatric dentistry was evaluated.
METHODS: Two chatbots (ChatGPT4/Gemini) and two non-expert reviewers were compared against two experts in a SR on AI in pediatric dentistry. Five tasks: (1) formulating a PICO question, (2) developing search queries for eight databases, (3) screening studies, (4) extracting data, and (5) assessing the risk of bias (RoB) were assessed. Chatbots and non-experts received identical prompts, with experts providing the reference standard. Performance was measured using accuracy, precision, sensitivity, specificity, and F1-score for search and screening tasks, Cohen's Kappa for risk of bias assessment, and a modified Global Quality Score (1-5) for PICO question formulation and data extraction quality. Statistical comparisons were performed using Kruskal-Wallis and Dunn's post-hoc tests.
RESULTS: In PICO formulation, ChatGPT outperformed Gemini slightly, while non-experts scored the lowest. Experts identified 1,261 records, compared to 569 (ChatGPT), 285 (Gemini), and 722 (non-experts). Screening showed chatbots having 90% sensitivity, >60% specificity, <25% precision, and F1-scores <40%, versus non-experts' 84% sensitivity, 91% specificity, and 39% F1-score, respectively. For data extraction, ChatGPT yielded a (mean±standard deviation) score of 31.6±12.3 (max. was 45), Gemini 29.2 ±12.3, and non-experts 30.4 ±11.3, respectively. For RoB, the agreement with experts was 49.4% for ChatGPT, 51.2% for Gemini 48.8% for non-experts (p>0.05).
CONCLUSION: Chatbots could enhance SR efficiency, particularly for the study screening and data extraction steps. Human oversight remains critical for ensuring accuracy and completeness.
Keywords: ChatGPT; Chatbot; Large language models; artificial intelligence; pediatric dentistry