bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–04–13
seven papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. J Dent. 2025 Apr 05. pii: S0300-5712(25)00177-0. [Epub ahead of print] 105733
       OBJECTIVES: The performance of chatbots for discrete steps of a systematic review (SR) on artificial intelligence (AI) in pediatric dentistry was evaluated.
    METHODS: Two chatbots (ChatGPT4/Gemini) and two non-expert reviewers were compared against two experts in a SR on AI in pediatric dentistry. Five tasks: (1) formulating a PICO question, (2) developing search queries for eight databases, (3) screening studies, (4) extracting data, and (5) assessing the risk of bias (RoB) were assessed. Chatbots and non-experts received identical prompts, with experts providing the reference standard. Performance was measured using accuracy, precision, sensitivity, specificity, and F1-score for search and screening tasks, Cohen's Kappa for risk of bias assessment, and a modified Global Quality Score (1-5) for PICO question formulation and data extraction quality. Statistical comparisons were performed using Kruskal-Wallis and Dunn's post-hoc tests.
    RESULTS: In PICO formulation, ChatGPT outperformed Gemini slightly, while non-experts scored the lowest. Experts identified 1,261 records, compared to 569 (ChatGPT), 285 (Gemini), and 722 (non-experts). Screening showed chatbots having 90% sensitivity, >60% specificity, <25% precision, and F1-scores <40%, versus non-experts' 84% sensitivity, 91% specificity, and 39% F1-score, respectively. For data extraction, ChatGPT yielded a (mean±standard deviation) score of 31.6±12.3 (max. was 45), Gemini 29.2 ±12.3, and non-experts 30.4 ±11.3, respectively. For RoB, the agreement with experts was 49.4% for ChatGPT, 51.2% for Gemini 48.8% for non-experts (p>0.05).
    CONCLUSION: Chatbots could enhance SR efficiency, particularly for the study screening and data extraction steps. Human oversight remains critical for ensuring accuracy and completeness.
    Keywords:  ChatGPT; Chatbot; Large language models; artificial intelligence; pediatric dentistry
    DOI:  https://doi.org/10.1016/j.jdent.2025.105733
  2. BMJ Evid Based Med. 2025 Apr 08. pii: bmjebm-2024-113066. [Epub ahead of print]
       OBJECTIVE: To assess custom GPT-4 performance in extracting and evaluating data from medical literature to assist in the systematic review (SR) process.
    DESIGN: A proof-of-concept comparative study was conducted to assess the accuracy and precision of custom GPT-4 models against human-performed reviews of randomised controlled trials (RCTs).
    SETTING: Four custom GPT-4 models were developed, each specialising in one of the following areas: (1) extraction of study characteristics, (2) extraction of outcomes, (3) extraction of bias assessment domains and (4) evaluation of risk of bias using results from the third GPT-4 model. Model outputs were compared against data from four SRs conducted by human authors. The evaluation focused on accuracy in data extraction, precision in replicating outcomes and agreement levels in risk of bias assessments.
    PARTICIPANTS: Among four SRs chosen, 43 studies were retrieved for data extraction evaluation. Additionally, 17 RCTs were selected for comparison of risk of bias assessments, where both human comparator SRs and an analogous SR provided assessments for comparison.
    INTERVENTION: Custom GPT-4 models were deployed to extract data and evaluate risk of bias from selected studies, and their outputs were compared to those generated by human reviewers.
    MAIN OUTCOME MEASURES: Concordance rates between GPT-4 outputs and human-performed SRs in data extraction, effect size comparability and inter/intra-rater agreement in risk of bias assessments.
    RESULTS: When comparing the automatically extracted data to the first table of study characteristics from the published review, GPT-4 showed 88.6% concordance with the original review, with <5% discrepancies due to inaccuracies or omissions. It exceeded human accuracy in 2.5% of instances. Study outcomes were extracted and pooling of results showed comparable effect sizes to comparator SRs. A review of bias assessment using GPT-4 showed fair-moderate but significant intra-rater agreement (ICC=0.518, p<0.001) and inter-rater agreements between human comparator SR (weighted kappa=0.237) and the analogous SR (weighted kappa=0.296). In contrast, there was a poor agreement between the two human-performed SRs (weighted kappa=0.094).
    CONCLUSION: Customized GPT-4 models perform well in extracting precise data from medical literature with potential for utilization in review of bias. While the evaluated tasks are simpler than the broader range of SR methodologies, they provide an important initial assessment of GPT-4's capabilities.
    Keywords:  Health Services Research; Systematic Reviews as Topic
    DOI:  https://doi.org/10.1136/bmjebm-2024-113066
  3. J Orthop Case Rep. 2025 Apr;15(4): 1-3
      Conventionally, systematic reviews and meta-analysis constituted the highest level of evidence from medical research. With the introduction of artificial intelligence (AI), rapid analysis of large amounts of medical data and synthesis of useful results are possible in a fraction of the time taken for systemic reviews and meta-analysis by humans. However, it is not without drawbacks. This article discusses the implications of AI in the future of systematic reviews and meta-analysis in the medical literature.
    Keywords:  Artificial intelligence; meta analysis; orthopedics; systematic reviews
    DOI:  https://doi.org/10.13107/jocr.2025.v15.i04.5420
  4. Urogynecology (Phila). 2025 Apr 08.
       IMPORTANCE: As the volume of medical literature continues to expand, the provision of artificial intelligence (AI) to produce concise, accessible summaries has the potential to enhance the efficacy of content review.
    OBJECTIVES: This project assessed the readability and quality of summaries generated by ChatGPT in comparison to the Plain Text Summaries from Cochrane Review, a systematic review database, in incontinence research.
    STUDY DESIGN: Seventy-three abstracts from the Cochrane Library tagged under "Incontinence" were summarized using ChatGPT-3.5 (July 2023 Version) and compared with their corresponding Cochrane Plain Text Summaries. Readability was assessed using Flesch Kincaid Reading Ease, Flesch Kincaid Grade Level, Gunning Fog Score, Smog Index, Coleman Liau Index, and Automated Readability Index. A 2-tailed t test was used to compare the summaries. Each summary was also evaluated by 2 blinded, independent reviewers on a 5-point scale where higher scores indicated greater accuracy and adherence to the abstract.
    RESULTS: Compared to ChatGPT, Cochrane Review's Plain Text Summaries scored higher in the numerical Flesch Kincaid Reading Ease score and showed lower necessary education levels in the 5 other readability metrics with statistical significance, indicating better readability. However, ChatGPT earned a higher mean accuracy grade of 4.25 compared to Cochrane Review's mean grade of 4.05 with statistical significance.
    CONCLUSIONS: Cochrane Review's Plain Text Summaries provide clearer summaries of the incontinence literature when compared to ChatGPT, yet ChatGPT generated more comprehensive summaries. While ChatGPT can effectively summarize the medical literature, further studies can improve reader accessibility to these summaries.
    DOI:  https://doi.org/10.1097/SPV.0000000000001688
  5. BMJ Evid Based Med. 2025 Apr 07. pii: bmjebm-2024-113123. [Epub ahead of print]
       BACKGROUND: Evaluation of the quality of evidence in systematic reviews (SRs) is essential for assertive decision-making. Although Grading of Recommendations Assessment, Development and Evaluation (GRADE) affords a consolidated approach for rating the level of evidence, its application is complex and time-consuming. Artificial intelligence (AI) can be used to overcome these barriers.
    DESIGN: Analytical experimental study.
    OBJECTIVE: The objective is to develop and appraise a proof-of-concept AI-powered tool for the semiautomation of an adaptation of the GRADE classification system to determine levels of evidence in SRs with meta-analyses compiled from randomised clinical trials.
    METHODS: The URSE-automated system was based on an algorithm created to enhance the objectivity of the GRADE classification. It was developed using the Python language and the React library to create user-friendly interfaces. Evaluation of the URSE-automated system was performed by analysing 115 SRs from the Cochrane Library and comparing the predicted levels of evidence with those generated by human evaluators.
    RESULTS: The open-source URSE code is available on GitHub (http://www.github.com/alisson-mfc/urse). The agreement between the URSE-automated GRADE system and human evaluators regarding the quality of evidence was 63.2% with a Cohen's kappa coefficient of 0.44. The metrics of the GRADE domains evaluated included accuracy and F1-scores, which were 0.97 and 0.94 for imprecision (number of participants), 0.73 and 0.7 for risk of bias, 0.9 and 0.9 for I2 values (heterogeneity) and 0.98 and 0.99 for quality of methodology (A Measurement Tool to Assess Systematic Reviews), respectively.
    CONCLUSION: The results demonstrate the potential use of AI in assessing the quality of evidence. However, in consideration of the emphasis of the GRADE approach on subjectivity and understanding the context of evidence production, full automation of the classification process is not opportune. Nevertheless, the combination of the URSE-automated system with human evaluation or the integration of this tool into other platforms represents interesting directions for the future.
    Keywords:  Evidence-Based Practice; Systematic Reviews as Topic
    DOI:  https://doi.org/10.1136/bmjebm-2024-113123