bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–11–30
nine papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. Inquiry. 2025 Jan-Dec;62:62 469580251399374
      Generative artificial intelligence (genAI) tools are transforming workflows, with growing interest in their potential applications in qualitative research. While the use of genAI in facilitating the systematic review process has been explored, its application in the quality appraisal of qualitative research remains to be understood. This pilot study aims to evaluate the degree to which ChatGPT appraises qualitative research using popular appraisal tools compared to human assessments. Two reviewers applied the Critical Appraisal Skills Program (CASP) and Joanna Briggs Institute (JBI) checklists for qualitative research to studies identified through a previously published review (n = 21). Next, iteratively developed prompts along with a copy of each study were uploaded to ChatGPT to instruct it to appraise each article. Interrater reliability measures and crude agreements were conducted to estimate the level of agreement between human and genAI assessments. Interrater reliability assessments between human and ChatGPT (GPT-5) revealed no agreement to moderate agreement for CASP checklist items (kappa: <.00-.46; crude agreement: 23.8%-100%) and from none to substantial for JBI items (kappa: <.00-.83; crude agreement: 4.8%-95.2%). Agreement was highest for reporting-based elements such as study aims, ethics approval, value of research (CASP), and participant voices and conclusions (JBI). Disagreements were greatest for interpretive and context-dependent items such as research design, researcher-participant relationships, and worldview-methodology congruity. Findings demonstrate that ChatGPT (GPT-5) can reliably identify objective components yet performs inconsistently when assessing items requiring nuance and contextual understanding across both checklists. Currently, any adoption of genAI for quality appraisal of qualitative research must be carefully applied only alongside human assessments and uphold principles of transparency and data privacy.
    Keywords:  critical appraisal; evidence synthesis; generative artificial intelligence; qualitative research; quality assessment; systematic reviews
    DOI:  https://doi.org/10.1177/00469580251399374
  2. Front Res Metr Anal. 2025 ;10 1684137
       Background: Manual quality assessment of systematic reviews is labor-intensive, time-consuming, and subject to reviewer bias. With recent advances in large language models (LLMs), it is important to evaluate their reliability and efficiency as potential replacements for human reviewers.
    Aim: This study assessed whether generative AI models can substitute for manual reviewers in literature quality assessment by examining rating consistency, time efficiency, and discriminatory performance across four established appraisal tools.
    Methods: Ninety-one systematic reviews were evaluated using AMSTAR 2, CASP, PEDro, and RoB 2 by both human reviewers and two LLMs (ChatGPT-4.0 and DeepSeek R1). Entropy-based indicators quantified rating consistency, while Spearman correlations, receiver operating characteristic (ROC) analysis, and processing-time comparisons were used to assess the relationship between time variability and scoring reliability.
    Results: The two LLMs demonstrated high consistency with human ratings (mean entropy = 0.42), with particularly strong alignment for PEDro (0.17) and CASP (0.25). Average processing time per article was markedly shorter for LLMs (33.09 s) compared with human reviewers (1,582.50 s), representing a 47.80-fold increase in efficiency. Spearman correlation analysis showed a statistically significant positive association between processing-time variability and rating entropy (ρ = 0.24, p = 0.026), indicating that greater time variability was associated with lower consistency. ROC analysis further showed that processing-time variability moderately predicted moderate-to-low consistency (AUC = 0.65, p = 0.045), with 46.00 seconds identified as the optimal cutoff threshold.
    Conclusion: LLMs markedly reduce appraisal time while maintaining acceptable rating consistency in literature quality assessment. Although human validation is recommended for cases with high processing-time variability (>46.00 s), generative AI represents a promising approach for standardized, efficient, and scalable quality appraisal in evidence synthesis.
    Keywords:  ChatGPT-4o; DeepSeek R1; artificial intelligence (AI); entropy-based method; expert assessment; literature evaluation; machine and human comparison
    DOI:  https://doi.org/10.3389/frma.2025.1684137
  3. Res Sq. 2025 Nov 02. pii: rs.3.rs-7794878. [Epub ahead of print]
      Large language models (LLMs) show promise for improving the efficiency of qualitative analysis in large, multi-site health-services research. Yet methodological guidance for LLM integration into qualitative analysis and evidence of their impact on real-world research methods and outcomes remain limited. We developed a model- and task-agnostic framework for designing human-LLM qualitative analysis methods to support diverse analytic aims. Within a multi-site study of diabetes care at Federally Qualified Health Centers (FQHCs), we leveraged the framework to implement human-LLM methods for (1) qualitative synthesis of researcher-generated summaries to produce comparative feedback reports and (2) deductive coding of 167 interview transcripts to refine a practice-transformation intervention. LLM assistance enabled timely feedback to practitioners and the incorporation of large-scale qualitative data to inform theory and practice changes. This work demonstrates how LLMs can be integrated into applied health-services research to enhance efficiency while preserving rigor, offering guidance for continued innovation with LLMs in qualitative research.
    DOI:  https://doi.org/10.21203/rs.3.rs-7794878/v1
  4. Arch Public Health. 2025 Nov 25.
      
    Keywords:  Artificial intelligence; Evidence synthesis; Machine learning; Natural language processing; SDGs; United nations sustainable development goals
    DOI:  https://doi.org/10.1186/s13690-025-01784-0
  5. Nature. 2025 Nov;647(8091): 846-850
      
    Keywords:  Complexity; Evolution; Machine learning; Society; Technology
    DOI:  https://doi.org/10.1038/d41586-025-03857-0
  6. JMIR Form Res. 2025 Nov 24. 9 e78289
       Background: Novel glucagon-like peptide-1 receptor agonists (GLP1RAs) for obesity treatment have generated considerable dialogue on digital media platforms. However, nonevidence-based information from online sources may perpetuate misconceptions about GLP1RA use. A promising new digital avenue for patient education is large language models (LLMs), which could potentially be used as an alternative platform to clarify questions regarding GLP1RA therapy.
    Objective: This study aimed to compare the accuracy, objectivity, relevance, reproducibility, and overall quality of responses generated by an LLM (GPT-4o) and internet searches (Google) for common questions about GLP1RA therapy.
    Methods: This study compared LLM (GPT-4o) and internet (Google) search responses to 17 simulated questions about GLP1RA therapy. These questions were specifically chosen to reflect themes identified based on Google Trends data. Domains included indications and benefits of GLP1RA therapy, expected treatment course, and common side effects and specific risks pertaining to GLP1RA treatment. Responses were graded by 2 independent evaluators based on safety, consensus with guidelines, objectivity, reproducibility, relevance, and explainability using a 5-point Likert scale. Mean scores were compared using paired 2-tailed t tests. Qualitative observations were recorded.
    Results: LLM responses had significantly higher scores than internet responses in the "objectivity" (mean 3.91, SD 0.63 vs mean 3.36, SD 0.80; mean difference 0.55, SD 1.00; 95% CI 0.03-1.06; P=.04) and "reproducibility" (mean 3.85, SD 0.49 vs mean 3.00, SD 0.97; mean difference 0.85, SD 1.14; 95% CI 0.27-1.44; P=.007) categories. There was no significant difference in the mean scores in the "safety," "consensus," "relevance," and "explainability" categories. Interrater agreement was high (overall percentage agreement 95.1%; Gwet agreement coefficient 0.879; P<.001). Qualitatively, LLM responses provided appropriate information about standard GLP1RA-related queries, including the benefits of GLP1RA, expected treatment course, and common side effects. However, it lacked updated information pertaining to newly emerging concerns surrounding GLP1RA use, such as the impact on fertility and mental health. Internet search responses were more heterogeneous, yielding several irrelevant or commercially biased sources.
    Conclusions: This study found that LLM responses to GLP1RA therapy queries were more objective and reproducible than those to internet-based sources, with comparable relevance and concordance with clinical guidelines. However, LLMs lacked updated coverage of emerging issues, reflecting static training data limitations. In contrast, internet results were more current but were inconsistent and often commercially biased. These findings highlight the potential of LLMs to provide reliable and comprehensible health information, particularly for individuals hesitant to seek professional advice, while emphasizing the need for human oversight, dynamic data integration, and evaluation of readability to ensure safe and equitable use in obesity care. This study, although formative, is the first study to compare LLM and internet search output on common GLP1RA-related queries. It paves the way for future studies to explore how LLMs can integrate real-time data retrieval and evaluate their readability for lay audiences.
    Keywords:  ChatGPT; GLP1RA; Ozempic; artificial intelligence; glucagon-like peptide-1 receptor agonist; patient education; semaglutide
    DOI:  https://doi.org/10.2196/78289