bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–11–02
seven papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. Cochrane Evid Synth Methods. 2025 Nov;3(6): e70058
       Introduction: Systematic literature reviews (SLRs) of randomized clinical trials (RCTs) underpin evidence-based medicine but can be limited by the intensive resource demands of data extraction. Recent advances in accessible large-language models (LLMs) hold promise for automating this step, however testing is limited across different outcomes and disease areas.
    Methods: This study developed prompt engineering strategies for GPT-4o to extract data from RCTs across three disease areas: non-small cell lung cancer, endometrial cancer and hypertrophic cardiomyopathy. Prompts were iteratively refined during the development phase, then tested on unseen data. Performance was evaluated via comparison to human extraction of the same data, using F1 scores, precision, recall and percentage accuracy.
    Results: The LLM was highly effective for extracting study and baseline characteristics, often equaling human performance, with test F1 scores exceeding 0.85. Complex efficacy and adverse event data proved more challenging, with test F1 scores ranging from 0.22 to 0.50. Transferability of prompts across disease areas was promising but varied, highlighting the need for disease-specific refinement.
    Conclusion: Our findings demonstrate the potential of LLMs, guided by rigorous prompt engineering, to augment the SLR process. However, human oversight remains essential, particularly for complex and nuanced data. As these technologies evolve, continued validation of AI tools will be necessary to ensure accuracy and reliability, and safeguarding of the quality of evidence synthesis.
    Keywords:  artificial intelligence; data extraction; large‐language model; prompt engineering; systematic literature review
    DOI:  https://doi.org/10.1002/cesm.70058
  2. Cochrane Evid Synth Methods. 2025 Nov;3(6): e70059
       Introduction: While artificial intelligence (AI) tools have been utilized for individual stages within the systematic literature review (SLR) process, no tool has previously been shown to support each critical SLR step. In addition, the need for expert oversight has been recognized to ensure the quality of SLR findings. Here, we describe a complete methodology for utilizing our AI SLR tool with human-in-the-loop curation workflows, as well as AI validations, time savings, and approaches to ensure compliance with best review practices.
    Methods: SLRs require completing Search, Screening, and Extraction from relevant studies, with meta-analysis and critical appraisal as relevant. We present a full methodological framework for completing SLRs utilizing our AutoLit software (Nested Knowledge). This system integrates AI models into the central steps in SLR: Search strategy generation, Dual Screening of Titles/Abstracts and Full Texts, and Extraction of qualitative and quantitative evidence. The system also offers manual Critical Appraisal and Insight drafting and fully-automated Network Meta-analysis. Validations comparing AI performance to experts are reported, and where relevant, time savings and 'rapid review' alternatives to the SLR workflow.
    Results: Search strategy generation with the Smart Search AI can turn a Research Question into full Boolean strings with 76.8% and 79.6% Recall in two validation sets. Supervised machine learning tools can achieve 82-97% Recall in reviewer-level Screening. Population, Interventions/Comparators, and Outcomes (PICOs) extraction achieved F1 of 0.74; accuracy for study type, location, and size were 74%, 78%, and 91%, respectively. Time savings of 50% in Abstract Screening and 70-80% in qualitative extraction were reported. Extraction of user-specified qualitative and quantitative tags and data elements remains exploratory and requires human curation for SLRs.
    Conclusion: AI systems can support high-quality, human-in-the-loop execution of key SLR stages. Transparency, replicability, and expert oversight are central to the use of AI SLR tools.
    Keywords:  artificial intelligence; evidence synthesis; human‐in‐the‐loop; meta‐analysis; systematic literature review
    DOI:  https://doi.org/10.1002/cesm.70059
  3. J Clin Epidemiol. 2025 Oct 27. pii: S0895-4356(25)00360-9. [Epub ahead of print] 112027
      Systematic reviews are widely recognized as the cornerstone of evidence-based health decision-making. Individual studies are synthesized into coherent summaries, providing clarity amid the complexity of modern scientific research. However, the systematic review model is facing a crisis. The rapid increase in the number of reviews has paradoxically undermined their utility, with many becoming outdated, redundant, or of low quality. Despite methodological advancements and the introduction of rapid and living reviews, the evidence ecosystem remains fragmented and inefficient. In this paper, it is argued that current reform efforts fall short because the crisis is not addressed holistically, and the concept of "sustainable knowledge" is proposed to frame evidence synthesis with the lens of sustainability. Lessons learned during the COVID-19 pandemic and emerging innovations are drawn upon to reframe the problem as systemic, calling for a reconsideration of the entire lifecycle of systematic reviews, including their creation, updating, and application in practice. Stronger networks of collaboration are encouraged, alongside careful use of automation and artificial intelligence where genuine value is added. Academic incentives are suggested to be reshaped so that quality and relevance are prioritized over the sheer number of publications. It is proposed that by adopting sustainability as a guiding principle, systematic reviews can better fulfill their purpose of providing timely, high-quality, and actionable evidence for health decision-making.
    Keywords:  Evidence synthesis; Sustainable knowledge; Systematic reviews
    DOI:  https://doi.org/10.1016/j.jclinepi.2025.112027
  4. J Med Internet Res. 2025 Oct 29. 27 e79379
       BACKGROUND: Large language models (LLMs) coupled with real-time web retrieval are reshaping how clinicians and patients locate medical evidence, and as major search providers fuse LLMs into their interfaces, this hybrid approach might become the new "gateway" to the internet. However, open-web retrieval exposes models to nonprofessional sources, risking hallucinations and factual errors that might jeopardize evidence-based care.
    OBJECTIVE: We aimed to quantify the impact of guideline-domain whitelisting on the answer quality of 3 publicly available Perplexity web-based retrieval-augmented generation (RAG) models and compare their performance using a purpose-built, biomedical literature RAG system (OpenEvidence).
    METHODS: We applied a validated 130-item question set derived from the American Academy of Neurology (AAN) guidelines (65 factual and 65 case based). Perplexity Sonar, Sonar-Pro, and Sonar-Reasoning-Pro were each queried 4 times per question with open-web retrieval and again with retrieval restricted to aan.com and neurology.org ("whitelisted"). OpenEvidence was queried 4 times. Two neurologists, blinded to condition, scored each response (0=wrong, 1=inaccurate, and 2=correct); any disagreements that arose were resolved by a third neurologist. Ordinal logistic models were used to assess the influence of question type and source category (AAN or neurology vs nonprofessional) on accuracy.
    RESULTS: From the 3640 LLM answers that were rated (interrater agreement: κ=0.86), correct-answer rates were as follows (open vs whitelisted, respectively): Sonar, 60% vs 78%, Sonar-Pro, 80% vs 88%, and Sonar-Reasoning-Pro, 81% vs 89%; for OpenEvidence, the correct-answer rate was 82%. A Friedman test on modal scores across the 7 configurations was significant (χ26=73.7; P<.001). Whitelisting improved mean accuracy on the 0 to 2 scale by 0.23 for Sonar (95% CI 0.12-0.34), 0.08 for Sonar-Pro (95% CI 0.01-0.16), and 0.08 for Sonar-Reasoning-Pro (95% CI 0.02-0.13). Including ≥1 nonprofessional source halved the odds of a higher rating in Sonar (odds ratio [OR] 0.50, 95% CI 0.37-0.66; P<.001), whereas citing an AAN or neurology document doubled it (OR 2.18, 95% CI 1.64-2.89; P<.001). Furthermore, factual questions outperformed case vignettes across Perplexity models (ORs ranged from 1.95, 95% CI 1.28-2.98 [Sonar + whitelisting] to 4.28, 95% CI 2.59-7.09 [Sonar-Reasoning-Pro]; all P<.01) but not for OpenEvidence (OR 1.44, 95% CI 0.92-2.27; P=.11).
    CONCLUSIONS: Restricting retrieval to authoritative neurology domains yielded a clinically meaningful 8 to 18 percentage-point gain in correctness and halved output variability, upgrading a consumer search assistant to a decision-support-level tool that at least performed on par with a specialized literature engine. Lightweight source control is therefore a pragmatic safety lever for maintaining continuously updated, web-based RAG-augmented LLMs fit for evidence-based neurology.
    Keywords:  artificial intelligence; evidence-based medicine; information retrieval; large language models; medical guidelines; neurology
    DOI:  https://doi.org/10.2196/79379
  5. J Health Econ Outcomes Res. 2025 ;12(2): 154-162
       Introduction: Economic evaluations are essential for informed healthcare decision-making but often face challenges due to inconsistent reporting and methodological complexity. Large Language Models (LLMs) offer a scalable alternative for evaluating adherence to such standards. Building on Hileas, a previously developed tool, this study assesses the accuracy of LLM-generated evaluations compared with human reviewers, aiming to quantify reliability, identify limitations, and advance automated, but assistive quality assessment methods in health economic research.
    Methods: In all, 110 peer-reviewed economic evaluation papers were evaluated using the CHEERS checklist through structured LLM prompts and scored by 2 human reviewers on a 0-4 ordinal scale. Interrater agreement and LLM performance were measured using Cohen's kappa, sensitivity, specificity, and area under the curve. LLM outputs were compared against human consensus ratings, and usability of the review platform was assessed with the System Usability Scale.
    Results: Among 2860 item-level evaluations, 25.3% showed disagreement between human reviewers, with generally low interrater reliability (kappa=-0.07 to 0.43). Compared with human consensus, the LLM achieved 72.3% to 94.7% agreement, with areas under the curve up to 0.96 but variable performance across checklist items. At the paper level, LLM-assigned CHEERS scores (median, 17) were consistently lower than human-reviewed scores (median, 18-21).
    Conclusion: This study demonstrated an exploratory proof-of-concept application of LLMs to research quality evaluation. Our results suggests that the LLM was generally able to provide well-reasoned evaluations that closely aligned with human assessments, although with some limitations in fully supporting its judgments.
    Keywords:  CHEERS; Large Language Models; artificial intelligence; publication quality
    DOI:  https://doi.org/10.36469/001c.145214
  6. Front Artif Intell. 2025 ;8 1658316
       Background: The integration of generative artificial intelligence (AI), particularly large language models (LLMs), into medical statistics offers transformative potential. However, it also introduces risks of erroneous responses, especially in tasks requiring statistical rigor.
    Objective: To evaluate the effectiveness of various prompt engineering strategies in guiding LLMs toward accurate and interpretable statistical reasoning in biomedical research.
    Methods: Four prompting strategies: zero-shot, explicit instruction, chain-of-thought, and hybrid were assessed using artificial datasets involving descriptive and inferential statistical tasks. Outputs from GPT-4.1 and Claude 3.7 Sonnet were evaluated using Microsoft Copilot as an LLM-as-a-judge, with human oversight.
    Results: Zero-shot prompting was sufficient for basic descriptive tasks but failed in inferential contexts due to lack of assumption checking. Hybrid prompting, which combines explicit instructions, reasoning scaffolds, and format constraints, consistently produced the most accurate and interpretable results. Evaluation scores across four criteria-assumption checking, test selection, output completeness, and interpretive quality confirmed the superiority of structured prompts.
    Conclusion: Prompt design is a critical determinant of output quality in AI-assisted statistical analysis. Hybrid prompting strategies should be adopted as best practice in medical research to ensure methodological rigor and reproducibility. Additional testing with newer models, including Claude 4 Sonnet, Claude 4 Opus, o3 mini, and o4 mini, confirmed the consistency of results, supporting the generalizability of findings across both Anthropic and OpenAI model families. This study highlights prompt engineering as a core competency in AI-assisted medical research and calls for the development of standardized prompt templates, evaluation rubrics, and further studies across diverse statistical domains to support robust and reproducible scientific inquiry.
    Keywords:  AI-assisted data analysis; LLM-as-a-judge; evaluation frameworks; large language models; medical research; prompt engineering; statistical assumption checking; statistical reasoning
    DOI:  https://doi.org/10.3389/frai.2025.1658316