bims-arines Biomed News
on AI in evidence synthesis
Issue of 2026–05–31
nine papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. Arthroplast Today. 2026 Jun;39 102035
       Background: Systematic reviews and meta-analyses represent the highest level of evidence in clinical research, but the process of article retrieval and screening is labor-intensive. Large language models, such as ChatGPT-5, may offer an efficient alternative, yet their performance in full systematic review workflows remains untested. This study compares ChatGPT-5's Deep Research and Agent Modes with human researchers in replicating gold standard systematic reviews in total joint arthroplasty.
    Methods: Five published systematic reviews were selected as reference articles. Three groups: orthopaedic research fellows, ChatGPT-5 Deep Research Mode, and ChatGPT-5 Agent Mode, independently identified eligible articles using standardized search terms and inclusion/exclusion criteria. Artificial intelligence (AI) searches were repeated 3 times for reproducibility. Extracted articles were evaluated against the gold standard for recall, precision, false positives/negatives, and time efficiency. Newly identified eligible studies were also assessed.
    Results: The research fellows dedicated 268 hours to screening 9101 articles, achieving 85.2% recall of gold standard articles. Deep Research and Agent Modes averaged 12-14 minutes per search, identifying 47.5% and 40.9% of gold standard articles, respectively. Fellows had fewer false negatives (n = 5) compared with Deep Research (n = 19) and Agent Mode (n = 12). AI models retrieved several additional eligible studies not captured by humans, demonstrating complementary potential.
    Conclusions: Human reviewers remain superior to current AI models in replicating systematic review article selection, particularly for nuanced inclusion/exclusion criteria. However, ChatGPT-5 significantly reduces search time and can identify additional relevant studies, suggesting its role as a valuable adjunct in systematic review workflows with expert oversight.
    Keywords:  Article screening; Artificial intelligence; Large language models; Meta-analysis; Systematic review; Total joint arthroplasty
    DOI:  https://doi.org/10.1016/j.artd.2026.102035
  2. Res Synth Methods. 2026 May 29. 1-18
      Data extraction in systematic reviews, maps, and meta-analyses is time-consuming and prone to human error or subjective judgment. Large Language Models offer the potential for saving time, yet their performance has been evaluated in a limited range of platforms, disciplines, and review types. We assessed the performance of the Elicit platform across diverse data extraction tasks using journal articles from seven systematic reviews in life and environmental sciences. Human-extracted data served as the gold standard. For each review, we used eight articles for prompt development and another eight for testing. Initial prompts were iteratively refined to exceed 87% accuracy or up to five rounds. We then tested extraction accuracy, reproducibility across user accounts, and the effect of Elicit's high-accuracy mode. Of 90 considered prompts, 70 exceeded the 87% accuracy when compared to gold standard, but tended to be lower when tested on a new set of articles. Repeating data extractions with different Elicit user accounts resulted in 90% agreement on extracted values, though supporting quotes and reasoning matched in only 46% and 30% of cases, respectively. In high-accuracy mode, value matches dropped to 77%, with just 10% quote matches and 0% reasoning matches. Extraction accuracy did not differ by data types. Elicit also helped identify eight (<1%) errors in the gold standard data. Our results show that Elicit can complement, but not replace, human data extractors. Elicit may be best used for sanity checks and to evaluate the clarity of data extraction protocols. Prompts must be fine-tuned and independently validated.
    Keywords:  artificial intelligence; evidence synthesis; meta-analysis; proof of concept; research methods; systematic maps
    DOI:  https://doi.org/10.1017/rsm.2026.10080
  3. Eur Heart J Digit Health. 2026 Jun;7(5): ztag070
    iCARE4CVD consortium
       Aims: Artificial intelligence (AI) tools utilizing large language models (LLMs) can accelerate scientific literature reviews by automating title, abstract, and full-text-based screenings of relevant patient populations and biomarkers. We developed an AI-based tool to automate and improve full-text screening performance using LLMs to accurately identify relevant publications that meet complex criteria.
    Methods and results: We conducted a literature review utilizing the Population, Intervention-biomarkers, Comparison, Outcome framework to define our inclusion and exclusion criteria, focusing on biomarkers in heart failure with reduced ejection fraction (HFrEF). An AI-based full-text screening tool was created to process 5405 selected publications, combining multi-level and task-oriented retrieval-augmented generation (RAG) and agent-based methods, establishing ground truth standards to evaluate performance metrics both for the tool and human reviewers. Intra-LLM reliability was assessed by rerunning screenings on a batch of publications. Among the public and private domain models, LLaMA 3.3 70B was selected for its superior accuracy (82%), precision (71%), and recall (100%) in screening 49 manuscripts by LLMs. During the training phase, based on several hundred manuscripts, performance metrics significantly improved. Validation results showed a sensitivity of 91.4%, specificity of 53.2%, a false positive rate of 46.8%, and a false negative rate of 8.6%. The LLM outperformed human reviewers in F1 score and interrater reliability, achieving 100% consistency across multiple runs, with each run consisting of multiple LLMs on 1000 documents.
    Conclusion: Our study demonstrated that AI tool can reduce labour-intensive efforts while maintaining accuracy in literature reviews, with greater inter-rater agreement compared to human reviewers.
    Keywords:  Artificial intelligence (AI); Biomarkers; Full-text screening; Heart failure; Large language models (LLMs); Retrieval-augmented generation (RAG)
    DOI:  https://doi.org/10.1093/ehjdh/ztag070
  4. Gynecol Obstet Fertil Senol. 2026 May 25. pii: S2468-7189(26)00139-X. [Epub ahead of print]
       OBJECTIVE: Systematic reviews are a cornerstone of evidence-based medicine but remain time-consuming and prone to human error. Large Language Models (LLMs), such as ChatGPT or Claude, offer new opportunities for partial automation of these tasks. This articles aims to provide a critical synthesis of the current uses of LLMs across the key stages of systematic reviews and meta-analyses in healthcare.
    METHODS: We conducted a narrative review based on a literature search in PubMed and Scopus (2019-2025), including empirical studies, scoping reviews, preprints, and technical reports discussing the use of LLMs in any stage of the systematic review process.
    RESULTS: LLMs can support research question formulation, search strategy development, reference screening, data extraction, meta-analysis scripting, and result synthesis. Reported performances are often high, especially for screening and quantitative data extraction, with sensitivities of 95-98%. However, significant limitations persist: hallucinations, bias, misinterpretations, and variability across models. Independent validation remains scarce.
    CONCLUSION: LLMs show promising potential to accelerate several stages of systematic reviews, provided their use is methodologically controlled. A semi-automated approach, combining AI capabilities with human expertise, currently appears the safest. Prompt structuring, result validation, and transparent reporting of AI involvement are essential to ensure the quality and reliability of the synthesized evidence.
    Keywords:  Artificial Intelligence; Evidence-Based Medicine; Grand modèle de langue; Intelligence artificielle; Large Language Models; Meta-Analysis as topic; Médecine factuelle; Méta-analyse comme sujet; Revue systématique; Systematic Review
    DOI:  https://doi.org/10.1016/j.gofs.2026.05.004
  5. Psychol Bull. 2026 Mar;152(3): 349-354
      Using a metascience framework for improving meta-analyses, Jansen et al. (2025) tested the accuracy and efficiency of data extraction from primary studies used in meta-analyses with a range of large language models. Efficiency was impressive: Across thousands of studies and hundreds of variables, eight large language models took less than an hour combined to extract hundreds of thousands of data points-work estimated to take a human coder >6,500 hr. Nevertheless, accuracy was inconsistent, ranging from high to low depending on the variable. From these results, Jansen et al. recommended (a) a research agenda for investigating the use of artificial intelligence (AI) for data extraction and (b) methods for using AI as a partner for data extraction when conducting systematic reviews. This commentary expands on recommendations for the research agenda, such as investigating AI-induced bias, the illusion of exploratory depth, and using AI to extract study quality data. This commentary also offers further considerations regarding using AI as a meta-analysis partner, such as how iterative prompts might reduce coding independence. Finally, the commentary discusses speed-accuracy tradeoffs in meta-analyses. (PsycInfo Database Record (c) 2026 APA, all rights reserved).
    DOI:  https://doi.org/10.1037/bul0000519
  6. Complement Med Res. 2026 May 28. 1-23
      Network meta-analysis (NMA) plays an important role in comparative effectiveness research, particularly in fields such as traditional and complementary medicine, where multiple interventions often need to be assessed within a single evidence framework. However, conducting NMA remains labor-intensive, methodologically demanding, and difficult to complete efficiently using conventional workflows. Although recent advances in large language models have created new opportunities for supporting evidence synthesis, their routine use in NMA is still constrained by limited transparency, prompt dependency, and difficulty integrating with structured analytical procedures. In this study, we introduce SmartEBM, a web-based human-AI collaborative platform designed to support the full workflow of NMA. SmartEBM provides an integrated environment for title and abstract screening, full-text screening, data extraction, risk of bias assessment, statistical analysis, and certainty of evidence assessment. The platform is organized around a human-in-the-loop model, in which AI-assisted functions support repeated and labor-intensive tasks while researchers retain oversight of methodological judgement, verification, and final decision-making. Through its six functional modules, SmartEBM offers low-code interfaces, structured outputs, and verification-oriented workspaces that help connect major steps of evidence synthesis within one platform. Rather than functioning as a stand-alone automation tool, SmartEBM is intended as a practical platform for end-to-end NMA support. This platform-oriented approach may help make evidence synthesis more manageable, traceable, and accessible in routine research practice, especially in complex review settings.
    DOI:  https://doi.org/10.1159/000552717
  7. Nature. 2026 May;653(8116): 983
      
    Keywords:  Machine learning; Medical research; Publishing; Research data
    DOI:  https://doi.org/10.1038/d41586-026-01616-3
  8. Genet Med. 2026 May 27. pii: S1098-3600(26)00930-5. [Epub ahead of print] 102612
       PURPOSE: Variant assessment of rare disease diagnostics depends on using domain knowledge in the time-intensive process of retrieving, reviewing, and synthesizing clinical and technical information.
    METHODS: To address these challenges, we developed the Evidence Aggregator (EvAgg), an open-source, generative-AI-based tool designed to support rare disease diagnosis that systematically extracts relevant information from the scientific literature for any human gene. Further, we constructed an expert-curated dataset and evaluated EvAgg's performance for the tasks of relevant paper selection, finding observations of human genetic variation within those papers, and extracting specific details about those observations (e.g. zygosity, variant inheritance, variant type, functional study. phenotype, and study type). A user study evaluated utility and user experience in rare disease case analysis.
    RESULTS: Our evaluation study revealed that EvAgg achieved 92% recall in identifying relevant papers, 96% recall in detecting instances of genetic variation within those papers, and ∼80% accuracy in extracting individual case and variant-level content. Our subsequent user study evaluated the utility and user experience in rare disease case analysis. We found that EvAgg reduced review time by 34% (p-value < 0.002) and increased the number of papers, variants, and cases evaluated per unit time.
    CONCLUSION: EvAgg provides a thorough and current summary of observed genetic variants and their associated clinical features, supporting the process of manual literature review and enabling rapid synthesis of evidence concerning gene-disease relationships. The demonstrated time savings have the potential to reduce diagnostic latency and increase solve rates for challenging rare disease cases.
    Keywords:  entity recognition and linking; evidence aggregation; generative AI; information retrieval; rare disease
    DOI:  https://doi.org/10.1016/j.gim.2026.102612
  9. Med Sci Educ. 2026 Apr;36(2): 1063-1064
      
    Keywords:  AI-assisted analysis; Artificial intelligence; ChatGPT; Qualitative research; Thematic analysis
    DOI:  https://doi.org/10.1007/s40670-025-02618-y