bims-arines Biomed News
on AI in evidence synthesis
Issue of 2026–02–01
ten papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. J Med Internet Res. 2026 Jan 27. 28 e76130
       Background: Living evidence (LE) synthesis refers to the method of continuously updating systematic evidence reviews to incorporate new evidence. It has emerged to address the limitations of the traditional systematic review process, particularly the absence of or delays in publication updates. The emergence of COVID-19 accelerated the progress in the field of LE synthesis, and currently, the applications of artificial intelligence (AI) in LE synthesis are expanding rapidly. However, in which phases of LE synthesis should AI be used remains an unanswered question.
    Objective: This study aims to (1) document the phases of LE synthesis where AI is used and (2) investigate whether AI improves the efficiency, accuracy, or utility of LE synthesis.
    Methods: We searched Web of Science, PubMed, the Cochrane Library, Epistemonikos, the Campbell Library, IEEE Xplore, medRxiv, COVID-19 Evidence Network to support Decision-making, and McMaster Health Forum. We used Covidence to facilitate the monthly screening and extraction processes to maintain the LE synthesis process. Studies that used or developed AI or semiautomated tools in the phases of LE synthesis were included.
    Results: A total of 24 studies were included, including 17 on LE syntheses, with 4 involving tool development, and 7 on living meta-analyses, with 3 involving tool development. First, a total of 34 AI or semiautomated tools were involved, comprising 12 AI tools and 22 semiautomated tools. The most frequently used AI or semiautomated tools were machine learning classifiers (n=5) and the Living Interactive Evidence synthesis platform (n=3). Second, 20 AI or semiautomated tools were used for the data extraction or collection and risk of bias assessment phase, and only 1 AI tool was used for the publication update phase. Third, 3 studies demonstrated the improvement in efficiency achieved based on time, workload, and conflict rate metrics. Nine studies applied AI or semiautomated tools in LE synthesis, obtaining a mean recall rate of 96.24%, and 6 studies achieved a mean F1-score of 92.17%. Additionally, 8 studies reported precision values ranging from 0.2% to 100%.
    Conclusions: AI and semiautomated tools primarily facilitate data extraction or collection and risk of bias assessment. The use of AI or semiautomated tools in LE synthesis improves efficiency, leading to high accuracy, recall, and F1-scores, while precision varies across tools.
    Keywords:  accuracy; artificial intelligence; efficiency; living evidence synthesis; phases; semiautomated tools; utility
    DOI:  https://doi.org/10.2196/76130
  2. Cochrane Evid Synth Methods. 2026 Jan;4(1): e70068
       Introduction: Priority screening has the potential to reduce the number of records that need to be annotated in systematic literature reviews. So-called technology-assisted reviews (TAR) use machine-learning with prior include/exclude annotations to continuously rank unseen records by their predicted relevance to find relevant records earlier. In this article, we present a systematic evaluation of methods to determine when it is safe to stop screening when using prioritization.
    Methods: We implement an open-source evaluation framework that features a novel method to generate rankings and simulate priority screening processes for 81 real-world data sets. We use these simulations to evaluate 15 statistical or rule-based (heuristic) stopping methods, testing a range of hyperparameters for each.
    Results: The work-saving potential and performance of stopping criteria heavily rely on "good" rankings, which are typically not achieved by a single ranking algorithm across the entire screening process. Our evaluation shows that almost all existing stopping methods either fail to reliably stop without missing relevant records or fail to utilize the full potential work-savings. Only one method reliably meets the set recall target, but stops conservatively.
    Conclusions: Many digital evidence synthesis tools provide priority screening features that are already used in many research projects. However, the theoretical work-savings demonstrated in retrospective simulations of prioritization can only be unlocked with safe and reproducible stopping criteria. Our results highlight the need for improved stopping methods and guidelines on how to responsibly use priority screening. We also urge screening platforms to provide indicators and authors to transparently report metrics when automating (parts of) their synthesis.
    Keywords:  digital evidence synthesis; priority screening; stopping methods; systematic maps; systematic reviews; technology‐assisted reviews
    DOI:  https://doi.org/10.1002/cesm.70068
  3. Minerva Obstet Gynecol. 2026 Jan 29.
       INTRODUCTION: Identifying eligible studies is a foundational component of systematic reviews, requiring careful interpretation of complex inclusion and exclusion criteria. Given the rapid integration of large language models (LLMs) into evidence synthesis, their reliability in autonomously performing this task necessitates timely evaluation. This study intends to assess whether general-purpose LLMs can accurately perform the study identification phase of a published systematic review in obstetrics using only predefined eligibility criteria.
    EVIDENCE ACQUISITION: Six publicly accessible LLMs were given a standardized prompt, without iterative refinement or human oversight, to identify eligible studies from a 2023 JAMA Network Open meta-analysis. Each model's output was compared with the 14 studies included in the reference review. Primary outcomes were precision, recall, F1 score, and hallucination severity.
    EVIDENCE SYNTHESIS: Claude 3.7 achieved the highest accuracy, correctly identifying 5 of 14 reference studies (precision 71.4%, recall 35.7%, F1 score of 0.48). In comparison, all other models exhibited substantially lower performance with minimal variation in F1 scores (ranging between 0.08-0.12), indicating a poor balance of precision and recall. Precision was generally inversely related to the number of false positives, and LLMs that returned more total studies tended to produce more hallucinations.
    CONCLUSIONS: Current general-purpose LLMs are unreliable for autonomous study identification in clinically relevant systematic reviews and human oversight remains essential. The low F1 scores highlight major limitations in current LLMs' ability to accurately and comprehensively identify relevant studies. These findings underscore the need for fine-tuning and hybrid AI-human workflows before safe integration into evidence synthesis in obstetrics and gynecology.
    DOI:  https://doi.org/10.23736/S2724-606X.25.05849-X
  4. Nature. 2026 Jan;649(8099): 1099-1101
      
    Keywords:  History; Language; Machine learning; Technology
    DOI:  https://doi.org/10.1038/d41586-026-00245-0
  5. Healthcare (Basel). 2026 Jan 19. pii: 248. [Epub ahead of print]14(2):
      Background/Objectives: Vast amounts of textual data are generated by healthcare organizations every year. Traditional content analysis is time-intensive, error-prone, and potentially biased. This study demonstrates how freely available large language model (LLM) artificial intelligence (AI) tools can efficiently and effectively analyze qualitative healthcare data and uncover insights missed by traditional manual analysis. Interview data from chief nursing officers (CNOs) at top-performing academic medical centers were analyzed to identify factors contributing to their operational and patient quality success. Methods: Semi-structured interviews were conducted with CNOs from top-performing academic medical centers that achieved top-decile quality measures while using resources most efficiently. Interview transcripts were analyzed using a mix of traditional text mining in LSA and Gemini 2.5. The capability of four freely available AI platforms-Gemini 2.5, Scholar AI 5.1, Copilot's Chat, and Claude's Sonnet 4.5-was also reviewed. Results: LLM AI analysis identified ten primary factors, comprising twenty-four subtopics, that characterized successful hospital performance. Notably, AI analysis identified a theoretical connection that manual analysis had missed, revealing how the identified framework aligned with Donabedian's seminal structure, process, outcomes quality model. The AI analysis reduced the required time from weeks to nearly instantaneous. Conclusions: LLM AI tools offer a transformative approach to unlocking insight from the analysis of qualitative textual data in healthcare settings. These tools can provide rapid insight that is accessible to personnel with minimal text-mining expertise and offer a practical solution for healthcare organizations to unlock insight hidden in the vast amounts of textual data they hold.
    Keywords:  Donabedian model; Large Language Models (LLMs); artificial intelligence in healthcare; healthcare quality; qualitative data analysis; text mining
    DOI:  https://doi.org/10.3390/healthcare14020248
  6. Knee Surg Sports Traumatol Arthrosc. 2026 Jan 28.
    ESSKA Artificial Intelligence Working Group
      Research communication is undergoing a paradigm shift. The traditional linear manuscript-foundational for centuries-increasingly reveals limitations in the digital era, struggling with information overload, delayed dissemination, and rigid formats. We propose a transition towards 'living publications': interactive, artificial intelligence (AI)-enhanced platforms that evolve with new evidence. Unlike static papers, these systems utilise large language models (LLMs) and vector databases to interpret context, synthesise real-time findings, and map interdisciplinary connections. This shift promises to democratise knowledge, accelerate validation, and enable dynamic evidence synthesis. However, it necessitates robust frameworks for verification, 'version of record' tracking, and peer review to maintain rigor. Successfully navigating this transition requires balancing technological innovation with preservation of academic values-ensuring that increased speed and accessibility enhance rather than diminish the quality of scientific discourse. As interactive platforms mature, they may potentially reshape how knowledge is shared, discovered, and applied, ideally accelerating scientific advancement through more dynamic, accessible, and collaborative research communication. LEVEL OF EVIDENCE: NA.
    Keywords:  artificial intelligence; digital publishing; knowledge dissemination; natural language processing; scientific communication
    DOI:  https://doi.org/10.1002/ksa.70286
  7. J Glob Health. 2026 Jan 30. 16 04037
       Background: Artificial intelligence (AI) tools based on large language models (LLMs) are being increasingly used by researchers and may play a role in health-related research priority-setting exercises (RPSEs). However, little is known about how these tools may differ in the types of research priorities they generate.
    Methods: We examined research priorities aimed at improving treatments for four diseases: cancer, COVID-19, HIV, and Alzheimer. We compared the outputs from five AI tools (DeepSeek, ChatGPT, Claude, Perplexity, and Gemini) using SBERT-BioBERT embeddings and cosine similarity scores, and assessed the stability of differences between them by re-running identical prompts and slightly modified versions.
    Results: We found that the outputs produced by Gemini were highly similar to those produced by the other tools. The two most different outputs were those produced by DeepSeek and Perplexity, whereby the former tended to emphasise technical medical issues, while the latter emphasised public health concerns. This substantive distinction between DeepSeek and Perplexity remained stable across repeated and tweaked prompts.
    Conclusions: Our exploratory analysis suggests that Gemini performs well for researchers who prefer to generate health-related research priorities using a single AI model. For those planning to draw on multiple models, Perplexity and DeepSeek offer complementary perspectives.
    DOI:  https://doi.org/10.7189/jogh.16.04037
  8. Sci Rep. 2026 Jan 24. 16(1): 3258
      
    Keywords:  AI performance evaluation; Adherence to reporting standards; Adherence to standardised checklists and guidelines; Evidence synthesis; Generative AI; Research checklists
    DOI:  https://doi.org/10.1038/s41598-025-29591-1
  9. J Clin Nurs. 2026 Jan 27.
      
    Keywords:  artificial intelligence; clinical decision‐making; healthcare technology; nursing informatics; systematic review
    DOI:  https://doi.org/10.1111/jocn.70200
  10. J Dent Sci. 2026 Jan;21(1): 679-680
      
    Keywords:  Academic writing; Artificial intelligence; Fabricated references; Large language models
    DOI:  https://doi.org/10.1016/j.jds.2025.10.024