bims-arines Biomed News
on AI in evidence synthesis
Issue of 2026–02–22
fourteen papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. Syst Rev. 2026 Feb 17.
       BACKGROUND: The process of developing and updating an evidence gap map (EGM) is based on the principles of systematic reviews and requires extensive time and financial resources. Artificial intelligence (AI) tools, like prioritisation screening (PS), integrated into programmes such as EPPI-Reviewer (ER) and Copilot 365, can potentially mimic human performance in systematic review processes. ER is a subscription-based web application employed by systematic review groups, while Copilot 365, integrated into Microsoft 365, offers real-time assistance. Although ER shows promise in speeding up screening, the optimal threshold for accuracy remains unclear. Additionally, there is no evidence on the effectiveness of any version of Copilot in systematic review and EGM processes.
    OBJECTIVES: Assess the accuracy and efficiency of Copilot 365 and PS integrated into ER at different stages of an EGM update, comparing it to human performance.
    METHODS: We will conduct both manual and automated screening of references, full-text screening, data extraction, and critical appraisal. Two reviewers will independently screen studies for inclusion, extract data, and appraise included studies, resolving conflicts through discussion. We will assess the accuracy and efficiency of Copilot 365 and ER at different EGM update stages, comparing them to human performance. To evaluate the PS accuracy, we will use 20% and 40% manual screening thresholds, calculating the proportion of relevant references prioritised by PS and the total relevant citations missed. We will compare Copilot 365's full-text screening accuracy to reviewers' decisions and assess consistency using Cohen's Kappa. For automated data extraction and appraisal, we will manually inspect 20% of Copilot 365's outputs, comparing them to reviewers' results, measuring consistency with Cohen's Kappa, and evaluating time savings by comparing the time taken for manual extraction versus using Copilot 365.
    DISCUSSION: This study will offer insights into ER's accuracy in screening small samples of citations and potentially guide future applications in this context. Additionally, by evaluating Copilot 365, which shares similar features with other AI tools, we will gain a broader understanding of its applicability and limitations in evidence synthesis, making the results relevant to other AI applications in this field.
    SYSTEMATIC REVIEW REGISTRATION: Registered at Open Science Framework: https://doi.org/10.17605/OSF.IO/49BX8.
    Keywords:  Accuracy; Artificial intelligence; Automation; Copilot 365; Critical appraisal; Data extraction; Large language models; Priority screening; Protocol; Systematic review
    DOI:  https://doi.org/10.1186/s13643-026-03101-4
  2. Am J Perinatol. 2026 Feb 19.
      Objective Systematic reviews depend on rigorous risk-of-bias (RoB) assessments to ensure credibility, yet manual evaluation using the Cochrane RoB 2 tool is resource-intensive. While Large Language Models (LLMs) offer potential for automation, their alignment with human judgment remains underexplored. This study evaluates the reliability of ChatGPT-4o, ChatGPT-5, and Claude 3.5 Sonnet in assessing RoB in randomized controlled trials (RCTs), comparing their agreement with human reviewers and internal consistency. Study Design We retrospectively analyzed 180 RCTs from systematic reviews published in the American Journal of Obstetrics and Gynecology (2021-2023) reporting complete human RoB 2 ratings. Each LLM processed full-text PDFs using a standardized prompt incorporating the complete RoB 2 algorithm. Model performance was evaluated against human benchmarks using Cohen's kappa and prevalence- and bias-adjusted kappa (PABAK). Intra-model reliability was assessed across three independent runs to measure consistency. Results ChatGPT-5 consistently outperformed other models, achieving the highest agreement in randomization (Domain 1; 76%), missing outcome data (Domain 3; 80%), and outcome measurement (Domain 4; 76%). It showed moderate concordance for deviations from intended interventions (69%). However, all models struggled with selective reporting (Domain 5), where agreement dropped to 47-51%. For overall risk-of-bias judgments, ChatGPT-5 demonstrated superior concordance (60-62%, κ=0.36-0.40) compared to ChatGPT-4o (45%) and Claude 3.5 Sonnet (43%). ChatGPT-5 also exhibited substantial to near-perfect internal consistency. Conclusion Among the evaluated models, ChatGPT-5 most closely approximated human RoB 2 assessments and achieved superior internal consistency, suggesting it could serve as a practical first-pass tool to reduce reviewer burden. However, persistent limitations in detecting selective reporting-likely due to the inability to cross-reference external trial registries-highlight that expert human oversight remains essential for accurate evidence synthesis.
    DOI:  https://doi.org/10.1055/a-2793-9092
  3. Cochrane Evid Synth Methods. 2026 Mar;4(2): e70074
       Objective: PubReMiner is a text-mining tool that analyses a seed set of citations to assess word frequency in titles, abstracts, and Medical Subject Headings (MeSH). This study aimed to determine the sensitivity and precision of search strategies developed using the PubReMiner tool compared to conventional search strategies developed by a librarian at our institution.
    Methods: Twelve consecutive reviews conducted at our center were included from September 2023 to January 2025. These reviews included various types of evidence synthesis, including rapid reviews and systematic reviews, covering a variety of topics. One librarian developed a comprehensive search strategy, which included a conventional MEDLINE search for each review. Separately, two librarians independently developed MEDLINE search strategies using PubReMiner-generated word frequency tables (PubReMiner 1 and PubReMiner 2). All search strategies were constructed by experienced librarians using predefined work instructions. Primary outcomes were sensitivity and precision. Secondary outcomes included the number needed to read, the number of unique references retrieved, and the time taken to construct each strategy.
    Results: Sensitivity of PubReMiner strategies was generally lower than that of conventional strategies; however, in one review, PubReMiner achieved a higher sensitivity (83.87%) than the conventional strategy (58.06%). Only the sensitivity outcome showed a statistically significant difference between search methods (Friedman test p = 0.0065). No statistically significant difference in precision between the searches was identified. PubReMiner strategies were typically faster to construct but yielded inconsistent performance across reviews and between librarians.
    Conclusion: While PubReMiner offers efficiency advantages, its inconsistent performance in retrieving relevant studies suggests that it should not replace conventional search strategies. The study illustrates the value of multi-review SWARs in producing evidence that informs evidence synthesis practices.
    Keywords:  SWAR; information retrieval; study within a review; systematic search methods; text‐mining
    DOI:  https://doi.org/10.1002/cesm.70074
  4. Syst Rev. 2026 Feb 21.
      Systematic reviews are crucial for synthesizing evidence, but their manual processes, particularly abstract screening, are labor-intensive and prone to error. Advances in machine learning (ML) offer solutions to enhance efficiency and accuracy. Using a learning analytics (LA) in higher education review as a case study, this tutorial provides a step-by-step guide to implementing two ML solutions to streamline abstract screening. One solution is ASReview, an active learning-based ML framework. We detailed data preparation, ASReview setup, and its active learning capabilities, which significantly reduce manual workloads while maintaining high recall rates. The second solution is ChatGPT, a GPT-4 powered large language model (LLM), to demonstrate optimizing prompts and parameters in Python's Google Colab environment for accurate and consistent screening results. We present performance metrics, including sensitivity, specificity, and accuracy, to evaluate each tool's strengths and limitations. ASReview excels in handling large datasets, while ChatGPT enhances screening precision with well-designed prompts. This tutorial empowers researchers to integrate ML into systematic reviews, ensuring rigor, transparency, and efficiency while addressing the growing complexity of evidence synthesis.
    Keywords:  Abstract screening; Machine learning; Systematic review; Tutorial
    DOI:  https://doi.org/10.1186/s13643-026-03111-2
  5. NPJ Digit Med. 2026 Feb 19.
      Systematic reviews provide the highest level of evidence but remain resource-intensive. We evaluated the performance of a large language model (LLM; ChatGPT, OpenAI) in a PRISMA-guided review of randomized controlled trials on vaginal vault prolapse surgery. Prompts were carefully designed to minimize errors, and outputs were verified. Each task was completed within minutes. For title/abstract screening, recall was 69.8% and precision 85.7% (κ = 0.77); full-text agreement 94.1-100% (κ = 0.82-1); data extraction accuracy 87.5-99.7%. From 18 RCTs (1668 women), sacrocolpopexy (SC) showed higher anatomic success than sacrospinous fixation (SSF) (OR 1.42, 95% CI 0.71-2.84). Transvaginal mesh improved 3-year objective success compared with SSF (OR 1.84, 95% CI 1.13-2.99) but had higher reoperation rates (5-16% vs 2-4%) than SC. We did not find conclusive evidence that any single technique is superior; most comparisons were underpowered, with wide confidence intervals and substantial heterogeneity. All LLM-derived statistical results were identical to those from conventional R analyses, confirming robustness. Validated LLM workflows can enable more efficient and scalable evidence synthesis.
    DOI:  https://doi.org/10.1038/s41746-026-02431-w
  6. Syst Rev. 2026 Feb 19.
       BACKGROUND: Health Technology Assessment (HTA) is a cornerstone of evidence for informing health policy and resource allocation globally. Rapid advancements and the proliferation of digital health technologies and artificial intelligence (AI) have prompted the re-examination of HTA processes and methods. While traditional approaches are manual and labor-intensive, HTA processes are now exploring the use of AI and other digital technologies for automation, decision support, and evidence synthesis. To date, however, there have been very limited studies that map the innovative technological solutions of HTA, the models of integration, and the associated barriers, facilitators, and governance considerations. As such, this scoping review aims to address this critical gap by mapping the landscape of the global knowledge and practices related to AI and DTs used in and for HTA and identifying the key barriers and enablers influencing their adoption, integration, and effective application within HTA processes.
    METHODS: A scoping review will be conducted between August and November 2025, following the Arksey and O'Malley framework, enhanced by Joanna Briggs Institute (JBI) recommendations, and reported according to Preferred Reporting Items for Systematic Reviews and Meta‑Analyses extension for Scoping Reviews (PRISMA-ScR) guidelines. Literature searches will be performed in electronic databases such as Medline (Ovid), Embase (Ovid), Global Health (Ovid), CINAHL (Ebsco), Scopus, Web of Science, and all regional indexes in the World Health Organization's Global Index Medicus, and other region-specific sources for studies published between 2020 and 2025. Eligible studies will include peer-reviewed articles and grey literature describing the integration of digitization, automation, and AI in global HTA processes. Dual independent screening, data extraction, and quality appraisal will be employed.
    DISCUSSION: Findings from this review will provide a map of how digitization, automation, and AI are integrated into HTA practice, highlighting key enablers, barriers, and knowledge gaps. The insights will be used to better guide researchers, policymakers, HTA agencies, and AI developers, further supporting future research and implementation strategies for better informed decision-making.
    Keywords:  Artificial Intelligence; Digital Technologies; Health Technology Assessment (HTA); Scoping Review Evidence
    DOI:  https://doi.org/10.1186/s13643-026-03120-1
  7. Prev Sci. 2026 Feb 17.
      The field of prevention science seeks to identify and implement effective strategies to address social, emotional, and health challenges. A critical aspect of this endeavor is determining the core components of prevention programs that drive positive outcomes. This article presents a case study utilizing artificial intelligence (AI)-assisted systematic review methods to identify key components of healthy marriage and relationship education programs. Given the growing body of research in this domain, AI tools offer a promising means to enhance the efficiency and accuracy of literature reviews. This study employed AI to screen, code, and validate research articles, demonstrating its effectiveness in expediting systematic reviews while maintaining high accuracy in inclusion screening. This case study involved a systematic review of 22,028 resources (identified from PsycINFO, Academic Search Ultimate, and Google) and a final data set of 268 relevant studies. AI screening was integral in effectively conducting multiple rounds of screening. However, findings also highlight challenges in AI-assisted qualitative data abstraction, underscoring the continued need for human expertise in complex coding tasks. The study contributes to the ongoing discourse on integrating AI into prevention science methodologies and offers insights for optimizing AI applications in systematic reviews.
    Keywords:  AI-assisted literature review; AI-assisted meta analysis; AI-supported systematic review; Artificial intelligence; Literature review; Meta analysis; Systematic review
    DOI:  https://doi.org/10.1007/s11121-026-01885-4
  8. Knee. 2026 Feb 17. pii: S0968-0160(26)00066-9. [Epub ahead of print]60 104388
       OBJECTIVE: The purpose of this study was to prompt GPT-4 to analyze qualitative data used in a published scientific article where qualitative content analysis was performed by human researchers, and to qualitatively compare results from the published article with the results generated by GPT-4.
    METHODS: This study was conducted using the full interview dataset from a published qualitative study that aimed to explore experiences of patients treated with rehabilitation alone after an anterior cruciate ligament (ACL) injury. Interview transcripts were analyzed by GPT-4 through iterative prompting to replicate the original six-step content analysis process. Different attempts were conducted to improve GPT-4's output. GPT-4's final output was qualitatively compared with the human-generated results.
    RESULTS: While the human-made analysis produced one overarching theme supported by three main categories and nine sub-categories, GPT-4's analysis resulted in four themes, six main categories, and 15 sub-categories. Both analyses captured uncertainty and the impact of knee-related symptoms. GPT-4's results showed a suspiciously equal distribution of codes across sub-categories, and introduced a theme not grounded in the source data. Multiple prompts were required to produce and organize the material.
    CONCLUSION: The analysis performed by humans and GPT-4 had similarities and differences. The use of GPT-4 for qualitative analysis in its present form is challenging and needs to be performed across several steps. Currently, GPT-4 should not be used as the only tool in a qualitative analysis of interview data.
    Keywords:  Language processing; Qualitative research; Rehabilitation
    DOI:  https://doi.org/10.1016/j.knee.2026.104388
  9. Drug Alcohol Rev. 2026 Feb;45(2): e70128
      
    Keywords:  large language models; research metrics; research waste; systematic reviews
    DOI:  https://doi.org/10.1111/dar.70128
  10. Therapie. 2026 Jan 15. pii: S0040-5957(26)00016-8. [Epub ahead of print]
      The integration of language models into pharmacovigilance offers a valuable opportunity to enhance quality and efficiency across workflows. With their rapid evolution, these techniques have found diverse applications in pharmacovigilance, revealing promising advances in time-consuming and low-added-value tasks; nevertheless, several practical constraints continue to hinder their adoption. This work examines the use of traditional natural language processing techniques and advanced language models across three major pharmacovigilance domains: (i) adverse drug reactions extraction from medical records, (ii) case processing, and (iii) evidence screening. A PubMed® search was conducted to identify potentially relevant studies. Subsequently, expert-based selection refined the core literature set, which was expanded through related-reference screening to capture highly cited or conceptually linked papers. Traditional natural language processing techniques, including rule-based systems, dictionaries, and statistical models, offer transparent and efficient mechanisms for structured information extraction, and have shown practical applications, particularly in adverse drug reaction identification and coding. However, they often struggle with the linguistic variability typical of clinical narratives. In contrast, advanced language models, including large pre-trained transformers and large language models, demonstrate superior contextual understanding and adaptability to unstructured and heterogeneous text sources. Yet, their regulatory acceptance remains limited by hallucination risks, reduced transparency/reproducibility, dependence on parameter tuning, and the continued need for human-in-the-loop validation to ensure reliability. Additionally, the substantial computational requirements of large language models impose significant environmental costs, emphasizing the importance of rational and sustainable implementation strategies. Overall, natural language processing and advanced language models should be regarded as complementary approaches. Their integration can augment human expertise and foster a scalable, trustworthy, and sustainable pharmacovigilance ecosystem. At present, an interdisciplinary approach, where pharmacovigilance professionals actively contribute to model design, validation, and oversight, remains essential to harness automation benefits while maintaining clinical integrity.
    Keywords:  Adverse drug reaction reporting systems; Artificial intelligence; Data annotation; Expert system; Sustainable development
    DOI:  https://doi.org/10.1016/j.therap.2025.12.003
  11. Cureus. 2026 Jan;18(1): e101858
      Background Artificial intelligence (AI) tools such as ChatGPT are increasingly being explored for clinical decision support, yet their role in geriatric medicine remains uncertain due to the complexity of multimorbidity and care planning. This study aimed to evaluate the clinical accuracy, completeness, and guideline alignment of ChatGPT's responses to common geriatric scenarios using standardized vignettes. Methodology Seven standardized vignettes representing common geriatric scenarios, namely, polypharmacy, falls, dementia, delirium, frailty, advance care planning, and urinary incontinence, were submitted to ChatGPT (GPT-5). Responses were evaluated by five independent consultant geriatricians using a standardized rubric across the following five domains: accuracy, completeness, guideline alignment, safety, and clarity (0-2 score per domain). Descriptive statistics summarized performance, and qualitative feedback was thematically analyzed. Inter-rater reliability was assessed using Krippendorff's alpha. Results ChatGPT scored the highest in clarity (66/70) and safety (63/70), with slightly lower performance in accuracy (59/70) and completeness (55/70). Guideline alignment was generally strong (61/70). Advance care planning received the highest domain scores; urinary incontinence scored the lowest. Krippendorff's alpha showed high inter-rater agreement (0.969). Reviewers identified key omissions, such as missing assessments or guideline-recommended tools, in multiple vignettes. Conclusions ChatGPT showed potential as a supportive tool in geriatric care, offering clear and generally safe responses aligned with guidelines. However, it lacked clinical depth and missed key elements in complex scenarios. AI tools such as ChatGPT should be used with caution, under expert oversight, and not as standalone decision makers in clinical practice.
    Keywords:  artificial intelligence in medicine; chatgpt; geriatric medicine; geriatric syndromes; large language model
    DOI:  https://doi.org/10.7759/cureus.101858
  12. J Am Med Inform Assoc. 2026 Feb 04. pii: ocag015. [Epub ahead of print]
       OBJECTIVES: To develop and evaluate a human-LLM (Large Language Model) collaborative approach for systematic ontology updating, demonstrated with the Dietary Lifestyle Ontology (DILON).
    MATERIALS AND METHODS: One hundred dietary questionnaire items from English and Korean sources were semantically annotated by 4 state-of-the-art language models, which generated candidate concepts for inclusion into DILON. Outputs were refined through cross-model reconciliation, followed by expert review. The model curated the concept within DILON and experts reviewed and refined the outputs in Protégé to ensure accuracy and consistency.
    RESULTS: Claude Sonnet 4 effectively supported local tasks, including harvesting new concepts, detecting redundancies, and refining hierarchical segments. Global optimization of ontology, however, required systematic examination by human experts.
    DISCUSSION: These findings highlight the complementary strengths of LLMs and humans: LLMs accelerate repetitive and local updates, whereas humans maintain overall structural integrity.
    CONCLUSION: Human-LLM collaboration improves efficiency, scalability, and sustainability in ontology engineering, supporting the maintenance of complex biomedical ontologies.
    Keywords:  Dietary Lifestyle Ontology (DILON); Large Language Model (LLM); Person-Generated Health Data (PGHD); human-AI collaboration; ontology engineering
    DOI:  https://doi.org/10.1093/jamia/ocag015
  13. J Am Med Inform Assoc. 2026 Jan 28. pii: ocag002. [Epub ahead of print]
       OBJECTIVES: Research on artificial intelligence (AI)-based clinical decision-support (AI-CDS) systems has returned mixed results. Sometimes providing AI-CDS to a clinician will improve decision-making performance, sometimes it will not, and it is not always clear why. This scoping review seeks to clarify existing evidence by identifying clinician-level and technology design factors that impact the effectiveness of AI-assisted decision-making in medicine.
    MATERIALS AND METHODS: We searched MEDLINE, Web of Science, and Embase for peer-reviewed papers that studied factors impacting the effectiveness of AI-CDS. We identified the factors studied and their impact on 3 outcomes: clinicians' attitudes toward AI, their decisions (eg, acceptance rate of AI recommendations), and their performance when utilizing AI-CDS.
    RESULTS: We retrieved 5850 articles and included 45. Four clinician-level and technology design factors were commonly studied. Expert clinicians may benefit less from AI-CDS than nonexperts, with some mixed results. Explainable AI increased clinicians' trust, but could also increase trust in incorrect AI recommendations, potentially harming human-AI collaborative performance. Clinicians' baseline attitudes toward AI predict their acceptance rates of AI recommendations. Of the 3 outcomes of interest, human-AI collaborative performance was most commonly assessed.
    DISCUSSION AND CONCLUSION: Few factors have been studied for their impact on the effectiveness of AI-CDS. Due to conflicting outcomes between studies, we recommend future work should leverage the concept of "appropriate trust" to facilitate more robust research on AI-CDS, aiming not to increase overall trust in or acceptance of AI but to ensure that clinicians accept AI recommendations only when trust in AI is warranted.
    Keywords:  artificial intelligence; human–computer interaction; medical decision-making; review
    DOI:  https://doi.org/10.1093/jamia/ocag002