bims-arines Biomed News
on AI in evidence synthesis
Issue of 2026–02–15
eight papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. Syst Rev. 2026 Feb 10.
      Evidence synthesis (ES) involves rigorous, reproducible methodologies, which are increasingly being presented as 'Living' systematic reviews. As such, ES are critical to evidence-informed decision-making processes, such as the development, implementation, evaluation and monitoring of health technology assessments, practice guidelines and policies. However, the ES process is time-intensive, typically requiring months or years and extensive manual effort. Technological advancements, particularly artificial intelligence (AI), offer opportunities to automate various ES steps, potentially increasing efficiency and reducing costs. AI tools and platforms, including large language models (LLMs), facilitate faster ES through advanced natural language processing (NLP) capabilities. Despite their potential, AI tools have limitations, including risks of automation bias and lack of true semantic understanding, requiring careful evaluation to ensure trustworthiness. We conducted the first scoping review to update and map all data science tools, including LLMs, which are either being developed and/or deployed to optimise ES steps and assess their impact in both low- and middle-income countries (LMICs) and high-income countries (HICs). Our scoping review identified 137 studies and 388 of such AI tools and platforms to respond to the World Health Organization's call for safe and ethical AI in health, documenting the current landscape to identify barriers and facilitators to equitable and sustainable access for glocal researchers. We further outline three recommendations: (1) promote collaborative AI platforms ensuring equity of access to include gap regions identified (Latin America, Africa, Middle East), (2) establish evaluation standards for methods testing and reporting, and (3) emphasise human input and multidisciplinary capacity building for developing and implementing AI tools in ES.
    DOI:  https://doi.org/10.1186/s13643-025-02842-y
  2. PLoS One. 2026 ;21(2): e0342895
       BACKGROUND: Network meta-analysis (NMA) can compare several interventions at once by combining head-to-head and indirect trial evidence. However, identifying, extracting, and modelling these often takes months, delaying updates in many therapeutic areas.
    OBJECTIVE: To develop and validate MetaMind, an end-to-end, transformer-driven framework that automates NMA processes-including study retrieval, structured data extraction, and meta-analysis execution-while minimizing human input.
    METHODS: MetaMind integrates Promptriever, a fine-tuned retrieval model, to semantically retrieve high-impact clinical trials from PubMed; a multi-agent LLM architecture--Mixture of Agents (MoA)-- pipeline to extract PICO-structured (Population, Intervention, Comparison, Outcome) endpoints; and GPT-4o-generated Python and R scripts to perform Bayesian random-effects NMA and other NMA designs within a unified workflow. Validation was conducted by comparing MetaMind's outputs against manually performed NMAs in ulcerative colitis (UC) and Crohn's disease (CD).
    RESULTS: Promptriever outperformed baseline SentenceTransformer with higher similarity scores (0.7403 vs. 0.7049 for UC; 0.7142 vs. 0.7049 for CD) and narrower relevance ranges. Promptriever performance achieved 82.1% recall, 91.1% precision and an F1 score of 86.4% when compared to a previously published NMA. MetaMind achieved 100% accuracy on a limited set of remission endpoints regarding PICO (Population, Intervention, Comparator, Outcome) element extraction and produced comparative effect estimates and credible intervals closely matching manual analyses.
    CONCLUSIONS: In our validation studies, MetaMind reduced the end-to-end NMA process to less than a week, compared with the several months typically needed for manual workflows, while preserving statistical rigor. This suggests its potential for future scaling of evidence synthesis to additional therapeutic areas.
    DOI:  https://doi.org/10.1371/journal.pone.0342895
  3. JMIR Form Res. 2026 Feb 12. 10 e69707
       Background: Annotated bibliographies summarize literature, but training, experience, and time are needed to create concise yet accurate annotations. Summaries generated by artificial intelligence (AI) can save human resources, but AI-generated content can also contain serious errors.
    Objective: To determine the feasibility of using AI as an alternative to human annotators, we explored whether ChatGPT can generate annotations with characteristics that are comparable to those written by humans.
    Methods: We had 2 humans and 3 versions of ChatGPT (3.5, 4, and 5) independently write annotations on the same set of 15 publications. We collected data on word count and Flesch Reading Ease (FRE). In this study, 2 assessors who were masked to the source of the annotations independently evaluated (1) capture of main points, (2) presence of errors, and (3) whether the annotation included a discussion of both the quality and context of the article within the broader literature. We evaluated agreement and disagreement between the assessors and used descriptive statistics and assessor-stratified binary and cumulative mixed-effects logit models to compare annotations written by ChatGPT and humans.
    Results: On average, humans wrote shorter annotations (mean 90.20, SD 36.8 words) than ChatGPT (mean 113, SD 16 words) which were easier to interpret (human FRE score, mean 15.3, SD 12.4; ChatGPT FRE score, mean 5.76, SD 7.32). Our assessments of agreement and disagreement revealed that one assessor was consistently stricter than the other. However, assessor-stratified models of main points, errors, and quality/context showed similar qualitative conclusions. There was no statistically significant difference in the odds of presenting a better summary of main points between ChatGPT- and human-generated annotations for either assessor (Assessor 1: OR 0.96, 95% CI 0.12-7.71; Assessor 2: OR 1.64, 95% CI 0.67-4.06). However, both assessors observed that human annotations had lower odds of having one or more types of errors compared to ChatGPT (Assessor 1: OR 0.31, 95% CI 0.09-1.02; Assessor 2: OR 0.10, 95% CI 0.03-0.33). On the other hand, human annotations also had lower odds of summarizing the paper's quality and context when compared to ChatGPT (Assessor 1: OR 0.11, 95% CI 0.03-0.33; Assessor 2: OR 0.03, 95% CI 0.01-0.10). That said, ChatGPT's summaries of quality and context were sometimes inaccurate.
    Conclusions: Rapidly learning a body of scientific literature is a vital yet daunting task that may be made more efficient by AI tools. In our study, ChatGPT quickly generated concise summaries of academic literature and also provided quality and context more consistently than humans. However, ChatGPT's discussion of the quality and context was not always accurate, and ChatGPT annotations included more errors. Annotated bibliographies that are AI-generated and carefully verified by humans may thus be an efficient way to provide a rapid overview of literature. More research is needed to determine the extent that prompt engineering can reduce errors and improve chatbot performance.
    Keywords:  ChatGPT; annotated bibliography; artificial intelligence; evidence synthesis; information management; large language model
    DOI:  https://doi.org/10.2196/69707
  4. Clin Lab. 2026 Feb 01. 72(2):
       BACKGROUND: Gestational diabetes mellitus (GDM) affects millions of people worldwide. Patients often turn to the internet and artificial intelligence (AI)-based conversational models for information. The CLEAR tool evaluates the quality of health-related content produced by AI-based models. This study assessed the responses provided by medical guidelines, ChatGPT, and Google Bard to the ten most frequently asked online questions about GDM, uti-lizing the CLEAR tool for evaluation.
    METHODS: The most common online questions about GDM were identified using Google Trends, and the top 10 questions were selected. Answers were then gathered from two experienced physicians, ChatGPT 4.0o-mini, and Google Bard, with responses categorized into 'Guide,' 'ChatGPT,' and 'Bard' groups. Answers from the AI models were obtained using two computers and two separate sessions to ensure consistency and minimize bias.
    RESULTS: ChatGPT received higher scores than the medical guidelines, while Bard scored lower than ChatGPT. The medical guidelines provided more accessible answers for the general audience, while ChatGPT and Bard required higher literacy levels. Good reliability (0.781) was observed between the two reviewers. Regarding readability, the medical guidelines were the easiest to read, while Bard provided the most challenging text.
    CONCLUSIONS: ChatGPT and Google Bard perform well in content completeness and relevance but face challenges in readability and misinformation. Future research should improve accuracy and readability, integrate AI with peer-reviewed sources, and ensure healthcare professionals guide patients to reliable AI information.
    DOI:  https://doi.org/10.7754/Clin.Lab.2025.250544
  5. JMA J. 2026 Jan 15. 9(1): 369-371
      Generative artificial intelligence (GenAI) is now widely used in medicine, including medical writing. Its merits and demerits have been discussed; however, such discussion has not been based on evidence-based medicine (EBM). Here, I focus primarily on GenAI use in medical writing, illustrating how it has already spread before its safety-especially long-term safety-has been confirmed by EBM. I therefore make several modest proposals. Assuming GenAI is a new drug, its use has not yet cleared even the first step of a phase I trial. Assuming it is a new procedure, it remains at the "experience" or "case report" phase. EBM requires the completion of phase I-III trials and randomized controlled trials or meta-analyses before any drug or procedure is confirmed safe and effective. Emergency evacuation can be applied for life-threatening medical conditions; however, it does not apply to "writing." Nevertheless, the current publication world has already gone far beyond: GenAI use is already considerable in medical publication. Thus, three propositions have been made. First, we must recognize that the use of GenAI for writing operates outside the usual EBM framework. Second, we should conduct trials, even if they are difficult and time-consuming, to evaluate the safety and effectiveness of GenAI in writing. Third, we should use GenAI in writing only modestly until safety is confirmed. What is true becomes evident long after, and thus, I believe that we should take a cautious stance toward GenAI use in writing. How cautious should be discussed widely. This viewpoint may contribute to the discussion of GenAI use more generally, beyond medical writing.
    Keywords:  ChatGPT; artificial intelligence; evidence-based medicine; future; regulation
    DOI:  https://doi.org/10.31662/jmaj.2025-0443
  6. Interdiscip Cardiovasc Thorac Surg. 2026 Feb 06. pii: ivag038. [Epub ahead of print]
       OBJECTIVES: Large Language Models (LLMs) are generative-AI which generate text output like a human conversation. We wanted to assess the ability of LLMs to answer patient's questions and benchmark their output using a Best Evidence Topic (BET).
    METHODS: We asked LLMs whether Robot-assisted Thoracic Surgery (RATS) or Video-assisted Thoracoscopic Surgery (VATS) lobectomy had better perioperative outcomes for postoperative pain, length of hospital stay (LOS) and mortality. A BET was constructed according to a structured protocol for the same questions. An initial search yielded 324 papers, 12 represented best evidence.
    RESULTS: LLM outputs are almost instantaneous while a BET took many hours of searching a database for relevant evidence. However, current iterations and models of LLMs did not provide relevant outputs, suffered from hallucinations and could be restricted by copyright and paywall issues. The BET, on the other hand, was tailored to the scenario by specialist human oversight and therefore more reliable and nuanced.
    CONCLUSIONS: There were no major differences between RATS and VATS lobectomy for T1cN0M0 NSCLC apart from shorter LOS following RATS. Current LLMs may not be entirely reliable for answering clinical questions. An LLM-BET protocol could be used as a standardised process to compare LLM outputs for different clinical scenarios, each benchmarked with a BET. It can also be used to analyse outputs of different models of current and future LLMs.
    Keywords:   NSCLC—non-small cell lung cancer; RATS—robotic-assisted thoracoscopic surgery; VATS—video-assisted thoracoscopic surgery; ChatGPT; Gemini; Grok; Microsoft Copilot
    DOI:  https://doi.org/10.1093/icvts/ivag038
  7. Ann Med Surg (Lond). 2026 Feb;88(2): 2196-2197
      The expanding volume and heterogeneity of post-marketing drug-safety data have exposed the limitations of conventional pharmacovigilance systems. Artificial intelligence (AI) offers scalable solutions for detecting adverse drug reactions by integrating electronic health records, spontaneous reporting systems, and patient-generated data. Recent advances in natural language processing and deep learning enable earlier identification of rare and complex safety signals while reducing reporting delays. Nevertheless, issues related to algorithmic transparency, bias, and regulatory oversight remain critical. This letter highlights emerging applications, current challenges, and future directions for responsible AI integration in pharmacovigilance.
    Keywords:  adverse drug reactions; artificial intelligence; pharmacovigilance; real-world evidence; signal detection
    DOI:  https://doi.org/10.1097/MS9.0000000000004699