bims-arines 2026-04-05 papers

bims-arines

Biomed News

on AI in evidence synthesis

Issue of 2026–04–05
ten papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD

Artificial Intelligence Tools for Automating Evidence Synthesis: Scoping Review.
Large language models for risk-of-bias assessment in randomised clinical trials-a comparative validation study.
Large language models for thematic analysis in healthcare research: A blinded mixed-methods comparison with human analysts.
Improving few-shot named entity recognition for large language models using structured dynamic prompting with retrieval augmented generation.
A critical evaluation of generative query expansion on biomedical literature retrieval.
Is one run enough? Reproducibility of flagship large language models across temperature and reasoning settings in biomedical text processing.
Structured taxonomy and framework for developing medical benchmark in large language models derived from scoping review.
Green prompt engineering for sustainable generative AI.
Hallucinated citations are polluting the scientific literature. What can be done?
Fabricated citations: the quiet crisis of AI hallucinations in scientific publishing.

J Med Internet Res. 2026 Mar 30. 28 e81597

Artificial Intelligence Tools for Automating Evidence Synthesis: Scoping Review.

Sashika Harasgama, Helen Pearce, Cameron Appel, Liam Loftus, Helena Painter, Isla Kuhn, Justine Karpusheff, Aji Ceesay, John Ford.

   Background: Rapidly and accurately synthesizing large volumes of evidence is a time- and resource-intensive process. Once published, reviews often risk becoming outdated, limiting their usefulness for decision makers. Recent advancements in artificial intelligence (AI) have enabled researchers to automate stages of the evidence synthesis process, from literature searching and screening to data extraction and analysis. As previous reviews on this topic have been published, a significant number of tools have been further developed and evaluated. Furthermore, as generative AI increasingly automates evidence synthesis, understanding how it is studied and applied is crucial, given both its benefits and risks.
Objective: This review aimed to map the current landscape of evaluated AI tools used to automate evidence synthesis.
Methods: Following the Joanna Briggs Institute methodology for scoping reviews, we searched Ovid MEDLINE, Ovid Embase, Scopus, and Web of Science in February 2025 and conducted a gray literature search in April 2025. We included articles published in any language from January 2021 onward. Two reviewers independently screened citations using Rayyan, and data were extracted based on study design and key AI-related technical features.
Results: We identified 7841 unique citations through database searches and 19 records through gray literature searching. A total of 222 articles were included in the review. We identified 65 AI tools and 25 open-source models or machine learning (ML) algorithms that automate parts of or the whole evidence synthesis pathway. A total of 54.1% (n=120) of the studies were published in 2024, reflecting a trend toward researching general-purpose large language models (LLMs) for evidence synthesis automation. The most popular tool studied was generative pretrained transformer models, including its conversational interface ChatGPT (n=70, 31.5%). Moreover, 31.1% (n=69) studied tools automated by traditional ML algorithms. No studies compared traditional ML tools to LLM-based tools. In addition, 61.7% (n=137) and 26.1% (n=58) studied AI-assisted automation of title and abstract screening and data extraction, respectively, the 2 most intensive stages and, therefore, amenable to automation. Technical performance outcomes were the most frequently reported, with only 4.1% (n=9) of studies reporting time- or workload-specific outcomes. Few studies pragmatically evaluated AI tools in real-world evidence synthesis settings.
Conclusions: This review comprehensively captures the broad, evolving suite of AI automation tools available to support evidence synthesis, leveraged by increasingly complex AI approaches that range from traditional ML to LLMs. The notable shift toward studying general-purpose generative AI tools reflects how these technologies are actively transforming evidence synthesis practice. The lack of studies in our review comparing different AI approaches for specific automation stages or evaluating their effectiveness pragmatically represents a significant research gap. Optimal tool selection will likely depend on the review topic and methodology and researcher priorities. While they offer potential for reducing workload, ongoing evaluation to mitigate AI bias and to ensure the integrity of reviews is essential for safeguarding evidence-based decision-making.

Keywords:  ChatGPT; artificial intelligence; automation; evidence synthesis; large language models; machine learning; systematic reviews as a topic

DOI:  https://doi.org/10.2196/81597
EBioMedicine. 2026 Mar 28. pii: S2352-3964(26)00120-9. [Epub ahead of print]126 106238

Large language models for risk-of-bias assessment in randomised clinical trials-a comparative validation study.

Lauri Nyrhi, Ville Ponkilainen, Juho Laaksonen, Lauri Kuikka, Lauri Paljakka, Teemu Karjalainen, Ville M Mattila, Ilari Kuitunen.

   BACKGROUND: Large language models (LLMs) are emerging tools for evidence synthesis. Risk of bias (RoB) assessment of trials remains an essential but time-consuming step inconsistent even amongst experts. Early LLM studies showed mixed reliability. Advances in reasoning-enabled models warrant evaluation of their accuracy and consistency for RoB screening across randomised trials to reduce reviewer workload.
METHODS: We conducted a preregistered comparative validation study (March 11-May 19, 2025) of four LLMs-ChatGPT o3, DeepSeek v3, Google Gemini Flash 2.0, and Grok 3-prompted with full-text randomised clinical trial articles and protocols. Two corpora were analysed: 100 RCTs from recent Cochrane reviews (RoB 1) and 100 RCTs from meta-analyses in high-impact journals (RoB 2). The reference standard was published human RoB judgements. The primary outcome was interobserver reliability (Cohen κ, 95% CI); secondary outcomes were intraobserver agreement and diagnostic accuracy (sensitivity, specificity, predictive values, F1-score).
FINDINGS: For RoB 1, interobserver agreement ranged from κ 0.0.27 (95% CI 0.07-0.46) with Gemini Flash 2.0 to κ 0.39 (0.20-0.59) with DeepSeek v3. For RoB 2, agreement was lower, from κ 0.06 (-0.07 to 0.18) with ChatGPT o3 to κ 0.13 (-0.04 to 0.31) with Gemini. Diagnostic performance was limited with sensitivity ranging 0.05-0.55, specificity 0.78-0.99, PPV 0.31-0.50, and NPV 0.48-0.61 across models, with models consistently over-flagging concerns.
INTERPRETATION: None of the evaluated LLMs were sufficiently reliable for fully autonomous RoB assessment. DeepSeek v3 and ChatGPT o3 approximated human performance best on RoB 1, but RoB 2 rule-in and rule-out performance remained modest. Current use should be supervised, with possible application of LLMs for triage or as a second assessor. Major improvements in protocol retrieval, task-specific tuning, and calibrated thresholds, prospectively validated, are needed for safe stand-alone deployment.
FUNDING: This study received no financial support.

Keywords:  Artificial intelligence; Large language model; Methodology; Risk of bias

DOI:  https://doi.org/10.1016/j.ebiom.2026.106238
PLOS Digit Health. 2026 Apr;5(4): e0001189

Large language models for thematic analysis in healthcare research: A blinded mixed-methods comparison with human analysts.

Callum Hill, Arun Dahil, Glenn Simpson, David Hardisty, Jacob Keast, Cameron Kumar Pinn, Hajira Dambha-Miller.

Large language models (LLMs) are increasingly used for qualitative thematic analysis, yet evidence on their performance in analysing focus-group data, where polyvocality and context complicate coding, remains limited. Given the increasing role of such models in thematic analysis, there is a need for methodological frameworks that enable systematic, metric-based comparisons between human and model-based analyses. We conducted a blinded mixed-methods comparison of two general-purpose LLMs (ChatGPT-5 and Claude 4 Sonnet), an LLM-based qualitative coding application (QualiGPT), and blinded human analysts on an in-person focus-group transcript informing an AI-enabled digital health proposal. We evaluated deductive coding using a 10-code, 6-theme codebook against an expert consensus adjudication; inductive coding with a structured Likert-scale comparison to a reference-standard set of inductive themes generated by expert consensus; and manual quote verification of LLM segments to define LLM hallucination (evidence absent or non-supportive) and error rate (including partial matches and speaker-coded segments). During deductive coding against an expert consensus adjudication, large language models yielded a mean agreement of 93.5% (95% CI 92.5-94.5) with κ = 0.34 (95% CI 0.26-0.40); blinded human coders achieved 92.7% (95% CI 91.6-93.9) agreement with κ = 0.34 (95% CI 0.26-0.41). Mean Gwet's AC1 was 0.92 (95% CI 0.90-0.93) for the blinded human analysis, and 0.93 (95% CI 0.92-0.94) for the LLM-assisted deductive analysis, reflecting high agreement despite the low overall code prevalence (7.8%, SD = 3.2%). Only one model achieved non-inferiority in inductive analysis of the transcript (p = 0.043). The strict hallucination rate in inductive analysis was 1.2% (SD = 2.1%). LLMs were non-inferior to human analysts for deductive coding of the focus-group data, with variable performance in inductive analysis. Low hallucination but significant comprehensive error rates indicate that LLMs can augment qualitative analysis but require human verification.

DOI: https://doi.org/10.1371/journal.pdig.0001189
NPJ Artif Intell. 2026 ;2(1): 39

Improving few-shot named entity recognition for large language models using structured dynamic prompting with retrieval augmented generation.

Yao Ge, Yuting Guo, Sudeshna Das, Abeed Sarker.

  Biomedical named entity recognition (NER) is a high-utility natural language processing task, and large language models (LLMs) show promise in few-shot settings. In this article, we address performance challenges for few-shot biomedical NER by investigating innovative prompting strategies involving retrieval-augmented generation. Using five biomedical NER datasets, we implemented and evaluated a systematically-structured multi-component static prompt and a dynamic prompt engineering technique, where the prompt is dynamically updated via retrieval with most relevant in-context examples based on the input texts. Static prompting with structured components increased average F1-scores by 12% for GPT-4, and 11% for GPT-3.5 and LLaMA 3-70B, relative to basic static prompting. Dynamic prompting further boosted performance and was evaluated on GPT-4, LLaMA 3-70B, and the recently released open-weight GPT-OSS-120B model, with TF-IDF based retrieval yielding the best results, improving average F1-scores by 8.8% and 6.3% in 5-shot and 10-shot settings, respectively. An ablation study on retrieval pool size demonstrated that strong performance can be achieved with relatively small number of annotated samples, reinforcing the annotation efficiency and scalability of our framework in real-world settings.

Keywords:  Computational biology and bioinformatics; Health care; Mathematics and computing

DOI:  https://doi.org/10.1038/s44387-025-00062-2
J Am Med Inform Assoc. 2026 Apr 01. pii: ocag037. [Epub ahead of print]

A critical evaluation of generative query expansion on biomedical literature retrieval.

Yilu Fang, Gongbo Zhang, Fangyi Chen, Yifan Peng, Chunhua Weng.

   OBJECTIVE: To evaluate the effectiveness of generative query expansion for biomedical literature retrieval.
MATERIALS AND METHODS: We thoroughly examined eight generative query expansion methods using three large language models across five datasets for biomedical literature retrieval. We further performed a quantitative analysis, including performance comparisons, rank transition analysis, and article-type effect analysis. We also conducted a qualitative examination of representative cases, from which we derived an error taxonomy.
RESULTS: On BioASQ-Y/N, GPT-4o-based query expansion shifts Recall@10 to 0.417-0.512 and nDCG@10 to 0.358-0.479, relative to a baseline of 0.491 and 0.456. For PubMedQA, Precision@1 ranges from 0.764 to 0.876 and nDCG@10 from 0.847 to 0.931, compared with baseline values of 0.893 and 0.935. For 2019-Trec-PM, query expansion yields Recall@100 of 0.217-0.256 and nDCG@100 of 0.272-0.312, versus a baseline of 0.227 and 0.274. Similarly, for 2018-TREC-PM, Recall@100 spans 0.169-0.227 and nDCG@100 spans 0.195-0.250, relative to baseline scores of 0.164 and 0.191. For 2017-TREC-PM, Recall@100 and nDCG@100 fall within 0.111-0.139 and 0.154-0.191 under query expansion, compared with baseline metrics of 0.102 and 0.147. Both general-purpose and domain-specific Llama-based models demonstrate similar performance to GPT-4o.
DISCUSSION AND CONCLUSION: The impact of query expansion varies significantly by the expansion methods and type of evidence, but is relatively agnostic to backbone model choice. Notably, query expansion primarily affects article ranking but has a limited impact on the screening stage. Our findings underscore the unique challenges of biomedical literature retrieval and highlight the need to develop domain-specific information retrieval techniques.

Keywords:  Biomedical Literature Retrieval; Large Language Models; Query Expansion

DOI:  https://doi.org/10.1093/jamia/ocag037
J Am Med Inform Assoc. 2026 Mar 30. pii: ocag039. [Epub ahead of print]

Is one run enough? Reproducibility of flagship large language models across temperature and reasoning settings in biomedical text processing.

Paul Windisch, Carole Koechli, Fabio Dennstädt, Daniel M Aebersold, Daniel R Zwahlen, Robert Förster, Christina Schröder.

   BACKGROUND: To quantify run-to-run reproducibility of Gemini 3 Flash Preview and GPT-5.2 for trial-success classification across temperature and reasoning/thinking settings and determine whether single-run reporting suffices.
MATERIALS AND METHODS: We utilized 250 trial abstracts labeled based on primary endpoint success. We evaluated Gemini across thinking levels (minimal, low, medium, high) and temperatures 0.0-2.0 and GPT-5.2 across reasoning-effort levels (none to x-high) with an additional temperature sweep when reasoning was disabled. Each setting was run 3 times.
RESULTS: Reproducibility was high for Gemini (κ = 0.942-1.000; invalid outputs 0%-1.5%) and GPT-5.2 (κ = 0.984-0.995; no invalid outputs). F1 remained stable (mean/majority vote 0.955-0.971), with marginal gains from majority voting.
CONCLUSION: For binary biomedical classification with tightly constrained outputs, both models were reproducible across decoding and reasoning settings, suggesting single runs are often sufficient, with minimal replication as a practical stability check.

Keywords:  large language models; natural language processing; reasoning models; reproducibility; temperature

DOI:  https://doi.org/10.1093/jamia/ocag039
NPJ Digit Med. 2026 Mar 31.

Structured taxonomy and framework for developing medical benchmark in large language models derived from scoping review.

Junbok Lee, Jaeyong Shin, Belong Cho.

With the rapid advancement of large language model technology, numerous studies have explored its application in the medical field. Robust evaluation is crucial for ensuring reliability and safety, leading to the development of diverse benchmark datasets. In this study, we propose a structured taxonomy to provide researchers with practical guidance for benchmark selection. Furthermore, we introduce READY, a development framework built on five principles - Reliable, Ethical, Annotated, Diverse, Yield-validated - to support the systematic design of medical benchmarks and strengthen future evaluation practices. To establish the taxonomy and framework, we systematically reviewed benchmark datasets designed for evaluating LLMs in medical context. A comprehensive literature search yielded 55 relevant studies. Each benchmark was analyzed using a structured framework encompassing the dataset construction and evaluation methodology. To assess the applicability of the proposed framework, five domain experts independently applied the READY framework to benchmark studies, demonstrating consistent inter-rater agreement. We anticipate that this research will promote more rigorous and ethical LLM evaluation, paving the way for the safe application of LLMs in clinical settings.

DOI: https://doi.org/10.1038/s41746-026-02567-9
Environ Sci Ecotechnol. 2026 Mar;30 100684

Green prompt engineering for sustainable generative AI.

Sanjay Podder, Hema Date, Shankar Murthy.

  Prompt engineering involves manual design and optimization of text-based instructions or queries, enabling precise control over outputs generated by pre-trained large language models (LLMs) and ensuring alignment with desired responses. However, substantial computational costs and energy footprint of prompt inferencing process remain critical challenges while building generative AI applications. The energy efficiency of LLM inferences is particularly impacted by suboptimal prompts, which may require multiple iterations, thereby escalating energy consumption and the associated carbon footprint. To address these challenges, we propose a series of practices and guidelines designed to enhance the likelihood of obtaining desired responses from LLMs with minimal reiterations. Empirical evaluation demonstrates that, across a range of LLMs and test scenarios, energy consumption and corresponding operational greenhouse gas emissions were reduced by 32-48% when best practices were applied. Drawing upon these insights, our proposed best practices can be seamlessly integrated into the design frameworks of generative AI applications, thereby enhancing the energy efficiency of prompt inferencing. By addressing the challenge of establishing a cohesive framework for energy-efficient prompt design and inferencing, this paper advocates for the sustainable and effective deployment of generative AI technologies.

Keywords:  Generative AI; Green AI; Green computing; Greenhouse gases; Inference mechanisms; Prompt engineering

DOI:  https://doi.org/10.1016/j.ese.2026.100684
Nature. 2026 Apr;652(8108): 26-29

Hallucinated citations are polluting the scientific literature. What can be done?

Miryam Naddaf, Elizabeth Quill.



Keywords:  Ethics; Machine learning; Publishing; Scientific community

DOI:  https://doi.org/10.1038/d41586-026-00969-z
Am J Vet Res. 2026 Mar 16. pii: ajvr.87.04.editorial. [Epub ahead of print]87(4):

Fabricated citations: the quiet crisis of AI hallucinations in scientific publishing.

Lisa A Fortier.

DOI: https://doi.org/10.2460/ajvr.87.04.editorial