bims-arines 2026-03-08 papers

bims-arines

Biomed News

on AI in evidence synthesis

Issue of 2026–03–08
thirteen papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD

An overview of artificial intelligence approaches for automating evidence synthesis.
Assessing Large Language Models for Early Article Identification in Otolaryngology-Head and Neck Surgery Systematic Reviews.
Large language model-based multiagent collaboration for abstract screening toward automated systematic reviews.
Searching smarter, not harder: leveraging AI to enhance literature searches for theory-driven reviews-A methodological case study.
Using Large Language Models to Address Contextual Questions in Systematic Reviews.
Artificial intelligence and large language models for interview transcription in qualitative research: competency, politeness and ethical implications.
Using Explainable Artificial Intelligence in a Systematic Literature Review of Pressure Injury Prevention: Lessons Learned.
LLM-assisted systematic review of large language models in clinical medicine.
Use of artificial intelligence to draft a mini-HTA report on a new medical device belonging to class IIb-III.
AI and Qualitative Health Research: Working Through a Necessary Grieving Process.
Addressing the Special Issue: Intersections (Existing, Emerging, and Imagined) Between Artificial Intelligence and Qualitative Health Research.
Prevention, artificial intelligence, and mathematics in Health economics: a framework for dynamic value modelling under long-term uncertainty.
Talk to the bot: A scoping review on using AI-powered transcription tools in qualitative research.

Public Health. 2026 Mar 02. pii: S0033-3506(26)00089-2. [Epub ahead of print]254 106220

An overview of artificial intelligence approaches for automating evidence synthesis.

Sashika Harasgama, Helen Pearce, Liam Loftus, Helena Painter, John Ford.

  Artificial intelligence tools are increasingly being used to automate the evidence synthesis process, particularly for researcher-intensive tasks such as literature screening and data extraction. Researchers can face challenges in selecting appropriate tools from the large array available, often due to a limited understanding of their technicalities and therefore their capabilities and limitations. This paper provides a comprehensive overview of AI approaches leveraged by these evidence synthesis tools, examining the evolution from traditional machine learning to modern transformers such as large language models. We examine each approach's strengths and limitations within evidence synthesis, highlighting issues of accuracy, transparency, and task specificity. While AI has demonstrated significant potential for optimising researcher time and workload, important limitations remain regarding its statistical precision, interpretability, and reliability, which require careful consideration and continued human oversight. We conclude with recommendations for responsible adoption and future research directions to enhance the transparency and effectiveness of AI-assisted evidence synthesis.

Keywords:  Artificial intelligence; Automation; ChatGPT; Deep learning; Evidence synthesis; Large language models; Machine learning; Neural networks; Systematic reviews as a topic

DOI:  https://doi.org/10.1016/j.puhe.2026.106220
Health Care Sci. 2026 Feb;5(1): 19-28

Assessing Large Language Models for Early Article Identification in Otolaryngology-Head and Neck Surgery Systematic Reviews.

Ajibola B Bakare, Young Lee, Jhuree Hong, Claus-Peter Richter, Jonathan P Kuriakose.

   Background: Assess ChatGPT and Bard's effectiveness in the initial identification of articles for Otolaryngology-Head and Neck Surgery systematic literature reviews.
Methods: Three PRISMA-based systematic reviews (Jabbour et al. 2017, Wong et al. 2018, and Wu et al. 2021) were replicated using ChatGPTv3.5 and Bard. Outputs (author, title, publication year, and journal) were compared to the original references and cross-referenced with medical databases for authenticity and recall.
Results: Several themes emerged when comparing Bard and ChatGPT across the three reviews. Bard generated more outputs and had greater recall in Wong et al.'s review, with a broader date range in Jabbour et al.'s review. In Wu et al.'s review, ChatGPT-2 had higher recall and identified more authentic outputs than Bard-2.
Conclusion: Large language models (LLMs) failed to fully replicate peer-reviewed methodologies, producing outputs with inaccuracies but identifying relevant, especially recent, articles missed by the references. While human-led PRISMA-based reviews remain the gold standard, refining LLMs for literature reviews shows potential.

Keywords:  Bard; ChatGPT; artificial intelligence; large language models; systematic review

DOI:  https://doi.org/10.1002/hcs2.70048
Biol Methods Protoc. 2026 ;11(1): bpag006

Large language model-based multiagent collaboration for abstract screening toward automated systematic reviews.

Opeoluwa Akinseloyin, Xiaorui Jiang, Vasile Palade.

  Systematic reviews (SRs) are essential for evidence-based practice but remain labor-intensive, especially during abstract screening. This study evaluates whether multiple large language models (multi-LLMs) collaboration can improve the efficiency and reduce costs for abstract screening. Abstract screening was framed as a question-answering (QA) task using cost-effective LLMs. Three multi-LLM collaboration strategies were evaluated, including majority voting by averaging opinions of peers, multi-agent debate for answer refinement, and LLM-based adjudication against answers of individual QA baselines. These strategies were evaluated on 28 SRs of the CLEF eHealth 2019 technology-assisted review benchmark using standard performance metrics such as mean average precision (MAP) and work saved over sampling at 95% recall (work saved over sampling WSS@95%). Multi-LLM collaboration significantly outperformed QA baselines. Majority voting was overall the best strategy, achieving the highest MAP 0.462 and 0.341 on subsets of SRs about clinical intervention and diagnostic technology assessment, respectively, with WSS@95% 0.606 and 0.680, enabling in theory up to 68% workload reduction at 95% recall of all relevant studies. Multi-agent debate improved weaker models most. Our own adjudicator-as-a-ranker method was the second strongest approach, surpassing adjudicator-as-a-judge, but at a significantly higher cost than majority voting and debating. Multi-LLM collaboration substantially improves abstract screening efficiency, and the success lies in model diversity. Making the best use of diversity, majority voting stands out in terms of both excellent performance and low cost compared to adjudication. Despite context-dependent gains and diminishing model diversity, multi-agent debate is still a cost-effective strategy and a potential direction of further research.

Keywords:  abstract screening; ensemble; large language model; multiagent system; systematic review

DOI:  https://doi.org/10.1093/biomethods/bpag006
BMC Med Res Methodol. 2026 Mar 04.

Searching smarter, not harder: leveraging AI to enhance literature searches for theory-driven reviews-A methodological case study.

R Hunter, A Booth, L Wood.

   BACKGROUND: Integrating artificial intelligence (AI) into literature searching has the potential to enhance research synthesis by improving the identification of conceptually rich or otherwise difficult-to-locate evidence. Theoretical or conceptual literature reviews, including realist reviews, often involve resource-intensive searches because they aim to trace nuanced ideas, mechanisms, or conceptual relationships across multiple sources. This case study illustrates the use of AI-powered tools to support and streamline such literature searching, using a realist review as an example.
METHODS: We applied AI tools-Scite and Undermind-in the context of a realist review to facilitate the identification of relevant studies. Seed papers and key informant papers guided the search, and a novel classification system (grandparent, parent, and child papers) was used to systematically organise studies for developing and refining theoretical constructs. Transparent screening procedures and decision-making frameworks were employed to ensure methodological rigour and reproducibility.
RESULTS: The integration of AI tools supported the retrieval of conceptually relevant literature and helped manage complex datasets. The classification system enabled structured organisation of studies, supporting iterative testing and refinement of theoretical constructs. The workflow demonstrated flexibility and adaptability, suggesting potential applicability beyond realist review.
CONCLUSIONS: Our findings suggest that AI-powered tools can support literature searching, particularly in identifying conceptually relevant studies. However, these tools do not replace the critical interpretive work required by researchers. Human judgement remains essential to assess relevance, evaluate nuanced concepts, and make informed decisions throughout the search process, with AI serving as a valuable adjunct rather than a substitute.

Keywords:  Artificial intelligence; Evidence appraisal; Literature screening; Literature searches; Realist reviews; Review methodology

DOI:  https://doi.org/10.1186/s12874-026-02814-3
Cochrane Evid Synth Methods. 2026 Mar;4(2): e70060

Using Large Language Models to Address Contextual Questions in Systematic Reviews.

Susanne Hempel, Kimny Sysawang, Haley K Holmer, Erin Tokutomi, Suchitra Iyer, Zhen Wang, Edi Kuhn, Mohammad Hassan Murad.

   Objectives: Systematic evidence reviews (SERs) produced by the U.S. Agency for Healthcare Research and Quality (AHRQ) Evidence-based Practice Center (EPC) Program use contextual questions to provide context and background information on the topic. There is currently no standardized approach to address contextual questions in systematic reviews. This study explored the use of publicly available large language models (LLMs) in addressing contextual questions.
Study Design: Using a set of 20 published and 5 yet to be published SERs, we selected one contextual question per report and used it as a prompt to elicit answers from an LLM (ChatGPT, Bard, Claude, or Perplexity). Two independent reviewers rated the results using a priori established evaluation criteria (https://osf.io/4k3cu/), comparing the response in the SER to LLM-generated responses. The study was guided by six research questions addressing feasibility, validity of content, validity of structure, mistakes, congruence between responses, and incremental validity of using LLMs to address contextual questions.
Results: Using minimal prompt engineering produced relevant responses and documented the feasibility of LLM-generated answers to contextual questions. Responses differed in content and format and are not reproducible (e.g., LLMs update regularly), but LLMs were able to produce articulate, clinically plausible, and well-structured responses. We detected few factual errors, contradictions, and no instance of suspected bias, but citations supporting LLM-generated responses could often not be produced or could not be verified ('confabulations'). Congruence with human generated responses varied, with LLM-generated responses providing more background on the topic and SERs providing more nuanced answers in response to the contextual question. Results regarding incremental validity were mixed and may depend on the tool.
Conclusion: LLMs are potentially helpful in addressing contextual questions in systematic reviews but human expertise remains essential for using the generated information in a meaningful way.

Keywords:  artificial intelligence; context; large language models; systematic reviews

DOI:  https://doi.org/10.1002/cesm.70060
J Clin Epidemiol. 2026 Feb 27. pii: S0895-4356(26)00083-1. [Epub ahead of print] 112208

Artificial intelligence and large language models for interview transcription in qualitative research: competency, politeness and ethical implications.

Livia Puljak.



Keywords:  Accountability; Artificial Intelligence; Bias; Data Privacy; Ethics; Informed Consent; Natural Language Processing; Qualitative Research; Research; Speech Recognition

DOI:  https://doi.org/10.1016/j.jclinepi.2026.112208
J Wound Ostomy Continence Nurs. 2026 Mar-Apr 01;53(2):53(2): 94-101

Using Explainable Artificial Intelligence in a Systematic Literature Review of Pressure Injury Prevention: Lessons Learned.

Joshua Morriss, Laura E Edsberg, Jill Cox, Virginia Capasso.

  Artificial intelligence (AI) is rapidly transforming health care by augmenting clinical decision-making and enabling clinicians and researchers to perform literature searches, including systematic literature reviews (SLR). This article describes the methods used to develop an AI-generated SLR, and the lessons learned by the research team (comprised of PI content and AI experts) while completing this project. To generate the SLR, a proprietary explainable AI (XAI) platform was used incorporating generative-discriminative algorithms and reinforcement learning with human feedback. The following research question was posed: "What are best practices for pressure injury prevention in hospitalized patients?" Content experts defined and iteratively refined search parameters and exclusion criteria. The XAI screened 1414 records, following the Preferred Reporting Items for Systematic reviews and Meta-Analyses 2020 guidelines. After study selection, content experts reviewed the draft SLR for citation accuracy, synthesis quality, and clinical validity. This process yielded 110 studies. Among these, 33 studies were originally excluded but were re-incorporated after content expert input. The AI-generated SLR paper contained multiple citation errors and misinterpretations. Narrative quality was mechanical, with unsupported generalizations and factual inaccuracies. We found that content experts are critical to determine the correct search terms and interpret AI-generated results. Similarly, collaboration with AI experts is necessary to improve understanding of AI applications. The need for a detailed review of any AI-generated SLR is essential to ensure evidence fidelity.

Keywords:  Explainable artificial intelligence; Pressure Injury prevention; Pressure injury; Systematic literature review

DOI:  https://doi.org/10.1097/WON.0000000000001259
Nat Med. 2026 Mar 03.

LLM-assisted systematic review of large language models in clinical medicine.

Sully F Chen, Anton Alyakin, Andreas Seas, Eunice Yang, Joanne J Choi, Jin Vivian Lee, Amelia L Chen, Pranav I Warman, Rochelle T Bitolas, Robert J Steele, Daniel A Alber, Eric K Oermann.

Clinical evaluations of large language models (LLMs) have rapidly expanded since 2022, yet their evidence base remains opaque. The overwhelming volume of studies creates challenges for manual curation and review. However, LLMs themselves offer the scalability and capability to evaluate the ever-growing evidence base. This LLM-assisted review identified 4,609 peer-reviewed studies in clinical medicine between January 2022 and September 2025, equating to roughly 3.2 papers per day. Only 1,048 studies used real-world patient data and of these only 19 were prospective randomized trials; most addressed simulated scenarios (n = 1,857) or exam-style tasks (n = 1,704). ChatGPT and related OpenAI models constitute 65.7% of evaluated models, with Gemini/Bard a distant second constituting 13.1% of evaluated models. Patient-facing communication and education comprised 17% of tasks, followed by knowledge retrieval, and education and assessment simulation. Across 1,046 head-to-head comparisons, LLMs outperformed humans in 33% of comparisons, with a strong dependency on task realism and level of training. At least 25% of studies had sample sizes less than 30. Despite the growth of LLMs in medicine, rigorous, patient-centered evidence remains scarce, underscoring the need for larger prospective trials before clinical adoption.

DOI: https://doi.org/10.1038/s41591-026-04229-5
Glob Reg Health Technol Assess. 2026 Jan-Dec;13:13 55-57

Use of artificial intelligence to draft a mini-HTA report on a new medical device belonging to class IIb-III.

Andrea Messori, Sabrina Beltramini, Paola Crosasso, Claudia Fruttero, Melania Rivano, Marco Chiumente, Maria C Silvani.

  The rapid evolution of artificial intelligence (AI) in the pharmaceutical and medical device (MD) sectors has prompted interest in its potential role in supporting health technology assessment (HTA). This editorial presents an innovative project aimed at facilitating and expanding HTA activities for high-risk MDs (Class IIb-III) in Italy, where structured HTA processes for MDs are inconsistently implemented. The project centers on a freely accessible AI-based web tool designed to generate preliminary mini-HTA reports. The tool operates through two steps: users provide essential device information via an online form, and ChatGPT produces a structured draft report, including PICO statements, coverage of the nine EUnetHTA domains, and a preliminary summary of relevant PubMed evidence. Although these AI-generated reports are imperfect and require expert verification and refinement, they offer substantial practical advantages by reducing the initial workload and enabling rapid production of a first draft-within minutes rather than hours. The project includes detailed operational instructions and real application examples, such as an artificial iris device, presented in supplementary appendices. Future developments include the release of an English-language version to support broader international use. While AI cannot replace expert judgment, the editorial highlights its value as an accelerative tool that can streamline early HTA steps and promote more systematic evaluation of MDs across Italian regions. Continued iterative use is expected to improve system performance and enhance integration into HTA workflows.

Keywords:  Artificial intelligence; ChatGPT; Medical devices; Rapid mini-HTA report

DOI:  https://doi.org/10.33393/grhta.2026.3691
Qual Health Res. 2026 Mar 06. 10497323261417237

AI and Qualitative Health Research: Working Through a Necessary Grieving Process.

Jordan Sibeoni, Laurence Verneuil, Emilie Manolios, Jean-Pierre Meunier, Christian Ponsard, Jeanne Mathé, Anne Revah-Lévy.

  Qualitative health research has been shaken by the rapid uptake of artificial intelligence (AI), especially large language models. Drawing on Kübler-Ross's five-stage grief heuristic, we articulate a provocative, yet constructive, map of the field's current tensions (denial, anger, negotiation, depression, acceptance) around AI's pros and cons. We argue that what is at stake is not simply efficiency but our very identity of qualitative researchers: reflexivity, intersubjectivity, temporality, and the role of researcher subjectivity. We propose concrete practices compatible with qualitative research's epistemic and ethical commitments: collective prompt-writing, "coding retreats" for critical oversight of outputs, explicit disclosures, and transparency about what is (and is not) delegated to machines. Rather than reject or romanticize AI, we advocate a rigorous, ethically grounded co-working with it, that safeguards slowness, presence, and dialogical sense-making. Our contribution is to reframe AI as a catalyst for renewed reflexivity and methodological clarity, while warning against the erosion of embodied and collective thinking when research becomes "alone-with-AI." We conclude with actionable recommendations for reviewers, editors, and researchers to evaluate AI-assisted manuscripts without abandoning qualitative health research's core: careful attention to meaning, situated ethics, and intersubjective critique.

Keywords:  artificial intelligence; epistemology; ethics; large language models; methodology; qualitative health research; reflexivity

DOI:  https://doi.org/10.1177/10497323261417237
Qual Health Res. 2026 Mar;36(2-3): 140-144

Addressing the Special Issue: Intersections (Existing, Emerging, and Imagined) Between Artificial Intelligence and Qualitative Health Research.

Johanna Creswell Báez, James Salvo, Jessica Nina Lester.

  The articles in this special issue explore the intersections between artificial intelligence (AI) and qualitative health research at a moment of rapid technological expansion and heightened methodological debate. The contributions engage AI not as something to be adopted or rejected but as a focus of critical inquiry that raises epistemological, methodological, and ethical questions for qualitative scholars. Across diverse perspectives, the articles foreground reflexivity, methodological development, and responsible approaches to AI use in clinical settings. The special issue adopts a "big-tent" approach, bringing together varied perspectives that are often in tension, yet productively in conversation. Published amid an accelerating AI hype cycle and increasing institutional pressures to adopt technological solutions, this collection affirms qualitative health research as a vital space for critical dialogue and methodological innovation. The contributions collectively center the interpretive and value-based commitments that have long defined qualitative inquiry, engaging with AI critically and reflexively rather than on its own terms.

Keywords:  Artificial Intelligence; Generative AI; Qualitative Health Research; Qualitative Inquiry; Research Methods

DOI:  https://doi.org/10.1177/10497323261417532a
Expert Rev Pharmacoecon Outcomes Res. 2026 Mar 05.

Prevention, artificial intelligence, and mathematics in Health economics: a framework for dynamic value modelling under long-term uncertainty.

Lieke Maas, Ayshe Yaylali, Mickael Hiligsmann.



Keywords:  Health economics; artificial intelligence; health technology assessment; mathematics; opportunities; prevention; real-world evidence

DOI:  https://doi.org/10.1080/14737167.2026.2642648
Ann Acad Med Singap. 2026 Feb 23.

Talk to the bot: A scoping review on using AI-powered transcription tools in qualitative research.

Ying Jun Lim, Ngiap Chuan Tan.



Keywords:  artificial intelligence; interview; qualitative research; tool; transcription

DOI:  https://doi.org/10.47102/annals-acadmedsg.2025220