bims-arines 2026-06-28 papers

bims-arines

Biomed News

on AI in evidence synthesis

Issue of 2026–06–28
ten papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD

Comparative landscape of artificial intelligence-assisted systematic literature reviews.
Prompt engineering of large language models for paper screening in medical meta-analyses and systematic reviews: A prospective comparative study - CORRIGENDUM.
Supporting Literature Reviews: A Comparison Between Human and Generative Artificial Intelligence Screening for a Scoping Review.
Closing the screening gap but not the writing gap: a two-topic evaluation of LLMs for systematic reviews and meta-analyses in hepatology.
Artificial Intelligence in Health Technology Assessment Submissions: A Targeted Review of Global Policy and Practice.
Microbial Named Entity Recognition and Normalisation for AI-assisted Literature Review and Meta-Analysis.
Using ChatGPT for thematic analysis of qualitative interviews in cultural research: a methodological investigation.
The Role of Generative Artificial Intelligence in the Analysis of Qualitative Data Compared With Human-Led Analysis.
Integration of large language models and evidence-based Chinese medicine: A scoping review.
Trends in the use of adult-specific preference-weighted health-related quality of life instruments in clinical trials over the past 50 years: a protocol for a meta-research study using deep learning-based natural language processing and large language models.

Expert Rev Pharmacoecon Outcomes Res. 2026 Jun 25.

Comparative landscape of artificial intelligence-assisted systematic literature reviews.

Ákos Bernard Józwiak, László Balkányi, Judit Hagymásy.



Keywords:  agentic AI; artificial intelligence; evidence synthesis; large language models; regulatory science; systematic literature review

DOI:  https://doi.org/10.1080/14737167.2026.2695957
Res Synth Methods. 2026 Jun 22. 1

Prompt engineering of large language models for paper screening in medical meta-analyses and systematic reviews: A prospective comparative study - CORRIGENDUM.

Till J Adam, Salma A S Abosabie, Max Dittmer, Elise Wolf, Sara A Abosabie, Clara Behnke, Felix Baier, Annabelle Weickmann, Ludwig Köser, Christoph U Correll, Niklas Rutsch.

DOI: https://doi.org/10.1017/rsm.2026.10104
Comput Inform Nurs. 2026 Jun 24.

Supporting Literature Reviews: A Comparison Between Human and Generative Artificial Intelligence Screening for a Scoping Review.

Tami H Wyatt, Heather Carter-Templeton, Jordan Wrigley, Martin Kang, Rosemary Kennedy, Gregory L Alexander, Nancy Beale, Jan Nick, Safiye Sahin, Rachel Alexander.

  Scoping reviews are used to map the literature within a field or discipline, summarize existing evidence, and identify gaps in knowledge. Conducting a scoping review is often labor-intensive, requiring significant human resources. Artificial intelligence (AI) tools may offer efficiencies in stages of the review process; however, their accuracy and impact on rigor remain uncertain. This study used a retrospective cross-sectional agreement design to compare title and abstract screening decisions made by ChatGPT 3.5 with decisions made by human reviewers. Of the 3154 articles initially retrieved, 3148 were screened by both the research team and ChatGPT 3.5, with 6 articles excluded due to incomplete data or upload errors. During title and abstract screening, the human research team excluded 2661 articles (84.5%), whereas ChatGPT 3.5 excluded 1533 articles (48.7%). Our findings suggest that although AI-assisted screening may reduce time by filtering out a portion of irrelevant studies early in the process, these efficiencies must be balanced against the depth of understanding gained through review among the human team. Furthermore, the dialogue and consensus-building among research team members may be diminished when AI tools are used. This reduction in scholarly engagement may limit opportunities for critical appraisal, learning, and deeper comprehension of the evidence.

Keywords:  Home care model; Large language models; Scoping review; emerging technologies

DOI:  https://doi.org/10.1097/CIN.0000000000001587
NPJ Gut Liver. 2026 ;3(1): 21

Closing the screening gap but not the writing gap: a two-topic evaluation of LLMs for systematic reviews and meta-analyses in hepatology.

Yuntao Zou, Iris Kim, Nan Gao, Michelle Li, Mi-Ok Kim, Jin Ge.

  Systematic reviews are essential but labor-intensive. We evaluated LLM-assisted literature screening and drafting in two hepatology topics: carvedilol in compensated cirrhosis and anticoagulation in portal vein thrombosis. For each topic, we searched PubMed, Cochrane, and EMBASE. A few-shot prompt with explicit inclusion/exclusion criteria was used to screen titles and abstracts, with results compared to manual review. Included studies were then processed using a retrieval-augmented LLM to generate ten automated systematic review and meta-analysis drafts per topic, which were evaluated by a separate judge LLM for PRISMA 2020 compliance against human reviews. Screening performance: After deduplication (703 and 370 records), LLM-assisted screening showed high agreement with manual review (sensitivity 86-93%, specificity 96-99%) while reducing screening time to 3 and 2 h versus 62 and 30 h manually. Drafting performance: RAG-enabled LLMs generated structured manuscripts with variable PRISMA 2020 compliance: 100% for titles, 91% for introductions, 75-80% for methods, and 68-75% for results, with downstream weaknesses in abstracts and discussions (<65%). LLM-based PRISMA scoring closely matched human review (ICC ≈ 0.90). LLM-assisted screening was highly accurate, reducing workload by >90%, but automated drafting was reliable mainly for titles and introductions, requiring human oversight to prevent errors and hallucinations.

Keywords:  Diseases; Health care; Medical research

DOI:  https://doi.org/10.1038/s44355-026-00068-w
Value Health Reg Issues. 2026 Jun 23. pii: S2212-1099(26)00073-7. [Epub ahead of print] 101658

Artificial Intelligence in Health Technology Assessment Submissions: A Targeted Review of Global Policy and Practice.

Nishu Gaind, Matthew Badin, Johanna Jacob, Thomas Haugli-Stephens, Luka Ivkovic, Mir-Masoud Pourrahmat, Mir Sohail Fazeli, Eon Ting.

   OBJECTIVES: Artificial intelligence (AI) has the potential of revolutionizing healthcare, including health technology assessments (HTAs). Although its application in HTA remains emerging, AI holds promise for enhancing evidence generation, dossier development, and review quality and efficiency. This study examines the landscape of AI/machine learning use and acceptance by HTA agencies.
METHODS: A review of guidance documents, policy statements, and reports on AI use across HTA agencies in 17 countries (England, United States, Australia, Canada, France, Germany, Italy, Spain, Scotland, Belgium, The Netherlands, Sweden, Denmark, Finland, Norway, Japan, and Singapore), and European Network for Health Technology Assessment/Joint Clinical Assessment was conducted on October 1, 2025. A supplementary search of Embase, bibliographies of previous reviews, and gray literature was also completed.
RESULTS: Thirty-seven publications, including documents from 9 HTA agencies, were identified after screening 1309 abstracts. Among those providing guidance on AI/machine learning in HTA submissions, the National Institute for Health and Care Excellence, Institut für Qualität und Wirtschaftlichkeit im Gesundheitswesen, Canada's Drug Agency (CDA-AMC), Haute Autorité de Santé, Norwegian Institute of Public Health, Belgium Health Care Knowledge Centre, and European Network for Health Technology Assessment referenced AI use in systematic literature reviews, data extraction, evidence synthesis, health economic modeling, real-world evidence, and internal operations, emphasizing human oversight, ethical governance, tool evaluation, and pilot testing. CDA-AMC has also developed an evaluation instrument for AI search tools to monitor and assess evolving technologies. Quebec's Institut National d'Excellence en Santé et Services Sociaux has created a GPT-4-based screening tool to assist study screening.
CONCLUSIONS: This review underscores the evolving yet inconsistent integration of AI into HTA submissions. The National Institute for Health and Care Excellence and CDA-AMC stand out as the only HTA agencies with a clear position statement with implementation strategy for AI.

Keywords:  artificial intelligence; generative AI; health technology assessment; large language models; machine learning; systematic reviews

DOI:  https://doi.org/10.1016/j.vhri.2026.101658
Bioinformatics. 2026 Jun 20. pii: btag418. [Epub ahead of print]

Microbial Named Entity Recognition and Normalisation for AI-assisted Literature Review and Meta-Analysis.

Dhylan Patel, Antoine D Lain, Avish Vijayaraghavan, Nazanin Faghih-Mirzaei, Monica N Mweetwa, Meiqi Wang, Tim Beck, Joram M Posma.

   MOTIVATION: Manual curation of biomedical literature is slow and error-prone and while large language models trained on general texts have shown to be useful for text summarisation, these methods lack the domain-specific expertise required to perform this task accurately. Here we describe the creation of the first microbiome-specific text corpus, use this to train deep learning algorithms for named-entity recognition (NER) and entity linking (EL), and demonstrate their use to meta-analyse microbiome literature.
RESULTS: The training and validation set (n = 1,410) contained a total of 90,150 annotations (both long form and abbreviations). Using the gold-standard test set (n = 288), with an inter-annotator agreement rate of 99.52% for NER and 88.31% for EL, the trained models were evaluated and our fine-tuned BioBERT model achieved an F1-score of 96% for NER surpassing a rule- and dictionary-based annotation pipeline (94%). For EL the accuracy obtained by the deep learning models greatly surpassed that of the pipeline (91% vs 69%). Evaluated across the entire available literature (n = 6,927) across 14 domains, our models annotate an entire full-text document in only 7 seconds.
AVAILABILITY: All codes are available for automatic annotation and model training, with instructions on how to deploy the model on new text, from GitHub and Zenodo. The redistributable, annotated training set and unannotated test set are made available from Zenodo with the redistributable, human-labelled test set hosted as benchmark on Codabench for NER only and NER+EL for evaluation. The annotated documents for all available literature are hosted separately at Zenodo.
SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Keywords:  AI-assisted literature review; biomedical literature corpus; deep learning; microbiome; named entity normalisation; named entity recognition; natural language processing; text mining

DOI:  https://doi.org/10.1093/bioinformatics/btag418
Asian J Psychiatr. 2026 Jun 23. pii: S1876-2018(26)00244-3. [Epub ahead of print]122 105071

Using ChatGPT for thematic analysis of qualitative interviews in cultural research: a methodological investigation.

Madeha Umer, Muqaddas Asif, Siqi Xue, Brett D M Jones, Cindy-Lee Dennis, Farooq Naeem, Benoit H Mulsant, Muhammad Ishrat Husain.

   OBJECTIVE: The use of artificial intelligence (AI) tools in clinical and psychotherapy research is gaining increasing attention. This study explores the application of a large language model (LLM), ChatGPT version 4.5, as a coding assistant for qualitative data in psychotherapy research.
METHODS: Twenty-four semi-structured qualitative interviews were conducted with 12 patients with bipolar disorder and 12 family caregivers participating in a study of a culturally adapted psychotherapy intervention in Pakistan. Interviews were transcribed in Roman Urdu and analyzed using ChatGPT, which was prompted to generate English-language thematic codes and interpretations. Outputs were critically reviewed by a bilingual qualitative researcher for cultural relevance and semantic accuracy.
RESULTS: ChatGPT produced efficient, structured codes and surface-level thematic interpretations. However, it frequently missed emotional nuances, idiomatic expressions, and culturally embedded meanings. Human oversight was essential for ensuring interpretive depth and contextual validity.
CONCLUSION: The use of ChatGPT as an AI-assisted tool for qualitative analysis offers potential for enhancing efficiency in psychotherapy research but remains limited in capturing cultural and emotional subtleties. AI-human collaboration represents a promising but still evolving approach for cross-cultural qualitative research in mental health.

Keywords:  Artificial Intelligence (AI); Cultural Adaptation; Large Language Models (LLMs); Psychotherapy Research; Qualitative Research

DOI:  https://doi.org/10.1016/j.ajp.2026.105071
Cureus. 2026 May;18(5): e109620

The Role of Generative Artificial Intelligence in the Analysis of Qualitative Data Compared With Human-Led Analysis.

Jasmin Dhanoa, Mark Lee, Sonaina Chopra, Quang Ngo, Elif Bilgic.

  Introduction The use of generative artificial intelligence (GenAI) has been widely adopted across multiple fields and is beginning to be integrated into research, specifically in qualitative and mixed-methods designs. Currently, GenAI can be used for data familiarization and analysis. However, approaches that integrate GenAI with human analysis are still relatively new, and no studies in medical education have explored this approach. The overarching purpose of this study is to compare GenAI-led and human-led thematic analyses of qualitative data and to explore strategies that can enhance GenAI-led thematic analysis, thereby providing insights into how GenAI and human-led analyses can complement each other. Methods A GenAI platform (Microsoft 365 Copilot; Microsoft Corporation, Redmond, Washington, USA) was used to conduct reflexive thematic analysis and generate themes through a qualitative research dataset that includes 23 interview transcripts, whereby data were collected in 2024. The GenAI analysis was conducted through an iterative process of exploring the functions of Copilot, optimizing data input, and investigating prompting strategies. The quality of the GenAI analysis was explored by comparing its output to the human-led analysis. Results Overall, we found that, through effective prompting strategies, Copilot was able to create a thematic table, providing a comprehensive view and summary of the data. However, at times, Copilot could not use the entirety of a large prompt. Additionally, through examining the Copilot-generated and human-generated codebooks, it was found that Copilot took a more interpretive analytical approach compared to the human-led analysis, which utilized a qualitative descriptive approach. Conclusion In conclusion, since the use of GenAI to support qualitative analysis is new, we caution readers to explore the functions of the GenAI platform they use and understand the prompting strategies that yield the optimal analytical approach and output for their objectives. Specifically, it is important to reflect on the types of qualitative analysis that GenAI can support and to consider reflexivity and potential biases throughout the research process.

Keywords:  emotions; generative artificial intelligence; interviews; postgraduate medical education; qualitative analysis

DOI:  https://doi.org/10.7759/cureus.109620
Integr Med Res. 2026 Sep;15(3Part B): 101349

Integration of large language models and evidence-based Chinese medicine: A scoping review.

Yuanyuan Yao, Hui Liu, Daoze Yang, Xufei Luo, Honghao Lai, Zhe Wang, Yaolong Chen, Zhaoxiang Bian.

   Background: Large language models (LLMs) have attracted increasing attention in medical research and clinical practice and have been applied to processes related to evidence-based medicine (EBM). However, the extent of their integration with evidence-based Chinese medicine (CM) remains unclear.
Methods: We systematically searched PubMed, Web of Science, China National Knowledge Infrastructure (CNKI), and Wanfang Data from 30 November 2022 to 31 January 2026, with supplementary searches conducted in Google Scholar. Studies were included if they applied LLMs to EBM processes within a CM context or investigated LLMs in CM using established evidence-based research designs. Descriptive analysis summarized study characteristics, and findings were mapped according to the evidence ecosystem framework.
Results: A total of 12 studies published between 2023 and 2025 were included. Most studies integrated LLMs into different stages of the EBM workflow within a CM context. At the evidence generation stage, studies explored the role of LLMs in identifying research priorities. At the evidence synthesis stage, LLM performance was evaluated in literature screening, data extraction, and risk-of-bias assessment. At the evidence translation stage, studies evaluated the performance of LLMs in guideline-related question answering and recommendation generation. At the evidence implementation stage, LLMs combined with knowledge graphs or retrieval-augmented generation were used to develop intelligent question-answering systems based on CM guidelines or standards.
Conclusion: Existing studies suggest that LLMs have begun to be explored across multiple stages of evidence-based CM research and show potential for improving evidence synthesis efficiency and supporting knowledge translation and application.
Protocol registration: Open Science Framework (https://osf.io/ztbd5/overview).

Keywords:  Chinese medicine; Evidence-based medicine; Large language model; Scoping review

DOI:  https://doi.org/10.1016/j.imr.2026.101349
BMJ Open. 2026 Jun 22. 16(6): e118609

Trends in the use of adult-specific preference-weighted health-related quality of life instruments in clinical trials over the past 50 years: a protocol for a meta-research study using deep learning-based natural language processing and large language models.

Sarun Srikhom, Nancy Devlin, Nhung Nghiem, Sandra Nolte, Vu Vo, An Tran-Duy.

   BACKGROUND: Health technology assessment bodies increasingly emphasise the importance of preference-weighted health-related quality of life (HRQoL) evidence. However, such measures are often absent in clinical trial publications. It is not yet clear how frequently clinical trials have incorporated these measures over the past five decades, how the use of preference-weighted HRQoL instruments has evolved over time, and how trends differ across disease areas, countries and global regions. This study aims to (1) assess changes over time in the proportions of clinical trials using each preference-weighted HRQoL instrument in adults, and (2) model secular trends in the adoption of these instruments across disease areas, countries and regions. The study will provide a comprehensive, systematic assessment of the use of preference-weighted HRQoL instruments in clinical trials since 1976 and develop a scalable approach for large-scale evidence synthesis.
METHODS: We will identify clinical trials involving humans published in English since 1976 through systematic searches of MEDLINE, Embase, Cochrane Library and Web of Science. We will focus on generic preference-weighted HRQoL instruments for adults, including EQ-5D-3L, EQ-5D-5L, Short Form 6 Dimensions, 12-Item Short Form Health Survey (SF-12), Health Utility Index 2, Health Utility Index 3, Assessment of Quality of Life (AQoL) series (AQoL-4D, AQoL-6D, AQoL-7D, AQoL-8D), Quality of Well-Being Scale (QWB), QWB Self-Administered (QWB-SA), 15D and Patient-Reported Outcomes Measurement Information System (PROMIS) with the Preference Scoring System (PROPr). Screening and data extraction will be automated using natural language processing (NLP) pipeline or large language models (LLMs). To determine the most accurate approach, we will benchmark NLP and LLM performance against a manually curated reference dataset of 5000 randomly sampled articles reviewed independently by three reviewers. Model performance will be evaluated using classification metrics including accuracy, recall and F1-score. Annual counts and proportions of trials using each instrument will be calculated, stratified by disease area, country and region. Trends will be modelled using basis-splines (B-splines) with 2 or 3 degrees of freedom and Bayesian spline regression to estimate secular changes in both absolute numbers and proportions of instrument use over time.
ETHICS AND DISSEMINATION: This study uses only published literature and does not involve human participants or individual-level data. All results will be reported in aggregate form, with no identifiable information. Formal ethics approval is therefore not required. Findings will be disseminated via peer-reviewed publications and conference presentations, and aggregated data and analysis code will be made publicly available to support transparency and reproducibility.

Keywords:  Artificial Intelligence; Clinical Trial; Machine Learning; Natural Language Processing; Quality of Life; Systematic Review

DOI:  https://doi.org/10.1136/bmjopen-2026-118609