bims-arines 2026-06-14 papers

bims-arines

Biomed News

on AI in evidence synthesis

Issue of 2026–06–14
eighteen papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD

Using full agreement across multiple large language models for title-and-abstract screening in systematic reviews: a proof-of-concept.
Artificial Intelligence (AI) Readiness to Support Evidence Synthesis by Workflow: Findings From a Review of Reviews.
Comparing supervised machine learning and large language models in title-abstract screening.
Artificial intelligence versus human consensus: A concordance analysis in the screening of studies for evidence synthesis in physical activity and sport.
Validation of Synthesa AI, a Large Language Model-Based Screening Tool for Systematic Reviews: Results From 9 Pharmacologic Studies.
Show Your Work: Verbatim Evidence Requirements and Automated Assessment of Large Language Models for Biomedical Text Processing of Trial Eligibility Criteria.
Evaluating Large Language Models for Automated Evidence Synthesis in Neuroimaging AI: A Multi-Model Benchmark.
Artificial Intelligence Applications Versus Manual Methods For Literature Retrieval: A Comparative Analysis.
RETRACTION: Human-in-the-Loop Artificial Intelligence System for Systematic Literature Review: Methods and Validations for the AutoLit Review Software.
Automating methodological quality assessment in orthopedic systematic reviews using large language models.
An Open-Source Systematic Reviews Integrated System (OSSYRIS) - Streamlining Processes and Standardising Data Structures.
Can Artificial Intelligence Replicate Human Qualitative Analysis?
The Role of Large Language Models in Scientific Research.
Topic-Aware Summarization of Lived Health Care Experiences: Large Language Model Evaluation Study.
Study Design Indexing in Transition: A Focused Comparison of manual NLM Indexing vs. Transformer-based Automated Models.
A Retrieval-Augmented Natural Language Interface for Data Description and Meta-Analysis in the Pathogens-in-Foods (PIF) Database.
Artificial intelligence in food safety.
Updating CMV protocols in lung transplant patients: a single-center case study modeling use of generative AI for antimicrobial stewardship protocol development and economic impact analysis.

Syst Rev. 2026 Jun 12. pii: 191. [Epub ahead of print]15(1):

Using full agreement across multiple large language models for title-and-abstract screening in systematic reviews: a proof-of-concept.

Frederic Hilkenmeier, Merle Stoltenberg, Christian Stierle.

   BACKGROUND: The exponential growth of scientific literature poses significant challenges for conducting systematic reviews, particularly in the labor-intensive title-and-abstract screening phase. This study examines the feasibility of using multiple large language models (LLMs) for title-and-abstract screening in systematic reviews.
METHODS: We propose and evaluate a full-agreement approach using three commercially available LLMs from different model families (ChatGPT, Gemini, and Claude), in which automated classification decisions are only accepted when all three models assign the same label. This approach was examined across six datasets to assess its effectiveness. A structured workflow was developed to support implementation without requiring specialized technical expertise. Additionally, a stop criterion was introduced to ensure that LLM-based classification is only applied when predefined performance thresholds are met.
RESULTS: Across the six datasets, approximately four in five abstracts received full agreement across models and could therefore be classified automatically. For this subset of abstracts, classification performance was consistently higher than that of previous automated approaches using LLMs, with statistically significant improvements in all prespecified performance metrics.
CONCLUSIONS: A full agreement approach across three LLMs may offer a promising and conservative strategy for title-and-abstract screening in systematic reviews by automating concordant decisions while reserving human review for discordant cases. As a proof-of-concept, the present findings support this approach as a possible workflow for reducing screening burden in the face of continued growth in the scientific literature, although its broader generalizability remains to be established.

Keywords:  Abstract classification; Evidence synthesis; Large language models; Machine learning; Systematic reviews

DOI:  https://doi.org/10.1186/s13643-026-03228-4
Campbell Syst Rev. 2026 Jun;22(2): 18911803261454702

Artificial Intelligence (AI) Readiness to Support Evidence Synthesis by Workflow: Findings From a Review of Reviews.

Zijing Wei, Luyanda Ngongoma, Jose Cols, Arina L Bogdan, Ariel Lin, Claire Zhang, Yue Su, Nuno de Jesus Ximenes, Chloe Zhu, Yoav Ackerman, Heather L Bullock, Juhua Hu, Yanfang Su.

   Background: Evidence synthesis is crucial for informing evidence-based practice across various fields. However, the traditional methodology is resource-intensive, and its findings can be outdated before publication. There is a growing trend toward integrating automation and artificial intelligence (AI) approaches into evidence synthesis to enhance efficiency, but standardized adoption is still pending.
Objective: The goal of this study is to identify peer-reviewed evidence documenting AI readiness for evidence synthesis.
Methods: We searched MEDLINE, Embase, and Global Index Medicus in May 2025 to identify review articles that evaluated evidence synthesis tools. Relevant study reviews and tool reviews published in English between January 2020 and May 2025 were included in our review of reviews. Tool features and performance metrics were extracted according to stages of the evidence synthesis workflow, including search, screening, appraisal, extraction, and synthesis.
Results: We included 21 studies in our review of reviews and identified 46 evidence synthesis tools. Nine tools supported all five stages of the evidence synthesis workflow, among which DistillerSR covered the most workflow-supporting features (19 out of 21). Ten of the identified tools reported sensitivity rates for AI-powered title/abstract screening, all of which achieved ≥95% sensitivity in at least one configuration. Reported sensitivity rates of EPPI-Reviewer, Research Screener and SWIFT-Active Screener consistently reached the 95% threshold with varying degrees of automation.
Conclusion: This review found peer-reviewed evidence supporting AI readiness for human-supervised automation of title/abstract screening. However, evidence documenting AI readiness for other evidence synthesis tasks remains limited. DistillerSR and EPPI-Reviewer demonstrated the broadest feature support and strong evidence for AI-powered title/abstract screening. Our study highlights the potential of AI to improve efficiency while maintaining high sensitivity in the screening stage. AI-powered screening may serve as a critical first step toward scaling rapid reviews into living evidence syntheses.

Keywords:  artificial intelligence (AI); evidence synthesis; randomized controlled trials (RCTs); readiness; sensitivity; workflow

DOI:  https://doi.org/10.1177/18911803261454702
Syst Rev. 2026 Jun 09. pii: 190. [Epub ahead of print]15(1):

Comparing supervised machine learning and large language models in title-abstract screening.

Marco F Aigner, Matthias Ganzinger, Pascal Probst, Moritz Rinckens, Thomas M Pausch.

   BACKGROUND: Systematic reviews require reviewers to decide on the eligibility of large numbers of articles derived from database searches. To accelerate review conduct while continuously more literature gets published, past studies proposed automating the title/abstract-screening step by either supervised machine learning or large language models. Because prior studies mainly compared results within the same model family, we directly compared common TF-IDF-based supervised baselines and a zero-shot, criteria-prompted, and open-weight large language model on the same data to discuss whether, and in which scenarios, they are feasible for review screening automation.
METHODS: We predicted the eligibility of labeled articles by four supervised machine learning models (Naïve Bayes, support vector machine, random forest, logistic regression) and one large language model (Llama-3.1-8B-Instruct). Articles were labeled with eligibility as decided by human reviewers in six systematic reviews. We evaluated the performance by binary confusion matrices and calculated recall, specificity, precision, F1-score, and accuracy over a thousand bootstrap samples each. We compared these results to a reported performance of 0.86 (recall) and 0.79 (specificity) in single human reviewers.
RESULTS: Model performance varies greatly between the data sets. Except for Naïve Bayes, recall and specificity are closer aligned in the supervised machine learning models compared to llama. Averaged across all datasets, llama matches human recall and the Naïve Bayes classifier exceeds it, while both fall behind human specificity. Conversely, logistic regression, random forest and support vector machine fall behind human recall while all three exceed human specificity.
CONCLUSIONS: Both supervised machine learning and large language models achieve recalls close to or above those of human reviewers. The supervised machine learning models achieve a higher harmonic mean of recall and specificity, while the llama model is more sensitive. Considering the reliance on training data and the all-or-nothing automation with supervised machine learning, this study's results warrant their use in the extension of pre-existing, non-critical, systematic reviews. Contrarily, as large language models decide on articles individually and as they provide comprehensive, discussable, reasoning they may be used in tandem with human reviewers while the performance of ensembles of large language models is yet to be analyzed.

Keywords:  Large language model; Supervised machine learning; Systematic review; Title/abstract-screening

DOI:  https://doi.org/10.1186/s13643-026-03199-6
J Bodyw Mov Ther. 2026 Jul;pii: S1360-8592(26)00060-4. [Epub ahead of print]47 371-378

Artificial intelligence versus human consensus: A concordance analysis in the screening of studies for evidence synthesis in physical activity and sport.

Sebastián Rodríguez, Catalina León-Prieto, María Fernanda Rodríguez-Jaime.

   OBJECTIVE: To assess the agreement between the ChatGPT Plus (GPT 4.1) version and human consensus during the screening of studies in four different evidence synthesis projects.
METHODS: A comparative design was used to analyze the degree of agreement between ChatGPT Plus (GPT 4.1) and human reviewers in the study selection process for two systematic reviews with meta-analyses, one scoping review, and one literature review. Human screening was performed independently using the Rayyan platform, while the artificial intelligence was provided with predefined eligibility criteria and protocols. Screening decisions were compared using Cohen's kappa coefficient, sensitivity, and specificity, using Stata 18.
RESULTS: In the systematic reviews with meta-analyses (SR1 and SR2), agreement was high (κ = 0.73 and 0.86), with sensitivity ≥0.88 and specificity ≥0.99, indicating high reliability in excluding irrelevant studies. In contrast, in the scoping review (SR3) and the literature review (NR1), agreement was moderate (κ = 0.56 and 0.59), with lower positive predictive values (≤0.52), suggesting a higher risk of overdetection.
CONCLUSION: Overall, the results suggest that AI-assisted screening may serve as a reliable support tool or triage aid in reviews with well-defined inclusion criteria, rather than fully replacing manual screening. However, in more exploratory or interpretive contexts, human oversight remains necessary to ensure the accuracy of the selection process.

Keywords:  Artificial intelligence; Consensus; Evidence-based practice; Literature screening; Systematic reviews as topic

DOI:  https://doi.org/10.1016/j.jbmt.2026.04.008
J Cardiovasc Pharmacol. 2026 Jun 01. 87(6): 387-393

Validation of Synthesa AI, a Large Language Model-Based Screening Tool for Systematic Reviews: Results From 9 Pharmacologic Studies.

Lefteris Teperikidis, Christos Trampoukis, Kyiakos Polymenakos.

   ABSTRACT: Systematic review screening underpins the evidence base for pharmacology and drug development but remains burdensome, error-prone, and resource-intensive. Synthesa AI, a large language model-based abstract screening tool, was developed to streamline this process by providing a transparent and prompt-driven framework for abstract screening. In this validation study, Synthesa AI was tested across 17 benchmark meta-analyses on 9 therapeutic domains relevant to pharmacology and clinical pharmacotherapy. The tool screened 270,626 abstracts retrieved from PubMed and Scopus. Synthesa AI successfully identified all 163 benchmark-included studies, achieving a sensitivity of 100% (95% confidence interval: 97.7%-100.0%) and a specificity of 99.4% (95% confidence interval: 99.37%-99.42%). Importantly, it reduced reviewer workload by 91.7%, with only 1797 abstracts requiring manual review. Beyond replication, the tool identified 32 additional relevant studies that had been missed in the original reviews, representing a 19.6% increase in evidence yield. These findings highlight the potential of Synthesa AI to enhance pharmacological evidence synthesis by improving the reproducibility and comprehensiveness of systematic reviews used to evaluate drug efficacy, safety, and therapeutic positioning. Synthesa AI represents a transformative solution for living systematic reviews and large-scale evidence integration, offering a rigorous and efficient alternative to traditional human-led screening in pharmacology research.

Keywords:  abstract screening, validation; large language models; synthesa

DOI:  https://doi.org/10.1097/FJC.0000000000001768
Cureus. 2026 May;18(5): e108666

Show Your Work: Verbatim Evidence Requirements and Automated Assessment of Large Language Models for Biomedical Text Processing of Trial Eligibility Criteria.

Paul Windisch, Julia Weyrich, Fabio Dennstädt, Daniel R Zwahlen, Robert Förster, Christina Schröder.

  Introduction Large language models (LLMs) are used for biomedical text processing, but decisions are often hard to audit. We evaluated whether enforcing a mechanically checkable quote affects performance for trial eligibility-scope classification from abstracts. Methods We used 200 randomized controlled trials and provided models with the title and abstract. Trials were labeled with whether they allowed for the inclusion of patients with localized and/or metastatic disease. Flagship models from three vendors (OpenAI, Google, and Anthropic) were queried in two conditions: Label-only and label plus a verbatim supporting quote. Models could abstain if they deemed the abstract to not contain sufficient information. Each condition was repeated three times per abstract. Quotes were mechanically validated as exact substrings, and a separate judge step used an LLM to rate whether each quote supported the assigned label. Results Evidence requirements modestly reduced coverage, i.e., non-invalid non-abstained outputs (GPT-5.2 86.2% to 84.3%, Gemini 3 flash preview 98.3% to 92.8%, Claude Opus 4.5 96.0% to 94.5%) by increasing abstentions and, for Gemini, invalid outputs. Macro-F1 remained high but changed by model (slight gains for GPT-5.2 and Gemini, decrease for Claude). Labels were stable across repetitions (Fleiss' kappa 0.829 to 0.969). Mechanically valid quotes occurred in 83.3% to 91.2% of runs, yet only 48.0% to 78.8% of evidence-bearing predictions were judged semantically supported. Restricting to supported predictions increased macro-F1 at the cost of lower coverage. Conclusion Substring-verifiable quotes provide an automated audit trail and enable selective, higher-trust automation when applying LLMs to biomedical text processing. However, this approach introduces new failure modes and trades coverage for verifiability in a model-dependent way.

Keywords:  citation; evidence; explainability; large language models; natural language processing; reproducibility

DOI:  https://doi.org/10.7759/cureus.108666
J Clin Med. 2026 May 30. pii: 4230. [Epub ahead of print]15(11):

Evaluating Large Language Models for Automated Evidence Synthesis in Neuroimaging AI: A Multi-Model Benchmark.

Umid Sulaimanov, Nafiye Sanlier, Ariorad Moniri, Behman Demir, Yerkebulan Serikkanov, Ahmed Rasim Bayramoglu, Maryam Sabah Al-Jebur, Melih Yucel Sanlier, Ugur Erginoglu, Erkin Otles, Simon Gashaw Ammanuel, Abdullah Keles, Ufuk Erginoglu, Mustafa Kemal Baskaya.

  Background: Data extraction for systematic reviews is highly resource-intensive. This study evaluated four frontier large language models (LLMs) on complex structured metadata extraction from specialized neuroimaging artificial intelligence (AI) literature to determine their performance in automated evidence synthesis. Methods: We compared Google Gemini 3 Pro Preview, Anthropic Claude Opus 4.5, Perplexity Sonar Pro, and OpenAI GPT 5.2. Using a standardized prompt, each model extracted 22 variables from 91 peer-reviewed neuroimaging AI articles. The variables were stratified into low-, medium-, and high-complexity tiers. The performance was measured via the exact-match accuracy against a consensus-based expert ground truth. Results: The overall exact-match accuracy was moderate. Gemini 3 Pro Preview achieved the highest overall rate (56.4%), followed by Sonar Pro (52.1%), Claude Opus 4.5 (51.3%), and GPT 5.2 (46.5%). Gemini significantly outperformed all other models (p < 0.001). The performance declined dramatically as the variable complexity increased. Across models, the accuracy was 88.9-92.9% for low-complexity categorical fields, 47.0-63.3% for medium-complexity text extraction, and 2.7-15.5% for high-complexity variables requiring clinical judgment or multi-section synthesis. The most common type of error was misclassification. All four models scored 0% on the main performance metric, but this reflected a representational mismatch with the ground truth rather than extraction failure, indicating that the exact-match accuracy underestimates the true semantic performance. Conclusions: Frontier LLMs can effectively automate the retrieval of simple categorical data, but have serious difficulties with methodological variables that are complex. Although extraction can be fully automated for low-complexity fields, human review remains essential for context-dependent variables that require clinical judgment.

Keywords:  artificial intelligence; benchmarking; evidence synthesis; information extraction; large language models; neuroimaging

DOI:  https://doi.org/10.3390/jcm15114230
West J Nurs Res. 2026 Jun 10. 1939459261451723

Artificial Intelligence Applications Versus Manual Methods For Literature Retrieval: A Comparative Analysis.

Jenny O'Rourke, Matthew Byrne, Ginger Schroers.

   BACKGROUND: Artificial intelligence (AI), particularly generative and large language models, is being used in nursing education, practice, and scholarly writing. Generative AI applications have been specifically examined for their use in conducting literature reviews with evidence supporting reduced production time of scholarly work. However, there has been limited investigation of their levels of accuracy with identifying references for a literature review.
OBJECTIVE: The purpose of this study was to compare human-generated citations of literature reviews with AI literature-review generated citations.
METHODS: Using a comparative exploratory design, references from 4 human-written literature reviews, 2 published and 2 unpublished, on 4 different topics, were compared to references derived from 2 AI literature applications, Consensus and Elicit. Three prompting strategies were utilized, including prompts generated using ChatGPT-4. Agreement between the AI and human references was evaluated.
RESULTS: The percent of agreement between AI and human generated reference lists ranged from 0% to 63.6%. The Consensus application had a greater overall mean rate of match (21.3%) as compared to Elicit (3.7%). The use of a ChatGPT-4 prompt did not significantly impact results, and there were no differences based on published or unpublished literature reviews.
CONCLUSION: The 2 literature-based applications examined in this study offered a glimpse of their potential use and limitations. The use of an AI literature review application may support but not replace human work.

Keywords:  artificial intelligence; large language models; literature search; nursing; nursing research

DOI:  https://doi.org/10.1177/01939459261451723
Cochrane Evid Synth Methods. 2026 Jul;4(4): e70086

RETRACTION: Human-in-the-Loop Artificial Intelligence System for Systematic Literature Review: Methods and Validations for the AutoLit Review Software.

[This retracts the article DOI: 10.1002/cesm.70059.].

DOI: https://doi.org/10.1002/cesm.70086
J Orthop Surg (Hong Kong). 2026 May-Aug;34(2):34(2): 10225536261459518

Automating methodological quality assessment in orthopedic systematic reviews using large language models.

Yu-Jui Huang, Kai-Cheng Chang, Ying-Chen Kuo, Cheng-Chen Tai, Cheng-Pang Yang.

  BackgroundSystematic reviews represent the foundation of evidence-based orthopedic practice, yet their methodological rigor relies heavily on accurate and consistent methodological quality assessment. This step remains time-consuming, labor-intensive, and prone to subjectivity. Recent advances in large language models (LLMs) suggest potential for automating parts of evidence synthesis.PurposeThis study examined whether LLMs can perform AMSTAR-1-based methodological quality assessment evaluations in orthopedic systematic reviews with accuracy comparable to human experts.MethodsTen sports medicine knee reviews were analyzed using three LLMs-GPT-4o, GPT-5, and GPT Consensus-and their binary responses were compared against expert AMSTAR-1 ratings from a published umbrella review (110 decisions). An external validation set of four reviews published between 2022 and 2025 was included to assess generalizability and safeguard against information leakage.ResultsAgreement with human reviewers reached 87% for GPT-4o, 89% for GPT-5, and 90% for GPT Consensus; all models achieved 84% agreement in the validation set. Concordance was strongest for structured, explicitly reported domains such as a priori design, literature search, and study characteristics, and lowest for judgment-based items including grey literature inclusion, publication bias, and conflict of interest.ConclusionsLLMs cannot yet replace human reviewers, they can serve as reliable adjunct tools to enhance efficiency, transparency, and reproducibility in systematic review workflows within orthopedic research.

Keywords:  artificial intelligence; evidence-based; large language model; sports

DOI:  https://doi.org/10.1177/10225536261459518
Cochrane Evid Synth Methods. 2026 Jul;4(4): e70088

An Open-Source Systematic Reviews Integrated System (OSSYRIS) - Streamlining Processes and Standardising Data Structures.

Xavier Bosch-Capblanch, Christian Auer, A S M Sayem, Guillaume Deschamps, Salvador Camacho, Luís Segura, Salem Al-Aidroos, George Tsey Sabblah, Kaspar Wyss.

   Introduction: Carrying out a systematic review (SR) of the literature entails a high workload and encompasses a variety of very different tasks. The emergence of artificial intelligence tools has brought further opportunities to improve the efficiency and reliability of SRs. SR processes can be optimised to the extent that integration and interoperability of software tools across production stages are progressively implemented. A key stage is data extraction, which can be challenging due to the large amounts of data items to consider and the variability of studies reporting styles, which heavily complicates data processing and analyses.We report the development of a software platform that integrates processes across all types of SR tasks, including overviews of SRs, is open source, and addresses the challenges of data extraction through the standardisation of data structures: the "Open-Source SYstematic Reviews Integrated System" (OSSYRIS).
Methods: We established a series of criteria to select the software integrated in OSSYRIS: few applications, covering all SRs production processes, inter-operable and open source. After several trials, we selected Zotero as reference manager, KoboToolbox XLSForms for screening and data extraction and R for analyses and reporting. We integrated all components using Application Programming Interfaces (API) in R. For the data extraction form, we identified content items from our own experience and from the Cochrane handbook. OSSYRIS has been piloted and used in several SRs and overviews carried out by the authors.
Results: In OSSYRIS, references are manually imported in Zotero and are integrated into XLSForms in KoboToolbox, which are used for online screening by reviewers. R automatically downloads the screening results from KoboToolbox and updates the status of the references in Zotero as 'irrelevant', 'included', 'excluded,' and 'unclear'. R automatically produces the figure with the PRISMA flow of studies and references lists by status, for reporting.Data extraction is manually done using another XLSForm structured in sections: study characteristics, participants, intervention or exposure, outcomes, results and conclusion. Data extraction is standardised by using pre-coded data items, filtering data items according to relevance criteria and modularising data structures. Results of studies are entered using a data structure consistent with the information on the type of outcomes, in a form preceding section. Items that require a decision based on certain criteria, such as which is the type of study or the risk of bias assessments, are not filled in by reviewers; rather reviewers enter the criteria and OSSYRIS internal algorithms issue the specific type of study design or the risk of bias assessments, based on those criteria. XLSForms provide additional functionalities to ensure data integrity. R automatically produces the characteristics of included studies and other analytical outputs for reporting. Standardisation and modularity facilitate adapting the form for different types of SR.
Conclusions: OSSYRIS provides an open source, integrated system to carry out SRs. Our work may support the promotion of open source and free tools to conduct SRs bringing together a community of practice to further improve it, within Cochrane and beyond.

Keywords:  data extraction; integration; open‐source; systematic reviews

DOI:  https://doi.org/10.1002/cesm.70088
J Surg Res. 2026 Jun 08. pii: S0022-4804(26)00254-4. [Epub ahead of print]324 243-249

Can Artificial Intelligence Replicate Human Qualitative Analysis?

Grayson P Stinger, Jamaica Westfall-Snyder, Stewart R Carter, Sarah A Hayek, Ryan K Shabahang, Katelyn A Young, Mohsen M Shabahang, Christie L Buonpane.

   INTRODUCTION: Qualitative research is essential in surgical education for exploring complex social phenomena. However, thematic analysis is time-intensive, requires methodological expertise, and is inherently subject to interpretive bias. Artificial intelligence (AI) has increasingly been explored as a tool to support qualitative analysis, though its role in interpreting complex, abstract concepts remains unclear. The aim of this study was to evaluate whether generative AI can perform qualitative thematic analysis of abstract concepts by comparing AI-generated themes with human-derived themes.
METHODS: We conducted a secondary comparative analysis using transcripts from two previously completed thematic analyses examining how general surgery residency applicants define wellness and engagement. Human-derived themes were generated using an inductive immersion-crystallization approach. The same deidentified transcripts were then analyzed by ChatGPT version 4.0 using a single prompt to generate four themes per dataset without iterative refinement. Human- and AI-generated themes were compared descriptively for conceptual overlap, alignment and interpretive consistency.
RESULTS: A total of 117 applicants were interviewed. Visual mapping demonstrated substantial conceptual overlap between human- and AI-generated themes for both wellness and engagement, with no unique or contradictory themes identified by AI. Human analysts tended to generate discrete themes separating individual- and group-level constructs. In contrast, AI-generated themes integrated these constructs into broader, relational constructs while preserving core thematic content.
CONCLUSIONS: Generative AI demonstrated meaningful alignment with human thematic analysis. When used as a complementary analytic tool with appropriate human oversight, AI may enhance efficiency and accessibility of qualitative methods in surgical education research without replacing interpretive judgment.

Keywords:  Artificial intelligence; ChatGPT; Generative AI; Qualitative analysis; Thematic analysis

DOI:  https://doi.org/10.1016/j.jss.2026.04.020
Rofo. 2026 Jun 08.

The Role of Large Language Models in Scientific Research.

Tobias Lindner, Marc-André Weber, Mathias Manzke.

Background: Large language models (LLMs) are increasingly being incorporated in scientific research, transforming the landscape across disciplines. Their potential spans various stages of the research process - from automating literature reviews and generating research questions to analyzing complex data sets and synthesizing or evaluating manuscripts. However, their implementation presents challenges, especially for researchers with limited experience using AI.
Materials and Methods: This review describes the potential of LLMs to support and enhance systematic literature searches in scientific research. It analyzes their capabilities and addresses practical and ethical challenges, particularly those concerning scientific integrity, transparency, and reproducibility.
Conclusion: Combining LLMs with human expertise offers a promising avenue to accelerate innovation, drive scientific discovery, and ultimately improve healthcare outcomes. Nevertheless, responsible and informed use is essential to maintain rigorous, ethical, and trustworthy scientific practices.
Key Points: · LLMs can accelerate systematic literature searches through automation. · Self-hosted models offer better control, data protection, and domain-specific customization options. · Ethical challenges require transparency, quality assurance, and responsible use of AI.
Citation Format: · Lindner T, Weber MA, Manzke M. The Role of Large Language Models in Scientific Research. Rofo 2026; DOI 10.1055/a-2868-7797.

DOI: https://doi.org/10.1055/a-2868-7797
JMIR Med Inform. 2026 Jun 11. 14 e85960

Topic-Aware Summarization of Lived Health Care Experiences: Large Language Model Evaluation Study.

Maneesh Bilalpur, Megan E Hamm, Young Ji Lee, Natasha G Norman, Kathleen M Mctigue, Yanshan Wang.

   Background: Existing work to understand adults' health care experiences has focused on the analysis of patient feedback provided as written responses to after-visit surveys or social media discourse. Often, such written feedback has been studied using natural language processing techniques, such as topic detection and sentiment analysis, to provide coarse-grained insights. Storytelling is a powerful form of communication and may provide insights into factors contributing to gaps in health care outcomes and avenues for improvement. In addition, studying health care experiences using natural language processing techniques has been limited to patients. The experiences of stakeholders, such as caregivers and health care providers, remain underexplored.
Objective: We extract fine-grained insights from health care experiences through narratives collected from patients, caregivers, and health care providers using large language models (LLMs). Topic detection, together with hierarchical summarization of long-form stories from individuals, offers fine-grained insights. Furthermore, the study demonstrates that generated summaries can be evaluated using the LLM-as-a-judge framework and validates the outcomes through comparisons with 2 domain experts.
Methods: Fifty automatically transcribed stories of African American experiences were used to identify topics in their experiences using the latent Dirichlet allocation (LDA) technique. Stories about a given topic were summarized using an open-source LLM-based hierarchical summarization approach. Topic summaries were generated by summarizing across story summaries for each story that addressed a given topic. The generated topic summaries were rated for fabrication, accuracy, comprehensiveness, and usefulness by the GPT-4 model; its reliability was validated against the original story summaries by 2 domain experts.
Results: Whisper-based automatic transcription of audio narrations achieved a Levenshtein score of 6%. Twenty-six topics were identified using LDA and labeled using the LLM in the 50 African American stories. The GPT-4 ratings suggest that topic summaries were free from fabrication, highly accurate, comprehensive, and useful. The reliability of GPT ratings compared to expert assessments showed moderate-to-high agreement (Bennett S-score of 0.65 or higher). Our approach identified African American experience-relevant topics, such as health behaviors, interactions with medical team members, caregiving, and symptom management, among others. Such insights could help researchers learn from unstructured datasets in an efficient manner-leveraging the communicative power of storytelling.
Conclusions: The use of LDA and LLMs to identify and summarize the experiences of African American individuals suggests a variety of possible avenues for health research and possible clinical improvements to support patients and caregivers, thereby improving health outcomes.

Keywords:  health disparities; large language models; natural language processing; text summarization; topic modeling; unstructured qualitative data

DOI:  https://doi.org/10.2196/85960
medRxiv. 2026 Jun 04. pii: 2026.06.03.26354854. [Epub ahead of print]

Study Design Indexing in Transition: A Focused Comparison of manual NLM Indexing vs. Transformer-based Automated Models.

Puranjani Das, Jodi Schneider, Evan Mayo-Wilson, Halil Kilicoglu, Joe D Menke, Dongin Nam, Kiran Ninan, Jean-Pierre Oberste, Ang Michael Troy, Xiangji Ying, Arthur W Holt, Neil R Smalheiser.

Objectives: Study design indexing of biomedical publications is crucial for evidence retrieval and synthesis. We sought to evaluate the accuracy and suitability of a transformer-based model (TM) for indexing clinical study designs, in comparison to National Library of Medicine (NLM) indexing. However, this is challenging for at least three reasons: First, to date, all automated systems have been trained and evaluated on manual NLM indexing assignments, itself subject to errors. Second, TM's probabilistic predictive scores take into account uncertainty, and can be converted to TRUE/FALSE assignments in different ways depending on the needs of users, while NLM labels are categorical. Third, our goal (to tag articles only that exhibit a given design) differs from NLM which tags articles that both discuss as well as exhibit that design.
Materials and Methods: Therefore, we carried out a limited evaluation of the TM model that focuses only on the articles that received the most confident predictions, that is, the highest scores that are almost certainly TRUE and the lowest scores that are almost certainly FALSE, but which disagreed with NLM assignments. This was performed both for articles published in 2016 (when NLM decisions were manual) and in 2025 (when NLM decisions were automated). To establish ground truth, dual annotators indexed the articles independently, following written definitions, for four prominent study designs-cohort, case-control, cross-sectional, and case report.
Results: For three designs (case-control, case report, cross-sectional), the articles having the top 100 predictive TM scores (when NLM failed to assign that design) were judged to exhibit that design in the great majority (86-100%) of cases. Conversely, the articles having the lowest 100 predictive TM scores (when NLM did assign the study design) exhibited the design only in relatively few (0-21%) of cases. The most confident predictions of the TM model were highly accurate and not redundant with automated NLM indexing; the exception was cohort studies articles, in which both TM and NLM labels showed high error rates of both omission and commission.
Discussion and Conclusion: TM may have value for identifying articles exhibiting study designs, which is especially important for clinical decision-making as well as systematic reviews and other evidence syntheses. NLM indexing of cohort studies cannot be regarded as a reliable gold standard for training or evaluation of automated systems, warranting efforts to create a new manually annotated corpus.

DOI: https://doi.org/10.64898/2026.06.03.26354854
J Food Prot. 2026 Jun 09. pii: S0362-028X(26)00134-1. [Epub ahead of print] 100829

A Retrieval-Augmented Natural Language Interface for Data Description and Meta-Analysis in the Pathogens-in-Foods (PIF) Database.

Lucas Ribeiro Silva, Ursula Gonzales-Barron, Vasco Cadavez.

  Food-safety occurrence databases are increasingly important for surveillance, evidence appraisal, and quantitative risk assessment, yet their routine analytical use remains constrained by the need for database literacy and statistical programming. Building on the curated and harmonized Pathogens-in-Foods (PIF) database, we developed and evaluated a retrieval-augmented natural-language interface designed to support grounded querying and reproducible evidence synthesis. The system includes two complementary modes: an Open Chat Mode for exploratory, tool-mediated interrogation of the database and a Guided Meta-Analysis Mode that couples structured user input to a deterministic R-based analytical pipeline. Evaluation included four compact language models: Phi-4 Mini (3.8B), DeepSeek-R1 Tool-Calling (14B), Cogito (14B), and Qwen 3 (8B), together with Gemini 2.5 Pro as a larger proprietary baseline model. Within a 10-query benchmark, all models achieved 100% tool selection accuracy and retrieval correctness; for the five argument bearing queries, all models also achieved 100% argument extraction F1-score, indicating reliable grounding of database operations for the evaluated query set. In a guided case study on Toxoplasma in meat and meat products (153 records from 65 studies), the system achieved 100% numerical concordance and high visual informativeness; the highest report quality index was 93% with Qwen 3 (8B). Performance differences across models arose primarily from the factual precision and economy of their written interpretations rather than from failures in tool execution. These findings support hybrid, evidence-grounded analytical interfaces built on curated data resources and deterministic statistical backends as practical tools for accelerating surveillance-oriented evidence synthesis in food protection.

Keywords:  data visualization; food safety; meta-analysis; natural language interface; open-weight language models; tool-using agents

DOI:  https://doi.org/10.1016/j.jfp.2026.100829
NPJ Sci Food. 2026 Jun 11.

Artificial intelligence in food safety.

Floor van Meer, Masami Takeuchi, Phillis E Ochieng, Raffaella Tavelli, Arjen Gerssen, Bas H M van der Velden.

Artificial Intelligence (AI) is revolutionizing many fields including food safety. In food safety, the available data generated within the agrifood systems can potentially feed into AI applications. This review presents an overview of the use of AI in different fields of food safety. The study introduces a framework of AI in food safety, focusing on the research domain, application context, data collection method, and AI technique. The AI tool ASReview, which uses active learning to select top-priority papers, was used to conduct the study in a systematic way. Over 150 peer-reviewed journal publications were reviewed and described according to this framework. The review concludes with a summary of the challenges for AI in food safety and an outlook of future opportunities for AI in food safety.

DOI: https://doi.org/10.1038/s41538-026-00925-1
Infect Control Hosp Epidemiol. 2026 Jun 08. 1-8

Updating CMV protocols in lung transplant patients: a single-center case study modeling use of generative AI for antimicrobial stewardship protocol development and economic impact analysis.

Kyle T Enriquez, Augusto Dulanto Chiang, Casey Smiley, Milner Staub.

OBJECTIVE: Antimicrobial Stewardship Programs (ASPs) need healthcare economic analyses to support and inform ASP strategies. This work aimed to determine whether widely available artificial intelligence (AI) platforms like Microsoft CopilotTM could facilitate healthcare economics analyses for ASP programs without dedicated healthcare economic supports.
DESIGN: AI (Microsoft CopilotTM) was prompted to develop a cytomegalovirus prophylaxis protocol for lung transplant recipients using only PubMed-indexed articles. CopilotTM was then prompted to produce probabilistic samples of simulated patients from aggregate statistics of a 165-patient cohort from Vanderbilt University Medical Center and to analyze cost-effectiveness across four distinct cytomegalovirus prophylaxis protocols, including its own.
SETTING: Tertiary care academic medical center, including outpatient and inpatient environments.
PATIENTS OR PARTICIPANTS: Simulated patient data was developed via random, single-blind, probabilistic selection from pre-defined aggregate cohort statistics.
RESULTS: The AI-generated prophylaxis protocol was evidence-based without hallucination, but this conservative protocol relied on outdated evidence and was associated with significant increases in expected per-patient cost (mean +$4740, P < .01) compared to recent guideline-based and institutional protocols. AI independently identified and executed sensitivity analyses, which revealed that in this simplified model, letermovir use had a large impact on expected per-patient cost.
CONCLUSIONS: The AI-proposed protocol was less cost-effective, but data suggest that careful prompting can provide appropriate PubMed-indexed literature to support ASP protocol development. Additionally, CoPilotTM provided a thorough cost-effectiveness analysis comparing all potential and existing protocols. With appropriate oversight, AI and Microsoft CopilotTM can conduct healthcare economic analyses suitable for ASP strategic planning and implementation.

DOI: https://doi.org/10.1017/ice.2026.10480