bims-arines 2026-04-19 papers

bims-arines

Biomed News

on AI in evidence synthesis

Issue of 2026–04–19
eighteen papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD

Comparing the performance of narrow vs. broad search strategies when using machine learning-based software for title/abstract screening.
The Promise of Artificial Intelligence to Aid in Systematic Reviews: An Examination of Scopus AI.
Prompt engineering of large language models for paper screening in medical meta-analyses and systematic reviews: A prospective comparative study.
Optimizing document retrieval using massive text embeddings and LLM prompt engineering.
Zero-shot interpretable biomedical literature appraisal with generative large language models.
Batch Size Effects on Mid-2025 State-of-the-Art Large Language Model Performance in Automated Title and Abstract Screening.
Transforming evidence synthesis: A systematic review of the evolution of automated meta-analysis in the age of AI.
Evaluating automated or artificial intelligence search tools for evidence synthesis.
Strategizing AI utilization for psychological literature screening: A comparative analysis of machine learning algorithms and key factors to consider.
Letter regarding "The role of reviewers in the era of systematic reviews and meta-analysis: A practical guide for researchers".
Response to the Letter regarding "The role of reviewers in the era of systematic reviews and meta-analysis: A practical guide for researchers".
Classification of Cochrane Plain Language Summaries by Conclusiveness Using Transformer-Based Models and ChatGPT: Retrospective Observational Study.
Utilizing Artificial Intelligence to create narrative literature reviews.
Comparing five generative AI chatbots' answers to LLM-generated clinical questions with medical information scientists' evidence summaries.
From chaos to clarity: schema-constrained AI for auditable biomedical evidence extraction from full-text PDFs.
Scenario-based evaluation of large language models for reference accuracy in dermatology: literature retrieval on latent tuberculosis in psoriasis patients on anti-IL-17/23 therapy.
AI Methods for Implementation Science (AIM-IS): developing a framework, toolkit, and reporting standard for the responsible use of AI in implementation practice and research.
Are large language models consistent with the ASPS and AAPS guidelines? A comparison of AI chatbot recommendations and plastic surgery clinical guidance.

J Med Libr Assoc. 2026 Apr 01. 114(2): 105-115

Comparing the performance of narrow vs. broad search strategies when using machine learning-based software for title/abstract screening.

Michelle Swab.

   Objectives: To retrospectively evaluate workload implications and recall performance of narrower or broader database search strategies when using active learning screening tools.
Method: A convenience sample of 10 completed reviews was used to assess search strategy performance in ASReview LAB, an open-source systematic review software tool. For each review, a single database search strategy was selected and then revised to either broaden (n = 9) or narrow (n = 1) the scope. Results from both the more sensitive (broader) and more precise (narrower) search strategies were labeled as relevant or irrelevant based on inclusion in the completed review. The labeled result sets were uploaded into the ASReview LAB simulation module, which mimics the process of human screening. Metrics such as number of records screened to reach recall of 95% or more were recorded. The effects of three different stopping rules on workload and recall were also explored.
Results: For quantitative systematic reviews, the difference in absolute screening time required to reach 95% recall between broader or narrower search strategies was minimal (≤35 minutes). In contrast, for qualitative systematic reviews and other review types, broader search strategies led to increased workload. With respect to stopping rules, the time-based stopping heuristic resulted in substantial workload increases when broader search strategies were employed.
Conclusion: Time savings achieved through the use of semi-automated screening tools may not always offset additional screening time required by broader, more sensitive search strategies. Librarians and information specialists should consider a variety of factors when determining the appropriate balance between search sensitivity and specificity in the context of semi-automated screening tools.

Keywords:  AI; Screening Tools; evidence synthesis as topic; machine learning; search strategy development; systematic review as topic

DOI:  https://doi.org/10.5195/jmla.2026.2286
J Adv Pract Oncol. 2026 Mar 04. 1-7

The Promise of Artificial Intelligence to Aid in Systematic Reviews: An Examination of Scopus AI.

Lydia T Madsen, John C Carter, Sheeba Cantanelli, Christine Hong, Angela Bazzell.

Systematic reviews are a critical tool in oncology practice to facilitate informed clinical decision-making, synthesize current research, and guide practice policy. To facilitate early exploration of a literature query or topic, Scopus Artificial Intelligence (AI), which was introduced in 2024 and is subscription-based, provides a new tool for researchers and providers to access current data or begin a systematic review topic exploration. The following article is intended to familiarize advanced practice providers (APPs) with both the recently released AI tool of Scopus AI and associated AI interface capabilities with literature search methodologies. Scopus AI is embedded within the extensive resources of Scopus, an established search engine database. Scopus AI simplifies a topic search by allowing a user to enter the question or phrase in natural language, or ordinary spoken or written language. It then translates the query into a vector and/or keyword search. Scopus AI summarizes the output results to include bullet points, numbered highlights, and conclusions. Associated citations, with internal URL links to articles embedded within the Scopus database, allow for confidence in the output summary. Pivotal or landmark study foundational document citations are also listed. The utilization of AI tools can aid APP researchers and clinicians to expedite steps in the systematic review process. Multiple tools are available to assist the researcher; Scopus AI is one of the tools that can be used to assist in streamlining specific aspects such as the initial tasks and literature search steps of the systematic review development process.

DOI: https://doi.org/10.6004/jadpro.2026.17.7.10
Res Synth Methods. 2026 Apr 14. 1-18

Prompt engineering of large language models for paper screening in medical meta-analyses and systematic reviews: A prospective comparative study.

Till J Adam, Salma A S Abosabie, Max Dittmer, Elise Wolf, Sara A Abosabie, Clara Behnke, Felix Baier, Annabelle Weickmann, Ludwig Köser, Christoph U Correll, Niklas Rutsch.

  Interest in large language models (LLMs) as a tool for meta-analyses and systematic reviews (MA/SRs). We prospectively developed 515 unique prompts by predefined screening-related categories and tested with open-access LLMs (Llama, Mistral) against four gold-standard MA/SRs from different medical fields published after the LLMs' training cut-offs, using a Python-based pipeline. Heterogeneity between prompts was quantified, and hypothetical workload/cost reduction with top-performing prompts calculated. Across 12,360 pipeline runs, LLMs versus MA/SRs reached average recall/sensitivity = 83.6 ± 17.0%, precision = 18.5 ± 15.6%, specificity = 36.6 ± 23.7% F1-score = 27.6 ± 17.2%, and accuracy = 61.1 ± 11.0%. F1-scores were significantly higher when prompts focused on methods (0.78 ± 0.40%), explicitly mentioned MA/SR screening (0.81 ± 0.37%), included the comparison MA/SR's title (5.64 ± 0.37%) or selection criteria (8.05 ± 0.68%), and with more LLM parameters (70b = 4.48 ± 0.31%, 123b = 7.77 ± 0.31%), but lower when screening abstracts instead of titles (-3.67 ± 0.28%). In LLM-base preselection, top-performing F1-score prompts (recall/sensitivity = 72.2%, specificity = 66.1%, precision = 28.6%) would reduce screening demands by 34.5%-37.5%, saving 8.4-8.8 weeks of work and 17,592-18,552. Recall/sensitivity increased with less MA/SR information contrasting F1-score results, which highlights a recall/sensitivity-precision/specificity trade-off. F1-score increased with detailed MA/SR information, while recall/sensitivity increased with shorter, zeroshot prompts. We provide the first prospectively assessed prompt engineering framework for early-stage LLM-based paper screening across medical fields. The publicly available Python pipeline and full prompt list used here support further development of LLM-based evidence synthesis.

Keywords:  evidence-based medicine; large language models; meta-analysis as topic; prompt engineering; systematic reviews as topic

DOI:  https://doi.org/10.1017/rsm.2026.10093
Syst Rev. 2026 Apr 14.

Optimizing document retrieval using massive text embeddings and LLM prompt engineering.

Goran Mitrov, Boris Stanoev, Vladimir Trajkovik, Biljana Risteska Stojkoska, Lasko Basnarkov, Petre Lameski, Martin Kampel, Eftim Zdravevski.

   BACKGROUND: The rapid expansion of digital data poses a unique challenge for retrieving relevant and insightful information efficiently. In particular, the increasing volume of scientific publications has made literature reviews time-consuming. The emergence of large language models (LLMs) offers new opportunities to streamline this process.
METHODS: This paper explores the use of generative artificial intelligence (GenAI) for query reformulation and evaluates the performance of nine massive text embedding models, varying in size and fine-tuning strategies, in the context of document retrieval. We apply multiple prompt engineering techniques to evaluate the ability of LLMs to generate effective queries, comparing them with human-crafted queries. These are used to retrieve documents utilizing nine embedding models. The evaluation is across five datasets using metrics such as recall, average precision, and rank-based measures.
RESULTS: Results show that embedding models fine-tuned for semantic similarity consistently outperform general-purpose models, with UAE Large proving most robust across diverse domains. Furthermore, queries generated using zero-shot and few-shot prompting techniques often surpass the performance of human-formulated queries.
CONCLUSION: These findings highlight the value of integrating LLMs and massive text embeddings to reduce manual effort in literature reviews. GenAI provides a reliable starting point for query formulation, with human input reserved for refinement when needed.

Keywords:  Automated surveys; Document retrieval; Information retrieval; LLMs; Massive text embeddings; Prompt engineering; Systematic review automation; Vector indexes

DOI:  https://doi.org/10.1186/s13643-026-03155-4
JAMIA Open. 2026 Apr;9(2): ooag043

Zero-shot interpretable biomedical literature appraisal with generative large language models.

Fangwen Zhou, Muhammad Afzal, Ashirbani Saha, Rick Parrish, R Brian Haynes, Alfonso Iorio, Cynthia Lokker.

   Objective: This study aims to apply 2 decoder-based Generative Pre-trained Transformer (GPT) models (GPT-4o and GPT-o3-mini) in automating the methodological appraisal of randomized controlled trials (RCTs), under a variety of prompt designs, and to compare their performance to a fine-tuned encoder-only BioLinkBERT model.
Materials and Methods: A stratified random sample of 800 articles from the McMaster Premium LiteratUre Service and Clinical Hedges databases was appraised using 2 prompting schemes: (1) classifier (independent assessment) and (2) verifier (validation of BioLinkBERT) considering either the title and abstract (TIAB) or the full text of an article. Performance was primarily evaluated against human assessments using Matthews correlation coefficient (MCC). Bootstrapping over 1000 iterations was used to estimate 95% CIs.
Results: GPT-4o as a classifier with full text demonstrated comparable performance (MCC 0.429; 95% CI, 0.387-0.470) to BioLinkBERT (MCC, 0.466; 95% CI, 0.409-0.519), drastically outperforming the best GPT-o3-mini scheme (MCC, 0.272; 95% CI, 0.211-0.334). GPT-4o as a verifier with full text showed similar performance (MCC, 0.391; 95% CI, 0.335-0.444). GPT models provided transparent criterion-specific justifications. Performance using TIAB alone markedly decreased for GPT models (MCC, ≤0.100), highlighting dependency on detailed methodological information.
Discussion: GPT-4o effectively automates RCT critical appraisal with comparable performance to specialized fine-tuned models when provided full text, enhancing interpretability and transparency through explicit justifications. Limitations in abstract-level detail suggest complementary roles for fine-tuned models when full texts are unavailable. Future studies should optimize goal-specific prompting to further facilitate adoption in clinical knowledge translation workflows.

Keywords:  GPT; deep learning; evidence-based medicine; explainable AI; natural language processing; text classification

DOI:  https://doi.org/10.1093/jamiaopen/ooag043
Cochrane Evid Synth Methods. 2026 May;4(3): e70082

Batch Size Effects on Mid-2025 State-of-the-Art Large Language Model Performance in Automated Title and Abstract Screening.

Petter Fagerberg, Oscar Sallander, Kim Vikhe Patil, Anders Berg, Anastasia Nyman, Natalia Borg, Thomas Lindén.

   Background: Manual abstract screening is a primary bottleneck in evidence synthesis. Emerging evidence suggests that large language models (LLMs) can automate this task, but their performance when processing multiple references simultaneously in "batches" is uncertain.
Objectives: To evaluate the classification performance of four state-of-the-art LLMs (Gemini 2.5 Pro, Gemini 2.5 Flash, GPT-5, and GPT-5 mini) in predicting reference eligibility across a wide range of batch sizes for a systematic review of randomized controlled trials.
Methods: We used a gold-standard dataset of 790 references (93 considered relevant) from a published Cochrane Review on stem cell treatment for acute myocardial infarction. Using the public APIs for each model, batches of 1 to 790 references were submitted to classify each as "Include" or "Exclude." Performance was assessed using sensitivity and specificity, with internal validation conducted through 10 repeated runs for each model-batch combination.
Results: Gemini 2.5 Pro was the most robust model, successfully processing the full 790-reference batch. In contrast, GPT-5 failed at batches ≥400, while GPT-5 mini and Gemini 2.5 Flash failed at the 790-reference batch. Overall, all models demonstrated strong performance within their operational ranges, with two notable exceptions: Gemini 2.5 Flash showed low initial sensitivity at batch 1, and GPT-5 mini's sensitivity degraded at higher batch sizes (from 0.88 at batch 200 to 0.48 at batch 400). At a practical batch size of 100, Gemini 2.5 Pro achieved the highest sensitivity (1.00, 95% CI 1.00-1.00), whereas GPT-5 delivered the highest specificity (0.98, 95% CI 0.98-0.98).
Conclusion: State-of-the-art LLMs can effectively screen multiple abstracts per prompt, moving beyond inefficient single-reference processing. However, performance is model-dependent, revealing trade-offs between sensitivity and specificity. Therefore, batch size optimization and strategic model selection are important parameters for successful implementation.

Keywords:  AI; ChatGPT; Gemini; artificial intelligence; diagnostic test accuracy; large language models; literature screening; meta‐analysis; systematic review; validation

DOI:  https://doi.org/10.1002/cesm.70082
Res Synth Methods. 2026 May;17(3): 403-450

Transforming evidence synthesis: A systematic review of the evolution of automated meta-analysis in the age of AI.

Lingbo Li, Anuradha Mathrani, Teo Susnjak.

  Exponential growth in scientific literature has heightened the demand for efficient evidence-based synthesis, driving the rise of the field of automated meta-analysis (AMA) powered by natural language processing and machine learning. This PRISMA systematic review introduces a structured framework for assessing the current state of AMA, based on screening 13,216 papers (2006-2024) and analyzing 61 studies across diverse domains. Findings reveal a predominant focus on automating data processing (52.5%), such as extraction and statistical modeling, while only 16.4% address advanced synthesis stages. Just one study (approximately 2%) explored preliminary full-process automation, highlighting a critical gap that limits AMA's capacity for comprehensive synthesis. Despite recent breakthroughs in large language models and advanced AI, their integration into statistical modeling and higher-order synthesis, such as heterogeneity assessment and bias evaluation, remains underdeveloped. This has constrained AMA's potential for fully autonomous meta-analysis (MA). From our dataset spanning medical (67.2%) and non-medical (32.8%) applications, we found that AMA has exhibited distinct implementation patterns and varying degrees of effectiveness in actually improving efficiency, scalability, and reproducibility. While automation has enhanced specific meta-analytic tasks, achieving seamless, end-to-end automation remains an open challenge. As AI systems advance in reasoning and contextual understanding, addressing these gaps is now imperative. Future efforts must focus on bridging automation across all MA stages, refining interpretability, and ensuring methodological robustness to fully realize AMA's potential for scalable, domain-agnostic synthesis.

Keywords:  AI-driven meta-analysis; automated evidence synthesis; automated meta-analysis (AMA); large language models for meta-analysis; scalable meta-analysis; systematic reviews

DOI:  https://doi.org/10.1017/rsm.2025.10065
J Med Libr Assoc. 2026 Apr 01. 114(2): 171-172

Evaluating automated or artificial intelligence search tools for evidence synthesis.

Robin Featherstone.

  To advance information retrieval science for producing evidence syntheses at Canada's Drug Agency, the Research Information Services team developed a replicable process to evaluate automated or artificial intelligence (AI) search tools. The team inventoried 51 tools in the fall of 2023 and built a flexible evaluation instrument to inform adoption decisions and enable comparison between tools. Building on this foundational evaluation work, the team further conducted a comparative analysis on three top-ranked tools in the fall of 2024. The investigation confirmed that these automated or AI tools have inconsistent and variable performance for the range of information retrieval tasks performed by Information Specialists at Canada's Drug Agency. Implementation recommendations from this study informed a "fit for purpose" approach where Information Specialists leverage automated or AI search tools for specific tasks or project types.

Keywords:  Artificial Intelligence; Automation; Generative Artificial Intelligence; Information Sciences; Information Storage and Retrieval; Large Language Models; Review Literature as Topic

DOI:  https://doi.org/10.5195/jmla.2026.2341
Res Synth Methods. 2026 May;17(3): 451-482

Strategizing AI utilization for psychological literature screening: A comparative analysis of machine learning algorithms and key factors to consider.

Lars König, Steffen Zitzmann, Martin Hecht.

  With the rapid growth of scholarly literature, efficient artificial intelligence (AI)-aided abstract screening tools are becoming increasingly important. This study evaluated 10 different machine learning (ML) algorithms used in AI-aided screening tools for ordering abstracts according to their estimated relevance. We focused on assessing their performance in terms of the number of abstracts required to screen to achieve a sufficient detection rate of relevant articles. Our evaluation included articles screened with diverse inclusion and exclusion criteria. Crucially, we examined how characteristics of the screening data-such as the proportion of relevant articles, the overall frequency of abstracts, and the amount of training data-impacted algorithm effectiveness. Our findings provide valuable insights for researchers across disciplines, highlighting key factors to consider when selecting an ML algorithm and determining a stopping point for AI-aided screening. Specifically, we observed that the algorithm combining the logistic regression (LR) classifier with the sentence-bidirectional encoder representations from transformers (SBERT) feature extractor outperformed other algorithms, demonstrating both the highest efficiency and the lowest variability in performance. Nonetheless, the algorithm's performance varied across experimental conditions. Building on these findings, we discuss the results and provide practical recommendations to assist users in the AI-aided screening process.

Keywords:  AI-aided literature screening; machine learning; meta-analysis; stopping criteria; systematic reviews

DOI:  https://doi.org/10.1017/rsm.2025.10053
Biomol Biomed. 2026 Apr 13.

Letter regarding "The role of reviewers in the era of systematic reviews and meta-analysis: A practical guide for researchers".

Himel Mondal.

This correspondence addresses three significant concerns regarding the current peer review process for systematic reviews and meta-analyses. First, while artificial intelligence tools can enhance language and readability, their implementation necessitates transparent disclosure and diligent human oversight, as AI-generated content may contain errors, fabricated references, or misleading interpretations. Second, an overreliance on text similarity reports may promote unnecessary paraphrasing of standardized methodological descriptions, leading to unclear or convoluted phrasing without enhancing scientific originality. Third, the verification of references has increasingly burdened reviewers due to inaccurate citations and repeated security barriers encountered during source verification, which further prolongs the review process and exacerbates reviewer fatigue. We contend that journals and publishers should enhance editorial screening, utilize responsible similarity and reference-checking tools, provide clearer guidelines for systematic review and meta-analysis methods sections, and improve access systems to facilitate efficient and reliable peer review.

DOI: https://doi.org/10.17305/bb.2026.14264
Biomol Biomed. 2026 Apr 13.

Response to the Letter regarding "The role of reviewers in the era of systematic reviews and meta-analysis: A practical guide for researchers".

Emir Begagić, Faruk Skenderi, Semir Vranić.

This response to the letter expands the discussion on the evolving demands of peer review for systematic reviews and meta-analyses. We emphasize that the main concern surrounding artificial intelligence is not its limited and disclosed use for language support, but undisclosed application and insufficient human verification, which may compromise citation accuracy, interpretation, and overall trustworthiness. We also argue that similarity reports should be interpreted contextually, particularly in evidence syntheses where standardized methodological language is unavoidable, and that low similarity does not necessarily exclude manuscript manipulation. Finally, we highlight reference verification as a central research-integrity challenge that should not rest on peer reviewers alone. Preserving the credibility of evidence synthesis requires shared responsibility across authors, reviewers, editors, and publishers.

DOI: https://doi.org/10.17305/bb.2026.14271
JMIR Med Inform. 2026 Apr 14. 14 e72657

Classification of Cochrane Plain Language Summaries by Conclusiveness Using Transformer-Based Models and ChatGPT: Retrospective Observational Study.

Antonija Mijatović, Luka Ursić, Nensi Bralić, Ružica Bandić, Barbara Ćaćić, Ivan Buljan, Ana Marušić.

   Background: Cochrane plain language summaries (PLSs) aim to make systematic review findings more accessible to the general public. However, inconsistencies in how conclusions are presented may impact comprehension and decision-making. Classifying PLSs based on conclusiveness can improve clarity and facilitate informed health decisions.
Objective: This study aimed to develop and evaluate deep learning language models for the classification of PLSs according to 3 levels of conclusiveness (conclusive, inconclusive, and unclear) and to compare their performance with a general-purpose large language model (GPT-4o).
Methods: We used a publicly available dataset containing 4405 Cochrane PLSs of systematic reviews published until 2019, already classified by humans according to 9 categories of conclusiveness regarding the intervention's effectiveness or safety. We merged these categories into 3 classes based on the strength of conclusiveness: conclusive, inconclusive, and unclear. For the fine-tuning, we used Scientific Bidirectional Encoder Representations from Transformers (SciBERT), a pretrained language model trained on 1.14 million papers primarily from the health sciences, and Longformer, a transformer model designed specifically to process long documents. The script was developed using the Python programming language and the PyTorch framework. We computed evaluation metrics using the scikit-learn machine learning library and determined the area under the curve of the receiver operating characteristic (AUCROC) to measure the model performance in balancing sensitivity and specificity. We also analyzed a separate set of 213 PLSs and compared the predictions of our pretrained models with both manual verification and outputs generated by ChatGPT.
Results: The model based on SciBERT achieved a balanced accuracy of 56.6%. The AUCROC was 0.91 for "conclusive," 0.67 for "inconclusive," and 0.75 for "unclear" conclusiveness classes. The Longformer-based model had a balanced accuracy of 60.9%, with AUCROCs of 0.86 for "conclusive," 0.67 for "inconclusive," and 0.72 for "unclear" conclusiveness classes. Both models underperformed compared with ChatGPT, which demonstrated higher accuracy (74.2%), better precision and recall, and a higher Cohen κ (0.57).
Conclusions: Fine-tuning 2 transformer-based language models showed mixed results in classifying Cochrane PLSs by conclusiveness, likely due to semantic overlap and subtle linguistic differences. Despite satisfactory internal test metrics, the fine-tuned models failed to generalize to newly published PLSs, where performance dropped to near-chance levels. These findings suggest that general-purpose large language models like GPT-4o may currently offer more reliable results for practical classification tasks in biomedical applications.

Keywords:  Longformer; PLS; SciBERT; Scientific Bidirectional Encoder Representations from Transformers ; fine-tuning; large language models; plain language summary

DOI:  https://doi.org/10.2196/72657
Einstein (Sao Paulo). 2026 ;pii: S1679-45082026000101407. [Epub ahead of print]24 eRW1165

Utilizing Artificial Intelligence to create narrative literature reviews.

Auro Del Giglio, Mateus Uérlei Pereira da Costa.

This study explores the potential impact of Artificial Intelligence on narrative literature reviews in academic research. The literature review process involves finding, analyzing, and synthesizing relevant literature and is crucial for situating new research within existing knowledge. The integration of Artificial Intelligence tools, specifically Large Language Models such as the Generative Pre-Trained Transformer series, can significantly improve the efficiency and effectiveness of this process. This paper outlines the steps involved in conducting a literature review and examines how Artificial Intelligence tools can aid in identifying research gaps, organizing and analyzing retrieved articles, and writing the review. In the literature review process, we provide examples of both free and commercially available Artificial Intelligence software to demonstrate their potential applications.

DOI: https://doi.org/10.31744/einstein_journal/2026RW1165
J Med Libr Assoc. 2026 Apr 01. 114(2): 94-104

Comparing five generative AI chatbots' answers to LLM-generated clinical questions with medical information scientists' evidence summaries.

Mallory N Blasingame, Taneya Y Koonce, Annette M Williams, Jing Su, Dario A Giuse, Poppy A Krump, Nunzia B Giuse.

   Objective: To compare answers to clinical questions between five publicly available large language model (LLM) chatbots and information scientists.
Methods: LLMs were prompted to provide 45 PICO (patient, intervention, comparison, outcome) questions addressing treatment, prognosis, and etiology. Each question was answered by a medical information scientist and submitted to five LLM tools: ChatGPT, Gemini, Copilot, DeepSeek, and Grok-3. Key elements from the answers provided were used by pairs of information scientists to label each LLM answer as in Total Alignment, Partial Alignment, or No Alignment with the information scientist. The Partial Alignment answers were also analyzed for the inclusion of additional information.
Results: The entire LLM set of answers, 225 in total, were assessed as being in Total Alignment 20.9% of the time (n=47), in Partial Alignment 78.7% of the time (n=177), and in No Alignment 0.4% of the time (n=1). Kruskal-Wallis testing found no significant performance difference in alignment ratings between the five chatbots (p=0.46). An analysis of the partially aligned answers found a significant difference in the number of additional elements provided by the information scientists versus the chatbots per Wilcoxon-Rank Sum testing (p=0.02).
Discussion: Five chatbots did not differ significantly in their alignment with information scientists' evidence summaries. The analysis of partially aligned answers found both chatbots and information scientists included additional information, with information scientists doing so significantly more often. An important next step will be to assess the additional information, both from the chatbots and the information scientists for validity and relevance.

Keywords:  LLMs; Large Language Models; artificial intelligence; biomedical informatics; chatbots; evidence synthesis; generative AI; information science; library science

DOI:  https://doi.org/10.5195/jmla.2026.2333
BMC Med Res Methodol. 2026 Apr 14.

From chaos to clarity: schema-constrained AI for auditable biomedical evidence extraction from full-text PDFs.

Pouria Mortezaagha, Joseph Shaw, Bowen Sun, Arya Rahgozar.

   BACKGROUND: Biomedical evidence synthesis depends on accurate extraction of methodological, laboratory, and outcome variables from full-text research articles. These variables are predominantly embedded in complex scientific PDFs that interleave multi-column text, tables, figures, and captions, making manual abstraction time-intensive, error-prone, and increasingly impractical at the scale of contemporary systematic reviews. Despite advances in layout-aware and multimodal document models, end-to-end extraction systems suitable for evidence synthesis remain constrained by limited throughput, OCR error propagation, and insufficient auditability.
METHODS: We propose a schema-constrained AI extraction system that transforms full-text biomedical PDFs into structured, analysis-ready records by explicitly restricting model inference through typed schemas, controlled vocabularies, and evidence-gated decisions. Documents are ingested using resume-aware hashing, partitioned into page-level and caption-aware chunks, and processed asynchronously under explicit concurrency and rate-limiting controls. A high-accuracy OCR model is guided by multiple domain-specific schemas covering bibliographic metadata, study design, populations, laboratory assays, timing and thresholds, clinical outcomes, and diagnostic performance. Chunk-level outputs are deterministically merged into study-level records using controlled vocabularies, conflict-aware handling of scalar fields, set-based aggregation of list-valued fields, and sentence-level evidence capture to enable traceability and post-hoc audit.
RESULTS: Applied to a corpus of 734 biomedical articles on direct oral anticoagulant (DOAC) level measurement, the pipeline processed all documents without manual intervention while maintaining stable throughput. Schema-constrained extraction exhibited strong internal consistency, with sentence-level provenance populated for nearly all supported decisions. Iterative schema and prompt refinement yielded substantial improvements in extraction fidelity, particularly for outcome definitions, assay classification, and global coagulation testing. Outputs included reproducible CSV/Parquet datasets and caption-aware multimodal markdown reconstructions supporting efficient expert review.
CONCLUSIONS: Schema-constrained AI extraction enables scalable and auditable extraction of structured evidence from heterogeneous scientific PDFs. By combining deterministic chunking, asynchronous orchestration, controlled vocabularies, sentence-level provenance, and aggregated analytical outputs, the proposed pipeline aligns modern document understanding capabilities with the transparency, reproducibility, and reliability demands of biomedical evidence synthesis.

Keywords:  Auditability; Biomedical evidence synthesis; Biomedical natural language processing; Multimodal document analysis; OCR-based document understanding; Provenance-aware extraction; Schema-constrained AI; Scientific PDF information extraction; Systematic reviews

DOI:  https://doi.org/10.1186/s12874-026-02847-8
Cutan Ocul Toxicol. 2026 Apr 16. 1-6

Scenario-based evaluation of large language models for reference accuracy in dermatology: literature retrieval on latent tuberculosis in psoriasis patients on anti-IL-17/23 therapy.

Nihal Altunisik, Sibel Altunisik Toplu, Dursun Turkmen.

   BACKGROUND: Large language models (LLMs) could accelerate clinical literature searches, but their reliability is compromised by "hallucinations" generating false references. This study compared three general-purpose LLMs using a standardized dermatology literature retrieval prompt for reference accuracy, relevance, and hallucination rates.
METHODS: A clinical scenario on latent tuberculosis management in psoriasis patients on IL-17/23 inhibitors was defined. To establish a reference standard, references (n=74) from the two most recent and comprehensive systematic reviews on the topic were screened. These two reviews were selected as they represented the most current and complete syntheses of evidence on this clinical question; using their reference lists ensured a focused, expert-validated foundation for evaluating LLM outputs. This process yielded 16 studies directly addressing the scenario. Each LLM (ChatGPT, Gemini, Deepseek-V3.2) was prompted to list 15 recent specific references. The 45 retrieved references were manually validated as: "True and Relevant," "True but Irrelevant/General," or "False/Hallucination." Distributions were compared using Pearson's chi-square test.
RESULTS: A significant difference was found between models (p<0.010). ChatGPT listed 80.0% (12/15) correct and relevant references with no hallucinations. Gemini produced 80.0% (12/15) hallucinations, while Deepseek-V3.2 generated 100.0% fictional references. Notably, 4 references ChatGPT found correct were valid articles overlooked in the predefined pool; these were verified as relevant, indicating the reference standard may not have been exhaustive.
CONCLUSION: LLM performance varies considerably with high hallucination risk. Findings highlight caution and independent verification. Future research should test advanced query techniques and hybrid systems integrating LLMs with academic databases.

Keywords:  ChatGPT; Deepseek-V3.2; Gemini; Large language models; artificial intelligence; dermatology; hallucinations; latent tuberculosis; literature review; psoriasis

DOI:  https://doi.org/10.1080/15569527.2026.2656177
Implement Sci. 2026 Apr 14.

AI Methods for Implementation Science (AIM-IS): developing a framework, toolkit, and reporting standard for the responsible use of AI in implementation practice and research.

Guillaume Fontaine, Susan Michie, Rinad S Beidas, Elvin Geng, Christine Fahim, Byron J Powell, Vivian Welch, James Thomas, Jeffery Chan, Samira Abbasgholizadeh-Rahimi, France Légaré, Janna Hastings, Sylvie D Lambert, Justin Presseau, Sharon E Straus, Ruopeng An, Ashrita Saran, Natalie Taylor.

   BACKGROUND: Artificial intelligence (AI), including machine learning, natural language processing, and large language models, may support implementation practice and research in tasks such as evidence synthesis, determinant assessment, strategy selection, monitoring, adaptation, and theory development. However, these applications of AI do not form a single, uniform category. They span a continuum from practice-facing applications that support local implementation work to research- and methods-facing applications that support evidence generation and synthesis. The guidance on how to classify, evaluate, and report these uses of AI remains limited. The AI Methods for Implementation Science (AIM-IS) program aims to develop, validate, and maintain a suite of products to guide the responsible use of AI across implementation practice, implementation research, and bridging use cases.
METHODS: AIM-IS is a multi-phase, multi-method methodological development program. The unit of analysis is the AI-for-implementation use case: a specific AI capability supporting a defined implementation practice or research task within a workflow, decision point, and governance context. Phase 1 is a living scoping review mapping published AI use cases in implementation science, including how they are evaluated and what risks they raise. Phase 2 is a qualitative interview study with implementation researchers, practitioners, AI experts, community members, and data infrastructure and governance experts to refine use cases and identify feasibility constraints, outcome priorities, and reporting needs. Phase 3 will integrate findings from Phases 1 and 2 to develop the draft AIM-IS products, including a framework, a taxonomy of use cases, guardrails for responsible use, a practical guide, outcome domains, and reporting items. Phase 4 will use an eDelphi process and consensus meeting to refine and finalize these products. Phase 5 will conduct usability testing to improve clarity and ease of use, resulting in the finalized AIM-IS products. AIM-IS is informed by implementation science, sociotechnical systems, equity, and responsible AI frameworks, and includes a living-update approach to support ongoing refinement.
DISCUSSION: The AIM-IS program will deliver a suite of products, including a framework, toolkit and reporting standard, to support the specification, governance, evaluation, and reporting of AI in implementation science. Together, these products aim to strengthen transparency, comparability, accountability, and attention to equity in how AI is used by implementation practitioners and researchers over time.
REGISTRATION: Open Science Framework, March 15, 2026: https://doi.org/10.17605/OSF.IO/BX35K.

Keywords:  Artificial intelligence; Generative AI; Implementation practice; Implementation research; Large language models; Machine learning; Methodology; Reporting guideline

DOI:  https://doi.org/10.1186/s13012-026-01503-5
J Plast Reconstr Aesthet Surg. 2026 Mar 16. pii: S1748-6815(26)00146-4. [Epub ahead of print]116 215-222

Are large language models consistent with the ASPS and AAPS guidelines? A comparison of AI chatbot recommendations and plastic surgery clinical guidance.

Raeesa Kabir, Savannah C Braud, Chandler S Hinson, Rahim S Nazerali.

   INTRODUCTION: Assessing the ability of AI chatbots to provide information consistent with clinical guidelines is essential for evaluating the accuracy of the information that patients may receive. We evaluated the ability of three widely used chatbots to reference and respond to clinical questions in alignment with the American Society of Plastic Surgeons' (ASPS) clinical guidelines.
METHODS: Evidence-based clinical practice guidelines from ASPS and the American Association of Plastic Surgeons (AAPS) were used to develop prompts for ChatGPT-4, Meta Llama 3.1, and Microsoft Copilot. Reviewers determined if the chatbots' answer aligned with the ASPS guidelines. Any reference to ASPS by the chatbots was recorded. Descriptive statistics were used for data analysis.
RESULTS: Forty-nine total recommendations from five clinical guidelines were included: reduction mammoplasty, autologous breast reconstruction, breast-implant associated anaplastic large cell lymphoma, eyelid surgery, and reconstruction after skin cancer. Copilot cited ASPS recommendations most frequently (Copilot: 67.3%, Llama: 34.7%, ChatGPT: 16.3%; p<0.0001) and had the highest rate of ASPS- and AAPS-aligned responses (Copilot: 79.6%, Llama: 73.5%, ChatGPT: 69.4%; p>0.05). Among the misaligned responses, neutral responses were most common with no significant differences among the chatbots (Copilot: 60%, Llama: 69.2%, ChatGPT: 40%; p=0.62).
CONCLUSION: In our study, up to 30% of chatbot responses did not align with ASPS and AAPS guidance. These results indicate a need for advocacy from plastic surgery societies regarding patient reliance on AI chatbots and training AI models specific to the specialty.

Keywords:  AI; AI chatbots; AI in plastic surgery; Artificial intelligence; LLM; Large language models

DOI:  https://doi.org/10.1016/j.bjps.2026.03.009