bims-arines Biomed News
on AI in evidence synthesis
Issue of 2026–05–10
eight papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. J Clin Epidemiol. 2026 Apr 30. pii: S0895-4356(26)00184-8. [Epub ahead of print] 112309
       OBJECTIVE: To compare accuracy, precision, recall, F1 and time spent using commercial tools to identify physiotherapy trials based on title and abstract, compared with a human approach.
    STUDY DESIGN: This study compared two approaches for title and abstract screening of 10,793 newly published records. In the reference standard human approach, two reviewers independently screened records using pre-specified rules to assess relevance to physiotherapy. A third person resolved disagreements. We evaluated three LLMs (gpt-4o, gpt-4.5, gpt-4-turbo) within two commercial, web-based tools (ChatGPT and Co-pilot). Outcomes were accuracy (proportion of records that model correctly identified as relevant or irrelevant), precision (proportion of records identified as relevant that were considered as relevant by human approach), recall (the proportion of all actual relevant records that the model successfully identified), F1 (harmonic mean of precision and recall) and time spent. Exploratory analyses compared the performance of the commercial tools with local approaches, including local LLMs implementation, machine learning and natural language processing.
    RESULTS: Commercial tools showed comparable performance across all metrics (ChatGPT vs Copilot: accuracy: 83% vs 86%; precision: 44% vs 48%; recall: 88% vs 87%; F1: 59% vs 62%). The total time spent using commercial tools with a labelled dataset was equivalent to 37% of the time required for the human-only screening process. Exploratory analysis showed that the API-based implementation has comparable performance (accuracy: 82%; precision: 42%; recall: 93%; F1: 58%). Yet, LLM-based models demonstrated lower performance compared with other local, custom-adapted automation approaches such as machine learning and natural language processing.
    CONCLUSION: This proof-of-concept study demonstrates that commercial web-based LLMs may have sufficient accuracy to support title and abstract screening and substantially reduce the time to identify field-specific trials. However, alternative approaches, including machine learning or natural language processing, could achieve screening performance similar to or slightly higher than that of commercial tools, yet they require a series of pre-processing steps for implementation.
    Keywords:  large language model; physiotherapy; rehabilitation
    DOI:  https://doi.org/10.1016/j.jclinepi.2026.112309
  2. Front Artif Intell. 2026 ;9 1745928
      Generative artificial intelligence (GenAI) chatbots powered by large language models (LLMs) are becoming increasingly integrated into health and medical research workflows, offering researchers new tools to enhance efficiency, support innovation, and assist with knowledge translation. Although their use in health and medical research is expanding rapidly, the practical application of these tools across the broader health and medical research landscape remains complex and evolving. Health and medical researchers often engage with complex study designs, theoretical frameworks, and population needs, all of which require thoughtful, effective and responsible use when involving AI tools. This 10-chapter guide serves as a practical, evidence-informed resource for health and medical researchers to engage effectively and responsibly with GenAI chatbots through the practice of prompt engineering, the design of clear, structured, and purposeful prompts that guide GenAI chatbot outputs. It presents strategies to improve prompt quality and adapt GenAI chatbot interactions to the varied methodological and disciplinary contexts found across health and medical research. The article outlines a structured framework for how GenAI chatbots can be applied throughout the research cycle, including research question development, study design, literature searching, querying for appropriate reporting guidelines and appraisal tools, quantitative and qualitative data analysis, writing and dissemination, and implementation. AI-generated content should be treated as a preliminary draft and must always be reviewed, verified against credible sources, and aligned with disciplinary standards. Risks such as hallucinated content, embedded biases, and ethical challenges are addressed, particularly in sensitive or high-stakes settings. Transparency in AI use and researcher accountability are essential. While GenAI chatbots have the potential to expand access to research support and foster innovation, they cannot replace critical thinking, methodological rigour, or contextual understanding. Instead, they should augment, not replace, human expertise. This guide encourages effective and responsible use of GenAI chatbots and support their thoughtful integration into the health and medical research process.
    Keywords:  GenAI chatbot; artificial intelligence; chatbot; generative artificial intelligence; medical research; prompt engineering; research process; scientific process
    DOI:  https://doi.org/10.3389/frai.2026.1745928
  3. J Clin Med. 2026 Apr 08. pii: 2830. [Epub ahead of print]15(8):
      Background: Periprosthetic joint infection (PJI) remains a devastating complication following arthroplasty. Systematic reviews of PJI provide essential evidence to inform clinical practice; however, the screening process remains labor-intensive. Recent advancements in large language models (LLMs) offer potential for automating literature screening, though evaluation of current generation models is needed. Methods: This validation study evaluated GPT-5, GPT-5 Pro, and Gemini 2.5 Pro in replicating the title/abstract and full-text screening stages of a published systematic review on intraosseous versus intravenous antibiotic prophylaxis in total joint arthroplasty. Title/abstract screening was performed on 165 articles, followed by a full-text eligibility assessment of 26 articles. Accuracy, sensitivity, specificity, and Cohen's kappa (κ) were calculated against human screening decisions as the gold standard. Results: In title/abstract screening, GPT-5 Pro achieved the highest accuracy (92.1-92.7%) and specificity (98.6-99.3%), while GPT-5 demonstrated the highest sensitivity (84.6-96.1%). In full-text screening, Gemini 2.5 Pro showed the most consistent performance across repeated evaluations (κ = 0.839 in both trials), whereas GPT-5 Pro exhibited marked intra-model variability (κ = 0.399 to 0.920). Conclusions: Current-generation LLMs achieve near-human accuracy in systematic review screening for PJI research, though substantial intra-model variability underscores the continued need for human oversight in systematic review workflows.
    Keywords:  artificial intelligence assisted screening; large language model; periprosthetic joint infection; systematic review; total joint arthroplasty
    DOI:  https://doi.org/10.3390/jcm15082830
  4. Prev Vet Med. 2026 Apr 28. pii: S0167-5877(26)00127-3. [Epub ahead of print]254 106908
      Evidence synthesis is essential for summarizing existing knowledge and identifying research gaps in animal health, but study screening is resource-intensive and time-consuming. This study evaluated the feasibility of a rule-based large language model (LLM) screening framework developed to support protocol-aligned study selection in a veterinary scoping review on feed additives in calves. The framework was compared with consensus decisions from three independent human reviewers across two screening stages: title-and-abstract screening and full-text screening. Agreement between framework-generated and human consensus decisions was assessed descriptively, using human consensus as the operational reference standard. At title-and-abstract screening, overall agreement was 96.8% (211/218 records), with seven discordant decisions. At full-text screening, overall agreement was 97.5% (39/40 records), with one discordant exclusion. Discrepancies in stage 1 reflected ambiguity or incomplete reporting, as well as differences in eligibility criteria interpretation relative to the intended screening scope. In stage 2, discrepancies occurred when the framework diverged from human decisions because information was distributed across narrative and tabular elements that required integration. Operational screening time was substantially shorter for the LLM screening framework than for the human screening process. These findings support the feasibility of a structured, rule-based LLM screening framework as a decision-support tool for veterinary evidence synthesis when implemented with predefined eligibility criteria, documented prompts, and human oversight.
    Keywords:  animal health; evidence synthesis; large language models; scoping review; study screening; veterinary medicine
    DOI:  https://doi.org/10.1016/j.prevetmed.2026.106908
  5. Implement Sci Commun. 2026 May 06.
       INTRODUCTION: Thematic coding helps researchers characterize intervention implementation in embedded pragmatic clinical trials (ePCTs), particularly interventions for older adults with dementia and care partners. However, manual coding is time-consuming, requiring multiple researchers. Because implementation science relies on systematic identification of determinants, barriers, and facilitators, advances in Artificial Intelligence (AI), specifically large language models (LLMs), may automate this process meaningfully by accelerating implementation evaluations within ePCTs. We developed and tested an automated algorithm using Chat GPT-4o and Chat GPT-4o-mini to achieve human-level performance coding interview transcripts.
    METHODS: We created a Python-based system that uses LLMs to process and code semi-structured interview transcripts about implementation challenges in translating dementia interventions into healthcare systems. The system matches excerpts to an existing codebook. Multiple iterations, including expert review, were used to refine accuracy and efficiency.
    RESULTS: The LLM consistently coded more excerpts than humans. In the third iteration (V3), the LLM captured 61.7% of human-coded excerpts, with matching rates reaching as high as 72.6% for individual transcripts. Matching was higher for descriptive codes, 63.7%, than interpretive codes, 57.7%. The LLM identified 206 correct coded excerpts that human coders missed. In the fourth iteration (V4), GPT-4o outperformed GPT-4o-mini: descriptive code matching reached 89% (e.g. "Site Characteristics"), compared to 69% for GPT-4o-mini with the R1+R2 85% threshold. GPT-4o showed a weak, but positive correlation (r = 0.230) between transcript word count and matching agreement, while 4o-mini showed a moderate, but negative correlation (r = -0.452). The LLM workflow yielded a 97% reduction in time and a 99% reduction in cost per transcript.
    CONCLUSION: This study compared an LLM-powered workflow with human coding for thematic analysis. The LLM aligned strongly with human coders. While error rates necessitate human oversight, time and cost reduction, and ability to identify missed excerpts make it a potentially reliable supplementary tool. Although ePCTs and implementation science share complementary goals, they differ in focus, this flexible approach enhances efficiency and scalability, with acceptable accuracy. It streamlines the qualitative research workflow from outlining to the analysis of implementation processes in real-world settings and may accelerate existing implementation approaches while minimizing implementation resources.
    Keywords:  Artificial intelligence (AI); Embedded pragmatic clinical trials (ePcts); Older persons with dementia; Qualitative analysis automation; Thematic coding
    DOI:  https://doi.org/10.1186/s43058-026-00953-8
  6. bioRxiv. 2026 Apr 30. pii: 2026.01.09.697335. [Epub ahead of print]
      The rapid expansion of biomedical literature demands automated summarization tools that can reliably condense research articles into concise, accurate summaries. We benchmarked 62 text summarization methods, ranging from frequency-based and TextRank extractors to encoder-decoder models (EDMs) and large language models (LLMs), on 1,000 biomedical abstracts with author-generated highlights as reference summaries. Models were evaluated using a composite suite of lexical, semantic, and factual metrics, including ROUGE, BLEU, METEOR, embedding-based similarity, and factuality scores. Our results indicate that general-purpose language models (LMs) achieve the highest overall performance across lexical and semantic dimensions, outperforming both reasoning-oriented and domain-specific models. Notably, medium-sized models often outperform frontier-scale counterparts, suggesting an optimal balance between model capacity and computational efficiency. Statistical extractive methods consistently lag behind neural approaches. These findings provide a systematic reference for selecting biomedical summarization tools and highlight that broad pretraining remains more effective than narrow domain adaptation for generating high-quality scientific summaries.
    DOI:  https://doi.org/10.64898/2026.01.09.697335
  7. Acta Cardiol. 2026 May 06. 1-7
       BACKGROUND: Heart failure (HF) remains a major cause of morbidity and mortality worldwide. Large language models (LLMs) such as ChatGPT are emerging as potential clinical decision support tools, but their adherence to specialty guidelines is not well characterised.
    OBJECTIVES: To evaluate the accuracy and guideline concordance of ChatGPT-5 in managing real-world HF scenarios compared with the 2023 European Society of Cardiology (ESC) and 2022 American College of Cardiology (ACC)/American Heart Association (AHA)/Heart Failure Society of America (HFSA) recommendations.
    METHODS: Thirty-eight anonymised HF clinical vignettes spanning reduced, mildly reduced, and preserved ejection fraction phenotypes and varied New York Heart Association (NYHA) classes were presented to ChatGPT-5. Two board-certified cardiologists independently graded each response for concordance with guideline recommendations using a 4-point scale (3 = fully concordant, 2 = partially concordant, 1 = discordant, 0 = unsafe/harmful). Discrepancies were adjudicated by a third reviewer. Descriptive statistics summarised performance and inter-rater agreement.
    RESULTS: Of the 38 responses, 20 (53%) were fully concordant, 4 (11%) partially concordant, 8 (21%) discordant, and 6 (16%) unsafe/harmful. Most inaccuracies involved vague drug titration guidance, incomplete device therapy recommendations, or omission of guideline-directed medical therapy (GDMT). Unsafe suggestions occurred in complex device or advanced therapy decisions. Inter-rater agreement was high.
    CONCLUSIONS: ChatGPT-5 showed moderate concordance with ESC and ACC/AHA/HFSA HF guidelines, indicating potential value as a tool for knowledge synthesis and preliminary clinical support. However, its outputs require expert validation, and safe clinical integration will depend on future models incorporating guideline-based frameworks, real-time data, and rigorous physician oversight.
    Keywords:  ChatGPT-5; Heart failure; clinical vignettes
    DOI:  https://doi.org/10.1080/00015385.2026.2668803