bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–04–20
eight papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. Neurosurgery. 2025 Apr 14.
       BACKGROUND AND OBJECTIVES: The body of neurosurgical literature has grown exponentially with publication rates increasing year-over-year. Manual screening of abstracts for systematic review creation and guideline formation has become an arduous process because of the mass of literature. Natural Language Processing, namely, large language models (LLMs), has shown promise in automating the abstract screening process. We evaluated whether Gemini Pro and ChatGPT, two LLM, can automate the screening of abstracts for a guideline created by the Congress of Neurological Surgeons.
    METHODS: We developed novel pipelines using Gemini Pro and ChatGPT-4o-mini to screen abstracts for guideline creation. We tested our pipeline using abstracts generated from the EMBASE search term provided in a Congress of Neurological Surgeons guideline on Chiari I malformations for a single population, intervention, comparison, and outcome question. We used only two inclusion/exclusion criteria and inputted a simplified version of the research question investigated.
    RESULTS: Of the 1764 abstracts generated from the search, 22 were manually chosen to be relevant for guideline creation. Using Gemini Pro, 1043 articles were correctly excluded and only 1 was incorrectly excluded, resulting in a sensitivity of 95% and a specificity of 60%. Using ChatGPT-4o-mini, 1066 articles were correctly excluded, but only 4 articles were correctly included, resulting in a sensitivity of 18% and a specificity of 95%. Both pipelines completed the screening process in under 1 hour.
    CONCLUSION: We have developed novel LLM pipelines to automate abstract screening for neurosurgical guideline creation. This technology can reduce the time necessary for abstract screening processes from several weeks to a few hours. While further validation is required, this process could pave the way for evidence-based guidelines to be continuously updated in real time across medical fields.
    Keywords:  Artificial intelligence; Guideline creation; Large language models; Natural language processing
    DOI:  https://doi.org/10.1227/neu.0000000000003450
  2. Clin Child Fam Psychol Rev. 2025 Apr 18.
      Systematic and meta-analytic reviews provide gold-standard evidence but are static and outdate quickly. Here we provide performance data on a new software platform, LitQuest, that uses artificial intelligence technologies to (1) accelerate screening of titles and abstracts from library literature searches, and (2) provide a software solution for enabling living systematic reviews by maintaining a saved AI algorithm for updated searches. Performance testing was based on LitQuest data from seven systematic reviews. LitQuest efficiency was estimated as the proportion (%) of the total yield of an initial literature search (titles/abstracts) that needed human screening prior to reaching the in-built stop threshold. LitQuest algorithm performance was measured as work saved over sampling (WSS) for a certain recall. LitQuest accuracy was estimated as the proportion of incorrectly classified papers in the rejected pool, as determined by two independent human raters. On average, around 36% of the total yield of a literature search needed to be human screened prior to reaching the stop-point. However, this ranged from 22 to 53% depending on the complexity of language structure across papers included in specific reviews. Accuracy was 99% at an interrater reliability of 95%, and 0% of titles/abstracts were incorrectly assigned. Findings suggest that LitQuest can be a cost-effective and time-efficient solution to supporting living systematic reviews, particularly for rapidly developing areas of science. Further development of LitQuest is planned, including facilitated full-text data extraction and community-of-practice access to living systematic review findings.
    Keywords:  Accuracy; Artificial intelligence; Efficiency; Machine learning; Systematic reviews
    DOI:  https://doi.org/10.1007/s10567-025-00519-5
  3. Environ Evid. 2025 Apr 15. 14(1): 5
      Systematic reviews (SRs) in environmental science is challenging due to diverse methodologies, terminologies, and study designs across disciplines. A major limitation is that inconsistent application of eligibility criteria in evidence-screening affects the reproducibility and transparency of SRs. To explore the potential role of Artificial Intelligence (AI) in applying eligibility criteria, we developed and evaluated an AI-assisted evidence-screening framework using a case study SR on the relationship between stream fecal coliform concentrations and land use and land cover (LULC). The SR incorporates publications from hydrology, ecology, public health, landscape, and urban planning, reflecting the interdisciplinary nature of environmental research. We fine-tuned ChatGPT-3.5 Turbo model with expert-reviewed training data for title, abstract, and full-text screening of 120 articles. The AI model demonstrated substantial agreement at title/abstract review and moderate agreement at full-text review with expert reviewers and maintained internal consistency, suggesting its potential for structured screening assistance. The findings provide a structured framework for applying eligibility criteria consistently, improving evidence screening efficiency, reducing labor and costs, and informing large language models (LLMs) integration in environmental SRs. Combining AI with domain knowledge provides an exploratory step to evaluate feasibility of AI-assisted evidence screening, especially for diverse, large volume, and interdisciplinary studies. Additionally, AI-assisted screening has the potential to provide a structured approach for managing disagreement among researchers with diverse domain knowledge, though further validation is needed.
    Keywords:  Eligibility criteria; Fine tuning; Interdisciplinary study; Large Language model; Literature review; Literature screening
    DOI:  https://doi.org/10.1186/s13750-025-00358-5
  4. J Biomed Inform. 2025 Apr 16. pii: S1532-0464(25)00048-6. [Epub ahead of print] 104819
       OBJECTIVE: Scientific publications are essential for uncovering insights, testing new drugs, and informing healthcare policies. Evaluating the quality of these publications often involves assessing their Risk of Bias (RoB), a task traditionally performed by human reviewers. The goal of this work is to create a dataset and develop models that allow automated RoB assessment in clinical trials.
    METHODS: We use data from the Cochrane Database of Systematic Reviews (CDSR) as ground truth to label open-access clinical trial publications from PubMed. This process enabled us to develop training and test datasets specifically for machine reading comprehension and RoB inference. Additionally, we created extractive (RoBInExt) and generative (RoBInGen) Transformer-based approaches to extract relevant evidence and classify the RoB effectively.
    RESULTS: RoBIn was evaluated across various settings and benchmarked against state-of-the-art methods, including large language models (LLMs). In most cases, the best-performing RoBIn variant surpasses traditional machine learning and LLM-based approaches, achieving a AUROC of 0.83.
    CONCLUSION: This work addresses RoB assessment in clinical trials by introducing RoBIn, two Transformer-based models for RoB inference and evidence retrieval, which outperform traditional models and LLMs, demonstrating its potential to improve efficiency and scalability in clinical research evaluation. We also introduce a public dataset that is automatically annotated and can be used to enable future research to enhance automated RoB assessment.
    Keywords:  Classification models; Deep learning; Evidence-based medicine; Machine reading comprehension; Natural language processing; Risk of bias; Systematic reviews
    DOI:  https://doi.org/10.1016/j.jbi.2025.104819
  5. Surg Endosc. 2025 Apr 18.
       BACKGROUND: Clinical practice guidelines provide important evidence-based recommendations to optimize patient care, but their development is labor-intensive and time-consuming. Large language models have shown promise in supporting academic writing and the development of systematic reviews, but their ability to assist with guideline development has not been explored. In this study, we tested the capacity of LLMs to support each stage of guideline development, using the latest SAGES guideline on the surgical management of appendicitis as a comparison.
    METHODS: Prompts were engineered to trigger LLMs to perform each task of guideline development, using key questions and PICOs derived from the SAGES guideline. ChatGPT-4, Google Gemini, Consensus, and Perplexity were queried on February 21, 2024. LLM performance was evaluated qualitatively, with narrative descriptions of each task's output. The Appraisal of Guidelines for Research and Evaluation in Surgery (AGREE-S) instrument was used to quantitatively assess the quality of the LLM-derived guideline compared to the existing SAGES guideline.
    RESULTS: Popular LLMs were able to generate a search syntax, perform data analysis, and follow the GRADE approach and Evidence-to-Decision framework to produce guideline recommendations. These LLMs were unable to independently perform a systematic literature search or reliably perform screening, data extraction, or risk of bias assessment at the time of testing. AGREE-S appraisal produced a total score of 119 for the LLM-derived guideline and 156 for the SAGES guideline. In 19 of the 24 domains, the two guidelines scored within two points of each other.
    CONCLUSIONS: LLMs demonstrate potential to assist with certain steps of guideline development, which may reduce time and resource burden associated with these tasks. As new models are developed, the role for LLMs in guideline development will continue to evolve. Ongoing research and multidisciplinary collaboration are needed to support the safe and effective integration of LLMs in each step of guideline development.
    Keywords:  Appendicitis; ChatGPT; Clinical practice guideline; Generative AI; Large language models; Surgery
    DOI:  https://doi.org/10.1007/s00464-025-11723-3
  6. Wellcome Open Res. 2024 ;9 402
       Background: Developing behaviour change interventions able to tackle major challenges such as non-communicable diseases or climate change requires effective and efficient use of scientific evidence. The Human Behaviour-Change Project (HBCP) aims to improve evidence synthesis in behavioural science by compiling intervention reports and annotating them with an ontology to train information extraction and prediction algorithms. The HBCP used smoking cessation as the first 'proof of concept' domain but intends to extend its methodology to other behaviours. The aims of this paper are to (i) assess the extent to which methods developed for annotating smoking cessation intervention reports were generalisable to a corpus of physical activity evidence, and (ii) describe the steps involved in developing this second HBCP corpus.
    Methods: The development of the physical activity corpus involved: (i) reviewing the suitability of smoking cessation codes already used in the HBCP, (ii) defining the selection criteria and scope, (iii) identifying and screening records for inclusion, and (iv) annotating intervention reports using a code set of 200+ entities from the Behaviour Change Intervention Ontology.
    Results: Stage 1 highlighted the need to modify the smoking cessation behavioural outcome codes for application to physical activity. One hundred physical activity intervention reports were reviewed, and 11 physical activity experts were consulted to inform the adapted code set. Stage 2 involved narrowing down the scope of the corpus to interventions targeting moderate-to-vigorous physical activity. In stage 3, 111 physical activity intervention reports were identified, which were then annotated in stage 4.
    Conclusions: Smoking cessation annotation methods developed as part of the HBCP were mostly transferable to the physical activity domain. However, the codes applied to behavioural outcome variables required adaptations. This paper can help anyone interested in building a body of research to develop automated evidence synthesis methods in physical activity or for other behaviours.
    Keywords:  classification system; evidence synthesis automation; exercise; movement behaviours; ontology; systematic review; taxonomy
    DOI:  https://doi.org/10.12688/wellcomeopenres.21664.2
  7. J Clin Epidemiol. 2025 Apr 16. pii: S0895-4356(25)00122-2. [Epub ahead of print] 111789
       BACKGROUND: Living guideline maintenance is underpinned by manual approaches towards evidence retrieval, limiting long term sustainability. Our study aimed to evaluate the feasibility of using only PubMed, Embase, OpenAlex or Semantic Scholar in automatically retrieving articles that were included in a high-quality international guideline - the 2023 International Polycystic Ovary Syndrome (PCOS) Guidelines.
    METHODS: The digital object identifiers (DOIs) and PubMed ID (PMIDs) of articles included after full text screening in the 2023 International PCOS Guidelines were extracted. These IDs were used to automatically retrieve article metadata from all tested databases. A title only search was then conducted on articles that were not initially retrievable. The extent of coverage, and overlap of coverage, was determined for each database. An exploratory analysis of the risk of bias of articles that were unretrievable was then conducted for each database.
    RESULTS: OpenAlex had the best coverage (98.6%), followed by Semantic Scholar (98.3%), Embase (96.8%) and PubMed (93.0%). However, 90.5% of all articles were retrievable from all four databases. All articles that were not retrievable from OpenAlex and Semantic Scholar were either assessed as medium or high risk of bias. In contrast, both Embase and PubMed missed articles that were of high quality (low risk of bias).
    CONCLUSION: OpenAlex should be considered as a single source for automated evidence retrieval in living guideline development, due to high coverage, and low risk of missing high-quality articles. These insights are being leveraged as part of transitioning the 2023 International PCOS Guidelines towards a living format.
    Keywords:  Evidence synthesis; evidence retrieval; learning health systems; living evidence; living guidelines
    DOI:  https://doi.org/10.1016/j.jclinepi.2025.111789
  8. J Biomed Inform. 2025 Apr 15. pii: S1532-0464(25)00054-1. [Epub ahead of print] 104825
       OBJECTIVE: Encoder-only transformer-based language models have shown promise in automating critical appraisal of clinical literature. However, a comprehensive evaluation of the models for classifying the methodological rigor of randomized controlled trials is necessary to identify the more robust ones. This study benchmarks several state-of-the-art transformer-based language models using a diverse set of performance metrics.
    METHODS: Seven transformer-based language models were fine-tuned on the title and abstract of 42,575 articles from 2003 to 2023 in McMaster University's Premium LiteratUre Service database under different configurations. The studies reported in the articles addressed questions related to treatment, prevention, or quality improvement for which randomized controlled trials are the gold standard with defined criteria for rigorous methods. Models were evaluated on the validation set using 12 schemes and metrics, including optimization for cross-entropy loss, Brier score, AUROC, average precision, sensitivity, specificity, and accuracy, among others. Threshold tuning was performed to optimize threshold-dependent metrics. Models that achieved the best performance in one or more schemes on the validation set were further tested in hold-out and external datasets.
    RESULTS: A total of 210 models were fine-tuned. Six models achieved top performance in one or more evaluation schemes. Three BioLinkBERT models outperformed others on 8 of the 12 schemes. BioBERT, BiomedBERT, and SciBERT were best on 1, 1 and 2 schemes, respectively. While model performance remained robust on the hold-out test set, it declined in external datasets. Class weight adjustments improved performance in most instances.
    CONCLUSION: BioLinkBERT generally outperformed the other models. Using comprehensive evaluation metrics and threshold tuning optimizes model selection for real-world applications. Future work should assess generalizability to other datasets, explore alternate imbalance strategies, and examine training on full-text articles.
    Keywords:  Critical appraisal; Deep learning; Encoder-only transformer; Evidence-based medicine; Natural language processing; Text classification
    DOI:  https://doi.org/10.1016/j.jbi.2025.104825