bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–12–21
nine papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. Syst Rev. 2025 Dec 18.
       BACKGROUND: Artificial intelligence (AI) can greatly enhance efficiency in systematic literature reviews and meta-analyses, but its accuracy in screening titles/abstracts and full-text articles is uncertain.
    OBJECTIVES: This study evaluated the performance metrics (sensitivity, specificity) of a GPT-4 AI program, Review Copilot, against human decisions (gold standard) in screening titles/abstracts and full-text articles from four published systematic reviews/meta-analyses.
    RESEARCH DESIGN: Participant data from four already-published systematic literature reviews were used for this validation study. This was a study comparing Review Copilot to human decision-making (gold standard) in screening titles/abstracts and full-text articles for systematic reviews/meta-analyses. The four studies that were used in this study included observational studies and randomized control trials. Review Copilot operates on the OpenAI, GPT-4 server. We examined the performance metrics of Review Copilot to include and exclude titles/abstracts and full-text articles as compared to human decisions in four systematic reviews/meta-analyses. Sensitivity, specificity, and balanced accuracy of title/abstract and full-text screening were compared between Review Copilot and human decisions.
    RESULTS: Review Copilot's sensitivity and specificity for title/abstract screening were 99.2% and 83.6%, respectively, and 97.6% and 47.4% for full-text screening. The average agreement between two runs was 95.4%, with a kappa statistic of 0.83. Review Copilot screened in one-quarter of the time compared to humans.
    CONCLUSIONS: AI use in systematic reviews and meta-analyses is inevitable. Health researchers must understand these technologies' strengths and limitations to ethically leverage them for research efficiency and evidence-based decision-making in health.
    DOI:  https://doi.org/10.1186/s13643-025-02997-8
  2. J Clin Transl Sci. 2025 ;9(1): e241
      ASReview is a software that can potentially reduce the workload of literature screening in systematic reviews by ranking the retrieved records. We assessed the tool's feasibility, advantages, and limitations, to populate a database of cancer immunotherapy trials. ASReview is easy to use, and it efficiently identified relevant records. It may save resources compared to traditional systematic reviews using two human reviewers. Predefined procedures are necessary to maintain a transparent and reproducible workflow. Limitations include that adding references to existing projects is difficult and that the algorithm learns from every decision, even when this may not be appropriate.
    Keywords:  ASreview; Evidence synthesis; artificial intelligence; review software; study selection
    DOI:  https://doi.org/10.1017/cts.2025.10173
  3. Psychol Bull. 2025 Oct;151(10): 1280-1306
      Psychological science requires reliable measures. Within systematic literature reviews, reliability hinges on high interrater agreement during data extraction. Yet, the extraction process has been time-consuming. Efforts to accelerate the process using technology have shown limited success until generative artificial intelligence (genAI), particularly large language models (LLMs), accurately extracted variables from medical studies. Nonetheless, for psychological researchers, it remains unclear how to utilize genAI for data extraction, given the range of tested variables, the medical context, and the variability in accuracy. We systematically assessed extraction accuracy and error patterns across domains in psychology by comparing genAI-extracted and human-extracted data from 22 systematic review databases published in the Psychological Bulletin. Eight LLMs extracted 312,329 data points from 2,179 studies on 186 variables. LLM extractions achieved unacceptable accuracy on all metrics for 20% of variables. For 46% of variables, accuracy was acceptable for some metrics and unacceptable for others. LLMs reached acceptable but not high accuracy on all metrics in 15%, high but not excellent in 8%, and excellent accuracy in 12% of variables. Accuracy varied most between variables, less between systematic reviews, and least between LLMs. Moderator analyses using a hierarchical logistic regression, hierarchical linear model, and meta-analysis revealed that accuracy was higher for variables describing studies' context and moderator variables compared to variables for effect size calculation. Also, accuracy was higher in systematic reviews with more detailed variable descriptions and positively correlated with model sizes. We discuss directions for investigating ways to use genAI to accelerate data extractions while ensuring meaningful human control. (PsycInfo Database Record (c) 2025 APA, all rights reserved).
    DOI:  https://doi.org/10.1037/bul0000501
  4. JBI Evid Synth. 2025 Dec 19.
       OBJECTIVE: The objective of this scoping review will be to chart the available evidence on user experience and adoption of automation and artificial intelligence (AI) technologies for evidence synthesis.
    INTRODUCTION: Evidence syntheses are crucial for informing health care practice and policy; however, they are constrained by the ever-increasing volume of research and labor-intensive methods. With reviews often taking over a year to complete, automation and AI offer promising solutions by streamlining evidence synthesis workflows. However, while these technologies may offer significant time savings, their adoption depends on usability, trustworthiness, and workflow integration-elements which are currently poorly understood.
    ELIGIBILITY CRITERIA: This review will include primary research articles, all types of reviews, expert opinions, and gray literature that discuss user experience and/or adoption of automation and AI technologies for evidence synthesis across all disciplines.
    METHODS: Following JBI scoping review methodology, the search strategy will identify published and unpublished evidence sources using a 3-step process. An initial exploratory search of PubMed was conducted to identify relevant keywords and terms. This will be followed by searches of PubMed, Web of Science Core Collection, Scopus, ProQuest Central, and ACM Digital Library databases, as well as online gray literature sources to identify eligible studies. A date limit of October 2015 will be applied to the searches, with no language limitations. Three reviewers will independently screen, select, and extract data from relevant evidence sources. Data extraction and analysis will be charted and mapped through the lenses of 4 distinct frameworks: Unified Theory of Acceptance and Use of Technology (UTAUT), RE-AIM, Human-AI Interaction (HAI), and user experience (UX) principles.
    REVIEW REGISTRATION: OSF https://doi.org/10.17605/OSF.IO/AYQJC.
    Keywords:  artificial intelligence; human computer interaction; systematic reviews; technology acceptance; usability
    DOI:  https://doi.org/10.11124/JBIES-25-00236
  5. J Cardiovasc Pharmacol. 2025 Nov 25.
      Colchicine has been studied as an anti-inflammatory treatment for cardiovascular prevention, but findings from randomized trials have been inconsistent. This meta-analysis evaluated the efficacy and safety of colchicine in reducing major adverse cardiovascular events (MACE) and its individual components, using ChatGPT as an assistant throughout the process. Randomized trials of colchicine for cardiovascular prevention were systematically identified, and data extraction, risk of bias assessment, and meta-analyses were performed with ChatGPT under human supervision. The primary outcome was MACE, while secondary outcomes included myocardial infarction (MI), stroke, revascularization, cardiovascular mortality, and all-cause mortality. Eleven trials involving 30,888 patients were included. Colchicine significantly reduced MACE (risk ratio 0.75, 95% CI 0.63-0.88), though no significant effects were observed for MI, stroke, cardiovascular mortality, or all-cause mortality. In addition to its clinical findings, this study illustrates the potential of ChatGPT to assist in systematic reviews and meta-analyses by automating screening, data extraction, bias assessment, and statistical code generation. This integration reduced researcher time by over 70% while maintaining accuracy through human validation. Overall, colchicine appears to lower the risk of MACE but the results of the CLEAR trial have lowered certainty, while the findings highlight the feasibility and efficiency gains of using large language models in evidence synthesis workflows.
    DOI:  https://doi.org/10.1097/FJC.0000000000001780
  6. JBI Evid Synth. 2025 Dec 19.
       OBJECTIVE: The objective of this investigation was to evaluate the agreement rate on judgments made using the Murad tool by different systematic review teams.
    INTRODUCTION: Evaluating the methodological quality of case reports and case series is challenging but some tools do exist for this purpose. We leveraged the presence of studies that have been evaluated by different systematic review teams to assess the inter-consensus agreement among different teams on Murad tool domains.
    METHODS: Using a back-citation method, we identified systematic reviews that used the Murad tool and retrieved all the included primary studies. We selected studies that were assessed by more than 1 systematic review team. We calculated observed agreement and Gwet's agreement coefficient on judgments made about each signaling question.
    RESULTS: We identified 982 systematic reviews that cited the Murad tool and collectively cited 59,080 references. The final data comprised of 81 duplicated case reports and series assessed by more than 1 systematic review team. Overall, the signaling questions had very high observed agreement with 5 of the 8 questions having agreement over 75%. The signaling questions with the highest agreement addressed the adequacy of the ascertainment of the exposure (coefficient 0.959), ascertainment of the outcome (coefficient 0.829), presence of a dose-response gradient (perfect agreement) and clarity of reporting (coefficient 0.755).
    CONCLUSION: The current study demonstrates overall high agreement among different systematic review teams that used the Murad tool for appraisal of the same case series and case reports. Leveraging duplicated studies across systematic reviews is a feasible way to retrospectively assess the reliability of tool.
    Keywords:  Murad tool; inter-consensus agreement; inter-rater reliability; methodological quality; risk of bias
    DOI:  https://doi.org/10.11124/JBIES-25-00225
  7. Medicine (Baltimore). 2025 Dec 12. 104(50): e46216
      With the rapid rise of artificial intelligence tools, applications like ChatPDF are seen as promising for supporting academic tasks in neurosurgery, such as literature review, summarization, and question generation. However, its accuracy and relevance remain to be critically assessed. This study assesses ChatPDF's accuracy in interpreting neurosurgical research articles, aiming to identify its strengths and limitations. Articles from the 10 highest-ranked neurosurgical journals were reviewed by selecting the first original research article from each journal's 2023 volume. Ten detailed questions were independently generated by 2 researchers based on each article's content. Each article was then uploaded to ChatPDF, which generated its own questions and provided responses to both its questions and those posed by the researchers. Responses were categorized as completely correct, partially correct, or incorrect. Source reliability was also evaluated to determine ChatPDF's performance. An overall accuracy rate of 89% was achieved by ChatPDF across 100 questions, with 89% of responses classified as completely correct, 5% as partially correct, and 6% as incorrect. Source reliability averaged 83%, although variability was noted, particularly in journals such as the Journal of Neurosurgery: Spine and Neurosurgery Clinics, which showed lower reliability rates. Substantial accuracy and potential were demonstrated by ChatPDF as a supplementary tool for neurosurgical literature review. However, limitations such as inconsistent source reliability and lack of visual content analysis highlight the need for ongoing refinement. While promising, ChatPDF should be used alongside manual verification to ensure comprehensive and accurate literature interpretation in neurosurgical research.
    Keywords:  ChatPDF; academic research tools; artificial intelligence; neurosurgical literature analysis; performance assessment
    DOI:  https://doi.org/10.1097/MD.0000000000046216
  8. J Clin Epidemiol. 2025 Dec 16. pii: S0895-4356(25)00443-3. [Epub ahead of print] 112110
       OBJECTIVE: To implement a semi-automated approach to facilitate rating the Grading, Recommendation, Assessment, Development and Evaluation (GRADE) certainty of evidence (CoE) for indirect and network meta-analysis (NMA) estimates.
    METHODS: We developed and implemented algorithms for generating automated ratings for the CoE for indirect and network estimates in two living NMAs of rheumatoid arthritis treatment. At the indirect stage, inputs included CoE ratings for direct estimates and the contribution matrix. Intransitivity ratings were assigned based on the indirectness ratings of the two direct estimates with the highest percent contribution. An online tool (customized to our project) facilitated assessment of imprecision on the network estimate. Automated ratings were reviewed by two independent experts.
    RESULTS: Across 1,306 indirect comparisons, the contribution matrix identified the dominant branches of evidence regardless of whether a single first order loop was present (80%) or not. The reviewers agreed with all automated CoE ratings for incoherence (n=34), network estimates (n=34) and imprecision (n=1447). They agreed with the automated intransitivity algorithm except when the total contribution of the top-two direct estimates was low (e.g. <50%, which occurred in 38% of the estimates).
    CONCLUSION: Automated approaches facilitated CoE ratings for indirect and network estimates. Further work is required to define appropriate algorithms for intransitivity.
    Keywords:  Bayesian; GRADE; Living Systematic reviews; Network meta-analysis; Randomized Controlled Trials (RCTs); Rheumatoid arthritis; Semi-automation
    DOI:  https://doi.org/10.1016/j.jclinepi.2025.112110
  9. JAMIA Open. 2025 Dec;8(6): ooaf156
       Objectives: The objective of this study was to develop and test natural language processing (NLP) methods for screening and, ultimately, predicting the cancer relevance of peer-reviewed publications.
    Materials and Methods: Two datasets were used: (1) manually curated publications labeled for cancer relevance, co-authored by members of The University of Kansas Cancer Center (KUCC) and (2) a derived dataset containing cancer-related abstracts from American Association for Cancer Research journals and noncancer-related abstracts from other medical journals. Two text encoding methods were explored: term frequency-inverse document frequency (TF-IDF) vectorization and various BERT embeddings. These representations served as inputs to 3 supervised machine learning classifiers: Support Vector Classification (SVC), Gradient Boosting Classification, and Multilayer Perceptron (MLP) neural networks. Model performance was evaluated by comparing predictions to the "true" cancer-relevant labels in a withheld test set.
    Results: All machine learning models performed best when trained and tested within the derived dataset. Across the datasets, SVC and MLP both exhibited strong performance, with F1 scores as high as 0.976 and 0.997, respectively. BioBERT embeddings resulted in slightly higher metrics when compared to TF-IDF vectorization across most models.
    Discussion: Models trained on the derived data performed very well internally; however, weaker performance was noted when these models were tested on the KUCC dataset. This finding highlights the subjective nature of cancer-relevant determinations. In contrast, KUCC trained models had high predictive performance when tested on the derived-specific classifications, showing that models trained on the KUCC dataset may be suitable for wider cancer-relevant prediction.
    Conclusions: Overall, our results suggest that NLP can effectively automate the classification of cancer-relevant publications, enhancing research productivity tracking; however, great care should be taken in selecting the appropriate data, text representation approach, and machine learning approach.
    Keywords:  BioBERT embeddings; NLP; Support Vector Classification; TF-IDF vectorization; academic research centers; biomedical text classification; natural language processing; publication curation; supervised machine learning
    DOI:  https://doi.org/10.1093/jamiaopen/ooaf156