bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–12–14
six papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. Front Artif Intell. 2025 ;8 1662202
       Background: Systematic literature reviews (SLRs) are critical to health research and decision-making but are often time- and labor-intensive. Artificial intelligence (AI) tools like large language models (LLMs) provide a promising way to automate these processes.
    Methods: We conducted a systematic literature review on the cost-effectiveness of adult pneumococcal vaccination and prospectively assessed the performance of our AI-assisted review platform, Intelligent Systematic Literature Review (ISLaR) 2.0, compared to expert researchers.
    Results: ISLaR demonstrated high accuracy (0.87 full-text screening; 0.86 data extraction), precision (0.88; 0.86), and sensitivity (0.91; 0.98) in article screening and data extraction tasks, but lower specificity (0.79; 0.42), especially when extracting data from tables. The platform reduced abstract and full-text screening time by over 90% compared to human reviewers.
    Conclusion: The platform has strong potential to reduce reviewer workload but requires further development.
    Keywords:  artificial intelligence; data extraction; health technology assessment; large language models; reviewer workload; systematic literature review
    DOI:  https://doi.org/10.3389/frai.2025.1662202
  2. JMIR AI. 2025 Dec 10. 4 e80247
       Background: Systematic literature reviews (SLRs) build the foundation for evidence synthesis, but they are exceptionally demanding in terms of time and resources. While recent advances in artificial intelligence (AI), particularly large language models, offer the potential to accelerate this process, their use introduces challenges to transparency and reproducibility. Reporting guidelines such as the PRISMA-AI (Preferred Reporting Items for Systematic Reviews and Meta-Analyses-Artificial Intelligence Extension) primarily focus on AI as a subject of research, not as a tool in the review process itself.
    Objective: To address the gap in reporting standards, this study aimed to develop and propose a discipline-agnostic checklist extension to the PRISMA 2020 statement. The goal was to ensure transparent reporting when AI is used as a methodological tool in evidence synthesis, fostering trust in the next generation of SLRs.
    Methods: The proposed checklist, named PRISMA-trAIce (PRISMA-Transparent Reporting of Artificial Intelligence in Comprehensive Evidence Synthesis), was developed through a systematic process. We conducted a literature search to identify established, consensus-based AI reporting guidelines (eg, CONSORT-AI [Consolidated Standards of Reporting Trials-Artificial Intelligence] and TRIPOD-AI [Transparent Reporting of a Multivariable Prediction Model of Individual Prognosis or Diagnosis-Artificial Intelligence]). Relevant items from these frameworks were extracted, analyzed, and thematically synthesized to form a modular checklist that integrated with the PRISMA 2020 structure.
    Results: The primary result of this work is the PRISMA-trAIce checklist, a comprehensive set of reporting items designed to document the use of AI in SLRs. The checklist covers the entire structure of an SLR, from title and abstract to methods and discussion, and includes specific items for identifying AI tools, describing human-AI interaction, reporting performance evaluation, and discussing limitations.
    Conclusions: PRISMA-trAIce establishes an important framework to improve the transparency and methodological integrity of AI-assisted systematic reviews, enhancing the trust required for the responsible application of AI-assisted systematic reviews in evidence synthesis. We present this work as a foundational proposal, explicitly inviting the scientific community to join an open science process of consensus building. Through this collaborative refinement, we aim to evolve PRISMA-trAIce into a formally endorsed guideline, thereby ensuring the collective validation and scientific rigor of future AI-driven research.
    Keywords:  AI; PRISMA; Preferred Reporting Items for Systematic Reviews and Meta-Analyses; artificial intelligence; evidence synthesis; large language models; reporting guideline; systematic literature review; transparency
    DOI:  https://doi.org/10.2196/80247
  3. JMIR Form Res. 2025 Dec 10. 9 e77707
       Background: The Archive of German-Language General Practice (ADAM) stores about 500 paper-based doctoral theses published from 1965 to today. Although they have been grouped in different categories, no deeper systematic process of information extraction (IE) has been performed yet. Recently developed large language models (LLMs) like ChatGPT have been attributed the potential to help in the IE of medical documents. However, there are concerns about LLM hallucinations. Furthermore, there have not been reports regarding their usage in nonrecent doctoral theses yet.
    Objective: The aim of this study is to analyze if LLMs can help to extract information from doctoral theses by using GPT-4o and Gemini-1.5-Flash for paper-based doctoral theses in ADAM.
    Methods: We randomly selected 10 doctoral theses published between 1965 and 2022. After preprocessing, we used two different LLM pipelines, using models by OpenAI and Google. Pipelines were used to extract dissertation characteristics and generate uniform abstracts. Furthermore, one pooled human-generated abstract was written for comparison. Furthermore, blinded raters were asked to evaluate LLM-generated abstracts in comparison to the human-generated ones. Bidirectional encoder representations from transformers scores were calculated as the evaluation metric.
    Results: Relevant dissertation characteristics and keywords could be extracted for all theses (n=10): institute name and location, thesis title, author name(s), and publication year. For all except one doctoral thesis, an abstract could be generated using GPT-4o, while Gemini-1.5-Flash provided abstracts in all cases (n=10). The modality of abstract generation showed no influence in raters' evaluation using the nonparametric Kruskal-Wallis test for independent groups (P=.44). The creation of LLM-generated abstracts was estimated to be 24-36 times faster than creation by humans. Evaluation metrics showed moderate-to-high semantic similarity (mean bidirectional encoder representations from transformers F1-score, GPT-4o: 0.72 and Gemini: 0.71). Translation from German into English did not result in a loss of information (n=10).
    Conclusions: An accumulating body of unpublished doctoral theses makes it difficult to extract relevant evidence. Recent advances in LLMs like ChatGPT have raised expectations in text mining, but they have not yet been used in the IE of "historic" medical documents. This feasibility study suggests that both models (GPT-4o and Gemini-1.5-Flash) helped to accurately simplify and condense doctoral theses into relevant information, while LLM-generated abstracts were perceived as similar to human-generated ones, were semanticly similar, and took about 30 times less time to create. This pilot study demonstrates the feasibility of a regular office-scanning workflow and use of general-purpose LLMs to extract relevant information and produce accurate abstracts from ADAM doctoral theses. Taken together, this information could help researchers to better search the family medicine scientific literature over the last 60 years, helping to develop current research questions.
    Keywords:  AI; ChatGPT; GPT-4o; Gemini; artificial intelligence; doctoral thesis; family medicine
    DOI:  https://doi.org/10.2196/77707
  4. Expert Rev Pharmacoecon Outcomes Res. 2025 Dec 11.
      
    Keywords:  Artificial Intelligence; Climate Resilience; Health Economics; Health Equity; Low-and Middle-income Countries; Outcomes Research
    DOI:  https://doi.org/10.1080/14737167.2025.2603951
  5. Perspect Med Educ. 2025 ;14(1): 882-890
       Introduction: It is estimated that large language models (LLMs), including ChatGPT, are already widely used in academic paper writing. This study examined whether certain words and phrases reported as frequently used by LLMs have increased in medical literature, comparing their trends with common academic expressions.
    Methods: A structured literature review identified 135 potentially AI-influenced terms from 15 studies documenting LLM vocabulary patterns. For comparison, 84 common academic phrases in medical research served as controls. PubMed records from 2000 to 2024 were analyzed to track the frequency of these terms. Usage trends were normalized using a modified Z-score transformation.
    Results: Of the 135 potentially AI-influenced terms, 103 showed meaningful increases (modified Z-score ≥3.5) in 2024. Terms with the highest increases included "delve," "underscore," "primarily," "meticulous," and "boast." The linear mixed-effects model revealed significantly higher usage of potentially AI-influenced terms compared to controls (β = 0.655, p < 0.001). Notably, these terms began increasing in 2020, preceding ChatGPT's 2022 release, with marked acceleration in 2023-2024.
    Discussion: Certain words and phrases have become more common in medical literature since ChatGPT's introduction. However, the use of these terms tended to increase before 2022, indicating the possibility that the emergence of LLMs amplified existing trends rather than creating entirely new patterns. By understanding which terms are overused by AI, medical educators and researchers can promote better editing of AI-assisted drafts and maintain diverse vocabulary across scientific writing.
    DOI:  https://doi.org/10.5334/pme.1929
  6. J Med Internet Res. 2025 Dec 08. 27 e77110
       Background: Health research that uses predictive and generative artificial intelligence (AI) is rapidly growing. As in traditional clinical studies, the way in which AI studies are conducted can introduce systematic errors. The translation of this AI evidence into clinical practice and research needs critical appraisal tools for clinical decision-makers and researchers.
    Objective: This study aimed to identify existing tools for the critical appraisal of clinical studies that use AI and to examine the concepts and domains these tools explore. The research question was framed using the Population-Concept-Context (PCC) framework. Population (P): AI clinical studies; Concept (C): tools for critical appraisal and associated constructs such as quality, reporting, validity, risk of bias, and applicability; and context (C): clinical practice. In addition, studies on bias classification and chatbot assessment were included.
    Methods: We searched medical and engineering databases (MEDLINE, Embase, CINAHL, PsycINFO, and IEEE) from inception to April 2024. We included clinical primary research with tools for critical appraisal. Classical reviews and systematic reviews were included in the first phase of screening and excluded in the secondary phase after identifying new tools by forward snowballing. We excluded nonhuman, computer, and mathematical research, and letters, opinion papers, and editorials. We used Rayyan (Qatar Computing Research Institute) for screening. Data extraction was done by two reviewers, and discrepancies were resolved through discussion. The protocol was previously registered in Open Science Framework. We adhered to the PRISMA-ScR (Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews) and the PRISMA-S (PRISMA-Search) extension for reporting literature in systematic reviews.
    Results: We retrieved 4393 unique records for screening. After excluding 3803 records, 119 were selected for full-text screening. From these, 59 were excluded. After inclusion of 10 studies via other methods, a total of 70 records were finally included. We found 46 tools (26 guides for reporting AI studies, 16 tools for critical appraisal, 2 for study quality, and 2 for risk of bias). Nine papers focused on bias classification or mitigation. We found 15 chatbot assessment studies or systematic reviews of chatbot studies (6 and 9, respectively), which are a very heterogeneous group.
    Conclusions: The results picture a landscape of evidence tools where reporting tools predominate, followed by critical appraisal, and a few tools for risk of bias. The mismatch of bias in AI and epidemiology should be considered for critical appraisal, especially regarding fairness and bias mitigation in AI. Finally, chatbot assessment studies represent a vast and evolving field in which progress in design, reporting, and critical appraisal is necessary and urgent.
    Keywords:  artificial intelligence; critical appraisal tools; reporting guides; risk of bias; scoping review
    DOI:  https://doi.org/10.2196/77110