bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–09–14
thirteen papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. JMIR AI. 2025 Sep 11. 4 e68097
       Background: Systematic literature reviews (SLRs) are foundational for synthesizing evidence across diverse fields and are especially important in guiding research and practice in health and biomedical sciences. However, they are labor intensive due to manual data extraction from multiple studies. As large language models (LLMs) gain attention for their potential to automate research tasks and extract basic information, understanding their ability to accurately extract explicit data from academic papers is critical for advancing SLRs.
    Objective: Our study aimed to explore the capability of LLMs to extract both explicitly outlined study characteristics and deeper, more contextual information requiring nuanced evaluations, using ChatGPT (GPT-4).
    Methods: We screened the full text of a sample of COVID-19 modeling studies and analyzed three basic measures of study settings (ie, analysis location, modeling approach, and analyzed interventions) and three complex measures of behavioral components in models (ie, mobility, risk perception, and compliance). To extract data on these measures, two researchers independently extracted 60 data elements using manual coding and compared them with the responses from ChatGPT to 420 queries spanning 7 iterations.
    Results: ChatGPT's accuracy improved as prompts were refined, showing improvements of 33% and 23% between the initial and final iterations for extracting study settings and behavioral components, respectively. In the initial prompts, 26 (43.3%) of 60 ChatGPT responses were correct. However, in the final iteration, ChatGPT extracted 43 (71.7%) of the 60 data elements, showing better performance in extracting explicitly stated study settings (28/30, 93.3%) than in extracting subjective behavioral components (15/30, 50%). Nonetheless, the varying accuracy across measures highlighted its limitations.
    Conclusions: Our findings underscore LLMs' utility in extracting basic as well as explicit data in SLRs by using effective prompts. However, the results reveal significant limitations in handling nuanced, subjective criteria, emphasizing the necessity for human oversight.
    Keywords:  evidence synthesis; generative artificial intelligence; human-AI collaboration; large language models; systematic reviews
    DOI:  https://doi.org/10.2196/68097
  2. Am J Physiol Heart Circ Physiol. 2025 Sep 12.
      The exponential growth in academic publishing - exceeding 2 million papers annually 2023 - has rendered traditional systematic review methods unsustainable1-3. These conventional approaches typically require 6-24 months for completion4-6, creating critical delays between evidence availability and clinical implementation. While existing automation tools demonstrate workload reductions of 30-72.5%, their machine learning dependencies create barriers to immediate implementation7-10. Additionally, direct AI screening methods involve substantial computational costs, lack real-time adaptability, suffer from inconsistent performance across different research domains, and provide no clear audit trail for regulatory compliance. We present a one-week systematic review acceleration protocol using rule-based automation where artificial intelligence (AI) assists with code generation. Researchers define screening criteria, then use AI language models (Claude, ChatGPT) as coding assistants. This protocol employs a two-phase screening process: (1) rule-based title/abstract screening and (2) rule-based full-text analysis, while adhering to established systematic review guidelines such as Cochrane methodology and PRISMA reporting5, 6. The rule-based system provides immediate implementation with complete transparency, while validation framework guides researchers in systematically testing screening sensitivity to minimize false negatives and ensure comprehensive study capture; meta-analysis and statistical synthesis remain manual processes requiring human expertise. We demonstrate the protocol's application through a case study examining cardiac fatty acid oxidation in heart failure with preserved ejection fraction (HFpEF)11, and validated through a separate review examining e-cigarette versus traditional cigarette cardiopulmonary effects, which successfully processed 3,791 records12. This protocol represents a substantial advancement in systematic review methodology, making high-quality evidence synthesis more accessible across a broad range of scientific disciplines.
    Keywords:  artificial intelligence in research; digital research methods; machine learning alternatives; scientific automation; systematic reviews
    DOI:  https://doi.org/10.1152/ajpheart.00374.2025
  3. J Evid Based Med. 2025 Sep 11. e70067
       BACKGROUND: Formulating evidene-based recommendations for practice guidelines is a complex process that requires substantial expertise. Artificial intelligence (AI) is promising in accelerating the guideline development process. This study evaluates the feasibility of leveraging five large language models (LLMs)-ChatGPT-3.5, Claude-3 sonnet, Bard, ChatGLM-4, Kimi chat-to generate recommendations based on structured evidence, assesses their concordance, and explores the potential for AI.
    METHODS: The general and specific prompts were drafted and validated. We searched PubMed to include evidence-based guidelines related to health and lifestyle. We randomly selected one recommendation from every included guideline as the sample and extracted the evidence base supporting the selected recommendations. The prompts and evidence were fed into five LLMs to generate structured recommendations.
    RESULTS: ChatGPT-3.5 demonstrated the highest proficiency in comprehensively extracting and synthesizing evidence to formulate novel insights. Bard consistently adhered to existing guideline principles, aligning its algorithm with these tenets. Claude generated fewer topical recommendations, focusing instead on evidence analysis and mitigating irrelevant information. ChatGLM-4 exhibited a balanced approach, combining evidence extraction with adherence to guideline principles. Kimi showed potential in generating concise and targeted recommendations. Among the six generated recommendations, average consistency ranged from 50% to 91.7%.
    CONCLUSION: The findings of this study suggest that LLMs hold immense potential in accelerating the formulation of evidence-based recommendations. LLMs can rapidly and comprehensively extract and synthesize relevant information from structured evidence, generating recommendations that align with the available evidence.
    Keywords:  ChatGPT; evidence‐based decision‐making; guideline; large language model
    DOI:  https://doi.org/10.1111/jebm.70067
  4. Curr Opin Ophthalmol. 2025 Sep 15.
       PURPOSE OF REVIEW: Traditional health economic analysis is essential for guiding healthcare decision-making but is hindered by slow, resource-intensive processes. This review examines how recent advancements in artificial intelligence can automate and accelerate the core components of health economic analysis, from evidence generation to economic modeling and regulatory submissions, and explores the implications of this transformation for ophthalmology.
    RECENT FINDINGS: Recent proof-of-concept studies demonstrate that artificial intelligence can automate systematic literature reviews with high accuracy, significantly reducing screening times while matching or exceeding the sensitivity of human reviewers. In economic modeling, artificial intelligence systems can now autonomously write and adapt complex simulation code from textual descriptions, replicating the results of published models with near-perfect fidelity. Furthermore, to ensure rigor, new reporting guidelines like ELEVATE-GenAI are emerging alongside proactive regulatory position statements from health technology assessment agencies like NICE. While direct applications in ophthalmology remain in their early stages, these combined developments signal a transformative potential to accelerate the cost-effectiveness assessment of emerging sight-saving technologies.
    SUMMARY: Artificial intelligence-driven automation represents a paradigm shift in health economic analysis, enabling evaluations that once took months to be completed in a fraction of the time. This capability is particularly critical for ophthalmology's rapidly evolving technological landscape, enabling dynamic assessment of innovations from artificial intelligence-powered diagnostics and robotic surgical systems to novel gene therapies and advanced pharmaceuticals. Although challenges remain regarding analytical validity, bias amplification, and regulatory acceptance, the integration of artificial intelligence promises to accelerate evidence-based adoption of sight-saving technologies through responsive, context-specific economic insights.
    Keywords:  artificial intelligence; economic modeling; health economics; large language models; systematic review automation
    DOI:  https://doi.org/10.1097/ICU.0000000000001173
  5. Cochrane Evid Synth Methods. 2025 Sep;3(5): e70045
      To inform implementation recommendations for novel or emerging technologies, Research Information Services at Canada's Drug Agency conducted a multimodal research project involving a literature review, a retrospective comparative analysis, and a focus group on 3 Artificial Intelligence (AI) or automation tools for information retrieval (AI search tools): Lens.org, SpiderCite, and Microsoft Copilot. For the comparative analysis, the customary information retrieval practices used at Canada's Drug Agency served as our reference standard for comparison, and we used the eligible studies of 7 completed projects to measure tool performance. For searches conducted with our usual practice approaches and with each of the 3 tools, we calculated sensitivity/recall, number needed to read (NNR), time to search and screen, unique contributions, and the likely impact of the unique contributions on the projects' findings. Our investigation confirmed that AI search tools have inconsistent and variable performance for the range of information retrieval tasks performed at Canada's Drug Agency. Implementation recommendations from this study informed a "fit for purpose" approach where Information Specialists leverage AI search tools for specific tasks or project types.
    Keywords:  artificial intelligence; biomedical technology assessment; generative artificial intelligence; information science; information storage and retrieval; large language models; review literature as topic
    DOI:  https://doi.org/10.1002/cesm.70045
  6. J Biomed Inform. 2025 Sep 09. pii: S1532-0464(25)00140-6. [Epub ahead of print] 104911
      The exponential growth of biomedical literature has rendered traditional screening methods inefficient and unsustainable, making knowledge discovery akin to finding a needle in a haystack. While recent advances in artificial intelligence (AI) offer new opportunities for rapid literature retrieval, many clinicians and researchers lack familiarity with these tools. In this study, we optimized LitSuggest, a user-friendly, code-free AI-based literature screening system, and established a standardized operational workflow. Using the field of organoid-based bone tissue engineering as a case study, the optimized system achieved an accuracy of 98.83%, precision of 76.19%, recall of 83.33%, and an F1-score of 79.60%, while reducing manual screening workload by over 90%. Furthermore, we innovatively integrated correlation scoring into literature analysis, revealing that China and the United States are leading contributors to bone organoid regeneration research, and that complex and genetic disease organoid models hold significant research potential. This AI-driven approach enables researchers to focus on high-value literature, improving efficiency while guiding future research in bone organoid regeneration and broader biomedical fields.
    Keywords:  Bone organoid; Literature screening; Machine learning; Scoring-based literature analysis
    DOI:  https://doi.org/10.1016/j.jbi.2025.104911
  7. Qual Health Res. 2025 Sep 08. 10497323251365211
      The launch of ChatGPT in November 2022 accelerated discussions and research into whether base large language models (LLMs) could increase the efficiency of qualitative analysis phases or even replace qualitative researchers. Reflexive thematic analysis (RTA) is a commonly used method for qualitative text analysis that emphasizes the researcher's subjectivity and reflexivity to enable a situated, in-depth understanding of knowledge generation. Researchers appear optimistic about the potential of LLMs in qualitative research; however, questions remain about whether base models can meaningfully contribute to the interpretation and abstraction of a dataset. The primary objective of this study was to explore how LLMs may support an RTA of an interview text from health science research. Secondary objectives included identifying recommended prompt strategies for similar studies, highlighting potential weaknesses or challenges, and fostering engagement among qualitative researchers regarding these threats and possibilities. We provided the interview file to an offline LLM and conducted a series of tests aligned with the phases of RTA. Insights from each test guided refinements to the next and contributed to the development of a recommended prompt strategy. At this stage, base LLMs provide limited support and do not increase the efficiency of RTA. At best, LLMs may identify gaps in the researchers' perspectives. Realizing the potential of LLMs to inspire broader discussion and deeper reflections requires a well-defined strategy and the avoidance of misleading prompts, self-referential responses, misguiding translations, and errors. Conclusively, high-quality RTA requires a human, comprehensive familiarization phase, and methodological competence to preserve epistemological integrity.
    Keywords:  artificial intelligence; exploratory study; health science; large language model; qualitative analysis; reflexive thematic analysis; reflexivity
    DOI:  https://doi.org/10.1177/10497323251365211
  8. J Eval Clin Pract. 2025 Sep;31(6): e70272
       RATIONALE: Systematic reviews are essential for evidence-based healthcare decision-making. While it is relatively straightforward to quantitatively assess random errors in systematic reviews, as these are typically reported in primary studies, the assessment of biases often remains narrative. Primary studies seldom provide quantitative estimates of biases and their uncertainties, resulting in systematic reviews rarely including such measurements. Additionally, evidence appraisers often face time constraints and technical challenges that prevent them from conducting quantitative bias assessments themselves. Given that multiple biases and random errors collectively skew the point estimate from the truth, it is important to incorporate comprehensive quantitative methods of uncertainty in systematic reviews. These methods should integrate random errors and biases into a unified measure of uncertainty and be easily accessible to evidence appraisers, preferably through user-friendly software.
    AIMS AND OBJECTIVES: To address this need, we propose a posterior mixture model and introduce AppRaise, a free, web-based interactive software designed to implement this approach.
    METHODS: We showcase its application through a health technology assessment (HTA) report on the effectiveness of continuous glucose monitoring in reducing A1c levels among individuals with type 1 diabetes.
    RESULTS: Applying the AppRaise software to the HTA report revealed a high level of certainty (86% probability) that continuous glucose monitoring would, on average, result in a reduction in A1c levels compared with self-monitoring of blood glucose among Ontarians with type 1 diabetes. These findings were similar to other quantitative bias-adjusted approaches in systematic reviews.
    CONCLUSION: AppRaise can be utilized as a standalone tool or as a complement to validate the quality of evidence assessed using qualitative-based scoring methods. This approach is also useful for assessing the sensitivity of parameter estimates to potential biases introduced by primary studies.
    Keywords:  AppRaise; decision‐making; health technology assessment; posterior mixture model; quantifying uncertainty; systematic review
    DOI:  https://doi.org/10.1111/jep.70272
  9. Patterns (N Y). 2025 Jul 11. 6(7): 101318
      ASReview LAB v.2 introduces an advancement in AI-assisted systematic reviewing by enabling collaborative screening with multiple experts ("a crowd of oracles") using a shared AI model. The platform supports multiple AI agents within the same project, allowing users to switch between fast general-purpose models and domain-specific, semantic, or multilingual transformer models. Leveraging the SYNERGY benchmark dataset, performance has improved significantly, showing a 24.1% reduction in loss compared to version 1 through model improvements and hyperparameter tuning. ASReview LAB v.2 follows user-centric design principles and offers reproducible, transparent workflows. It logs key configuration and annotation data while balancing full model traceability with efficient storage. Future developments include automated model switching based on performance metrics, noise-robust learning, and ensemble-based decision-making.
    Keywords:  active learning; crowdsourcing; data-driven screening; hyperparameter optimization; machine learning; multiagent systems; open-source software; reproducibility; systematic reviews; transparency
    DOI:  https://doi.org/10.1016/j.patter.2025.101318
  10. JMIR Med Inform. 2025 Sep 08. 13 e76252
       BACKGROUND: Primary liver cancer, particularly hepatocellular carcinoma (HCC), poses significant clinical challenges due to late-stage diagnosis, tumor heterogeneity, and rapidly evolving therapeutic strategies. While systematic reviews and meta-analyses are essential for updating clinical guidelines, their labor-intensive nature limits timely evidence synthesis.
    OBJECTIVE: This study proposes an automated literature screening workflow powered by large language models (LLMs) to accelerate evidence synthesis for HCC treatment guidelines.
    METHODS: We developed a tripartite LLM framework integrating Doubao-1.5-pro-32k, Deepseek-v3, and DeepSeek-R1-Distill-Qwen-7B to simulate collaborative decision-making for study inclusion and exclusion. The system was evaluated across 9 reconstructed datasets derived from published HCC meta-analyses, with performance assessed using accuracy, agreement metrics (κ and prevalence-adjusted bias-adjusted κ), recall, precision, F1-scores, and computational efficiency parameters (processing time and cost).
    RESULTS: The framework demonstrated good performance, with a weighted accuracy of 0.96 and substantial agreement (prevalence-adjusted bias-adjusted κ=0.91), achieving high weighted recall (0.90) but modest weighted precision (0.15) and F1-scores (0.22). Computational efficiency varied across datasets (processing time: 248-5850 s; cost: US $0.14-$3.68 per dataset).
    CONCLUSIONS: This LLM-driven approach shows promise for accelerating evidence synthesis in HCC care by reducing screening time while maintaining methodological rigor. Key limitations related to clinical context sensitivity and error propagation highlight the need for reinforcement learning integration and domain-specific fine-tuning. LLM agent architectures with reinforcement learning offer a practical path for streamlining guideline updates, though further optimization is needed to improve specialization and reliability in complex clinical settings.
    Keywords:  hepatocellular carcinoma; large language model; literature screening; methodology; treatment
    DOI:  https://doi.org/10.2196/76252
  11. Adv Pharm Bull. 2025 Jul;15(2): 467-473
       Purpose: This study explores the potential of generative AI models to aid experts in developing scripts for pharmacokinetic (PK) models, with a focus on constructing a two-compartment population PK model using data from Hosseini et al.
    Methods: Generative AI tools ChatGPT v3.5, Gemini v2.0 Flash and Microsoft Copilot free could help PK professionals- even those without programming experience-learn the programming languages and skills needed for PK modeling. To evaluate these free AI tools, PK models were created in R Studio, covering key tasks in pharmacometrics and clinical pharmacology, including model descriptions, input requirements, results, and code generation, with a focus on reproducibility.
    Results: ChatGPT demonstrated superior performance compared to Copilot and Gemini, highlighting strong foundational knowledge, advanced concepts, and practical skills, including PK code structure and syntax. Validation indicated high accuracy in estimated and simulated plots, with minimal differences in clearance (Cl) and volume of distribution (V c and V p) compared to reference values. The metrics showed absolute fractional error (AFE), absolute average fractional error (AAFE), and mean percentage error (MPE) values of 0.99, 1.14, and -1.85, respectively.
    Conclusion: These results show that generative AI can effectively extract PK data from literature, build population PK models in R, and create interactive Shiny apps for visualization, with expert support.
    Keywords:  Artificial intelligence; ChatGPT; Generative models; Modelling; Pharmacokinetics
    DOI:  https://doi.org/10.34172/apb.025.43852
  12. Nurs Outlook. 2025 Sep 10. pii: S0029-6554(25)00197-6. [Epub ahead of print]73(6): 102544
       BACKGROUND: Large language models (LLMs) provide significant potential benefits for nursing practice and research. Advanced natural language processing capabilities can effectively analyze text in qualitative studies. However, systematic exploration of their application contexts and efficacy in nursing remains limited.
    PURPOSE: This review aimed to conduct a scoping review of LLMs' application in qualitative nursing research, exploring current scenarios, effects, assessment methods, and challenges to provide a reference for future development.
    METHODS: Relevant literature was sourced from 11 databases up to April 2025 and used Joanna Briggs Scoping Review Methodology and PRISMA-ScR reporting standards. The search terms for this review included "nurs*," "large language model*," "qualitative study*," and "qualitative analys*."
    FINDINGS: We included 11 studies after reviewing 2,478 articles. The application scenarios of LLMs in qualitative nursing research included topic generation, role-playing, and interview question generation. LLM outputs demonstrated moderate-to-high similarity to human outputs in theme generation and superior text analysis efficiency but performed poorly in applying theoretical frameworks, generating interview questions, and developing codebooks.
    DISCUSSION: This review systematically outlined LLMs applications and limitations in qualitative nursing research. Although LLM has great potential, its application is still in its infancy.
    CONCLUSION: Future research needs to address issues such as analysis depth, simulation accuracy, technical limitations, and evaluation tools.
    Keywords:  Large language models; Nursing research; Qualitative research; Qualitative studies; Scoping review
    DOI:  https://doi.org/10.1016/j.outlook.2025.102544
  13. Cochrane Evid Synth Methods. 2025 Sep;3(5): e70049
       Introduction: Information retrieval is essential for evidence synthesis, but developing search strategies can be labor-intensive and time-consuming. Automating these processes would be of benefit and interest, though it is unclear if Information Specialists (IS) are willing to adopt artificial intelligence (AI) methodologies or how they currently use them. In January 2025, the NIHR Innovation Observatory and NIHR Methodology Incubator for Applied Health and Care Research co-sponsored the inaugural CORE Information Retrieval Forum, where attendees discussed AI's role in information retrieval.
    Methods: The CORE Information Retrieval Forum hosted a Knowledge Café. Participation was voluntary, and attendees could choose one of six event-themed discussion tables including AI. To support each discussion, a QR code linking to a virtual collaboration tool (Padlet; padlet.com) and a poster in the exhibition space were available throughout the day for attendee contributions.
    Results: The CORE Information Retrieval Forum was attended by 131 IS from nine different types of organizations, with most from the UK and ten countries represented overall. Among the six discussion points available in the Knowledge Café, the AI table was the most popular, receiving the highest number of contributions (n = 49). Following the Forum, contributions to the AI topic were categorized into four themes: critical perception (n = 21), current uses (n = 19), specific tools (n = 2), and training wants/needs (n = 7).
    Conclusions: While there are critical perspectives on the integration of AI in the IS space, this is not due to a reluctance to adapt and adopt but from a need for structure, education, training, ethical guidance, and systems to support the responsible use and transparency of AI. There is interest in automating repetitive and time-consuming tasks, but attendees reported a lack of appropriate supporting tools. More work is required to identify the suitability of currently available tools and their potential to complement the work conducted by IS.
    Keywords:  artificial intelligence; evidence synthesis; generative AI; information retrieval; information specialist; large language models; literature search
    DOI:  https://doi.org/10.1002/cesm.70049