bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–11–16
sixteen papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. Value Health Reg Issues. 2025 Nov 11. pii: S2212-1099(25)00465-0. [Epub ahead of print]53 101539
       OBJECTIVES: To evaluate the performance of Claude 3.7 Sonnet in automating data extraction for systematic literature reviews (SLRs).
    METHODS: An artificial intelligence (AI) extraction model based on the Claude 3.7 Sonnet large language model was developed through a structured process, including targeted training using a master data list and selected full-text articles. The master data list enhanced the model's contextual knowledge, guiding data extraction. Seven full-text articles from 4 oncology-focused treatment efficacy and safety SLRs were used for early testing and iterative refinement through error analysis. Model performance was then evaluated using 20 full-text articles, drawn from the same SLRs but not used for model development, and benchmarked against human extractions. Evaluation metrics included precision, recall, and F1 score. Extraction time was also compared across 3 different approaches: AI model-only, hybrid (AI model with human verification), and traditional human extraction.
    RESULTS: The AI model extracted 117 889 data points across 106 variables, achieving an overall precision of 98.2%, recall of 96.6%, and F1-score of 97.4%. Extraction performance was highest for Study Characteristics (precision: 97.7%, recall: 98.7%) and Participant Characteristics (precision: 97.3%, recall: 98.7%). Outcome data showed 96.4% recall and 98.7% precision. Intervention Characteristics achieved 97.5% precision and 94.6% recall. Extraction using the AI model alone averaged 4.5 minutes per article, compared with 64.5 minutes with the hybrid approach and approximately 240 minutes with traditional human extraction.
    CONCLUSIONS: The Claude 3.7 Sonnet-based model demonstrated strong performance, supporting efficient and reliable AI-driven data extraction in oncology SLRs, with potential for broader applicability.
    Keywords:  artificial intelligence (AI); data extraction; large language models (LLMs); performance evaluation; systematic reviews
    DOI:  https://doi.org/10.1016/j.vhri.2025.101539
  2. Environ Evid. 2025 Nov 12. 14(1): 21
      Artificial intelligence (AI) is increasingly being explored as a tool to optimize and accelerate various stages of evidence synthesis. A persistent challenge in environmental evidence syntheses is that these remain predominantly monolingual (English), leading to biased results and misinforming cross-scale policy decisions. AI offers a promising opportunity to incorporate non-English language evidence in evidence syntheses screening process and help to move beyond the current monolingual focus of evidence syntheses. Using a corpus of Spanish-language peer-reviewed papers on biodiversity conservation interventions, we developed and evaluated text classifiers using supervised machine learning models. Our best-performing model achieved 100% recall meaning no relevant papers (n = 9) were missed and filtered out over 70% (n = 867) of negative documents based only on the title and abstract of each paper. The text was encoded using a pre-trained multilingual model and class-weights were used to deal with a highly imbalanced dataset (0.79%). This research therefore offers an approach to reducing the manual, time-intensive effort required for document screening in evidence syntheses-with minimal risk of missing relevant studies. It highlights the potential of multilingual large language models and class-weights to train a light-weight non-English language classifier that can effectively filter irrelevant texts, using only a small non-English language labelled corpus. Future work could build on our approach to develop a multilingual classifier that enables the inclusion of any non-English scientific literature in evidence syntheses.
    Keywords:  Biodiversity conservation; Evidence synthesis; Explainable AI; Language barriers; Multilingual language model; Natural language processing; Non-English; SHAP
    DOI:  https://doi.org/10.1186/s13750-025-00370-9
  3. J Laparoendosc Adv Surg Tech A. 2025 Nov 04.
      Background: Smoking is associated with higher complication and recurrence rates in ventral and inguinal hernia repairs, but evidence is fragmented. This study evaluated the efficacy of AI-based large language models (LLMs) for identifying literature on the impact of smoking on hernia repairs. Methods: ChatGPT 4.0, ChatGPT 4o, Microsoft Copilot, and Google Gemini were instructed to search PubMed, Embase, and Scopus for retrospective/prospective studies and randomized controlled trials regarding smoking's effects on ventral and inguinal hernia repairs. The models' outputs were cross-checked against previous systematic reviews to assess accuracy. Results: The artificial intelligence (AI) tools generated 24 citations, of which only nine (37.5%) proved valid and relevant. Thirteen (54.2%) were fabricated references, and two (8.3%) cited studies that did not match the specified criteria. Additionally, the AIs identified two studies missed by previous systematic reviews but overlooked 35 (79.5%) recognized by those reviews. Conclusions: Although LLMs can quickly compile potentially relevant references, they are prone to fabricating or omitting crucial studies. Human verification remains essential for conducting reliable, comprehensive literature searches in systematic reviews and meta-analyses.
    Keywords:  artificial intelligence; hernia repair; large language models; smoking
    DOI:  https://doi.org/10.1177/10926429251393122
  4. Campbell Syst Rev. 2025 Dec;21(4): e70074
      Evidence synthesists are ultimately responsible for their evidence synthesis, including the decision to use artificial intelligence (AI) and automation, and to ensure adherence to legal and ethical standards.Cochrane, the Campbell Collaboration, JBI, and the Collaboration for Environmental Evidence support the aims of the Responsible use of AI in evidence SynthEsis (RAISE) recommendations, which provide a framework for ensuring responsible use of AI and automation across all roles within the evidence synthesis ecosystem.Evidence synthesists developing and publishing syntheses with Cochrane, the Campbell Collaboration, JBI, and the Collaboration for Environmental Evidence can use AI and automation as long as they can demonstrate that it will not compromise the methodological rigor or integrity of their synthesis.AI and automation in evidence synthesis should be used with human oversight.Any use of AI or automation that makes or suggests judgements should be fully and transparently reported in the evidence synthesis report.AI tool developers should proactively ensure their AI systems or tools adhere to the RAISE recommendations so we have clear, transparent, and publicly available information to inform decisions about whether an AI system or tool could and should be used in evidence synthesis.
    DOI:  https://doi.org/10.1002/cl2.70074
  5. Ann Pediatr Endocrinol Metab. 2025 Oct;30(5): 229-241
      The integration of large language models (LLMs) in academic research has transformed traditional research methodologies. This review investigates the current state, applications, and limitations of LLMs, particularly ChatGPT, in medical and scientific research. I performed a systematic review of recent literature and LLM development reports in artificial intelligence-assisted research tools, including commercial LLM services (GPT-4o, Claude 3, Gemini Pro) and specialized research platforms (Genspark, Scispace). I evaluated their performance, applications, and limitations across stages of the research process. Recent advancements in LLMs shows potential for improving research efficiency, particularly in literature review, data analysis, and manuscript preparation. Performance comparison revealed varying strengths: GPT-4o and o1 outperformed in the overall area, Claude 3 in writing and coding, and Gemini Pro in multimodal processing. Therefore, it is important to choose and use each model wisely according to its advantages. However, hallucination risks, inherent biases, plagiarism concerns, and privacy issues are concerns in LLMs. The emergence of Retrieval-Augmented Generation models and specialized research tools has improved accuracy and current information access. LLMs offer effective support for research productivity, but they should serve as complementary tools rather than primary research drivers. The successful application of these tools depends on a thorough understanding of their limitations, strict adherence to ethical guidelines, and preservation of researcher autonomy.
    Keywords:  Artificial intelligence; Biomedical research; Generative pre-trained transformer; Natural language processing; Research ethics
    DOI:  https://doi.org/10.6065/apem.2550028.014
  6. J Cancer Educ. 2025 Nov 09.
      
    Keywords:  Artificial intelligence; ChatGPT; Natural language processing; Pathology; Radiation oncology
    DOI:  https://doi.org/10.1007/s13187-025-02783-z
  7. Endocrinol Metab (Seoul). 2025 Oct;40(5): 659-667
      Research applying machine learning and deep learning has become increasingly common in medicine. However, for clinicians lacking Python programming skills, conducting such research has often been an intractable task-even when ample data were available. The emergence of 'vibe coding' in 2025 has substantially lowered this barrier to entry. This review defines vibe coding, provides a taxonomy of its available tools, and illustrates its practical application through several use cases. Vibe coding is a goal-oriented process in which the user focuses on the desired outcome, issuing natural language directives for environment setup, functionality specification, and output format. The generative artificial intelligence (AI) then produces and refines the underlying code through an interactive feedback loop. Tools such as generative AI platforms (e.g., ChatGPT, Gemini, Claude), graphical user interface-based agents (e.g., Memex, Replit), AI-augmented editors (e.g., Cursor, Visual Studio Code), and command-line interface (CLI) agents (e.g., Gemini CLI, Codex CLI, Claude Code) are available. Demonstrative case studies using publicly accessible datasets illustrate how clinicians can generate and refine Python scripts for classification tasks with minimal coding expertise. Researchers are encouraged to select an accessible tool and gain hands-on experience with real-world data. The adoption of these tools by clinicians, residents, and medical students may promote broader engagement with machine learning and accelerate medical research.
    Keywords:  Artificial intelligence; Biomedical research; Deep learning; Machine learning; Physicians
    DOI:  https://doi.org/10.3803/EnM.2025.2675
  8. Digit Health. 2025 Jan-Dec;11:11 20552076251384604
       Objectives: This study aims to evaluate the stylistic and structural equivalence of Artificial Intelligence (AI)-generated summaries, particularly those by Large Language Models (LLMs) like ChatGPT, compared to traditional human-generated case summaries in neuro-oncological board decisions. The primary goal is to explore the stylistic alignment between AI-generated and human-authored summaries from board meeting audio recordings.
    Methods: The study compares 30 traditional human-generated case summaries with 30 AI-generated summaries based on board meeting audio recordings. Two expert raters, blinded to the source of the summaries, evaluated a total of 60 cases. A Likert scale was used to assess the plausibility, linguistic style, evidence adherence, and reference accuracy of the summaries.
    Results: The results indicated that both LLM-generated and human-reviewed summaries demonstrated consistently high performance across all criteria evaluated. The general plausibility ratings were comparable (LLM: 4.7, Human: 4.73, P = .959). Linguistic style ratings also showed similarity (LLM: 4.87, Human: 4.97, P = .512). In terms of adherence to evidence, the means were close (LLM: 4.8, Human: 4.87, P = .541). Reference accuracy was slightly higher for AI-generated summaries (LLM: 4.97, Human: 4.9, P = .664). These findings were consistent with the results from Rater 2, and statistical analysis using Kendall's tau showed no significant differences between methods (P > .05).
    Conclusion: The study finds that LLM-generated summaries can effectively emulate the style and structure of human-authored ones, indicating their promise as an additional tool in neuro-oncology. These AI models can enhance documentation quality and serve as valuable support in clinical settings. While further research is necessary to explore broader applications, LLMs offer exciting potential as a complement to traditional decision-making processes.
    Keywords:  Neurooncology; artificial intelligence; clinical decision making; evidence-based recommendations; large language models; neurosurgery
    DOI:  https://doi.org/10.1177/20552076251384604
  9. Cardiovasc Diagn Ther. 2025 Oct 31. 15(5): 1107-1112
      Artificial intelligence (AI) has emerged as a widely used tool for writing, including in scientific research and publications. While its application to cardiovascular research is the focus of numerous studies, the policies related to its use for manuscript writing are rapidly evolving and not well understood. We sought to compare the policies of high-impact cardiovascular journals regarding AI for manuscript writing assistance and assess the prevalence of its use. Cardiovascular medicine journals with an SCImago Journal Rank (SJR) ≥3 and h-index ≥100 were screened for an AI policy. Journal policies were assessed for author disclosure requirements, standardization of disclosure section and language, and AI detection software used during the submission process. Each journal with an AI policy that required disclosure of its use was systematically searched to evaluate the prevalence of articles disclosing its use for writing assistance from January 2023 to August 2025. The number of publications with AI disclosure and publication characteristics was recorded. Seventeen journals met inclusion criteria and were screened for an AI policy, of which 14 journals (82%) contained such a policy. Among these, three journals (18%) had an AI policy that required disclosure, but that was not specific to AI use for manuscript writing. One journal (6%) did not require disclosure. The remaining three journals (18%) did not have any AI policy. None of the journals mandated a dedicated AI disclosure section or provided authors with standardized disclosure language. Fifteen journals (88%) used identifiable AI detection software, while only one posted this information publicly. Among the 14 journals with an AI disclosure policy, 11 AI-disclosing works were found. ChatGPT was the most common AI tool used (n=9, 82%). Journal policies regarding AI use for manuscript writing assistance vary widely, and therefore, there is a growing need for standardization. The prevalence of articles disclosing the use of AI was profoundly low across all journals evaluated, with significant variation in how AI use was disclosed. Having clear and consistent policies across journals and requiring authors to disclose their use of AI for manuscript writing is essential to uphold transparency and maintain medical research integrity.
    Keywords:  Artificial intelligence (AI); cardiovascular journals; disclosure policies; manuscript writing
    DOI:  https://doi.org/10.21037/cdt-2025-381
  10. Cureus. 2025 Oct;17(10): e94259
      Among generative artificial intelligence (AI) tools, ChatGPT (OpenAI, San Francisco, CA, USA) has seen rapid adoption in the education sector due to its accessibility and versatility. While it supports a wide range of academic tasks, it also has significant limitations. Its difficulty in fully understanding the context of the conversations can lead to vague or inaccurate responses, raising questions about its reliability in educational and healthcare settings, where contextual accuracy is paramount. This narrative review aims to explore the challenges and prospects of using ChatGPT as a support tool in health science teaching and learning processes at the university level.  A review of the scientific literature on the use of generative artificial intelligence was conducted in the main databases between April and May of 2025. Studies published between January 2023 and March 27, 2025, were included to access the most up-to-date information available. The studies had to focus on the use of generative AI in higher education settings, specifically in the field of health sciences. A total of 18 documents met the inclusion criteria and were selected for analysis.  The results reveal that ChatGPT is widely used in undergraduate health education for tasks such as providing writing support, helping with content comprehension, generating quizzes, creating clinical simulations, and designing curricula. However, challenges such as misinformation, ethical concerns, and overreliance on AI were frequently noted. Additionally, disparities in access and lack of formal training for both students and educators were revealed to be significant barriers.  In conclusion, ChatGPT has significant potential to improve teaching and learning in health sciences education by providing personalized support, real-time feedback, and resource creation. However, effectively integrating ChatGPT into health sciences education requires paying special attention to ethical standards, equitable access, and developing digital literacy to ensure it complements, rather than replaces, fundamental human expertise.
    Keywords:  artificial intelligence; chatgpt; higher education; impact; student
    DOI:  https://doi.org/10.7759/cureus.94259
  11. Nature. 2025 Nov 13.
      
    Keywords:  Computer science; Machine learning
    DOI:  https://doi.org/10.1038/d41586-025-03379-9
  12. Nature. 2025 Nov 14.
      
    Keywords:  Ethics; Machine learning; Technology
    DOI:  https://doi.org/10.1038/d41586-025-03701-5
  13. JMIR Ment Health. 2025 Nov 12. 12 e80371
       BACKGROUND: Mental health researchers are increasingly using large language models (LLMs) to improve efficiency, yet these tools can generate fabricated but plausible-sounding content (hallucinations). A notable form of hallucination involves fabricated bibliographic citations that cannot be traced to real publications. Although previous studies have explored citation fabrication across disciplines, it remains unclear whether citation accuracy in LLM output systematically varies across topics within the same field that differ in public visibility, scientific maturity, and specialization.
    OBJECTIVE: This study aims to examine the frequency and nature of citation fabrication and bibliographic errors in GPT-4o (Omni) outputs when generating literature reviews on mental health topics that varied in public familiarity and scientific maturity. We also tested whether prompt specificity (general vs specialized) influenced fabrication or accuracy rates.
    METHODS: In June 2025, GPT-4o was prompted to generate 6 literature reviews (~2000 words; ≥20 citations) on 3 disorders representing different levels of public awareness and research coverage: major depressive disorder (high), binge eating disorder (moderate), and body dysmorphic disorder (low). Each disorder was reviewed at 2 levels of specificity: a general overview (symptoms, impacts, and treatments) and a specialized review (evidence for digital interventions). All citations were extracted (N=176) and systematically verified using Google Scholar, Scopus, PubMed, WorldCat, and publisher databases. Citations were classified as fabricated (no identifiable source), real with errors, or fully accurate. Fabrication and accuracy rates were compared by disorder and review type by using chi-square tests.
    RESULTS: Across the 6 reviews, GPT-4o generated 176 citations; 35 (19.9%) were fabricated. Among the 141 real citations, 64 (45.4%) contained errors, most frequently incorrect or invalid digital object identifiers. Fabrication rates differed significantly by disorder (χ22=13.7; P=.001), with higher rates for binge eating disorder (17/60, 28%) and body dysmorphic disorder (14/48, 29%) than for major depressive disorder (4/68, 6%). While fabrication did not differ overall by review type, stratified analyses showed higher fabrication for specialized versus general reviews of binge eating disorder (11/24, 46% vs 6/36, 17%; P=.01). Accuracy rates also varied by disorder (χ22=11.6; P=.003), being lowest for body dysmorphic disorder (20/34, 59%) and highest for major depressive disorder (41/64, 64%). Accuracy rates differed by review type within some disorders, including higher accuracy for general reviews of major depressive disorder (26/34, 77% vs 15/30, 50%; P=.03).
    CONCLUSIONS: Citation fabrication and bibliographic errors remain common in GPT-4o outputs, with nearly two-thirds of citations being fabricated or inaccurate. Reliability systematically varied by disorder familiarity and prompt specificity, with greater risks in less visible or specialized mental health topics. These findings highlight the need for careful prompt design, rigorous human verification of all model-generated references, and stronger journal and institutional safeguards to protect research integrity as LLMs are integrated into academic practice.
    Keywords:  AI; academic research; artificial intelligence; citations; large language models; mental health; psychiatry
    DOI:  https://doi.org/10.2196/80371
  14. Eur J Cardiothorac Surg. 2025 Nov 11. pii: ezaf394. [Epub ahead of print]
      
    Keywords:  artificial intelligence; fabrication; human; research; writing
    DOI:  https://doi.org/10.1093/ejcts/ezaf394