bims-arines Biomed News
on AI in evidence synthesis
Issue of 2025–07–13
five papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. Cureus. 2025 Jun;17(6): e85338
      Artificial intelligence (AI) has the potential to transform healthcare, medical education, and research. Large language models (LLMs) have gained attention for their ability to improve qualitative research by automating data analysis, coding, and thematic interpretation. While prior research evaluates LLMs' performance in qualitative studies, clear guidelines on their implementation remain scarce. This manuscript offers detailed methods with instructions and prompts for using LLMs in qualitative analysis. It provides a clear, step-by-step, practical approach. We developed a customized generative pre-trained transformer (Custom-GPT) based on Braun and Clarke's six-step thematic analysis framework. The performance of the model was evaluated across three datasets, comparing its outputs with manually generated codes and themes. Triangulation was conducted using Google's NotebookLM. Across the three datasets, the model generated consistent thematic structures that aligned closely with manual coding. However, slight variability in responses, lack of AI decision-making explanations, and requiring repeated prompting during the process were the main challenges. Additional human interventions were required between steps to refine outputs and ensure methodological integrity. LLMs offer promising opportunities to enhance qualitative thematic analysis. However, their limitations emphasize the necessity of human oversight throughout the process. This report highlights the importance of integrating AI tools responsibly, emphasizing methodological rigor, and developing clear guidelines for AI-assisted qualitative research. Future research should explore ethical frameworks, domain-specific LLMs, and advanced prompt engineering techniques to optimize AI's role in qualitative analysis.
    Keywords:  ai-assisted research; artificial intelligence (ai); braun and clarke’s framework; custom-gpt; large language models (llms); qualitative research; thematic analysis
    DOI:  https://doi.org/10.7759/cureus.85338
  2. Psychol Methods. 2025 Jul 10.
      Independent human double screening of titles and abstracts is a critical step to ensure the quality of systematic reviews and meta-analyses herein. However, double screening is a resource-demanding procedure that slows the review process. To alleviate this issue, we evaluated the use of OpenAI's generative pretrained transformer (GPT) application programming interface (API) models as an alternative to human second screeners of titles and abstracts. We did so by developing a new benchmark scheme for interpreting the performances of automated screening tools against common human screening performances in high-quality systematic reviews and by conducting three large-scale experiments on three psychological systematic reviews with different levels of complexity. Across all experiments, we show that the GPT API models can perform on par with and in some cases even better than typical human screening performance in terms of detecting relevant studies while showing high exclusion performance, as well. Hereto, we introduce the use of multiprompt screening, which is making one concise prompt per inclusion/exclusion criteria in a review, and show that it can be a valuable tool to use and support screenings in highly complex review settings. To consolidate future implementation, we develop a reproducible workflow and a set of tentative guidelines for when and when not to use GPT API models as independent second screeners of titles and abstracts. Moreover, we present the R package AIscreenR to standardize the suggested application. Our aim is ultimately to make GPT API models acceptable as independent second screeners within high-quality systematic reviews, such as the ones published in Psychological Bulletin. (PsycInfo Database Record (c) 2025 APA, all rights reserved).
    DOI:  https://doi.org/10.1037/met0000769
  3. Nature. 2025 Jul;643(8071): 329-331
      
    Keywords:  Biodiversity; Machine learning; Medical research; Research data
    DOI:  https://doi.org/10.1038/d41586-025-02069-w
  4. JMIR Form Res. 2025 Jul 08. 9 e72815
       Background: Qualitative research appraisal is crucial for ensuring credible findings but faces challenges due to human variability. Artificial intelligence (AI) models have the potential to enhance the efficiency and consistency of qualitative research assessments.
    Objective: This study aims to evaluate the performance of 5 AI models (GPT-3.5, Claude 3.5, Sonar Huge, GPT-4, and Claude 3 Opus) in assessing the quality of qualitative research using 3 standardized tools: Critical Appraisal Skills Programme (CASP), Joanna Briggs Institute (JBI) checklist, and Evaluative Tools for Qualitative Studies (ETQS).
    Methods: AI-generated assessments of 3 peer-reviewed qualitative papers in health and physical activity-related research were analyzed. The study examined systematic affirmation bias, interrater reliability, and tool-dependent disagreements across the AI models. Sensitivity analysis was conducted to evaluate the impact of excluding specific models on agreement levels.
    Results: Results revealed a systematic affirmation bias across all AI models, with "Yes" rates ranging from 75.9% (145/191; Claude 3 Opus) to 85.4% (164/192; Claude 3.5). GPT-4 diverged significantly, showing lower agreement ("Yes": 115/192, 59.9%) and higher uncertainty ("Cannot tell": 69/192, 35.9%). Proprietary models (GPT-3.5 and Claude 3.5) demonstrated near-perfect alignment (Cramer V=0.891; P<.001), while open-source models showed greater variability. Interrater reliability varied by assessment tool, with CASP achieving the highest baseline consensus (Krippendorff α=0.653), followed by JBI (α=0.477), and ETQS scoring lowest (α=0.376). Sensitivity analysis revealed that excluding GPT-4 increased CASP agreement by 20% (α=0.784), while removing Sonar Huge improved JBI agreement by 18% (α=0.561). ETQS showed marginal improvements when excluding GPT-4 or Claude 3 Opus (+9%, α=0.409). Tool-dependent disagreements were evident, particularly in ETQS criteria, highlighting AI's current limitations in contextual interpretation.
    Conclusions: The findings demonstrate that AI models exhibit both promise and limitations as evaluators of qualitative research quality. While they enhance efficiency, AI models struggle with reaching consensus in areas requiring nuanced interpretation, particularly for contextual criteria. The study underscores the importance of hybrid frameworks that integrate AI scalability with human oversight, especially for contextual judgment. Future research should prioritize developing AI training protocols that emphasize qualitative epistemology, benchmarking AI performance against expert panels to validate accuracy thresholds, and establishing ethical guidelines for disclosing AI's role in systematic reviews. As qualitative methodologies evolve alongside AI capabilities, the path forward lies in collaborative human-AI workflows that leverage AI's efficiency while preserving human expertise for interpretive tasks.
    Keywords:  CASP checklist; Critical Appraisal Skills Programme; ETQS; Evaluative Tools for Qualitative Studies; JBI checklist; Joanna Briggs Institute; affirmation bias; artificial intelligence; human-AI collaboration; interrater agreement; large language models; qualitative research appraisal; systematic reviews
    DOI:  https://doi.org/10.2196/72815
  5. Cochrane Evid Synth Methods. 2025 Jul;3(4): e70035
      Systematic review findings are typically disseminated via static outputs, such as scientific manuscripts, which can limit the accessibility and usability for diverse audiences. Interactive data dashboards transform systematic review data into dynamic, user-friendly visualizations, allowing deeper engagement with evidence synthesis findings. We propose a workflow for creating interactive dashboards to display evidence synthesis results, including three key phases: planning, development, and deployment. Planning involves defining the dashboard objectives and key audiences, selecting the appropriate software (e.g., Tableau or R Shiny) and preparing the data. Development includes designing a user-friendly interface and specifying interactive elements. Lastly, deployment focuses on making it available to users and utilizing user-testing. Throughout all phases, we emphasize seeking and incorporating interest-holder input and aligning dashboards with the intended audience's needs. To demonstrate this workflow, we provide two examples from previous systematic reviews. The first dashboard, created in Tableau, presents findings from a meta-analysis to support a U.S. Preventive Services Task Force recommendation on lipid disorder screening in children, while the second utilizes R Shiny to display data from a scoping review on the 4-day school week among K-12 students in the U.S. Both dashboards incorporate interactive elements to present complex evidence tailored to different interest-holders, including non-research audiences. Interactive dashboards can enhance the utility of evidence syntheses by providing a user-friendly tool for interest-holders to explore data relevant to their specific needs. This workflow can be adapted to create interactive dashboards in flexible formats to increase the use and accessibility of systematic review findings.
    Keywords:  R; dashboard; evidence synthesis; shiny; tableau
    DOI:  https://doi.org/10.1002/cesm.70035