bims-arines Biomed News
on AI in evidence synthesis
Issue of 2026–03–15
eleven papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. J Prosthet Dent. 2026 Mar 12. pii: S0022-3913(26)00090-9. [Epub ahead of print]
       STATEMENT OF PROBLEM: Systematic reviews (SRs) are time-consuming and resource-intensive processes. Whether large language models (LLMs) can improve the process is unclear.
    PURPOSE: The purpose of this study was to evaluate the accuracy and reliability of 4 LLMs (GPT-4, Gemini, Claude, and Elicit) in performing SR tasks (full-text screening, data extraction, and risk of bias assessment) at 3 different sequential periods of time (0, 15, and 30 days).
    MATERIAL AND METHODS: A comprehensive systematic search was conducted across 5 databases in December 2024, with 59 articles evaluated for screening (2 used for pilot) and 31 for data extraction (2 used for pilot). A 3-pronged prompting strategy was used, including persona-based initialization, few-shot learning, and structured population, intervention, control, outcome (PICO) criteria. Performance was assessed through 3 repeated evaluations at 2-week intervals by measuring accuracy and reliability using standard metrics (accuracy, precision, F1-score, sensitivity, and specificity) against expert assessments, data extraction quality on a 0 to 5 scale, and risk of bias agreement via the Cohen kappa, with statistical analysis using Kruskal-Wallis and Dunn post-hoc tests (α=.05).
    RESULTS: In full-text screening, Claude achieved the highest sensitivity at 97%, while Claude and Elicit both showed strong overall performance with 86% accuracy and 87% F1-scores. All models maintained sensitivity above 90%. For data extraction, GPT-4 consistently performed best with median scores of 5.0, while Claude and Gemini showed similar capabilities. Significant differences only appeared in labeling and modeling tasks during Week 1 (P=.04). Risk of bias assessment agreement with experts varied from 55% to 90% across different criteria.
    CONCLUSIONS: LLMs show potential for SR efficiency (especially for data extraction) but require human oversight because of variable performance across models and tasks.
    DOI:  https://doi.org/10.1016/j.prosdent.2026.02.009
  2. Scand J Public Health. 2026 Mar 13. 14034948261423410
       AIM: Systematic reviewing is a time-consuming process that can be aided by artificial intelligence (AI). There are several AI options to assist with title/abstract screening, however options for full text screening are limited. The objective of this study was to evaluate the performance of a custom generative pretrained transformer (cGPT) for full text screening.
    METHODS: A cGPT powered by OpenAI's ChatGPT4o was tested with subsets of articles assessed in duplicate by human reviewers. Outputs from the testing subset were coded to simulate cGPT as an autonomous and an assistant reviewer. Cohen's kappa was used to assess interrater agreement.
    RESULTS: For the inclusion/exclusion decision, the human-human kappa scores ranged from 0.87 to 0.96, exceeding the ranges of kappa scores for autonomous cGPT-human pairings (0.59 to 0.67) and assistant cGPT-human pairings (0.62 to 0.72). For exclusion reason classification, the human-human kappa scores ranged from 0.71 to 0.78, exceeding the ranges of kappa scores for autonomous cGPT-human pairings (0.47 to 0.53) and assistant cGPT-human pairings (0.52 to 0.63).
    CONCLUSIONS: The assistant cGPT outperformed the autonomous cGPT. An assistant cGPT could speed up systematic reviewing in a sufficiently reliable manner, however, further research is needed to establish standardized thresholds for practical use. Improved speed of systematic reviewing has implications for directing timely public health policy decisions.
    Keywords:  OpenAI; article filtration; custom GPT; full text screening; large language models; systematic review
    DOI:  https://doi.org/10.1177/14034948261423410
  3. J Am Med Inform Assoc. 2026 Mar 09. pii: ocag014. [Epub ahead of print]
       OBJECTIVE: Automated literature screening in biomedical research is often hindered by domain shifts and scarcity of labeled data, which limit model accuracy and generalizability. While large language models (LLMs) perform well in zero-shot settings, they often fail to capture complex, domain-specific reasoning patterns. To address this limitation, this study investigates whether an interactive, weakly supervised learning framework combining GPT (generative pre-trained transformer)'s fine-tuning adaptability with DeepSeek's reasoning capabilities can improve literature screening performance across biomedical domains.
    MATERIALS AND METHODS: We developed an active learning framework that leverages model disagreement between GPT-4o and DeepSeek to improve literature screening performance. This process began with a labeled corpus of 6331 articles on large language models, from which a model disagreement analysis was performed to identify cases where GPT-4o misclassified and DeepSeek produced correct predictions. Three GPT variants-GPT-4o, GPT-4o-mini, and GPT-4.1-nano, were fine-tuned under standard supervised learning settings using these disagreement-based samples. Fine-tuning prompts incorporated classification labels and, when available, rationale traces generated by DeepSeek to provide reasoning-augmented weak supervision. Model performance was evaluated on an independent benchmark set of 291 annotated articles across 10 topic queries in cancer immunotherapy and LLMs in medicine, using standard evaluation metrics, with recall as the primary measure.
    RESULTS: Fine-tuning GPT models using disagreement-based examples significantly improved performance. GPT-4o-mini achieved the best overall results after fine-tuning, especially with the highest F1 score (0.93, P < .001) and recall (0.95, P < .001). Across the biomedical topics, fine-tuned models consistently outperformed their zero-shot counterparts without increasing reviewer workload.
    DISCUSSION: These findings demonstrate the effectiveness of disagreement-driven active learning in enhancing GPT-based biomedical literature screening. Lightweight models like GPT-4o-mini benefit most from targeted, reasoning-enriched training, highlighting their suitability for scalable deployment.
    CONCLUSION: This study introduces an interactive active learning framework that leverages fine-tuned LLMs with reasoning capabilities to enhance literature screening. The approach offers a scalable solution to more efficient and reliable information retrieval in systematic reviews.
    Keywords:  DeepSeek; active learning; generative pre-trained transformer (GPT); literature screening; reasoning; supervised learning
    DOI:  https://doi.org/10.1093/jamia/ocag014
  4. J Clin Epidemiol. 2026 Mar 06. pii: S0895-4356(26)00086-7. [Epub ahead of print] 112211
       OBJECTIVE: To examine how continually updated, living evidence and gap maps (L-EGMs) with an online presence report planned update schedules, retirement plans, living status, use of automation across review stages, and the methodological guidance cited to support their conduct and reporting.
    STUDY DESIGN AND SETTING: A cross-sectoral scoping review of digital L-EGM interfaces, which act as foundational support tools for decision-makers by providing a visual and interactive summary of all available evidence, as well as evidence gaps. Targeted searches were conducted in Google search engine (June 2022 and June/September 2025), Web of Science Core Collection and MEDLINE (January 2026), supplemented by records from a methodology review and additional EGMs found through supporting documentation of included maps, or known to the research team.
    RESULTS: Forty-four L-EGMs, predominantly health sector-related, met the eligibility criteria. Half of the digital interfaces cited big picture review guidance in their associated documentation, 11% cited (living) systematic review guidance, and 39% did not cite any overarching synthesis typology or living evidence synthesis guidance. 57% reported a fixed update schedule with planned update frequencies varying from daily to every two years (median: 1 month), but most did not clarify whether schedules differed across update stages. Only 14% reported retirement plans and 39% indicated whether the L-EGM is still living. Automation or semi-automation was reported in 70% of L-EGMs. 59% used it for searching, 45% for screening and 30% for coding, with many reporting automation across multiple stages. In addition, three L-EGMs used a natural language processing-based risk of bias assessment tool. 25% of L-EGMs reported context-specific automation performance metrics or validation approaches.
    CONCLUSIONS: We identified a growing body of digital, living EGMs that use automation - especially machine learning - that could be systematically reviewed. Reviewers and methodologists should further assess the potential for automating EGMs, their actual living mode parameters, methodological changes, and how these are reported across web-based versus conventional outputs, working toward a consensus on map-specific guidance. Until such guidance is established, authors of living EGMs can follow existing recommendations for responsible automation, living systematic reviews and other living evidence syntheses.
    Keywords:  Artificial intelligence; Evidence map; Living evidence synthesis; Living systematic review; Machine learning; Systematic map
    DOI:  https://doi.org/10.1016/j.jclinepi.2026.112211
  5. J Nurs Scholarsh. 2026 Mar;58(2): e70076
       INTRODUCTION: Systematic reviews (SRs) require comprehensive, reproducible searches, yet developing search strategies is resource-intensive and demands specialized expertise. Generative AI offers potential to streamline this process, but empirical evaluations for GAI-assisted SR searching remain scarce. The objectives of this study are to: demonstrate a step-by-step process for developing a custom ChatGPT-based chatbot to support SR search strategy development, and evaluate its performance.
    DESIGN: A cross-sectional evaluation study.
    METHODS: We used ChatGPT-4.0 to create a chatbot designed to mimic a medical librarian, generating PICO-informed searches. Its knowledge base was augmented with two methodological references. After piloting testing, we refined its instructions. For evaluation, we randomly sampled 50 Cochrane SRs published in 2024. Standardized P-I-O prompts produced database-ready queries for PUBMED and EMBASE. The primary outcome was per-review success rate, summarized by median and inter-quartile range. A sensitivity analysis was conducted.
    RESULTS: Pilot testing achieved a retrieval rate of 41/49 (83.7%). In the main sample (1169 studies; median 13.5 studies per SR), the chatbot identified a median of 67.4% of included studies (IQR: 43.1%-88.4%). When limited to indexed studies (n = 1114), retrieval rose to 72.0% (IQR: 46.0%-92.5%). Lower performance was observed when outcomes were absent from the abstracts or interventions had many lexical variants.
    CONCLUSIONS: A GAI-based chatbot can rapidly generate SR searches (~67%-72% identification), serving as a useful starting point but not a replacement for expert-led approaches. Integration of librarian expertise, structured prompts, and controlled vocabularies may improve performance. Further benchmarking and transparent reporting are needed to guide adoption.
    Keywords:  database searching; generative artificial intelligence; large language model; systematic review
    DOI:  https://doi.org/10.1111/jnu.70076
  6. Qual Health Res. 2026 Mar 07. 10497323261425889
      Artificial intelligence (AI) technologies are rapidly expanding in qualitative health research and often promise improved efficiency or novel discoveries. However, this promise has yet to be realized, and further, serious ethical issues emerge ranging from the use of AI videoconferencing technologies to conduct interviews, AI transcription services, and AI-augmented qualitative analysis tools. These ethical dilemmas are not always obvious and require careful consideration of the ramifications of integrating these technologies in the research process. These concerns are relevant to all stages of research experience ranging from emerging scholars to more practiced researchers but are particularly significant in training new scholars who are early adopters of AI technologies. To trace the ethical issues surrounding AI in the practice of qualitative health research, we map the specific values of autonomy, privacy, validity, and equity to highlight decision points and provide a framework for navigating ethical use of the AI tools.
    Keywords:  artificial intelligence; ethics; qualitative research; technology
    DOI:  https://doi.org/10.1177/10497323261425889
  7. Knee Surg Sports Traumatol Arthrosc. 2026 Mar 11.
       PURPOSE: The purpose of this study was to evaluate whether Chat Generative Pre-trained Transformer (ChatGPT; Version 5.1) can reproduce frequentist meta-analytic calculations with an accuracy comparable to established statistical software in orthopaedic research.
    METHODS: In this methodological comparison study, data from two previously published orthopaedic meta-analyses with identical statistical architectures as reference standards were used. Between-study variance (τ2) was estimated using the Sidik-Jonkman method and uncertainty was quantified using the Hartung-Knapp adjustment for the random-effects models, while common-effect models assume τ2 = 0. Original data extraction tables were provided to ChatGPT-5.1, which was instructed to perform the same analyses. ChatGPT-generated pooled mean differences, confidence intervals and heterogeneity statistics (I2, τ2, p values) were compared with verified reference results obtained using the meta and metafor packages in R.
    RESULTS: Across seven evaluated outcomes, ChatGPT-5.1 reproduced the direction of effects in all cases. Deviations compared with reference meta-analyses were classified as minor in three outcomes (43%), moderate in one outcome (14%) and major in three outcomes (43%). Agreement was highest in low-heterogeneity settings, whereas substantial deviations occurred in outcomes with pronounced between-study heterogeneity, particularly under random-effects models.
    CONCLUSION: ChatGPT-5.1 demonstrates emerging capability to approximate frequentist meta-analytic calculations, particularly in low-heterogeneity settings. However, its tendency to underestimate between-study variability and to deviate in complex random-effects scenarios limits its reliability as a standalone tool. At present, large language models may support exploratory analyses but cannot fully replace dedicated statistical software for meta-analyses in orthopaedic research.
    LEVEL OF EVIDENCE: Level III.
    Keywords:  artificial intelligence; heterogeneity; meta‐research; orthopaedic research; statistical validation
    DOI:  https://doi.org/10.1002/ksa.70379
  8. PLOS Digit Health. 2026 Mar;5(3): e0001263
      Patient experiences and perspectives are essential for shaping patient-centered healthcare. While large language models (LLMs) in healthcare are typically applied to specific clinical or patient-facing tasks, they have not been used for qualitative patient preference assessment, which often relies on thematic analysis to understand patient views expressed in interviews or focus groups. LLMs show initial promise for performing inductive thematic analysis of healthcare interview or focus group transcripts, yet no empirical studies have investigated LLMs to facilitate qualitative patient preference assessment. We employed the open-source Hermes-3-Llama-3.1-70B LLM to perform inductive thematic analysis on focus group transcripts from a previously published qualitative patient preference assessment study using three optimized prompt frameworks, and evaluated semantic similarity of LLM generated themes against human-analyzed themes using the Sentence-T5-XXL language embedding model. Sentence-level theme similarity was assessed using Jaccard similarity coefficients (0-1 range), computing coefficient scores across a broad range of discrete cosine similarity thresholds. We further evaluated LLM themes for similarity in lexical diversity and reading grade-level metrics and benchmarked semantic similarity results with published similarity thresholds previously used with qualitative healthcare data. All prompt frameworks generated themes with median Jaccard similarity coefficients with human-analyzed themes between 0.46-0.64, indicating moderate semantic overlap. Our best-performing framework instructed to pursue thematic saturation scored closest to human-analyzed themes on all reading grade-level metrics, and demonstrated 12% higher semantic overlap with human-analyzed themes compared to published benchmarks. Our worst-performing framework produced themes with moderate semantic overlap and hallucinated findings unidentified in human-analyzed themes. We demonstrate that LLMs can perform inductive thematic analysis of qualitative patient preference data, producing themes substantively similar in content and style to human-analyzed themes when augmented with sufficient domain-specific context. While LLMs may augment thematic analysis, the contextual nature of qualitative analysis remains a challenge requiring collaborative LLM frameworks integrating human expertise.
    DOI:  https://doi.org/10.1371/journal.pdig.0001263
  9. Drug Saf. 2026 Mar 11.
      Many high-stakes artificial intelligence (AI) applications target low-prevalence events, where apparent accuracy can conceal limited real-world value. Relevant AI models range from expert-defined rules and traditional machine learning to generative large language models (LLMs) constrained for classification. As the effort and expertise required to develop modern AI decrease, there is a risk that organizations devote too little time to understanding their limitations and sources of error. We outline key dimensions for critical appraisal of AI in rare-event recognition, including problem framing and test set design, prevalence-aware statistical evaluation, robustness assessment, and integration into human workflows. In addition, we propose an approach to structured case-level examination (SCLE), to complement statistical performance evaluation, and a set of considerations to guide procurement or development of AI models for rare-event recognition. We instantiate the framework in pharmacovigilance, drawing on three studies: rule-based retrieval of pregnancy-related reports, duplicate detection combining machine learning with probabilistic record linkage, and automated redaction of person names using an LLM. We highlight pitfalls specific to the rare-event setting including optimism from unrealistic class balance and lack of difficult positive controls in test sets-and show how cost-sensitive targets align model performance with operational value. While grounded in pharmacovigilance practice, the principles generalize to domains where positives are scarce, and error costs may be asymmetric.
    DOI:  https://doi.org/10.1007/s40264-026-01649-7
  10. Science. 2026 Mar 12. 391(6790): 1090-1091
      New tests gauge whether large language models can use their troves of knowledge to actually make discoveries.
    DOI:  https://doi.org/10.1126/science.aeh1091
  11. Nature. 2026 Mar;651(8105): 550
      
    Keywords:  Machine learning; Mathematics and computing; Society; Technology
    DOI:  https://doi.org/10.1038/d41586-026-00775-7