bims-arines Biomed News
on AI in evidence synthesis
Issue of 2026–03–29
twelve papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. J Clin Epidemiol. 2026 Mar 20. pii: S0895-4356(26)00123-X. [Epub ahead of print] 112248
       OBJECTIVE: To determine the accuracy for progressing records to full text screening using one versus two reviewers to screen title and abstracts in three reviews of the effectiveness of interventions for chronic primary low back pain. Secondary objectives include computing inter-rater reliability, describing misclassified records and reviewer performance across reviews, and conducting sensitivity analysis limited to English records and falsely excluded records.
    DESIGN AND SETTING: One reviewer screened title and abstracts using standardized eligibility criteria and results were compared to consensus screening from two reviewers. We computed sensitivity, specificity, positive (PPV) and negative predictive values (NPV) with 95% confidence intervals (CIs) using the two reviewers as the comparison. We calculated inter-rater reliability, proportion of misclassified citations and the reasons for misclassification. We conducted sensitivity analyses by restricting the analysis to English records.
    RESULTS: The sensitivity of one reviewer ranged from 48.8% to 66.3% and the specificity ranged from 88.0% to 93.3%. The PPV ranged from 40.6% to 51.8% and NPV 93.6% to 95%. The inter-rater reliability ranged from 0.39 to 0.50. Between 5.0% and 6.3% of records were misclassified as false negative by a single reviewer. Reasons for misclassification were primarily related to the assessment of relevant interventions and comparators, such as whether the intervention could be isolated. Our sensitivity analysis showed that screening English records only compared to all languages improved sensitivity and PPV, with no change in specificity and NPV.
    CONCLUSIONS: Using a single reviewer to screen titles and abstracts may lead to the exclusion of eligible records during title and abstract screening in rapid reviews of the literature. We caution against using Kappa alone as an indicator of the quality of screening, as it is influenced by classification imbalances and suggest including accuracy measures to describe the potential for differences between reviewer screening classifications.
    PLAIN LANGUAGE SUMMARY: This study investigated whether one reviewer can accurately screen research articles for inclusion in a systematic review, compared to the usual approach of having two people do the screening. This was tested in three reviews of common treatments for chronic primary low back pain. The single reviewer who screened titles and abstracts was likely to miss relevant articles that were identified as relevant by two reviewers. However, the single reviewer was good at correctly excluding irrelevant articles. Between 5-6% of eligible articles were incorrectly excluded by the single reviewer. Most mistakes happened when the single reviewer was uncertain about a treatment's eligibility. Limiting screening to English language articles slightly improved the accuracy of the screening but it did not eliminate the risk of missing relevant research. Because AI was used to translate Chinese studies to English, further research on the usefulness for this approach is warranted. In summary, restricting screening of articles to one reviewer may save time, but it increases the probability that important evidence will be overlooked. Researchers should be cautious about relying on a single reviewer and should use additional quality assurance to limit bias.
    Keywords:  accuracy; meta-research; rapid review; reliability; screening; systematic review
    DOI:  https://doi.org/10.1016/j.jclinepi.2026.112248
  2. Res Synth Methods. 2026 Mar 24. 1-14
      The exponential growth of scientific literature poses increasing challenges for evidence synthesis. Systematic reviews (SRs) usually rely on keyword-based database searches, which are limited by inconsistent terminology and indexing delays. Citation searching-identifying studies that cite or are cited by known relevant articles-offers a complementary route to uncover additional evidence but remains poorly automated and integrated into screening workflows. We developed BibliZap, an open-source, fully automated citation-searching tool built on Lens.org data, performing multi-level forward and backward citation searches with relevance-based ranking. Its performance was evaluated across 66 published SRs, comparing five approaches: (1) PubMed-only searches; (2) PubMed followed by BibliZap restricted to the top 500 ranked results; (3) PubMed followed by full BibliZap screening; and (4-5) two exploratory early-stop strategies where BibliZap was initiated after identifying the first or the first three PubMed relevant records. The primary outcome was sensitivity, with secondary assessments of screening workload and precision. When used after PubMed screening, BibliZap increased mean sensitivity from 75% to 97%, achieving complete recall in over half of the reviews. Screening only the top 500 outputs still allowed over 90% of reviews to reach or exceed 80% recall. BibliZap recovered a median of three additional included articles per review, not retrieved by PubMed, while adding a median of 6,450 additional records. Citation searching via BibliZap enhances the completeness of evidence retrieval in SRs based on restricted database searches and supports transparent, scalable workflows adaptable to rapid and exploratory review contexts.
    Keywords:  automation; bibliographic search; citation searching; evidence synthesis; open-source tool; systematic review
    DOI:  https://doi.org/10.1017/rsm.2026.10079
  3. BMC Med Res Methodol. 2026 Mar 28.
      
    Keywords:  Artificial Intelligence; Concerns; Evidence synthesis; Horizon Scanning; Systematic Reviews
    DOI:  https://doi.org/10.1186/s12874-026-02844-x
  4. Bioengineering (Basel). 2026 Mar 20. pii: 365. [Epub ahead of print]13(3):
      Background: Large Language Models (LLMs) are reshaping medical research workflows. Objective: This narrative review synthesizes evidence on LLM applications across systematic reviews, scientific writing, and clinical research. Methods: We reviewed literature from 2023-2025 examining LLM applications in medical research, identified through PubMed, Scopus, Web of Science, arXiv, medRxiv, and Google Scholar. Studies reporting empirical findings, methodological evaluations, or systematic analyses of LLM applications were included; editorials and commentaries without empirical data were excluded. Results: In systematic reviews, LLMs achieve 80-94% data extraction accuracy and 40% reduction in screening workload, but show only slight-to-moderate agreement (κ = 0.16-0.43) in risk-of-bias assessment. In scientific writing, hallucination rates of 47-55% for fabricated references and over 90% prevalence of demographic bias require rigorous verification. For clinical research, LLMs assist with statistical coding and protocol development but require human validation. Critically, excessive reliance on automated tools may cause cognitive offloading that compromises analytical capabilities. Conclusions: LLMs are powerful but unstable tools requiring constant verification. Success depends on maintaining human-in-the-loop approaches that preserve critical thinking while leveraging AI efficiency.
    Keywords:  ChatGPT; GPT-4; artificial intelligence; evidence synthesis; large language models; medical research; prompt engineering; systematic review
    DOI:  https://doi.org/10.3390/bioengineering13030365
  5. Reports (MDPI). 2026 Mar 18. pii: 90. [Epub ahead of print]9(1):
      Background/Objectives: This study developed and evaluated a BERT-assisted literature screening workflow to support meta-analyses of postradiotherapy complications in nasopharyngeal carcinoma patients. The aim was to automate key screening steps to improve downstream screening efficiency and consistency, while minimizing time and bias during manual reviews. Materials and Methods: A bidirectional encoder representations from transformers (BERT) model was integrated into a standard systematic review pipeline for studies on postradiotherapy complications in nasopharyngeal carcinoma. The workflow combined automated BERT-based classification with manual verification and followed PRISMA and PICOS guidelines for literature identification, screening, and eligibility assessment. Model training involved hyperparameter tuning and comparison of different optimizers to maximize screening performance against a manually curated reference set, with particular attention to discrimination (AUC) and processing time. Results: From an initial corpus of 6496 records, the combined automated and manual workflow identified 23 eligible studies for meta-analysis. The included studies showed substantial heterogeneity (I2 = 86.85%), supporting the use of a random-effects model to pool outcomes. The BERT model optimized with an Adagrad optimizer achieved an AUC of 0.77 for relevant-study classification and reduced screening time to 1142 s. To demonstrate the workflow's utility, a downstream meta-analysis was conducted using the identified studies. As a downstream application based on the identified studies, a quantitative synthesis was conducted, in which (meta-analysis of the 23 included studies), a random forest model-evaluated across those studies-achieved an AUC of 0.92 under a fixed-effect analysis for predicting postradiotherapy complications. Conclusions: Integrating BERT into the literature screening phase of meta-analysis for postradiotherapy nasopharyngeal carcinoma complications markedly improved screening efficiency while maintaining acceptable classification performance. This workflow demonstrates the feasibility of transformer-based assistance for systematic reviews and provides a foundation for developing disease-specific, AI-augmented evidence synthesis pipelines in oncology.
    Keywords:  BERT; artificial intelligence; complications; meta-analysis; nasopharyngeal carcinoma; natural language processing; radiotherapy
    DOI:  https://doi.org/10.3390/reports9010090
  6. J Glaucoma. 2026 Mar 19.
       PRCIS: DeepSeek, a biomedically enriched AI model, achieved the highest accuracy in generating PubMed citations for glaucoma research, outperforming general-purpose models and highlighting the necessity of human oversight to mitigate AI-related citation errors.
    PURPOSE: This study evaluated the accuracy and reliability of four artificial intelligence (AI) models-ChatGPT (OpenAI GPT-3.5), Copilot (GitHub/Microsoft), DeepSeek (DeepSeek AI), and Gemini (Google AI)-in generating PubMed citations for glaucoma research. This study aimed to assess the potential of AI tools for academic reference generation and identify their limitations, particularly in specialized ophthalmology fields.
    METHODS: Thirty-five standardized clinical paragraphs from The Review of Ophthalmology (4th edition) were used to test citation accuracy. Each model was instructed to generate AMA 11-style PubMed citations. Citations were evaluated for accuracy, DOI matching, and clinical relevance. An expert review validated the outputs and classified them as "Fully Cited," "Partially Cited," or "Not Cited."
    RESULTS: DeepSeek, a biomedically enriched model, outperformed the others, with an accuracy of 92.0%. Copilot and Gemini achieved moderate accuracies of 66.7% and 25.8%, respectively, while ChatGPT achieved the lowest citation accuracy at 19.4%. Frequent errors included DOI mismatches, incorrect journal names, and irrelevant references. Expert review confirmed that even the best model produced citation errors, emphasizing the need for human oversight. We interpret this apparent advantage cautiously, as model details, updates, and changes in underlying data may influence performance.
    CONCLUSION: AI models-particularly biomedically enriched tools such as DeepSeek-can accelerate citation drafting, but citation hallucinations and metadata errors remain common. AI should serve as a decision support tool for reference retrieval and formatting, not a substitute for rigorous manual verification before submission.
    Keywords:  Artificial Intelligence; Citation Accuracy; Glaucoma Disease; Ophthalmology; PubMed Citations
    DOI:  https://doi.org/10.1097/IJG.0000000000002716
  7. Clin Spine Surg. 2026 Mar 09.
      Artificial intelligence (AI) represents a paradigm-shifting technology that empowers computers and software to emulate human intelligence by processing vast amounts of data. Its ubiquitous utilization continues to expand across diverse domains. AI software leverages data to discern patterns, enhancing the efficiency and effectiveness of various tasks. This paper reviews 6 prominent AI platforms: Elicit, Scite, Trinka, SciSpace, Scholarcy, and Litmaps. The study aims to explore their applications in literature composition and their potential to streamline the entirety of the process. Despite the various benefits of AI applications, it is imperative to recognize that they are most effective when used synergistically with human expertise, rather than serving as replacements.
    Keywords:  AI; artificial intelligence; clinical research; machine learning
    DOI:  https://doi.org/10.1097/BSD.0000000000002017
  8. Digit Health. 2026 Jan-Dec;12:12 20552076261435085
       Objectives: This study presents a protocol that integrates conversational artificial intelligence into qualitative data analysis to support rapid, decision-oriented descriptive analysis in public health settings. The protocol was developed during an applied project with Hamilton County Public Health that analyzed interviews with the next of kin of overdose decedents to inform local strategies. The objective is to describe the protocol, its safeguards for data familiarization and human verification, and its practical application in a real-world case.
    Methods: Evaluators designed and tested manual coding, intentional artificial intelligence-assisted coding, and conversational artificial intelligence within ATLAS.ti, selecting the conversational approach for the protocol. The protocol requires a mandatory pre-analysis familiarization phase that includes reading a stratified subset of transcripts and drafting immersion memos. Analysts then pose structured natural language queries tied to prespecified research questions. All outputs are treated as proposals and undergo required human verification, including confirmation of quoted evidence and contextual review. Theme-level benchmarking compared independent human synthesis with conversational artificial intelligence outputs.
    Results: Conversational artificial intelligence produced rapid descriptive findings anchored to verifiable text, enabling efficient auditing through embedded links. Theme-level comparison showed conceptual overlap between human and artificial intelligence outputs, with transparent documentation of areas of divergence. The protocol supported rapid training of local personnel and sustained in-house analysis capacity.
    Conclusion: The protocol formalizes a pragmatic workflow for question-led, top-down descriptive analysis using conversational artificial intelligence with mandatory human oversight. It is not intended to replace interpretive or theory-generating approaches but offers a transparent and scalable option for time sensitive, decision-focused qualitative work.
    Keywords:  ATLAS.ti simulation; Artificial intelligence; capacity building; conversational AI; local government; qualitative data analysis
    DOI:  https://doi.org/10.1177/20552076261435085
  9. Prev Sci. 2026 Mar 26.
      Qualitative data poses a challenge for prevention science and public health, as it is critical to explain the context of communities, health, and behavior, yet collecting and analyzing qualitative data using traditional methods is time-intensive and requires extensive training. As artificial intelligence (AI) models have improved, there is a growing interest in using AI to code qualitative data quickly and reliably. This study compares the similarities and differences in methods and results of artificial intelligence (AI)-assisted qualitative analysis to traditional qualitative content analysis using data collected during the development of a city and county-based food plan. In total, 2820 community comments were collected across 43 community events in 27 zip codes across the region between March 2023 and January 2024. AI-assisted analysis was completed using a combination of a transcription app (Post-ItⓇ), GPT4 Plus, and GPT for Sheets with oversight from a public health practitioner. Traditional qualitative content analysis was completed with two trained coders who completed codebook development, reliability analysis, and full content coding. Both methods used deductive codes to represent key aspects of the food system and generated inductive codes to represent areas not included by the deductive food system codes. Results found that AI-assisted methods and traditional content analysis produced similar deductive coding results, while inductive coding results were less comparable across methods. Given that qualitative data has become a central part of prevention science, we believe with careful considerations, AI-assisted methods with intentional oversight have the potential to strengthen our ability to process large amounts of qualitative data.
    Keywords:  AI-assisted coding; Artificial intelligence; Content analysis; GPT; Qualitative analysis
    DOI:  https://doi.org/10.1007/s11121-026-01904-4
  10. Eur Thyroid J. 2026 Mar 20. pii: ETJ-25-0385. [Epub ahead of print]
       Introduction: Artificial Intelligence (AI) chatbots are increasingly used in medicine, but their reliability in scenarios with multiple management options is unclear. Indeterminate thyroid nodules and low- to low-intermediate-risk papillary thyroid carcinoma (PTC) represent such cases.
    Methods: In a nationwide web-based survey, 201 members of the Hellenic Endocrine Society evaluated 12 clinical vignettes on indeterminate thyroid nodules and low- to low-intermediate-risk PTC. Their responses were compared with those generated by four conversational AI models (ChatGPT, Gemini, Copilot, DeepSeek) at two time points, 11 months apart. DeepSeek, assessed only at the second time point. Chatbot outputs were assessed for agreement with endocrinologists' predominant answers, concordance with the most guideline-consistent options (American and European Thyroid Association recommendations), temporal stability, and inter-model agreement.
    Results: Alignment between chatbots and endocrinologists' predominant responses was limited, reaching at most 25% across scenarios. In contrast, concordance with the most guideline-consistent options was higher, up to 83% (10/12 scenarios) depending on the model and time point. Across 12 scenarios, ChatGPT, Gemini, and Copilot changed their responses in 4, 7, and 5 scenarios, respectively, with some updates moving closer to, and others further from, guideline-based answers. Inter-model agreement ranged from 33% to 67%, indicating substantial variability among chatbots.
    Conclusion: AI chatbots show evolving but inconsistent performance in complex thyroid management scenarios. While guideline concordance can be relatively high, substantial variability across models, limited temporal reproducibility, and poor alignment with clinical practice highlight the need for ongoing longitudinal evaluation before safe integration into clinical decision-making.
    Keywords:  Artificial intelligence; Chatbots; Clinical decision-making; Papillary thyroid cancer; Survey; Thyroid nodules
    DOI:  https://doi.org/10.1530/ETJ-25-0385
  11. Neuropsychol Rehabil. 2026 Mar 26. 1-20
      Qualitative research offers important insight into lived experiences of individuals with traumatic brain injury (TBI), particularly in the chronic phase where standardized measures may not fully capture subjective adaptation. This methodological case report examines the feasibility and interpretive value of combining human and artificial intelligence (AI)-assisted thematic analysis to explore subjective experience following cognitive rehabilitation in chronic severe TBI. A semi-structured qualitative interview was conducted with a 30-year-old male, 52 months post severe TBI, following 40 hours of cognitive rehabilitation. The de-identified transcript was analyzed using human-coded thematic analysis and exploratory AI-assisted thematic analysis (Grok 3, xAI). AI output was treated as preliminary and interpreted under full human oversight. Standardized psychosocial measures (BRIEF-A, NeuroQOL) were reported descriptively for contextual background. Human-coded analysis identified affective metacognition, perceived readiness for educational re-engagement, and persisting challenges. AI-assisted analysis showed substantial thematic convergence while highlighting broader interpretive patterns related to self-concept and existential reflection. Standardized self-report demonstrated partial convergence with qualitative findings and divergence in social domains. Dual human-AI thematic analysis is feasible for qualitative neurorehabilitation research and may extend thematic exploration when under careful human interpretation. Qualitative interviews captured aspects of subjective experience not fully reflected in standardized measures.
    Keywords:  Artificial intelligence; Cognitive rehabilitation; Lived experience; Qualitative research; Thematic analysis; Traumatic brain injury
    DOI:  https://doi.org/10.1080/09602011.2026.2643276