bims-arines Biomed News
on AI in evidence synthesis
Issue of 2026–01–18
seven papers selected by
Farhad Shokraneh, Systematic Review Consultants LTD



  1. BMC Med Res Methodol. 2026 Jan 14.
      
    Keywords:  ChatGPT-4; Claude 3 opus; Data extraction; Evidence synthesis; Large language models
    DOI:  https://doi.org/10.1186/s12874-025-02729-5
  2. JAMIA Open. 2026 Feb;9(1): ooaf098
       Objectives: The surge in publications increases screening time required to maintain high-quality literature reviews. One of the most time-consuming phases is title and abstract screening. Machine learning tools have semi-automated this process for systematic reviews, with limited success for scoping reviews. ChatGPT, a chatbot based on a large language model, might support scoping review screening by identifying key concepts and themes. We hypothesize that ChatGPT outperforms the semi-automated tool Rayyan, increasing efficiency at acceptable costs while maintaining a low type II error.
    Materials and Methods: We conducted a retrospective study using human screening decisions on a scoping review of 15 307 abstracts as a benchmark. A training set of 100 abstracts was used for prompt engineering for ChatGPT and training Rayyan. Screening decisions for all abstracts were obtained via an application programming interface for ChatGPT and manually for Rayyan. We calculated performance metrics, including accuracy, sensitivity, and specificity with Stata.
    Results: ChatGPT 4.0 decided upon 15 306 abstracts, vastly outperforming Rayyan. ChatGPT 4.0 demonstrated an accuracy of 68%, specificity of 67%, sensitivity of 88%-89%, a negative predictive value of 99%, and an 11% false negative rate when compared to human researchers' decisions. The workload savings were at 64% reasonable costs.
    Discussion and Conclusion: This study demonstrated ChatGPT's potential to be applied in the first phase of the literature appraisal process for scoping reviews. However, human oversight remains paramount. Additional research on ChatGPT's parameters, the prompts and screening scenarios is necessary in order to validate these results and to develop a standardized approach.
    Keywords:  ChatGPT; artificial intelligence; automation; large language model; scoping review; screening
    DOI:  https://doi.org/10.1093/jamiaopen/ooaf098
  3. Nature. 2026 Jan 14.
      
    Keywords:  Machine learning; Politics; Society
    DOI:  https://doi.org/10.1038/d41586-026-00147-1
  4. Nature. 2026 Jan;649(8097): 529
      
    Keywords:  Machine learning; Scientific community; Technology
    DOI:  https://doi.org/10.1038/d41586-026-00049-2
  5. Nature. 2026 Jan;649(8097): 560-561
      
    Keywords:  Computer science; Machine learning
    DOI:  https://doi.org/10.1038/d41586-025-04090-5
  6. Med Sci Monit. 2026 Jan 17. 32 e950916
      BACKGROUND We suggest that testing a large language model (LLM) chatbot in terms of the accuracy of the references it provides could be a powerful, quantifiable means of rating its inherent degree of misinformation, since the accuracy of the bibliographic data can be directly verified. Given the growing reliance on artificial intelligence (AI) tools in academic research and clinical decision-making, such a rating could be extremely useful. MATERIAL AND METHODS In this study, we compared 3 versions of ChatGPT and 3 versions of Gemini by asking them to provide references about 25 highly cited topics in otorhinolaryngology (those with "guidelines" in the title). Answers were sought on 3 consecutive days to assess the variability and consistency of responses. In total, the 6 chatbots returned 1947 references, which were carefully checked against PubMed, Web of Science, and Google Scholar, and rated according to accuracy. Ratings were given based on correct authorship, complete bibliographic details, and proper DOI numbers. RESULTS Common discrepancies noted were wrong author names and erroneous DOI numbers. Across the 6 chatbots, ChatGPT-4.1 (with web search enabled) achieved the best accuracy, with a score of 51%, with Gemini 2.5 Pro being second at 41%. The 2 versions with a web search facility performed better than the 4 versions without. Topics having higher citation counts were associated with lower error rates, suggesting that more widely disseminated scientific findings result in more accurate references. CONCLUSIONS Our findings provide a solid benchmark for rating AI-driven bibliographic retrieval and underline the need for further refinement before these tools can be reliably integrated into academia and clinical applications.
    DOI:  https://doi.org/10.12659/MSM.950916
  7. Sci Rep. 2026 Jan 12. 16(1): 1304
      Although general-purpose artificial intelligence (GPAI) is widely expected to accelerate scientific discovery, its practical limits in biomedicine remain unclear. We assess this potential by developing a framework of GPAI capabilities across the biomedical research lifecycle. Our scoping literature review indicates that current GPAI could deliver a speed increase of around 2x, whereas future GPAI could facilitate strong acceleration of up to 25x for physical tasks and 100x for cognitive tasks. However, achieving these gains may be severely limited by factors such as irreducible biological constraints, research infrastructure, data access, and the need for human oversight. Our expert elicitation with eight senior biomedical researchers revealed skepticism regarding the strong acceleration of tasks such as experiment design and execution. In contrast, strong acceleration of manuscript preparation, review and publication processes was deemed plausible. Notably, all experts identified the assimilation of new tools by the scientific community as a critical bottleneck. Realising the potential of GPAI will therefore require more than technological progress; it demands targeted investment in shared automation infrastructure and systemic reforms to research and publication practices.
    DOI:  https://doi.org/10.1038/s41598-025-32583-w