bims-helfai Biomed News
on AI in health care
Issue of 2026–02–01
29 papers selected by
Sergei Polevikov



  1. Am J Manag Care. 2026 Jan;32(1): 34-40
       OBJECTIVES: To evaluate the association between perceived and actual changes in physician documentation time (DocTime) following implementation of an artificial intelligence (AI) scribe and to determine whether physicians with higher baseline DocTime experience greater reductions in DocTime from AI scribe use.
    STUDY DESIGN: Retrospective assessment of AI scribe use among 310 ambulatory physicians across specialties who chose to adopt a commercial tool at a large academic medical center. We utilized data from a postimplementation user feedback survey and electronic health record audit log measures of scribe use and DocTime.
    METHODS: We used an ordered logit model to assess adjusted associations between perceived and actual changes in DocTime in the 12 weeks after AI scribe adoption for the 252 physicians (81.3%) with survey data. Multivariate regression models assessed whether baseline DocTime modified the relationship between level of AI scribe use (percentage of weekly encounters) and DocTime.
    RESULTS: Although the majority of physicians perceived reductions in DocTime (86.5%) following AI scribe adoption, there was no overall association between perceived reductions and actual changes in DocTime (OR, 0.975; P  = .144). In multivariate models, higher levels of AI scribe use were associated with lower DocTime. For each additional 10% of encounters with AI scribe use, DocTime decreased by just over 30 seconds per scheduled hour (P < .001). This effect was modified by baseline DocTime, with less-efficient physicians realizing the majority of time savings.
    CONCLUSIONS: Although most physicians perceived DocTime reductions from AI scribe use, those realizing the majority of actual time savings were those with higher relative baseline DocTime.
    DOI:  https://doi.org/10.37765/ajmc.2026.89869
  2. J Adv Nurs. 2026 Jan 26.
       AIM: To identify the experiences and perceptions of healthcare professionals on artificial intelligence in healthcare.
    DESIGN: Systematic literature review of qualitative studies and meta-aggregation.
    DATA SOURCES: CINAHL, PubMed, Scopus, Medic and ProQuest were systematically searched on December 9, 2024.
    RESULTS: Twenty-six studies were included in the review, of which 25 were analysed using meta-aggregation, and the results of one study were reported narratively. A total of 185 findings were identified from the included studies that addressed the research question. These findings were aggregated into 33 categories and then into five synthesised findings as follows: (1) Perceived benefits of AI in healthcare; (2) Perceived impact of AI on professional roles and workforce dynamics; (3) Perceived impacts of AI in communication and interaction; (4) Perceived challenges of AI related to technical, financial and systemic factors; (5) Perceived ethical, cultural and regulatory considerations regarding the use of AI.
    CONCLUSION: While AI holds significant potential to enhance efficiency and improve patient outcomes, it is essential to address the concerns raised by healthcare professionals regarding workforce dynamics, communication and ethical considerations.
    IMPLICATIONS FOR THE PROFESSION: The results can inform and support the implementation of AI in healthcare and the development of AI-related education and training to meet the demands of future healthcare work.
    REPORTING METHOD: The review was conducted and reported in accordance with the PRISMA guidelines.
    PATIENT OR PUBLIC CONTRIBUTION: None.
    TRIAL REGISTRATION: PROSPERO (CRD1073200).
    Keywords:  artificial intelligence; healthcare professionals; meta‐aggregation; perceptions; qualitative literature review; systematic review
    DOI:  https://doi.org/10.1111/jan.70500
  3. Am J Emerg Med. 2026 Jan 19. pii: S0735-6757(26)00038-0. [Epub ahead of print]102 90-97
       PURPOSE: Artificial intelligence systems known as large language models are being evaluated for clinical decision support, yet their role in emergency and primary care remains limited. Physicians in these settings often encounter ear, nose, and throat conditions where diagnostic uncertainty, unnecessary testing, and inappropriate referrals contribute to patient risk and healthcare inefficiency. This study compared the performance of advanced large language models with physicians in diagnosis, management, and referral across common and high-acuity otolaryngologic scenarios.
    METHODS: Twelve clinical vignettes representing routine and urgent presentations were developed and validated by otolaryngologists. One hundred practicing physicians in family medicine and emergency medicine, including residents and attending physicians, completed all vignettes by providing a diagnosis, management plan, and referral decision. Four large language models (Gemini-2.0, ChatGPT-4.0, ChatGPT-5, and OpenEvidence) were tested using identical prompts. Model outputs were anonymized, randomized, and rated by a blinded expert panel using the Quality Analysis of Medical Artificial Intelligence tool, which assesses accuracy, clarity, completeness, sourcing, relevance, and usefulness.
    RESULTS: Physicians achieved mean diagnostic accuracy of 91.6% and management accuracy of 87.9%. In non-urgent cases, 30.4% of responses represented inappropriate referral. Only half recognized the need for urgent referral in a cerebrospinal fluid leak scenario. Large language models demonstrated comparable diagnostic and management accuracy with higher referral appropriateness.
    CONCLUSIONS: Large language models showed consistent, guideline-concordant reasoning in simulated emergency and primary-care otolaryngology cases. Their potential lies in supporting, not replacing, clinical judgment through responsible integration and real-world validation.
    Keywords:  Artificial intelligence; Clinical decision support; Diagnostic accuracy; Large language models; Otolaryngology; Referral patterns
    DOI:  https://doi.org/10.1016/j.ajem.2026.01.029
  4. Sociol Health Illn. 2026 Feb;48(2): e70150
      Algorithmic technologies such as machine learning, generative artificial intelligence (GenAI) and automated decision-making have become one of the frontiers of contemporary technoscientific innovation in healthcare. However, algorithmic technologies can never be seen in isolation from the networks in which they are embedded. Not only are they woven into situated sociotechnical assemblages of human and nonhuman entities-tools, objects and other technologies-but their entanglements also reach into regulatory institutions and markets. This paper conceptualises GenAI in healthcare 'in the making' at the rapidly changing intersection of three spheres: regulatory, market and healthcare delivery. Our study, conducted in conjunction with two nongovernmental social justice organisations, explores how this intersection is currently 'motored' by data justice concerns on the one hand and data capitalist objectives on the other. We draw health sociologists' attention to the technopolitics and market interests that lie behind AI promissories and implementations in healthcare. More importantly, we contribute to collective thinking around how we may steer this dynamic towards the empowerment of civic society, dynamic regulation and a push for public value-rather than enrichment of the few.
    Keywords:  data capitalism; data justice; generative AI; healthcare; markets; sociotechnical assemblages
    DOI:  https://doi.org/10.1111/1467-9566.70150
  5. J Med Internet Res. 2026 Jan 21.
       BACKGROUND: Artificial intelligence (AI) tools are widely and freely available for clinical use. Understanding hospitalists' real-world adoption patterns in the absence of organizational endorsement is essential for healthcare institutions to develop governance frameworks and optimize AI integration.
    OBJECTIVE: The objective of this study was to investigate hospitalist use of AI, examining the AI platforms being utilized, frequency of use, and clinical contexts of application. We hypothesized that AI use is more common among younger, less experienced hospitalists, albeit at an overall low frequency.
    METHODS: An anonymous online survey was distributed via email to all 70 hospitalists (physicians, nurse practitioners, physician assistants) providing direct patient care at a large urban academic tertiary care hospital. Demographic data, AI platform used if any, purpose(s) for AI use, and frequency of use information was collected. CHERRIES checklist was used for creating, testing, administering, and reporting the results of the survey. Chi-square test was used where possible; when expected cell values were low, Fisher's exact test was used instead. Friedman test and pairwise Wilcoxon signed-rank test were used for analyzing the differences between frequency of AI use for various tasks. Likert-scale responses to frequency questions (Never, Rarely, Sometimes, Often, Always) were converted to ordinal values (1 - 5, respectively) to facilitate analysis.
    RESULTS: Of 70 providers, 54 (77.1%) responded to the survey. No significant differences in AI usage were observed across shift type, years of practice, time allocation to hospitalist duties, sex, age, or provider designation, contrary to our hypothesis. Overall, 36 of 54 respondents (66.7%, 95% CI 53.4%-77.8%) reported using AI in clinical practice. OpenEvidence was the most used platform (28/54, 51.9%), far exceeding general-purpose tools like ChatGPT (4/54, 7.4%), suggesting preference for medical-specific platforms. Among non-users, primary concerns were AI accuracy and preference for established resources. The most common application was answering miscellaneous clinical questions (32/36, 88.9%), generating differential diagnoses (31/36, 86.1%) and determining management options (31/36, 86.1%), with much lower use for patient education materials (16/36, 44.4%). There was a statistically significant difference in the frequency of AI use across these clinical scenarios (Friedman test chi-square statistic 37.596, df 4, P<.001). Pairwise comparisons using the Wilcoxon signed-rank test revealed significant differences between use for answering miscellaneous questions and confirming suspected diagnosis (P=.003) and generating patient education materials (P=.004) respectively. Most respondents reported using AI for under 25% of clinical encounters across all use cases.
    CONCLUSIONS: Two-thirds of hospitalists organically adopted AI despite the absence of institutional oversight. AI is predominantly used as a supplementary decision support tool, with a preference for a medical-specific platform. Healthcare institutions must develop governance frameworks, validation protocols, and educational initiatives to ensure safe and effective AI deployment in clinical practice.
    CLINICALTRIAL:
    DOI:  https://doi.org/10.2196/85973
  6. Am J Manag Care. 2026 Jan 01. 32(1): e25-e30
       OBJECTIVES:  To estimate the prevalence of ambient artificial intelligence (AI) documentation tool adoption among US hospitals using Epic electronic health record (EHR) systems and to identify hospital characteristics associated with adoption.​ Study Design: Cross-sectional observational study of US hospitals using Epic.​ Methods: Among a national sample of US hospitals using Epic, we assessed ambient AI adoption using Epic Showroom (June 2025) to identify eligible ambient applications and health systems that had implemented or were implementing these applications. We linked adoption data to hospital characteristics from the American Hospital Association Annual Survey (2012-2023; most recent response per hospital) and estimated multivariable logistic regression models with robust SEs clustered at the domain level, reporting adjusted predicted probabilities (margins).​ Results: Among 6561 US hospitals, 2784 (42.4%) were Epic users. Among Epic hospitals, 62.6% adopted ambient AI. In adjusted analyses, adoption was higher across workload quartiles (61.7% in quartile [Q] 1 vs 73.1% in Q4; P = .003) and among hospitals in the top operating margin quartiles (58.0% in Q1 vs 67.6% in Q4; P = .001 vs Q1). Adoption was higher among metropolitan hospitals (64.7% vs 54.3% in nonmetropolitan hospitals; P = .012) and nonprofit hospitals (70.2% vs 28.8% in for-profit hospitals; P < .001).​ Conclusions: Ambient AI documentation tools were widely adopted among US hospitals using Epic EHR systems, with adoption associated with workload, financial performance, ownership, and select structural characteristics. These patterns suggest potential for uneven diffusion across hospitals and underscore the need for research on impacts on clinician outcomes, care quality, and equity.
    DOI:  https://doi.org/10.37765/ajmc.2026.89876
  7. JAMIA Open. 2026 Feb;9(1): ooag004
       Objectives: To assess the performance of a reasoning large language model (LLM) in identifying medication errors in medical incident reports.
    Materials and Methods: OpenAI's O4-mini LLM was adapted using prompt engineering on 75 000 anonymized incident reports from the Västmanland region of Sweden (2019-2024). To guide the prompt design, we used a subset of 2434 reports, which were manually reclassified by pharmacists as medication-related or not. For validation, 200 reports (January 2024-March 2024) were independently classified by 2 pharmacists to establish a reference classification. Moreover, the LLM performed binary classification, with concordance rates measured against the expert consensus.
    Results: The LLM achieved a concordance rate of 96.0% (192/200; 95% CI, 92.3-98.3) with expert classification. Eight cases (4.0%) showed disagreements, primarily due to linguistic ambiguity or context-dependent interpretation. Five cases involved pharmacists classifying reports as non-medication-related, while the LLM classified them as medication-related, with the reverse in 3 cases. Subcategorization accuracy was 76.5%.
    Discussion: The LLM showed expert-level performance, outperforming existing automated methods. Thus, its integration into incident reporting systems might improve the efficiency, accuracy, and consistency of patient safety monitoring.
    Conclusion: This validated AI-driven method can be integrated directly into clinical informatics workflows, enabling healthcare organizations to rapidly and consistently identify medication errors, ultimately enhancing patient safety outcomes.
    Keywords:  artificial intelligence; incident reporting; medication errors; natural language processing; patient safety
    DOI:  https://doi.org/10.1093/jamiaopen/ooag004
  8. Digit Health. 2026 Jan-Dec;12:12 20552076251411230
       Objective: Over the past decade, the German tele-emergency medical system (tele-EMS) has undergone continuous expansion. This growth has introduced a range of innovations that have transformed the daily work of tele-EMS physicians. At the same time, it has also brought new challenges, including parallel rescue operations, supra-regional deployments, and an increasing number of patient cases. To address these issues, the utilisation of an artificial intelligence (AI) system developed specifically for tele-EMS physicians was investigated.
    Methods: As part of a qualitative study, 11 tele-EMS physicians were interviewed to understand their perspective on the implementation of AI in the field of tele-emergency medicine. The interview questionnaire covers a range of topics, including requirements and concerns of tele-EMS physicians regarding the use of the specific AI system, as well as their willingness to work with this system in future.
    Results: The results of the study reveal that, despite certain concerns and fears, tele-EMS physicians are generally positive about the implementation of AI technology in prehospital tele-emergency medicine. When designed effectively, the system is considered potentially suitable for reducing the workload of tele-EMS physicians and improving the quality of patient care.
    Conclusions: This study addresses a significant gap in the field of telemedicine research by examining perceptions of tele-EMS physicians regarding the implementation of AI in prehospital tele-emergency medicine, while also outlining critical ethical considerations related to AI integration in tele-emergency care. Furthermore, it provides a set of items for a qualitative interview study that can be easily adapted for use with other medical technologies.
    Keywords:  Telemedicine; clinical decision support system; emergency medicine; ethical considerations; ethics; tele-emergency physician
    DOI:  https://doi.org/10.1177/20552076251411230
  9. JMIR AI. 2026 Jan 26. 5 e80448
       BACKGROUND: Overcrowding in the emergency department (ED) is a growing challenge, associated with increased medical errors, longer patient stays, higher morbidity, and increased mortality rates. Artificial intelligence (AI) decision support tools have shown potential in addressing this problem by assisting with faster decision-making regarding patient admissions; yet many studies neglect to focus on the clinical relevance and practical applications of these AI solutions.
    OBJECTIVE: This study aimed to evaluate the clinical relevance of an AI model in predicting patient admission from the ED to hospital wards and its potential impact on reducing the time needed to make an admission decision.
    METHODS: A retrospective study was conducted using anonymized patient data from St. Antonius Hospital, the Netherlands, from January 2018 to September 2023. An Extreme Gradient Boosting AI model was developed and tested on these data of 154,347 visits to predict admission decisions. The model was evaluated using data segmented into 10-minute intervals, which reflected real-world applicability. The primary outcome measured was the reduction in the decision-making time between the AI model and the admission decision made by the clinician. Secondary outcomes analyzed the performance of the model across various subgroups, including the age of the patient, medical specialty, classification category, and time of day.
    RESULTS: The AI model demonstrated a precision of 0.78 and a recall of 0.73, with a median time saving of 111 (IQR 59-169) minutes for true positive predicted patients. Subgroup analysis revealed that older patients and certain specialties such as pulmonology benefited the most from the AI model, with time savings of up to 90 minutes per patient.
    CONCLUSIONS: The AI model shows significant potential to reduce the time to admission decisions, alleviate ED overcrowding, and improve patient care. The model offers the advantage of always providing weighted advice on admission, even when the ED is under pressure. Future prospective studies are needed to assess the impact in the real world and further enhance the performance of the model in diverse hospital settings.
    Keywords:  AI; artificial intelligence; clinical impact; emergency department; health care
    DOI:  https://doi.org/10.2196/80448
  10. Front Med (Lausanne). 2025 ;12 1709413
       Introduction: Multimodal large language models (LLMs) that can interpret clinical text and images are emerging as potential decision-support tools, yet their accuracy on standardized cases and how it compares with human performance across different difficulty levels remains largely unclear. This study aimed to rigorously evaluate the performance of four leading LLMs on the 200-item New England Journal of Medicine (NEJM) Image Challenge.
    Methods: We assessed OpenAI o4-mini-high, Claude 4 Opus, Gemini 2.5 Pro, and Qwen 3, and benchmarked the top model against three medical students (Years 5-7) and an internal-medicine attending physician under identical test conditions. Additionally, we characterized the dominant error types for OpenAI o4-mini-high and tested prompt engineering strategies for potential correction.
    Results: Our results suggest that OpenAI o4-mini-high achieved the highest overall accuracy of 94%. Its performance remained consistently high across easy, moderate, and difficult cases. The human accuracies in this cohort ranged from 38.5% for three medical students to 70.5% for an attending physician-all significantly lower than OpenAI o4-mini-high. An analysis of OpenAI o4-mini-high's 12 errors revealed that most (83.3%) were outputs reflecting lapses in diagnostic logic rather than input processing. Notably, simple prompting techniques like chain-of-thought and few-shot learning corrected over half of these initial errors.
    Conclusion: Within the context of this standardized challenge, a leading multimodal LLM delivered high diagnostic accuracy that surpassed the scores of both peer models and the recruited human participants. However, these results should be interpreted as evidence of pattern recognition capabilities rather than human-like clinical understanding. While further validation on real-world data is warranted, these findings support the potential utility of LLMs in educational and standardized settings, highlighting that most residual errors are due to logic gaps that can be partly mitigated by refined user prompting, emphasizing the importance of human-AI interaction for maximizing reliability.
    Keywords:  NEJM image challenge; artificial intelligence in medicine; clinical decision support; medical education; multimodal large language models
    DOI:  https://doi.org/10.3389/fmed.2025.1709413
  11. Proc AAAI Conf Artif Intell. 2024 Mar 25. 38(20): 22021-22030
      The ability of large language models (LLMs) to follow natural language instructions with human-level fluency suggests many opportunities in healthcare to reduce administrative burden and improve quality of care. However, evaluating LLMs on realistic text generation tasks for healthcare remains challenging. Existing question answering datasets for electronic health record (EHR) data fail to capture the complexity of information needs and documentation burdens experienced by clinicians. To address these challenges, we introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data. MedAlign is curated by 15 clinicians (7 specialities), includes clinician-written reference responses for 303 instructions, and provides 276 longitudinal EHRs for grounding instruction-response pairs. We used MedAlign to evaluate 6 general domain LLMs, having clinicians rank the accuracy and quality of each LLM response. We found high error rates, ranging from 35% (GPT-4) to 68% (MPT-7B-Instruct), and 8.3% drop in accuracy moving from 32k to 2k context lengths for GPT-4. Finally, we report correlations between clinician rankings and automated natural language generation metrics as a way to rank LLMs without human review. MedAlign is provided under a research data use agreement to enable LLM evaluations on tasks aligned with clinician needs and preferences.
    DOI:  https://doi.org/10.1609/aaai.v38i20.30205
  12. Blood Cancer Discov. 2026 Jan 26. OF1-OF5
      In this commentary, we open the debate on what can be expected from artificial intelligence (AI) in the diagnosis of hematologic cancers. We discuss the key factors that make AI solutions robust, trustworthy, and, above all, generalizable, with particular emphasis on the importance of dataset quality in shaping the performance and effectiveness of AI models.
    DOI:  https://doi.org/10.1158/2643-3230.BCD-25-0443
  13. J Physician Assist Educ. 2026 Jan 28.
       INTRODUCTION: Artificial intelligence (AI) tools, such as customized chatbots, present an opportunity to enhance active learning in physician assistant (PA) education. This study evaluated student perceptions and outcomes associated with incorporating an AI chatbot designed to promote self-directed learning, retrieval practice, and individualized feedback within a didactic clinical medicine course.
    METHODS: Faculty developed a course-specific chatbot (GPT-4o, OpenAI) integrated with lecture materials, open-access medical texts, and sample PANCE-style questions. The chatbot generated customized multiple-choice questions and provided tutoring functions for targeted remediation. Utilization data ("impressions") were collected across modules, and students completed an anonymous Qualtrics survey assessing perceptions of usefulness and engagement. Quantitative data were analyzed descriptively, and qualitative responses underwent thematic analysis.
    RESULTS: Sixty-three of 70 eligible students (90%) completed at least one survey item. Nearly all respondents (95%) used the chatbot, and 95% agreed or strongly agreed that it improved examination preparation and information retention. Most students identified practice questions and individualized feedback as the most beneficial features of the chatbot. Student use of AI increased significantly over the course, and qualitative analysis revealed themes of enhanced active recall, self-assessment, and confidence. Faculty noted reduced time demands for tutoring and quiz creation following initial tool development.
    DISCUSSION: Integration of a customized AI chatbot within PA education promoted active learning, self-efficacy, and AI literacy while ultimately reducing faculty workload. Early exposure to AI-based tools may not only enhance student readiness for didactic and summative assessments but also cultivate the skills necessary to use similar technologies effectively and responsibly in clinical practice.
    DOI:  https://doi.org/10.1097/JPA.0000000000000742
  14. Digit Health. 2026 Jan-Dec;12:12 20552076261418897
       Purpose: Synthetic data has emerged as a promising solution to overcome the shortage of clinical datasets needed for training healthcare artificial intelligence (AI) models. This study examined how synthetic data can support AI development in Africa's healthcare by analyzing its technical performance, fidelity limitations, and governance implications within low-resource health systems.
    Methods: A Critical Literature Review was conducted on scholarly and technical literature focused on the use of synthetic data for AI in healthcare across African settings. Databases searched included Scopus, Web of Science, PubMed, and Google Scholar. Thematic analysis identified trends in synthetic data generation, fidelity, domain adaptation, and adoption challenges in African healthcare AI.
    Results: Drawing on interdisciplinary evidence, the analysis demonstrates how addressing technical challenges, improving synthetic data fidelity, leveraging domain adaptation techniques, and confronting practical adoption barriers are critical to enhancing the reliability and applicability of synthetic data for AI-driven healthcare in Africa. Four themes emerged from the analysis. First, hybrid synthetic-real datasets consistently outperform synthetic-only models. Second, fidelity gaps introduced bias risk and misclassification. Third, domain adaptation remains underused in low-resource contexts. Fourth, infrastructure gaps, weak regulation, and clinician skepticism hindered the adoption of synthetic data.
    Conclusion: Synthetic data can enhance AI-enabled healthcare in Africa if it is embedded within regulatory frameworks, validated through hybrid modeling, and supported by investment in infrastructure and capacity-building. This study highlights the intersection of synthetic data, healthcare AI, data fidelity, domain adaptation, and governance considerations in African health systems, underscoring the need for robust health technology assessment processes.
    Keywords:  Africa; Synthetic data; artificial intelligence; eHealth systems; health technology assessment; healthcare policy
    DOI:  https://doi.org/10.1177/20552076261418897
  15. Diagnostics (Basel). 2026 Jan 11. pii: 232. [Epub ahead of print]16(2):
      Background/Objectives: Chest radiography is the primary first-line imaging tool for diagnosing pneumothorax in pediatric emergency settings. However, interpretation under clinical pressures such as high patient volume may lead to delayed or missed diagnosis, particularly for subtle cases. This study aimed to evaluate the diagnostic performance of ChatGPT-5, a multimodal large language model, in detecting and localizing pneumothorax on pediatric chest radiographs using multiple prompting strategies. Methods: In this retrospective study, 380 pediatric chest radiographs (190 pneumothorax cases and 190 matched controls) from a tertiary hospital were interpreted using ChatGPT-5 with three prompting strategies: instructional, role-based, and clinical-context. Performance metrics, including accuracy, sensitivity, specificity, and conditional side accuracy, were evaluated against an expert-adjudicated reference standard. Results: ChatGPT-5 achieved an overall accuracy of 0.77-0.79 and consistently high specificity (0.96-0.98) across all prompts, with stable reproducibility. However, sensitivity was limited (0.57-0.61) and substantially lower for small pneumothoraces (American College of Chest Physicians [ACCP]: 0.18-0.22; British Thoracic Society [BTS]: 0.41-0.46) than for large pneumothoraces (ACCP: 0.75-0.79; BTS: 0.85-0.88). The conditional side accuracy exceeded 0.96 when pneumothorax was correctly detected. No significant differences were observed among prompting strategies. Conclusions: ChatGPT-5 showed consistent but limited diagnostic performance for pediatric pneumothorax. Although the high specificity and reproducible detection of larger pneumothoraces reflect favorable performance characteristics, the unacceptably low sensitivity for subtle pneumothoraces precludes it from independent clinical interpretation and underscores the necessity of oversight by emergency clinicians.
    Keywords:  ChatGPT; chest radiograph; large language model; pediatric emergency medicine; pneumothorax; prompt engineering
    DOI:  https://doi.org/10.3390/diagnostics16020232
  16. Nature. 2026 Jan;649(8099): 1139-1146
    Center for AI Safety
      Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve more than 90% accuracy on popular benchmarks such as Measuring Massive Multitask Language Understanding1, limiting informed measurement of state-of-the-art LLM capabilities. Here, in response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be an expert-level closed-ended academic benchmark with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable but cannot be quickly answered by internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a marked gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai .
    DOI:  https://doi.org/10.1038/s41586-025-09962-4
  17. Nurs Clin North Am. 2026 Mar;pii: S0029-6465(25)00079-9. [Epub ahead of print]61(1): 101-111
      The artificial intelligence (AI) revolution has already begun. AI scribes are charting. Chatbots are offering limited psychotherapy services. Researchers are trying to use AI to improve suicide risk prediction, psychiatric diagnosis, medication management, and health professional education. However, this technology is not without its faults. Human oversight and continued research remain crucial.
    Keywords:  Artificial intelligence; Machine learning; Mental health
    DOI:  https://doi.org/10.1016/j.cnur.2025.09.011
  18. JMIR AI. 2026 Jan 27.
       BACKGROUND: Large language model (LLM)-based chatbots have rapidly emerged as tools for digital mental health (MH) counseling. However, evidence on their methodological quality, evaluation rigor, and ethical safeguards remains fragmented, limiting interpretation of clinical readiness and deployment safety.
    OBJECTIVE: This systematic review aimed to synthesize the methodologies, evaluation practices, and ethical/governance frameworks of LLM-based chatbots developed for MH counseling and to identify gaps affecting validity, reproducibility, and translation.
    METHODS: We searched Google Scholar, PubMed, IEEE Xplore, and ACM Digital Library for studies published between January 2020 and May 2025. Eligible studies reported original development or empirical evaluation of LLM-driven MH counseling chatbots. We excluded studies that did not involve LLM-based conversational agents, were not focused on counseling or supportive MH communication, or lacked evaluable system outputs or outcomes. Screening and data extraction were conducted in Covidence following PRISMA 2020 guidance. Study quality was appraised using a structured traffic-light framework across five methodological domains (design, dataset reporting, evaluation metrics, external validation, and ethics), with an overall judgment derived across domains. We used narrative synthesis with descriptive aggregation to summarize methodological trends, evaluation metrics, and governance considerations.
    RESULTS: Twenty studies met inclusion criteria. GPT-based models (GPT-2/3/4) were used in 45% (9/20) of studies, while 90% (18/20) used fine-tuned or domain-adaptation using models such as LlaMa, ChatGLM, or Qwen. Reported deployment types were not mutually exclusive; standalone applications were most common (90%, 18/20), and some systems were also implemented as virtual agents (20%, 4/20) or delivered via existing platforms (10%, 2/20). Evaluation approaches were frequently mixed, with qualitative assessment (65%, 13/20), such as thematic analysis or rubric-based scoring, often complemented by quantitative language metrics (90%, 18/20), including BLEU, ROUGE, or perplexity. Quality appraisal indicated consistently low risk for dataset reporting and evaluation metrics, but recurring limitations were observed in external validation and reporting on ethics and safety, including incomplete documentation of safety safeguards and governance practices. No included study reported registered randomized controlled trials or independent clinical validation in real-world care settings.
    CONCLUSIONS: LLM-based MH counseling chatbots show promise for scalable and personalized support, but current evidence is limited by heterogeneous study designs, minimal external validation, and inconsistent reporting of safety and governance practices. Future work should prioritize clinically grounded evaluation frameworks, transparent reporting of model and prompt configurations, and stronger validation using standardized outcomes to support safe, reliable, and regulatory-ready deployment.
    CLINICALTRIAL:
    DOI:  https://doi.org/10.2196/80348
  19. Ann Plast Surg. 2026 Jan 26.
       BACKGROUND: Patients often use Google as a source of quick medical information, although the accuracy and clarity of search results can vary. ChatGPT has emerged as an alternative tool capable of providing conversational and potentially more reliable medical information. This study compares the readability, accuracy, and completeness of responses generated by ChatGPT with those obtained using Google for common patient questions regarding craniosynostosis and cleft palate.
    METHODS: The terms "Craniosynostosis" and "Cleft Palate" were entered into Google, and the top 10 associated questions for each-identified using the "People Also Ask" tool-were recorded. Each question was then entered into both Google and ChatGPT, and the responses from each were recorded. The ease of readability for each response was determined by the Flesch-Kincaid instrument. Blinded reviewers evaluated accuracy and completeness using a 3-point scale (1 = fully incorrect, 2 = partially incorrect, 3 = correct). Reviewer scores were averaged, and comparisons between platforms were evaluated using t tests.
    RESULTS: A total of 20 questions yielded 40 unique responses. For cleft palate queries, Google responses had significantly lower reading levels than ChatGPT (9.95 vs 13.22, P = 0.006). No significant difference in readability was observed for craniosynostosis responses (14.66 vs 14.73, P = 0.467). Across all questions, ChatGPT responses were significantly more complete (2.60 vs 1.86, P < 0.0001) and more accurate (2.78 vs 2.09, P < 0.0001) than Google responses. These differences persisted when each condition was analyzed separately.
    CONCLUSION: ChatGPT provides more accurate and comprehensive information than Google for common patient questions about craniosynostosis without sacrificing readability. Patients can use this information to inform their future searches in order to obtain the most accurate information about their diagnoses. Further studies evaluating the information learned by patients from both search engines can help clinicians guide patients toward resources that best fit their individual care.
    Keywords:  ChatGPT; artificial intelligence; cleft palate; craniosynostosis; patient information
    DOI:  https://doi.org/10.1097/SAP.0000000000004652
  20. Cureus. 2025 Dec;17(12): e100323
      The integration of artificial intelligence (AI) in healthcare is rapidly progressing from diagnostic support to predictive analytics. However, most existing clinical AI systems remain limited to generating predictions without translating them into individualised, actionable decisions for patients. This abstract highlights the need for transitioning from predictive AI models to patient-centric decision intelligence, which contextualizes multimodal patient data to support personalized clinical decision-making. Conventional AI models rely primarily on structured data and image-based inputs. In reality, patient care is influenced by diverse variables, including free-text clinical notes, voice biomarkers, wearable sensor output, social determinants of health, environmental exposures, and physician reasoning. Multimodal deep learning platforms integrating these heterogeneous data streams can deliver context-aware recommendations, reduce diagnostic delays, and support dynamic, individualized treatment plans. Moving toward Decision-Intelligence Healthcare Systems (DIHS) would enable explainable risk stratification, adaptive therapeutics, and real-time bedside guidance. Ethical imperatives-algorithmic transparency, dataset diversity, secure interoperability, and shared clinical accountability must guide deployment to ensure equity and physician trust. The next evolution of AI in medicine lies in decision intelligence, where algorithms function as ethical, explainable, and context-sensitive clinical thinking partners. Such systems can advance precision medicine, particularly in low-resource settings, by improving the quality of care, reducing error rates, and supporting equitable access to personalized treatment.
    Keywords:  artificial intelligence (ai); clinical data; clinical data management; healthcare digital ethics; multimodel data; personalized precision medicine
    DOI:  https://doi.org/10.7759/cureus.100323
  21. IEEE Trans Affect Comput. 2025 Oct-Dec;16(4):16(4): 2668-2679
      Modern affective computing systems rely heavily on datasets with human-annotated emotion labels for both training and evaluation. However, human annotations are expensive to obtain, sensitive to study design, and difficult to quality control, because of the subjective nature of emotions. Meanwhile, Large Language Models (LLMs) have shown remarkable performance on many Natural Language Understanding tasks, emerging as a promising tool for text annotation. In this work, we analyze the complexities of emotion annotation in the context of LLMs, focusing on GPT-4 as a leading model. In our experiments, GPT-4 achieves high ratings in a human evaluation study, painting a more positive picture than previous work, in which human labels served as the only ground truth. On the other hand, we observe differences between human and GPT-4 emotion perception, underscoring the importance of human input in annotation studies. To harness GPT-4's strength while preserving human perspective, we explore two ways of integrating GPT-4 into emotion annotation pipelines, showing its potential to flag low-quality labels, reduce the workload of human annotators, and improve downstream model learning performance and efficiency. Together, our findings highlight opportunities for new emotion labeling practices and suggest the use of LLMs as a promising tool to aid human annotation.
    Keywords:  Annotation; Crowdsourcing; Emotion Recognition; LLMs
    DOI:  https://doi.org/10.1109/taffc.2025.3584775
  22. Nat Commun. 2026 Jan 29.
      As Large Language Models (LLMs) become widely adopted, understanding how they learn from, and memorize, training data becomes crucial. Memorization in LLMs is widely assumed to only occur as a result of sequences being repeated in the training data. Instead, we show that LLMs memorize by assembling information from similar sequences, a phenomenon we call mosaic memory. We show major LLMs to exhibit mosaic memory, with fuzzy duplicates contributing to memorization as much as 0.8 of an exact duplicate and even heavily modified sequences contributing substantially to memorization. Despite models displaying significant reasoning capabilities, we somewhat surprisingly show memorization to be predominantly syntactic rather than semantic. We finally show fuzzy duplicates to be ubiquitous in real-world data, untouched by deduplication techniques. In this work, we show memorization to be a complex, mosaic process, with real-world implications for privacy, confidentiality, model utility and evaluation.
    DOI:  https://doi.org/10.1038/s41467-026-68603-0
  23. J Med Internet Res. 2026 Jan 27. 28 e76130
       Background: Living evidence (LE) synthesis refers to the method of continuously updating systematic evidence reviews to incorporate new evidence. It has emerged to address the limitations of the traditional systematic review process, particularly the absence of or delays in publication updates. The emergence of COVID-19 accelerated the progress in the field of LE synthesis, and currently, the applications of artificial intelligence (AI) in LE synthesis are expanding rapidly. However, in which phases of LE synthesis should AI be used remains an unanswered question.
    Objective: This study aims to (1) document the phases of LE synthesis where AI is used and (2) investigate whether AI improves the efficiency, accuracy, or utility of LE synthesis.
    Methods: We searched Web of Science, PubMed, the Cochrane Library, Epistemonikos, the Campbell Library, IEEE Xplore, medRxiv, COVID-19 Evidence Network to support Decision-making, and McMaster Health Forum. We used Covidence to facilitate the monthly screening and extraction processes to maintain the LE synthesis process. Studies that used or developed AI or semiautomated tools in the phases of LE synthesis were included.
    Results: A total of 24 studies were included, including 17 on LE syntheses, with 4 involving tool development, and 7 on living meta-analyses, with 3 involving tool development. First, a total of 34 AI or semiautomated tools were involved, comprising 12 AI tools and 22 semiautomated tools. The most frequently used AI or semiautomated tools were machine learning classifiers (n=5) and the Living Interactive Evidence synthesis platform (n=3). Second, 20 AI or semiautomated tools were used for the data extraction or collection and risk of bias assessment phase, and only 1 AI tool was used for the publication update phase. Third, 3 studies demonstrated the improvement in efficiency achieved based on time, workload, and conflict rate metrics. Nine studies applied AI or semiautomated tools in LE synthesis, obtaining a mean recall rate of 96.24%, and 6 studies achieved a mean F1-score of 92.17%. Additionally, 8 studies reported precision values ranging from 0.2% to 100%.
    Conclusions: AI and semiautomated tools primarily facilitate data extraction or collection and risk of bias assessment. The use of AI or semiautomated tools in LE synthesis improves efficiency, leading to high accuracy, recall, and F1-scores, while precision varies across tools.
    Keywords:  accuracy; artificial intelligence; efficiency; living evidence synthesis; phases; semiautomated tools; utility
    DOI:  https://doi.org/10.2196/76130
  24. Front Med (Lausanne). 2025 ;12 1685419
       Background: Precise pre-procedural localisation of ventricular ectopic (VE) foci shortens mapping time, reduces fluoroscopy, and improves ablation success. Large language models such as ChatGPT offer instant, free-text clinical support; however, their accuracy in ECG-based VE localisation is unknown.
    Methods: In this single-centre pilot study, we assessed the diagnostic accuracy of ChatGPT in 50 consecutive adults (average age: 43 ± 14 years; 58% women) scheduled for first-time VE ablation. ChatGPT served as the index test, and invasive electroanatomical mapping during the ablation served as the reference standard. A blinded electrophysiologist converted each index 12-lead ECG into a structured textual description of QRS morphology. ChatGPT-4o (temperature 0.2) was then tasked with assigning one of five anatomical origins (RVOT, LVOT, papillary muscle, fascicular, and epicardial). Predictions were compared with electro-anatomical mapping during catheter ablation, and agreement was measured using Cohen's κ (κ).
    Results: Electro-anatomical mapping identified 30 RVOT, 11 LVOT, 4 papillary, 1 fascicular, and 4 epicardial foci. ChatGPT correctly localised 17/50 cases (34%), yielding an overall Cohen's κ of -0.02 (95% CI -0.18 to 0.14). Sensitivity/specificity was 40%/55% for the RVOT and 36%/62% for the LVOT; no fascicular or epicardial origins were correctly predicted. The performance of ChatGPT did not differ based on the presence of structural heart disease (p = 0.43). The duration of the procedure and the acute ablation success rate (96%) were unaffected by the accuracy of ChatGPT.
    Conclusion: Freetext querying of ChatGPT failed to provide clinically meaningful VE localisation, performing no better than chance and markedly below published ECG-based algorithms. This likely reflects the model's lack of domain-specific training and its reliance on purely text-based reasoning without direct access to ECG signals. Current general-purpose language models should not be relied upon for procedural planning in VE ablation; future work must integrate multimodal training and domain-specific optimisation before LLMs can augment electrophysiology practice.
    Keywords:  ChatGPT; artificial intelligence; catheter ablation; electrocardiogram; ventricular ectopy
    DOI:  https://doi.org/10.3389/fmed.2025.1685419
  25. Biomed Eng Lett. 2026 Jan;16(1): 1-10
      Recent advances in generative artificial intelligence (AI) have accelerated the development of foundation models-large-scale, pre-trained systems capable of learning across modalities and tasks with minimal supervision. In the radiology domain, where annotated data are limited and heterogeneous, generative AI plays a critical role not only in enabling self-supervised learning and synthetic data generation, but also in addressing core engineering challenges such as scalability, multimodal alignment, and data diversity. This review examines how generative models-ranging from VAEs to diffusion and autoregressive frameworks-serve as both the algorithmic and architectural backbone of medical foundation models. We explore hybrid designs that optimize sample quality, efficiency, and control, alongside representation learning techniques like masked autoencoding and contrastive learning. Further, we describe the design and training strategies of multimodal large language models (MLLMs), which integrate visual, textual, and structured clinical data for applications including report generation, segmentation, and clinical reasoning. Through case studies of models such as Med-CLIP, RetFound, M3D-LaMed, and Med-Gemini, we illustrate how generative AI enables scalable, adaptable, and privacy-conscious AI systems in medicine. Finally, we discuss ongoing challenges-hallucination, generalization, and regulatory constraints-and highlight future directions for engineering trustworthy and deployable medical AI infrastructures.
    DOI:  https://doi.org/10.1007/s13534-025-00517-0
  26. J Clin Nurs. 2026 Jan 27.
      
    Keywords:  artificial intelligence; clinical decision‐making; healthcare technology; nursing informatics; systematic review
    DOI:  https://doi.org/10.1111/jocn.70200
  27. J Vis Exp. 2026 Jan 09.
      Recent advancements in large language models (LLMs) have led to notable improvements in abstractive summarization quality. However, hallucination - especially entity-level hallucination where non-existent or incorrect entities are introduced - remains a critical challenge. In this work, we propose a reward-driven fine-tuning framework for summarization models using the Entity Hallucination Index (EHI) as a guiding metric. The methodology here begins with generating initial summaries from pre-trained models such as Flan-T5, DistilBART, and Mistral (or other popular LLM) on structured transcript datasets, XSUM. We compute EHI by extracting named entities from both generated summaries and gold references, evaluating precision, and penalizing fabricated entities. The fine-tuning process is guided by reinforcement learning, where EHI serves as the reward signal. We adopt a REINFORCE-style update mechanism to optimize the summarization model towards maximizing entity faithfulness. Experiments demonstrate that models fine-tuned with EHI achieve lower hallucination rates without compromising informativeness. Furthermore, we show that EHI-guided models generalize better on out-of-domain summarization tasks, suggesting enhanced robustness. The approach here offers a practical direction for improving factuality in summarization, emphasizing the critical role of accurate entity representation.
    DOI:  https://doi.org/10.3791/68962