bims-helfai Biomed News
on AI in health care
Issue of 2026–01–11
sixteen papers selected by
Sergei Polevikov



  1. NEJM AI. 2025 Dec;2(12):
       BACKGROUND: Ambient artificial intelligence (AI) scribes record patient encounters and rapidly generate visit notes, representing a promising solution to documentation burden and physician burnout. However, the scribes' impacts have not been examined in randomized clinical trials.
    METHODS: In this parallel three-group pragmatic randomized clinical trial, 238 outpatient physicians, representing 14 specialties, were assigned 1:1:1 via covariate-constrained randomization (balancing on time-in-note, baseline burnout score, and clinic days per week) to either one of two AI scribe applications - Microsoft Dragon Ambient eXperience (DAX) Copilot or Nabla - or a usual-care control group from November 4, 2024, to January 3, 2025. The primary outcome was the change from baseline log writing time-in-note. Secondary end points measured by surveys included the Mini-Z 2.0, a four-item physician task load (PTL), and Professional Fulfillment Index - Work Exhaustion (PFI-WE) scores to evaluate aspects of burnout; work environment; stress; and targeted questions addressing safety, accuracy, and usability.
    RESULTS: DAX was used in 33.5% of 24,696 visits; Nabla was used in 29.5% of 23,653 visits. Nabla users experienced a 9.5% (95% confidence interval [CI], -17.2% to -1.8%; P=0.02) decrease in time-in-note versus the control group, whereas DAX users exhibited no significant change versus the control group (-1.7%; 95% CI, -9.4% to +5.9%; P=0.66). Increases in total Mini-Z (scale 10-50; DAX 2.83 [95% CI, +1.28 to +4.37]; Nabla +2.69 [95% CI, +1.14 to +4.23]) and reductions in PTL (scale 0-400; DAX -39.9 [95% CI, -71.9 to -7.9]; Nabla -31.7 [95% CI, -63.8 to +0.4]), and PFI-WE (scale 0-4; DAX 0.32 [95% CI,-0.55 to -0.08]; Nabla -0.23 [95% CI, -0.46 to +0.01]) scores suggest improvement for users of either scribe versus the control. One grade 1 (mild) adverse event was reported, while clinically significant inaccuracies were noted "occasionally" on five-point Likert questions (DAX 2.7 [95% CI, 2.4 to 3.0]; Nabla 2.8 [95% CI, 2.6 to 3.0]).
    CONCLUSIONS: Nabla reduced time-in-note versus the control. Both DAX and Nabla resulted in potential improvements in burnout, task load, and work exhaustion, but these secondary end point findings need confirmation in larger, multicenter trials. Clinicians reported that performance was similar across the two distinct platforms, and occasional inaccuracies observed in either scribe require ongoing vigilance. (Funded by the University of California, Los Angeles, Department of Medicine and others; ClinicalTrials.gov number, NCT06792890.).
    DOI:  https://doi.org/10.1056/aioa2501000
  2. Int J Nurs Stud. 2025 Dec 19. pii: S0020-7489(25)00332-3. [Epub ahead of print]176 105322
       BACKGROUND: Clinical documentation is essential for safe, high-quality care but has become increasingly complex, contributing to clinician burnout. Large language models offer potential to ease documentation by generating summaries, structuring data, and ensuring compliance. However, concerns remain regarding accuracy, bias, privacy, and regulatory risks.
    OBJECTIVE: To map current literature on large language models applications in clinical documentation, evaluating their benefits, limitations, and ethical considerations.
    INFORMATION SOURCES: Five electronic databases (i.e., PubMed, Scopus, CINAHL, Cochrane Library, and IEEE Xplore) covering peer-reviewed literature published in English between January 2009 and August 2025.
    METHODS: This scoping review followed Arksey and OMalleys framework and was reported in accordance with PRISMA-ScR guidelines. Screening, data extraction, and quality appraisal were conducted independently by multiple reviewers using Joanna Briggs Institute tools. Findings were synthesized using descriptive and narrative approaches.
    RESULTS: Forty-one studies met inclusion criteria, most originating from the United States. Large language models were primarily applied to clinical note generation, discharge summaries, and provider-patient encounter documentation. Key evaluation metrics included content accuracy, linguistic quality, and summarization performance. Large language models demonstrated potential to improve documentation efficiency and readability, with some studies reporting up to 40 % time savings. However, concerns about factual inaccuracies, hallucinations, and reduced performance in complex cases were common. Clinician perceptions were mixed. Some found notes generated by large language models helpful and well-structured, while others raised concerns about reliability, liability, and loss of clinical nuance. Ethical challenges included data privacy, security, and algorithmic bias, with varying levels of compliance across settings.
    CONCLUSIONS: Large language models hold significant promise for enhancing clinical documentation by improving efficiency, standardization, and clarity. However, their safe and effective use requires rigorous attention to accuracy, ethical safeguards, and clinician trust. Integration must support, rather than supplant, clinical reasoning and patient-centered care. Co-design with clinicians, real-world evaluation, and artificial intelligence literacy are essential to ensure that these technologies augment, not erode, professional judgment and care quality.
    REGISTRATION: Open Science Framework Registries (https://osf.io/m4h3q).
    Keywords:  Artificial intelligence; Documentation; Electronic health records; Large language models
    DOI:  https://doi.org/10.1016/j.ijnurstu.2025.105322
  3. BMJ Health Care Inform. 2026 Jan 09. pii: e101400. [Epub ahead of print]33(1):
       OBJECTIVES: To examine primary care physicians' attitudes regarding artificial intelligence (AI) use for administrative clinical tasks.
    METHODS: Web-based survey with US physicians in family medicine or internal medicine (N=420, response rate 5.13%). Two hypothetical AI tools for administrative clinical activities were described. We examined physicians' attitudes towards AI tools, and their associations with practice years, exposure to AI, use case and stakeholder type were evaluated using generalised estimating equations.
    RESULTS: Participants were on average 49.6 years (SD=12.5) and 56.7% men (238/420). Physicians with fewer practice years were more likely to endorse the tools' benefits (OR 1.70-1.96), the tools' benefits outweighing risks (OR 1.79-2.06) and their openness to use (OR 1.63-1.83), and were less likely to endorse disclosure of AI use (OR 0.60 (95% CI 0.36 to 0.998)). Physicians with AI exposure were more likely to agree the tools' benefits outweighed their risks (OR 1.51 (95% CI 1.06 to 2.16)). Physicians were more likely to endorse the tools' benefit to physicians (OR 4.94 (95% CI 4.16 to 5.86)) and physicians' openness to using them (OR 3.53 (95% CI 2.97 to 4.20)) than they were to endorse their benefit to patients and patients' openness. Physicians rated an AI tool for notes generation as more beneficial than one for billing assistance (OR 1.73 (95% CI 1.39 to 2.16)).
    DISCUSSION: Although the findings are preliminary, US primary care physicians' attitudes toward AI for clinical administration varied by practice years, prior exposure to AI, use case and stakeholder type.
    CONCLUSION: Our findings highlight opportunities to develop training and implementation strategies in service of advancing safe and effective integration of administrative AI tools in primary care.
    Keywords:  Artificial intelligence; Primary Health Care
    DOI:  https://doi.org/10.1136/bmjhci-2024-101400
  4. Clin Anat. 2026 Jan 10.
      Artificial intelligence is among the most rapidly developing branches of technology. It has proven to be a helpful tool in various fields, including medicine. Significant advances in the development of new language models prompt an evaluation of their effectiveness across various areas of medicine, including anatomy. This study aimed to assess the effectiveness of artificial intelligence in solving theoretical anatomy exams designed for medical students. The study utilized 555 multiple-choice questions (150 in Polish and 405 in English) sourced from past anatomy exams for the medical program. The models tested included: ChatGPT-4o mini, ChatGPT-4o, DeepSeek, Copilot, Gemini, and two Polish models: Bielik and PLLum. Each question was asked only once. For analysis purposes, the questions were categorized by type and by the anatomical structure they addressed. Out of 555 questions, ChatGPT-4o mini answered 394 correctly (71%), ChatGPT-4o - 461 (83.1%), DeepSeek - 427 (76.9%), Copilot - 442 (79.6%), Gemini - 439 (78.8%), Bielik - 166 (29.9%), and PLLum - 222 (40.0%). The language models performed poorest on multiple-answer questions (37.6%) and best on questions concerning the function of a given organ (75%). Most of the tested language models are capable of independently passing the exam, which should serve as a warning to teaching staff supervising students during exams and assessments. Properly formulated questions can currently hinder students relying on artificial intelligence from passing, but ongoing AI advancements may result in even higher pass rates in the future.
    Keywords:  anatomy; anatomy teaching; artificial intelligence; large language models; tests
    DOI:  https://doi.org/10.1002/ca.70076
  5. JMIR Med Inform. 2026 Jan 03.
       UNSTRUCTURED: Objective: We employed the free artificial intelligence (AI) tool Google NotebookLM®, powered by the large language model (LLM) Gemini 2.0, to construct a medical decision-making aid for diagnosing and managing airway diseases, and subsequently evaluated its functionality and performance in clinical workflow. Methods: After feeding this tool with relevant published clinical guidelines for these diseases, we evaluated the feasibility of the system regarding its behavior, ability, and potential, and made simulated cases and used this system to solve associated medical problems. The test and simulation questions were designed by a pulmonologist, and the appropriateness (focusing on accuracy and completeness) of AI responses were judged by three pulmonologists independently. The system was then deployed in an emergency department (ED) setting, where it was tested by medical staff (n=20) to see how it affected the process of clinical consultation. Test opinions were collected through questionnaire. Results: Most (58/84=66.7%) of the specialists' ratings regarding AI responses were above average. The inter-rater reliability was moderate on accuracy (Intraclass correlation coefficient (ICC)=0.612, P<.001) and good on completeness (ICC=0.773, P<.001). When deployed in an ED setting, this system could respond with reasonable answers, enhance the literacy of personnel about these diseases. The potential to save the time spent in consultation did not reach statistical significance (Kolmogorov-Smirnov D=.223, P=.237>.05) across all participants, but indicated a favorable outcome if we analyzed only physicians' responses. Conclusions: This system is customizable, cost-efficient, and accessible by clinicians and allied professionals without any computer coding experience in treating airway diseases. It provides convincing guideline-based recommendations, increases the staff's medical literacy, and potentially saves physicians' time spent on consultation. It warrants further evaluation in other medical disciplines and healthcare environments.
    DOI:  https://doi.org/10.2196/78567
  6. JAMIA Open. 2025 Oct;8(5): ooaf126
       Objectives: To evaluate ChatGPT's ability to perform thematic analysis of medical Best Practice Advisory (BPA) free-text comments and identify prompt engineering strategies that optimize performance.
    Materials and Methods: We analyzed 778 BPA comments from a pilot AI-enabled clinical deterioration intervention at Stanford Hospital, categorized as reasons for deterioration (Category 1) and care team actions (Category 2). Prompt engineering strategies (role, context specification, stepwise instructions, few-shot prompting, and dialogue-based calibration) were tested on a 20% random subsample to determine the best-performing prompt. Using that prompt, ChatGPT conducted deductive coding on the full dataset followed by inductive analysis. Agreement with human coding was assessed as inter-rater reliability (IRR) using Cohen's Kappa (κ).
    Results: With structured prompts and calibration, ChatGPT achieved substantial agreement with human coding (κ = 0.76 for Category 1; κ = 0.78 for Category 2). Baseline agreement was higher for Category 1 than Category 2, reflecting differences in comment type and complexity, but calibration improved both. Inductive analysis yielded 9 themes, with ChatGPT-generated themes closely aligning with human coding.
    Discussion: ChatGPT can accelerate qualitative analysis, but its rigor depends heavily on prompt engineering. Key strategies included role and context specification, pulse-check calibration, and safeguard techniques, which enhanced reliability and reproducibility.
    Conclusion: This study demonstrates the feasibility of ChatGPT-assisted thematic analysis and introduces a structured approach for applying LLMs to qualitative analysis of clinical free-text data, underscoring prompt engineering as a methodological lever.
    Keywords:  artificial intelligence; informatics; large language models; prompt engineering; qualitative research; thematic analysis
    DOI:  https://doi.org/10.1093/jamiaopen/ooaf126
  7. Syst Rev. 2026 Jan 03.
       BACKGROUND: Systematic reviews are fundamental to evidence-based medicine, but the process of screening studies is time-consuming and prone to errors, especially when conducted by a single reviewer. False exclusions of relevant studies can significantly impact the quality and reliability of reviews. Artificial intelligence (AI) tools have emerged as secondary reviewers in detecting these false exclusions, yet empirical evidence comparing their performance is limited.
    METHODS: This study protocol outlines a comprehensive evaluation of four AI tools (ASReview, DistillerSR Artificial Intelligence System [DAISY], Evidence for Policy and Practice Information [EPPI]-Reviewer, and Rayyan) in their capacity to act as secondary reviewers during single-reviewer title and abstract screening for systematic reviews. Utilizing a database of single-reviewer screening decisions from two published systematic reviews, we will assess how effective AI tools are at detecting false exclusions while assisting single-reviewer screening compared to the dual-reviewer reference standard. Additionally, we aim to determine the overall screening performance of AI tools in assisting single-reviewer screening.
    DISCUSSION: This research seeks to provide valuable insights into the potential of AI-assisted screening for detecting falsely excluded studies during single screening. By comparing the performance of multiple AI tools, we aim to guide researchers in selecting the most effective assistive technologies for their review processes.
    SYSTEMATIC REVIEW REGISTRATION: (Open Science Framework): https://osf.io/dky26.
    Keywords:  AI tools; Falsely excluded studies; Rapid reviews
    DOI:  https://doi.org/10.1186/s13643-025-03031-7
  8. Insights Imaging. 2026 Jan 05. 17(1): 6
       OBJECTIVES: The use of AI is gaining relevance in healthcare. There is limited information regarding the views of patients on AI in healthcare. The aim of our study was to assess the views of patients on the use of AI in healthcare with an on-site questionnaire.
    MATERIALS AND METHODS: Patients in our tertiary hospital with a diagnostic imaging appointment were invited to complete a paper-based questionnaire between December 2022 and October 2023. We asked about socio-demographic data, experience, knowledge, and their opinion on the use of AI in healthcare, focusing on the fields (1) diagnostics, (2) therapy, and (3) triage.
    RESULTS: Out of a total of 198 patients (mean age 49.41 ± 17.6 years, 99 female), 91.5% stated that they expected benefits from the implementation of AI in healthcare, although 73.4% rated their knowledge of AI as moderate to none. The majority of patients were in favour of using AI in diagnostics (87.2%) and therapy (73.1%), while only 28.2% approved its use in patient triage. 84.0% wanted to be informed about the use of AI in at least one of the mentioned areas. Participants with higher education, higher self-assessed knowledge of AI and personal experience with AI showed greater approval for AI in healthcare.
    CONCLUSION: Our interviewed patients have a rather open attitude towards AI in healthcare, with differentiated views depending on the topic; patients are in favour of the use of AI, especially in diagnostics and to a lesser extent also for therapy support, but they reject its use for triage.
    CRITICAL RELEVANCE STATEMENT: Overall, the results emphasise the need for widespread efforts to address patient concerns about AI in healthcare, including enhancing understanding and acceptance while protecting marginalised groups. This will help clinical radiology to adopt AI more effectively.
    KEY POINTS: There is limited information on patients' views of AI in healthcare, often focused on specific groups, limiting generalizability. Patients are open to AI in healthcare, supporting its use in diagnostics and therapy, but rejecting its use for triage. Overall, patients want to be informed about AI usage and participants with higher education and AI experience showed more approval.
    Keywords:  Artificial intelligence; Machine learning; Patient; Questionnaire; Survey
    DOI:  https://doi.org/10.1186/s13244-025-02159-3
  9. J Med Syst. 2026 Jan 09. 50(1): 6
      Artificial intelligence (AI) and large language models (LLMs) such as ChatGPT-5 are increasingly applied in medical education. However, their potential role in clinical simulation remains largely unexplored. This descriptive proof-of-concept study aimed to examine ChatGPT-5's ability to synthesize and generate educational content related to clinical simulation, focusing on the coherence, factual accuracy, and understandability of its outputs. Seven exploratory questions covering conceptual, historical, and technological aspects of clinical simulation were submitted to ChatGPT-5. Each query was regenerated three times to assess consistency. Responses were independently evaluated by multiple reviewers using a five-point Likert scale for content quality and accuracy, and the Patient Education Materials Assessment Tool (PEMAT) for understandability. Authenticity of AI-generated references was verified through PubMed and Google Scholar. ChatGPT-5 produced coherent and organized responses reflecting major milestones and trends in clinical simulation. Approximately 80% of cited references were verifiable, while some inconsistencies indicated residual fabrication. The average agreement score for accuracy and coherence was 4 ("agree"), suggesting generally acceptable quality. PEMAT analysis showed that content was structured and clear but occasionally used complex terminology, limiting accessibility. Within the exploratory scope of this proof-of-concept study, ChatGPT-5 demonstrated potential as a supportive tool for synthesizing information about clinical simulation. Nonetheless, interpretive depth, citation reliability, and pedagogical adaptation require further refinement. Future research should assess the integration of LLMs into immersive simulation environments under robust ethical and educational frameworks.
    Keywords:  Artificial intelligence (MeSH); Chatbot; Education; Medical; ChatGPT
    DOI:  https://doi.org/10.1007/s10916-025-02334-5
  10. Br J Psychiatry. 2026 Jan 08. 1-3
      Artificial intelligence is increasingly being used in medical practice to complete tasks that were previously completed by the physician, such as visit documentation, treatment plans and discharge summaries. As artificial intelligence becomes a routine part of medical care, physicians increasingly trust and rely on its clinical recommendations. However, there is concern that some physicians, especially those younger and less experienced, will become over-reliant on artificial intelligence. Over-reliance on it may reduce the quality of clinical reasoning and decision-making, negatively impact patient communications and raise the potential for deskilling. As artificial intelligence becomes a routine part of medical treatment, it is imperative that physicians recognise the limitations of artificial intelligence tools. These tools may assist with basic administrative tasks but cannot replace the uniquely human interpersonal and reasoning skills of physicians. The purpose of this feature article is to discuss the risks of physician deskilling based on increasing reliance on artificial intelligence.
    Keywords:  Artificial intelligence; deskilling; over-reliance
    DOI:  https://doi.org/10.1192/bjp.2025.10496
  11. J Bone Joint Surg Am. 2026 Jan 05.
       BACKGROUND: Artificial intelligence (AI) in orthopaedics is shifting from passive interfaces in which a surgeon queries a large language model to an era of active participation in which a surgeon empowers a software platform to automate certain tasks on their behalf. The emerging new paradigm called agentic AI involves agents that move beyond decision support tools to becoming semi-autonomous collaborators in research, clinical, and rehabilitation tasks.
    PURPOSE: The purpose of this review is to summarize how recent advances (April 2022 to October 2025) in automation, prediction, and augmentation agents are poised to transform the practice of orthopaedics; and to outline the conceptual, technical, and ethical foundations of this transition.
    RECENT FINDINGS: An agent is software that can process information and act independently to execute a set of defined tasks. It can seek knowledge, ask for help, deploy other software, and learn from its actions. Automation, prediction and augmentation agents can be leveraged in multi-agent and federated-learning architectures, working together to create coordinated ecosystems that can manage complex tasks and that improve with clinical use. Collectively, the output of such ecosystems is referred to as agentic AI. However, regulatory and ethical concerns underscore the need for transparency, equity, and the preservation of human agency within these frameworks.
    SUMMARY: Agentic AI marks a transition from passive tools that merely assist clinicians to autonomous systems that act alongside them. The success of this technology in orthopaedics will depend on the depth of human-machine collaboration they enable and how well they align computational precision with the enduring human art of restoring motion and health.
    DOI:  https://doi.org/10.2106/JBJS.25.01497
  12. Int J Med Inform. 2026 Jan 06. pii: S1386-5056(26)00010-9. [Epub ahead of print]209 106270
       OBJECTIVES: This study aimed to assess the feasibility and practical utility of using large language models (LLMs) for Logical Observation Identifiers Names and Codes (LOINC) mapping to standardise healthcare data in the field of laboratory medicine. We evaluated the accuracy and applicability of three LLMs-ChatGPT-4.0 (OpenAI), Gemini 1.5 (Google DeepMind), and Perplexity AI (Perplexity.ai)-in mapping laboratory test items, which typically require considerable institutional-level standardisation efforts.
    METHODS: A total of 75 representative laboratory test items, including 55 clinical chemistry and 20 hematology tests commonly used in clinical practice, were selected. Six board-certified clinical pathologists independently mapped each test item to its appropriate LOINC code. A consensus mapping was established by the experts and used as the gold standard. Each LLM's output was compared to this consensus, and the results were categorised as complete match (CM), partial match (PM), or mismatch (MM) based on agreement with the reference.
    RESULTS: Overall paired ordinal analyses demonstrated a significant difference in LOINC code-mapping performance among the three models, with Gemini performing significantly worse than both ChatGPT-4.0 and Perplexity AI, and no significant difference between ChatGPT-4.0 and Perplexity AI. ChatGPT-4.0 achieved the highest CM rate in clinical chemistry (58.2%), whereas Perplexity AI performed best in hematology (55.0%). Gemini showed the highest MM rates, particularly in hematology (80.0%), while partial matches were largely attributable to method-related discrepancies rather than fully incorrect mappings.
    CONCLUSION: Structured inputs, localisation to domestic laboratory practices, and expert oversight are critical to improving the reliability of LLM-generated LOINC mappings. While LLMs can reduce workload by generating candidate mappings, human validation remains essential to ensure clinical accuracy. Future improvements should focus on algorithmic refinement, error feedback integration, and adaptation to diverse laboratory settings to enhance accuracy and generalisability in real-world laboratory settings.
    Keywords:  Clinical chemistry; Hematology; Interoperability; Large language models; Logical observation identifiers
    DOI:  https://doi.org/10.1016/j.ijmedinf.2026.106270
  13. JAMIA Open. 2026 Feb;9(1): ooaf123
       Objective: To explore the use of large language models (LLMs) to assist in developing new agent-based disease-specific patient journey models.
    Materials and Methods: We focus on Synthea, an open-source synthetic health data generator, with the goal of developing models in less time and with reduced expertise, expanding model diversity, and improving synthetic patient data quality. We apply a 4-stage methodology: (1) using an LLM to extract disease information from authoritative medical sources, (2) using an LLM to create an initial Synthea-compatible model, (3) validating that model through 2-level assessment (structural/syntax validation and requirements satisfaction), and (4) using an LLM to iteratively refine the model based on validation feedback.
    Results: Using hyperthyroidism as an example, we tested Claude 3.5 Sonnet, GPT-4o, and Gemini 1.5 Pro. While the LLMs generated initial models that varied widely in quality, all 3 demonstrated significant improvement in requirement fulfillment scores through successive iterations, with final requirement fulfilment scores approaching 100% for Claude and Gemini. However, evaluation by human experts revealed various structural deficits in final models.
    Discussion: LLMs can assist in creating patient journey models when combined with structured methodology and authoritative medical knowledge sources. Iterative improvement was shown to be essential in creating models meeting stated requirements. Limitations include frequent medical code inaccuracies, model isolation without comorbidity considerations, and remaining requirements for clinical expertise and human oversight.
    Conclusion: LLMs can serve as valuable assistive tools for synthetic health data model development when used within structured, iterative frameworks, although at the time of this writing (mid-2024) LLMs require continued human expertise and validation rather than fully autonomous operation. In principle, this conclusion is not limited to Synthea and could be applied to other agent-oriented patient journey frameworks.
    Keywords:  FHIR; Synthea; disease model; large language models; synthetic health data
    DOI:  https://doi.org/10.1093/jamiaopen/ooaf123
  14. J Med Internet Res. 2026 Jan 08. 28 e74304
       BACKGROUND: Structured medication reviews (SMRs) are an essential component of medication optimization, especially for patients with multimorbidity and polypharmacy. However, the process remains challenging due to the complexities of patient data, time constraints, and the need for coordination among health care professionals (HCPs). This study explores HCPs' perspectives on the integration of artificial intelligence (AI)-assisted tools to enhance the SMR process, with a focus on the potential benefits of and barriers to adoption.
    OBJECTIVE: This study aims to identify the key user requirements for AI-assisted tools to improve the efficiency and effectiveness of SMRs, specifically for patients with multimorbidity, complex polypharmacy, and frailty.
    METHODS: A qualitative study was conducted involving focus groups and semistructured interviews with HCPs and patients in the United Kingdom. Participants included physicians, pharmacists, clinical pharmacologists, psychiatrists from primary and secondary care, a policy maker, and patients with multimorbidity. Data were analyzed using a hybrid inductive and deductive thematic analysis approach to identify themes related to AI-assisted tool functionality, workflow integration, user-interface visualization, and usability in the SMR process.
    RESULTS: Four major themes emerged from the analysis: innovative AI potential, optimizing electronic patient record visualization, functionality of the AI tool for SMRs, and facilitators of and barriers to AI tool implementation. HCPs identified the potential of AI to support patient identification and prioritizing those at risk of medication-related harm. AI-assisted tools were viewed as essential in detecting prescribing gaps, drug interactions, and patient risk trajectories over time. Participants emphasized the importance of presenting patient data in an intuitive format, with a patient interface for shared decision-making. Suggestions included color-coding blood results, highlighting critical medication reviews, and providing timelines of patient medical histories. HCPs stressed the need for AI tools to integrate seamlessly with existing electronic patient record systems and provide actionable insights without overwhelming users with excessive notifications or "pop-up" alerts. Factors influencing the uptake of AI-assisted tools included the need for user-friendly design, evidence of tool effectiveness (though some were skeptical about the predictive accuracy of AI models), and addressing concerns around digital exclusion.
    CONCLUSIONS: The findings highlight the potential for AI-assisted tools to streamline and optimize the SMR process, particularly for patients with multimorbidity and complex polypharmacy. However, successful implementation depends on addressing concerns related to workflow integration, user acceptance, and evidence of effectiveness. User-centered design is crucial to ensure that AI-assisted tools support HCPs in delivering high-quality, patient-centered care while minimizing cognitive overload and alert fatigue.
    Keywords:  AI; artificial intelligence; health technology; medicine optimization; risk stratification; structured medication reviews
    DOI:  https://doi.org/10.2196/74304
  15. Ophthalmol Sci. 2026 Feb;6(2): 101007
       Purpose: To assess the quality of Chat Generative Pre-Trained Transformer-4 Omni (ChatGPT-4o) responses to questions submitted by patients through Epic MyChart.
    Design: Retrospective cross-sectional study.
    Participants: One hundred sixty-five patients who submitted ophthalmology-related questions via Epic MyChart.
    Methods: Questions asked by ophthalmology clinic patients related to the subspecialties of glaucoma, retina, and cornea via the Epic MyChart at a single institution were evaluated. Nonclinical questions were excluded. Each question was submitted to ChatGPT-4o twice, first without limitations and then after priming the large language model (LLM) to respond at a sixth-grade reading level. The ChatGPT-4o output and subsequent conversations were graded by 2 independent ophthalmologist reviewers as "accurate and complete," "incomplete," or "unacceptable" with respect to the quality of the output. A third subspecialist reviewer provided adjudication in cases of disagreement. Readability of the ChatGPT-4o output was assessed using the Flesch-Kincaid Grade Level and other readability indices.
    Main Outcome Measures: Quality and readability of answers generated by ChatGPT-4o.
    Results: Two hundred eighty-five queries asked by 165 patients were analyzed. Overall, 220 (77%) responses were graded as accurate and complete, 49 (17%) as incomplete, and 16 (6%) as unacceptable. The initial 2 reviewers agreed in 87% of the responses generated by ChatGPT-4o. The overall mean Flesch-Kincaid reading grade level was 12.1 ± 2.1. When asked to respond at a sixth-grade reading level, 242 (85%) responses were graded as accurate and complete, 38 (13%) were incomplete, and 5 (2%) were graded as unacceptable.
    Conclusions: Chat Generative Pre-Trained Transformer-4 Omni usually provides accurate and complete answers to the questions posed by patients to their glaucoma, retina, and cornea subspecialists. A substantial proportion of the responses were, however, graded as incomplete or unacceptable. Chat Generative Pre-Trained Transformer-4 Omni responses required a 12th-grade education level as assessed by Flesch-Kincaid and other readability indices, which may make them difficult for many patients to understand; however, when prompted to do so, the LLM can generate responses at a sixth-grade reading level without a compromise in response quality. Chat Generative Pre-Trained Transformer-4 Omni can potentially be used to answer clinical ophthalmology questions posed by patients; however, additional refinement will be required prior to implementation of such an approach.
    Financial Disclosures: Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.
    Keywords:  Artificial intelligence; ChatGPT-4o; Ophthalmology; Patient questions; Readability
    DOI:  https://doi.org/10.1016/j.xops.2025.101007