bims-librar Biomed News
on Biomedical librarianship
Issue of 2025–11–30
forty-one papers selected by
Thomas Krichel, Open Library Society



  1. J R Coll Physicians Edinb. 2025 Nov 27. 14782715251397295
       BACKGROUND: We provide historical context for the recently discovered University of Glasgow Library records of Archibald Hewan (1832- 1883), the first Black medical doctor known to have served as a missionary in Africa.
    METHODOLOGY: An audit of several different manuscript records of the University of Glasgow's Archives and Special Collections and an analysis of their meaning within the broader historical context of medical libraries during the mid nineteenth century.
    RESULTS: The records offer insight into what kinds of books were borrowed, how they connected to the curricula of Glasgow's universities, where they were consulted and what kinds of topics caught the attention of students.
    CONCLUSION: The records of medical libraries offer unique insight into the information sources and learning strategies employed by Hewan and other contemporary university students who later worked in colonial locations.
    Keywords:  Black doctors; Church history; book history; health information; medical education
    DOI:  https://doi.org/10.1177/14782715251397295
  2. medRxiv. 2025 Nov 15. pii: 2025.11.13.25340162. [Epub ahead of print]
       Objectives: Case reports are eyewitness reports of medical phenomena, such as adverse effects of treatments, outcomes of new surgical techniques, descriptions of rare diseases, unusual presentations of common diseases, or emerging infectious outbreaks. Although any single case report may be confounded, biased or erroneous, observations that are separately reported in multiple independent publications are more likely to be reliable, and so the accumulated evidence should have more value than any single report on its own. This notion led us to analyze the case reports literature in search of nuggets: collections of multiple case reports that describe similar main findings.
    Materials and Methods: To identify nuggets in collections of case reports retrieved in PubMed queries, semantic similarities among the case reports were computed based on titles and main finding sentences extracted from the abstracts, and then grouped into communities with a graph database. The initial communities were then merged with a secondary hierarchical clustering process.
    Results: Computed nuggets of size 4-100 articles are displayed along with large language model (LLM)-computed summaries, the title of the nugget's central article, and hyperlinks for viewing as well as export to our companion tool Anne O'Tate for further analysis. A variety of advanced options are also offered; users can optionally submit feedback on the quality of computed nuggets.
    Discussion: Our free, public tool https://arrowsmith.psych.uic.edu/casereports facilitates the identification of nuggets and their summarization and mining. This should enhance the value of case report evidence and assist clinicians as well as those performing evidence syntheses of the published literature.
    DOI:  https://doi.org/10.1101/2025.11.13.25340162
  3. NPJ Syst Biol Appl. 2025 Nov 28.
      Scientific literature is being published at an exponential rate, including in the field of mammalian cell bioprocessing. At the same time, the research landscape is becoming more diverse, with the emergence of multiple specialised subfields. This rise in information availability as well as broadening of research fields has a direct impact on ease of information retrieval. While this growth offers valuable insights, it also makes information retrieval more complex. Developing effective literature search queries has become increasingly challenging. This work discusses the process of literature query search refinement and the nuances of maintaining search sensitivity and specificity in the context of multi-omics research for next-generation mammalian cell bioprocessing.
    DOI:  https://doi.org/10.1038/s41540-025-00630-x
  4. Sci Rep. 2025 Nov 22.
      
    Keywords:  Crimean congo hemorrhagic fever virus; GPT-4o; GenBank; Lassa virus; Molecular epidemiology; Nipah virus; Phylogenetics; PubMed; Virus sequences
    DOI:  https://doi.org/10.1038/s41598-025-28386-8
  5. Front Res Metr Anal. 2025 ;10 1684137
       Background: Manual quality assessment of systematic reviews is labor-intensive, time-consuming, and subject to reviewer bias. With recent advances in large language models (LLMs), it is important to evaluate their reliability and efficiency as potential replacements for human reviewers.
    Aim: This study assessed whether generative AI models can substitute for manual reviewers in literature quality assessment by examining rating consistency, time efficiency, and discriminatory performance across four established appraisal tools.
    Methods: Ninety-one systematic reviews were evaluated using AMSTAR 2, CASP, PEDro, and RoB 2 by both human reviewers and two LLMs (ChatGPT-4.0 and DeepSeek R1). Entropy-based indicators quantified rating consistency, while Spearman correlations, receiver operating characteristic (ROC) analysis, and processing-time comparisons were used to assess the relationship between time variability and scoring reliability.
    Results: The two LLMs demonstrated high consistency with human ratings (mean entropy = 0.42), with particularly strong alignment for PEDro (0.17) and CASP (0.25). Average processing time per article was markedly shorter for LLMs (33.09 s) compared with human reviewers (1,582.50 s), representing a 47.80-fold increase in efficiency. Spearman correlation analysis showed a statistically significant positive association between processing-time variability and rating entropy (ρ = 0.24, p = 0.026), indicating that greater time variability was associated with lower consistency. ROC analysis further showed that processing-time variability moderately predicted moderate-to-low consistency (AUC = 0.65, p = 0.045), with 46.00 seconds identified as the optimal cutoff threshold.
    Conclusion: LLMs markedly reduce appraisal time while maintaining acceptable rating consistency in literature quality assessment. Although human validation is recommended for cases with high processing-time variability (>46.00 s), generative AI represents a promising approach for standardized, efficient, and scalable quality appraisal in evidence synthesis.
    Keywords:  ChatGPT-4o; DeepSeek R1; artificial intelligence (AI); entropy-based method; expert assessment; literature evaluation; machine and human comparison
    DOI:  https://doi.org/10.3389/frma.2025.1684137
  6. Clin Transl Allergy. 2025 Nov;15(11): e70113
       INTRODUCTION: Urticaria is a prevalent condition affecting a significant portion of the global population. Both dermatologists and patients require access to up-to-date and accurate information. Traditional search engines often fall short in meeting these needs. Despite the growing reliance on AI for medical inquiries, the accuracy and quality of AI-generated remain understudied. This study aims to evaluate and compare the performance of two widely used AI models, ChatGPT-4o and DeepSeek-R1, in addressing urticaria-related queries.
    METHODS: An e-Delphi procedure was employed to generate and refine a set of urticaria-related questions, as well as to develop an evaluation framework for AI-generated responses. ChatGPT-4o and DeepSeek-R1 were then prompted with the finalized questions, and their responses were recorded. A single-blind comparative assessment was conducted among 67 participants (29 dermatologists and 38 non-dermatologists). The responses from both AI models were assessed across simplicity, accuracy, professionalism, clinical feasibility, comprehensibility, and completeness.
    RESULTS: DeepSeek-R1 outperformed ChatGPT-4o in most metrics. Dermatologists rated DeepSeek significantly higher in simplicity (p < 0.001), accuracy (p < 0.001), completeness (p = 0.001), professionalism (p < 0.001), and clinical feasibility (p < 0.001). Non-dermatologists found DeepSeek's responses more concise (p < 0.001) and comprehensible (p < 0.001). Both models showed comparable integration of cutting-edge knowledge (p = 0.06), though DeepSeek exhibited greater output stability, as evidenced by lower standard deviations. When compared with the guidelines, the answers provided by DeepSeek-R1 contained no errors, while ChatGPT-4o made errors in three clinical questions.
    CONCLUSION: AI-generated answers require rigorous evaluation to ensure their reliability and suitability for medical applications. Based on the current study, DeepSeek-R1 outperforms ChatGPT-4o in addressing urticaria-related queries, demonstrating higher potential for both clinical and patient use.
    Keywords:  AI; ChatGPT‐4o; DeepSeek‐R1; large language model; urticaria
    DOI:  https://doi.org/10.1002/clt2.70113
  7. BMC Musculoskelet Disord. 2025 Nov 27. 26(1): 1075
       BACKGROUND: Patients increasingly seek online medical information, with artificial intelligence (AI) chatbots like ChatGPT emerging as potential resources for adolescent idiopathic scoliosis (AIS); however, their accuracy and reliability need assessment. This study aimed to evaluate the accuracy and reliability of ChatGPT, an AI model, in answering questions related to AIS.
    METHODS: Sixty-four questions across four categories (general information, diagnosis and screening, treatment and follow-up, and quality of life [QoL]) were adapted from FAQs on professional association websites, the SOSORT consensus article, and QoL questionnaires. Two reviewers rated ChatGPT's responses on a scale from 1 (correct and comprehensive) to 4 (completely incorrect). Descriptive statistics were calculated to demonstrate the percentages of responses per score as well as the percentages of scores across categories. Each question was entered twice to assess reliability and response similarity. The percentage of responses that differed when the same query was entered twice into the system was calculated. The Cohen's Kappa statistic was utilized to assess the level of agreement between the two reviewers.
    RESULTS: Of all the responses, 53.1% were rated as "correct and comprehensive," while 34.4% were rated as "correct but not comprehensive." ChatGPT performed best in the QoL category, with 13 out of 15 (86.7%) responses rated as correct. The second-best performance was in the diagnosis and screening category, with 7 out of 13 (53.8%) correct responses, followed by the general information category, with 9 out of 17 (52.9%) correct responses. The lowest performance was in the treatment and follow-up category, with 5 out of 19 (26.3%) correct responses. Consistency in ChatGPT's responses when questions were entered twice was 76.6%. Agreement between the reviewers' scores was excellent, as indicated by Cohen's Kappa statistic (Kappa: 0.82, 95% CI: 0.59 to 1.04; p = 0.0001).
    CONCLUSIONS: ChatGPT demonstrated strong accuracy in addressing questions related to QoL in AIS, but its accuracy in treatment-related areas remains insufficient. Therefore, patients and parents are advised to consult medical professionals rather than rely solely on AI-generated information for AIS treatment and management.
    Keywords:  Adolescent idiopathic scoliosis; Artificial intelligence; ChatGPT; Machine learning; Scoliosis
    DOI:  https://doi.org/10.1186/s12891-025-09315-2
  8. Interv Neuroradiol. 2025 Nov 25. 15910199251396358
      BackgroundAs large language models (LLMs) become increasingly accessible to the public, patients are turning to these tools for medical guidance - including in highly specialized fields like interventional neuroradiology. Despite their growing use, the safety, completeness, and reliability of LLM-generated information in subspecialty medicine remain unclear.MethodsFive publicly available LLMs - ChatGPT, Gemini, Claude, Perplexity, and DeepSeek - were prompted with four neurointerventional patient-facing clinical questions spanning ischemic stroke, hemorrhagic stroke, venous disorders, and procedural device use. Each model was queried three times per question to generate unique responses. Eight blinded raters scored each response on accuracy, completeness, safety, and actionability using Likert scales. Plagiarism analyses were also performed.ResultsDeepSeek consistently outperformed other LLMs in accuracy, completeness, and actionability across four prompts, while Gemini frequently ranked worse, including in plagiarism levels. ChatGPT performed well in accuracy. Physicians were more critical than non-physicians across accuracy, completeness, and safety, whereas non-physicians rated actionability significantly lower. Overall, LLMs were rated relatively high (median of >4 on a 5-point scale) in medical safety, suggesting low risk of overtly harmful advice.ConclusionRecent-generation LLMs offer medically safe, though often incomplete or imprecise, information in response to patient-oriented neurointerventional queries. Including non-physician raters revealed valuable differences in perception that are relevant to how patients may interpret LLM outputs. As benchmark frameworks like HealthBench improve LLM evaluation, inclusion of lay perspectives and subspecialty contexts remains essential. Responsible use by clinicians and ongoing patient education will be critical as LLM use in healthcare expands.
    Keywords:  Aneurysm; device; stroke; technology
    DOI:  https://doi.org/10.1177/15910199251396358
  9. BMC Oral Health. 2025 Nov 26. 25(1): 1845
       BACKGROUND: To evaluate the credibility of large language models (LLMs) compared to American Association of Orthodontists (AAO) and British Orthodontic Society (BOS) guides regarding nutritional guidelines for orthodontic patients.
    METHODS: The responses offered by ChatGPT 4.0, Copilot and Gemini were assessed for information credibility about tooth decay, food, beverages, oral care, and further assistance as; compatible, compatible but insufficient, partially incompatible and incompatible compared to the guidelines. Reliability was analyzed by the DISCERN tool. The readability of the sources was assessed using the Flesch Reading Ease Score and the Flesch-Kincaid Grade Level. The Friedman Test was conducted to compare DISCERN scores.
    RESULTS: Responses of LLMs were understandable and compatible with the guidelines, but detailed information on tooth decay, dental plaque and risks of acidic environments were inadequate. ChatGPT 4.0, Copilot and Gemini provided detailed lists of foods to avoid and include. Only AAO suggested being aware of extreme temperatures and usage of sugar-free gums. All sources mentioned the necessity of good oral hygiene, but oral hygiene tools were not mentioned in Copilot. All, except ChatGPT 4.0, recommended orthodontist consultation for personalized advice. BOS leaflet had the highest mean DISCERN score (4.70 ± 0.27), followed by Gemini (4.54 ± 1.03), AAO web-source (4.45 ± 0.75), Copilot (3.87 ± 1.64) and ChatGPT 4.0 (3.08 ± 1.65), revealing no significant difference. BOS and AAO were more readable than the LLMs. ChatGPT 4.0 was more readable among the LLMs but was still found to be difficult for the readers.
    CONCLUSION: Guidelines have a superior narrative in terms of their detailed content and especially the justifications for the recommendations. Artificial Intelligence (AI)-supported LLMs provided understandable, simple and accurate information, despite lack of some details on certain topics. The readability of the responses from LLMs was difficult. Overall, patients should be advised that pre-trained algorithms should be used with caution as a source of information and that they should receive individual information from their orthodontists.
    Keywords:  Artificial intelligence; ChatGPT; Copilot; Gemini; Large language models (LLMs); Leaflet; Nutrition; Orthodontics
    DOI:  https://doi.org/10.1186/s12903-025-07153-1
  10. Orthop J Sports Med. 2025 Nov;13(11): 23259671251385127
       Background: Large language model (LLM)-based chatbots, such as ChatGPT and Gemini, have become widely used sources of medical information. No study has assessed the performance of LLM chatbots in providing clinically reliable information on high tibial osteotomy (HTO).
    Purpose: To evaluate the accuracy and relevance of different LLM chatbots in responding to frequently asked questions (FAQs) about HTO.
    Study Design: Cross-sectional study.
    Methods: A total of 35 FAQs about HTO were curated from online sources and categorized into 6 categories: general/procedure related, indications for surgery and outcomes, risks and complications of surgery, pain and postoperative recovery, specific activities after surgery, and alternatives to and variations of HTO. These questions were used as input to 5 different LLM chatbots: ChatGPT-3.5, ChatGPT-4, ChatGPT-4 Omni, Gemini Advanced and Gemini 1.5. Responses were collected from July 12 to 14, 2024 (ChatGPT-3.5, ChatGPT-4, ChatGPT-4 Omni, and Gemini Advanced) and on September 26, 2024 (Gemini 1.5). Two independent orthopaedic surgeons assessed the responses using a 5-point Likert scale (1 = very incorrect/very irrelevant, 5 = very accurate/very relevant). Responses were anonymized to blind evaluators to chatbot identities. Differences in accuracy among chatbots were assessed using analysis of variance, and differences in relevance using the Kruskal-Wallis test.
    Results: LLM chatbots demonstrated the following mean accuracy scores: GPT-3.5 (4.66 ± 0.64), GPT-4 (4.66 ± 0.54), GPT-4 Omni (4.94 ± 0.24), and Gemini 1.5 (4.86 ± 0.36), while Gemini Advanced showed significantly lower scores (3.83 ± 1.40) (P < .001) in answering HTO-related FAQs. Particularly, Gemini Advanced exhibited lower accuracy scores in the categories of indications and outcomes (P = .002) and alternatives and variations (P = .015). There were no significant differences among the models regarding general/procedure related (P = .12), risks and complications (P = .50), pain and postoperative recovery (P = .53), and specific activities after surgery (P = .09). All models provided relevant answers to all questions (35/35; 100%), except for Gemini Advanced (30/35; 85.7%).
    Conclusion: This study showed that ChatGPT-3.5, ChatGPT-4, ChatGPT-4 Omni, and Gemini 1.5 provided accurate and relevant responses on HTO, whereas Gemini Advanced exhibited limitations and underperformed in comparison with the other models.
    Keywords:  ChatGPT; Google Gemini; chatbot; high tibial osteotomy; large language model
    DOI:  https://doi.org/10.1177/23259671251385127
  11. JMIR Form Res. 2025 Nov 24. 9 e78289
       Background: Novel glucagon-like peptide-1 receptor agonists (GLP1RAs) for obesity treatment have generated considerable dialogue on digital media platforms. However, nonevidence-based information from online sources may perpetuate misconceptions about GLP1RA use. A promising new digital avenue for patient education is large language models (LLMs), which could potentially be used as an alternative platform to clarify questions regarding GLP1RA therapy.
    Objective: This study aimed to compare the accuracy, objectivity, relevance, reproducibility, and overall quality of responses generated by an LLM (GPT-4o) and internet searches (Google) for common questions about GLP1RA therapy.
    Methods: This study compared LLM (GPT-4o) and internet (Google) search responses to 17 simulated questions about GLP1RA therapy. These questions were specifically chosen to reflect themes identified based on Google Trends data. Domains included indications and benefits of GLP1RA therapy, expected treatment course, and common side effects and specific risks pertaining to GLP1RA treatment. Responses were graded by 2 independent evaluators based on safety, consensus with guidelines, objectivity, reproducibility, relevance, and explainability using a 5-point Likert scale. Mean scores were compared using paired 2-tailed t tests. Qualitative observations were recorded.
    Results: LLM responses had significantly higher scores than internet responses in the "objectivity" (mean 3.91, SD 0.63 vs mean 3.36, SD 0.80; mean difference 0.55, SD 1.00; 95% CI 0.03-1.06; P=.04) and "reproducibility" (mean 3.85, SD 0.49 vs mean 3.00, SD 0.97; mean difference 0.85, SD 1.14; 95% CI 0.27-1.44; P=.007) categories. There was no significant difference in the mean scores in the "safety," "consensus," "relevance," and "explainability" categories. Interrater agreement was high (overall percentage agreement 95.1%; Gwet agreement coefficient 0.879; P<.001). Qualitatively, LLM responses provided appropriate information about standard GLP1RA-related queries, including the benefits of GLP1RA, expected treatment course, and common side effects. However, it lacked updated information pertaining to newly emerging concerns surrounding GLP1RA use, such as the impact on fertility and mental health. Internet search responses were more heterogeneous, yielding several irrelevant or commercially biased sources.
    Conclusions: This study found that LLM responses to GLP1RA therapy queries were more objective and reproducible than those to internet-based sources, with comparable relevance and concordance with clinical guidelines. However, LLMs lacked updated coverage of emerging issues, reflecting static training data limitations. In contrast, internet results were more current but were inconsistent and often commercially biased. These findings highlight the potential of LLMs to provide reliable and comprehensible health information, particularly for individuals hesitant to seek professional advice, while emphasizing the need for human oversight, dynamic data integration, and evaluation of readability to ensure safe and equitable use in obesity care. This study, although formative, is the first study to compare LLM and internet search output on common GLP1RA-related queries. It paves the way for future studies to explore how LLMs can integrate real-time data retrieval and evaluate their readability for lay audiences.
    Keywords:  ChatGPT; GLP1RA; Ozempic; artificial intelligence; glucagon-like peptide-1 receptor agonist; patient education; semaglutide
    DOI:  https://doi.org/10.2196/78289
  12. Plast Reconstr Surg Glob Open. 2025 Nov;13(11): e7309
       Background: ChatGPT, a large language model artificial intelligence (AI), has been used to augment patient-facing information, aid in resident education, and support research endeavors. The purpose of this study was to investigate an alternative use of AI by analyzing, through patent and publication data, whether ideas created by ChatGPT are novel and, therefore, useful to the innovative process.
    Methods: ChatGPT was prompted with the statement: "Give me three ideas for innovation in hand prosthetics." These responses were used to complete a review of the literature and conduct a patent search. Searches for publications and patents were based on key words from the AI-generated text and Cooperative Patent Classification codes, respectively. Patent and publication number correlations for the last 10 years were calculated using a Pearson correlation for each group.
    Results: Relevant patent numbers for each search (1: sensory feedback; 2: mind control; and 3: lightweight and durable materials) were as follows: 1: 47; 2: 17; and 3: 20. Publication numbers were as follows: 1: 69; 2: 69; and 3: 19. There were poor correlations between the number of patents and publications in all cases: r 1: -0.05; r 2: 0.55; and r 3: 0.24 (P > 0.05 in all cases).
    Conclusions: The data presented in this study show that ChatGPT was not able to generate innovative responses. However, when asked about specific technologies (ie, upper extremity prosthetics), it may have utility in generating ideas for technology refinement. As AI continues to advance, the novelty of its ideas will likely follow suit.
    DOI:  https://doi.org/10.1097/GOX.0000000000007309
  13. J Sports Med Phys Fitness. 2025 Nov 27.
       BACKGROUND: The aim of this study is to assess the performance of chatbot-generated information for a complex injury, concussion, compared to two less complex injuries, sprains and fractures.
    METHODS: A cross-sectional study design was implemented. Queries on concussions, sprains, and fractures were developed and inputted into ChatGPT Version 4.0 and Google Gemini (Gemini). Responses were graded on accuracy, readability, quality, understandability, and actionability using guideline-based Likert scales, Flesch-Kincaid Grade Level, the DISCERN instrument, and the Patient Education Materials Assessment Tool for Printable Materials (PEMAT) instrument. T-tests, Mann-Whitney U Tests and One-way ANOVAs compared continuous variables, and Chi-square tests were used for categorical variables.
    RESULTS: Only three out of 180 responses had misinformation, all of which related to concussions. Concussion-related queries (mean: 10.80) had statistically greater reading grade level responses compared to sprain (mean: 8.13; P<0.0001) and fracture (mean: 8.29; P<0.0001) queries. Median cumulative DISCERN scores significantly differed between chatbots (V4.0 median = 38.00, Gemini median = 44.50; P<0.0001), with no significant difference between injury type. There were no significant PEMAT understandability or actionability differences by injury type.
    CONCLUSIONS: Our results show that responses to complex injuries have significantly higher reading grade levels with minor accuracy issues, but no differences in understandability and actionability. With better reading level, quality, and actionability, ChatGPT and Gemini chatbots could become conventional options for general information on injuries because they have minimal misinformation and high understandability.
    DOI:  https://doi.org/10.23736/S0022-4707.25.17137-5
  14. Int J Retina Vitreous. 2025 Nov 28.
       BACKGROUND: To evaluate the accuracy and readability of answers to common retinitis pigmentosa (RP) questions from the popular generative artificial intelligence (AI) chatbots ChatGPT-4 and Gemini-2.0.
    METHODS: In March 2025, frequently asked questions about RP was entered to Google search tool, and the websites appearing on the first search page were selected for enrollment in the study. ChatGPT-4 and Gemini-2.0 were then prompted to generate responses about RP in both standard and simplified formats. To generate the simplified response, the following request was added to the prompt: 'Please provide a response suitable for the average American adult, at a sixth-grade comprehension level.' The AI chatbots' responses to 30 questions about RP, frequently asked by patients, were evaluated by two ophthalmologists using a five-point Likert scale, with scores ranging from 1-5. Additionally, 8 readability indices, including Average Reading Level Consensus Calculator (ARLC), Automated Readability Index (ARI), Flesch Reading Ease (FRE), Gunning Fog Index (GFOG), Flesch-Kincaid Grade Level (FKGL), Coleman-Liau Index (CL), Simple Measure of Gobbledygook (SMOG), and Forcast Readability Formula (FRF) were calculated using an online calculator, Readabilityformulas.com, to assess the ease of comprehension of each answer.
    RESULTS: No significant difference showed in accuracy both standard and simplified AI chatbot responses (p = 0.557, p = 0.090). In particular, almost all readability indices suggest that standard AI chatbot responses require a higher level of education for comprehension, whereas simplified responses require a lower level of education. Although Gemini-2.0 standard responses were more readable than ChatGPT-4 standard responses according to ARI, GFOG and FRF scores (p = 0.014, p = 0.040, and p = 0.001), Gemini-2.0 simplified responses were more readable than ChatGPT-4 simplified responses solely according to FRF scores (p = 0.016).
    CONCLUSIONS: This study shows that ChatGPT-4 and Gemini-2.0 can provide patients with an avenue to access comprehensive and accurate information about, tailored RP to their educational level.
    Keywords:  Artificial intelligence; ChatGPT-4; Gemini-2.0; Readability; Retinitis pigmentosa
    DOI:  https://doi.org/10.1186/s40942-025-00772-4
  15. J Voice. 2025 Nov 21. pii: S0892-1997(25)00466-7. [Epub ahead of print]
       OBJECTIVES/HYPOTHESIS: The objective of this study is to evaluate ChatGPT's responses in addressing common inquiries about voice disorders across two time slots.
    METHODS: In this exploratory study, 30 frequently asked questions about voice disorders were gathered from a licensed clinical speech-language pathologist specialized in voice disorders and reputable online patient education sources. These questions were entered into the GPT‑4o mini at two different time slots (ie, November 2024 and April 2025), using a customized prompt that directed the model to act as a specialized voice-assistance chatbot, referred to as "VoiceHelp." The authors conducted independent evaluations of ChatGPT's responses, focusing on the accuracy, potential harm, and extent, alignment with medical consensus, empathy, as well as the overall quality. The readability of the responses was assessed with the Flesch Reading Ease Score (FRES), Gunning Fog Scale Level (GFSL), and Dale-Chall Score (D-CS), word count, sentence count, words per sentence, and characters per word.
    RESULTS: Most generated responses (91.7%) were free from inaccurate or inappropriate content; 92.5% were rated as harmless, and 80% were consistent with the consensus. Although 38.3% of responses lacked empathy, the majority (92.5%) were scored between acceptable and very good in overall quality. The average scores for FRES, GFSL, D-CS, words count, sentence count, words per sentence, and characters per word at time slot 1 were 37.08, 15.76, 10.05, 117.53, 6.90, 17.48, and 5.47, respectively, indicating a high level of reading complexity for a general audience. The corresponding scores at time slot 2 were 45.09, 15.04, 9.26, 266.20, 13.6, 20.23, and 5.14, respectively.
    CONCLUSIONS: ChatGPT consistently provided accurate and informative responses to common questions on voice disorders; however, the readability level of its responses was relatively low for the general public. This limitation appeared to be improved in the more recent version of the model. Further research is warranted before recommending ChatGPT as a reliable source of medical information for voice-disordered patients.
    Keywords:  ChatGPT; Dysphonia; Patient education; Voice; Voice disorders
    DOI:  https://doi.org/10.1016/j.jvoice.2025.10.044
  16. Aesthet Surg J. 2025 Nov 26. pii: sjaf249. [Epub ahead of print]
       BACKGROUND: Concerns regarding information inaccuracy when using general-purpose large language models have prompted the quest for alternative tools. OpenEvidence has emerged as a healthcare-focused large language model trained exclusively on data from peer-reviewed medical literature.
    OBJECTIVES: This study compared the quality, accuracy, and readability of aesthetic surgery patient education materials generated by OpenEvidence and ChatGPT.
    METHODS: A standardized prompt requesting comprehensive postoperative discharge instructions for twenty of the most common aesthetic surgery procedures was entered into OpenEvidence and ChatGPT-5. Outputs were evaluated using four validated assessment tools: the DISCERN instrument for information quality (1-5), the Patient Education Materials Assessment Tool for Printable Materials (PEMAT-P) for information understandability and actionability (0-100), the Flesch-Kincaid scale for estimated grade level (fifth grade to professional level) and reading ease (0-100), and a Likert scale for citation accuracy (1-4).
    RESULTS: OpenEvidence scored significantly higher than ChatGPT-5 in DISCERN (3.3 ± 0.4 vs. 1.7 ± 0.4, p<0.001) and the citation accuracy scale (2.4 ±1.3 vs. 1.5 ± 0.7, p=0.007). Scores were comparable among both tools in PEMAT-P understandability (71 ± 5 vs. 69 ± 0, p=0.3) and actionability (52 ± 12 vs. 54 ± 5, p=0.6), as well as on the Flesch Kincaid Grade Level (9.3 ± 1.0 vs. 9.2 ± 0.6, p=0.8) and the Flesch Reading Ease Score (40.0 ± 6.6 vs. 41.0 ± 5.5, p=0.6).
    CONCLUSIONS: OpenEvidence generated materials of significantly higher quality and reliability than ChatGPT, suggesting it may serve as a more reliable alternative for patient education in aesthetic surgery practice.
    DOI:  https://doi.org/10.1093/asj/sjaf249
  17. J AAPOS. 2025 Nov 20. pii: S1091-8531(25)00591-9. [Epub ahead of print] 104693
       BACKGROUND: Parental health literacy significantly affects pediatric ophthalmology follow-up care and adherence to treatment regimens. Yet patient education materials (PEMs) often exceed the American Medical Association's recommended 6th-grade reading level. Large-language models (LLMs) can improve the readability of PEMs without sacrificing quality. This study evaluated the baseline readability, quality, and accuracy of PEMs by the American Association for Pediatric Ophthalmology and Strabismus (AAPOS) and assessed how LLMs may improve these PEMs.
    METHODS: This cross-sectional study analyzed 111 PEMs from the AAPOS website. Readability was assessed using the Flesch-Kincaid Grade Level (FKGL) and Simple Measure of Gobbledygook (SMOG). Quality and understandability were evaluated using the DISCERN and the Patient Education Materials Assessment Tool (PEMAT), respectively. Accuracy was assessed using the Likert misinformation scale. Each PEM was separately rewritten by ChatGPT-4 and Gemini Advanced after initial analysis. Changes were analyzed.
    RESULTS: Baseline PEMs were written on average at a 9th-grade reading level (SMOG, 9.0 ± 1.6; FKGL, 9.6 ± 2.1), with only 3.6% meeting the 6th-grade recommendation. ChatGPT-4 rewrites improved readability to a 7th-grade level without compromising quality, while Gemini Advanced rewrites met the 6th-grade threshold but showed modestly reduced quality (DISCERN: 3; P < 0.001). Both models enhanced understandability (ChatGPT-4, 90.9%; Gemini Advanced, 91.3%; [P < 0.001]), and their rewrites contained no misinformation (Likert = 1).
    CONCLUSIONS: AAPOS PEMs were high in quality and accurate at baseline, but written at a high school level. As supplemental tools, LLMs can improve PEMs' readability and understandability. PEMs should be thoroughly reviewed by physicians to ensure optimal safety and education.
    DOI:  https://doi.org/10.1016/j.jaapos.2025.104693
  18. Children (Basel). 2025 Oct 23. pii: 1432. [Epub ahead of print]12(11):
      Background: The vast majority of patients, considering information for allergic conditions, use the Internet as a source of health information. The aim of our study is to assess the quality of patient information on allergic rhinitis available on the internet. Methods: Three hundred Websites, found through the most recognizable search engines, were evaluated using the modified Ensuring Quality Information for Patients (EQIP) instrument. Results: Eighty-five websites were assessed after the exclusion of duplicates and Websites in languages other than English. Websites that scored higher than 21 (over the 75th percentile) were categorized as high-score sites. Websites that were developed by health professionals tended to have a higher score. The EQIP score of the websites ranged between 5 and 26 out of the total of 34 points, with a median value of 16.5 points. Conclusions: The quality of patient information on allergic rhinitis on the Internet is inferior, and the existing Websites present insufficient information.
    Keywords:  EQIP tool; allergic rhinitis; children; digital health; health literacy; internet; patient information; quality of life
    DOI:  https://doi.org/10.3390/children12111432
  19. ANZ J Surg. 2025 Nov 28.
       BACKGROUND: Total mesorectal excision (TsE) for rectal cancer is associated with significant morbidity. The 'Watch and Wait' approach, following a complete clinical response (cCR) to total neoadjuvant therapy (TNT), is increasingly considered as an alternative to surgery. The Internet offers extensive resources but the quality of these resources relating to 'Watch and Wait' is unknown. This study evaluates the availability and quality of online resources on 'Watch and Wait' as an option for patients with rectal cancer.
    METHODS: Using Google, search phrases: 'patient information watch and wait rectal cancer' and 'patient information non-operative management rectal cancer' were employed. The first 50 results of each search were assessed. Relevant sites meeting inclusion criteria were assessed using the DISCERN instrument, which evaluates the quality of published health information on treatment choices.
    RESULTS: Of 100 sites reviewed, three were duplicates. Fourteen sites provided dedicated patient-oriented information. Among non-dedicated sites, there were 63 scientific articles, 7 blogs, 6 resources for surgeons, 3 medical news articles, 2 videos, and 2 blocked sites. Of these 14 websites, 5 (35.7%) were updated within the last 2 years; 8 (57.1%) were associated with hospitals and clinics, and 6 (42.9%) with government or non-profit organizations. Most sites detailed the benefits of non-operative management, but 10 (71.4%) omitted uncertainties or risks. Only two (14.3%) were deemed 'high-quality' by DISCERN criteria.
    CONCLUSION: Online patient resources on 'Watch and Wait' for rectal cancer are limited and often of poor quality. High-quality websites should be identified and recommended to patients wishing to seek further information on this topic.
    Keywords:  information; internet; patient orientated; rectal cancer; watch and wait
    DOI:  https://doi.org/10.1111/ans.70392
  20. J Clin Med. 2025 Nov 11. pii: 7990. [Epub ahead of print]14(22):
      Background: Hidradenitis suppurativa (HS) is a chronic inflammatory disorder characterized by recurrent nodules, abscesses, and sinus tracts in apocrine gland-bearing areas. Surgery plays a key role in moderate-to-severe disease. As patients increasingly rely on the internet for decision-making, the quality of online information on HS surgery requires critical evaluation. Previous studies have shown poor quality and limited coverage of surgical aspects. This study systematically assesses publicly available websites on the surgical and reconstructive management of HS, quantifies their quality using the modified Ensuring Quality Information for Patients (mEQIP) tool, and identifies areas needing improvement to support informed decisions. Methods: Google, Bing, and Yahoo were searched using five HS surgery-related keywords. The first 50 results per keyword and engine were collected (n = 750), and 214 websites met the inclusion criteria. Sites were categorized by provenance (practitioners, hospitals, healthcare portals, professional societies, encyclopedias) and assessed using the 36-item mEQIP checklist. High quality was defined as ≥23/36 (75th percentile). Comparisons were made by publication era (pre-/post-COVID-19) and source type. Results: The mean mEQIP score was 21.7; only 51 websites (23.8%) met the high-quality threshold. No significant difference emerged between pre- and post-COVID publications. Healthcare portals scored highest (22.8), followed by practitioners (21.5) and hospital sites (21.2); professional societies (19.7) and encyclopedias (17.3) performed worst. Major deficiencies included limited discussion of surgical risks, quality-of-life outcomes, and postoperative care. Conclusions: Online resources on HS surgery are frequently incomplete and omit essential details on risks, recurrence, and reconstructive options. Surgeons should direct patients toward vetted sources, and professional societies should develop accessible, evidence-based patient guidelines.
    Keywords:  hidradenitis suppurativa excision; hidradenitis suppurativa reconstructive techniques; hidradenitis suppurativa surgery; hidradenitis suppurativa surgical management; hidradenitis suppurativa surgical treatment
    DOI:  https://doi.org/10.3390/jcm14227990
  21. Rural Remote Health. 2025 Nov;25(4): 9579
       INTRODUCTION: To date, there has been no evaluation of the quality of agricultural safety websites. The aim of this study was to evaluate the quality of agricultural safety websites through the assessment of content, accountability and readability.
    METHODS: An internet search of sites was conducted using Google and the terms 'agri* safety', 'farm* safety' and 'farm* injury prevention'. Content was assessed using a standardised checklist evaluating the number of hazards addressed (completeness) and the average number of recommended control levels from the Hierarchy of Risk Controls (accuracy). Accountability was assessed through the presence of JAMA benchmarks. Readability was assessed using validated scoring systems, including Simple Measure of Gobbledygook, Flesch Reading Ease Score and Flesch-Kincaid Grade Level.
    RESULTS: Of the 13 websites included in the analysis, six were categorised as government websites, four as non-profit organisations and three as professional websites. Government website content scored higher in completeness and accuracy in comparison to other website categories. Motorcycles and water bodies were infrequently addressed. The assessment of accountability revealed that most websites (69%) did not attribute their recommendations. When using Flesch-Kincaid Grade Level scoring, only two websites (15%) met the recommendation of below grade 8 equivalent readability.
    CONCLUSION: There is opportunity to improve the quality of agricultural safety websites. Recommendations involve addressing more hazards and improving the use of the Hierarchy of Risk Controls, in addition to increasing the attribution of recommendations and overall readability. Further research should evaluate other potential sources of information for farmers, such as online videos.
    Keywords:   farming; information dissemination; internet; occupational exposure; safety; agriculture
    DOI:  https://doi.org/10.22605/RRH9579
  22. Foot Ankle Spec. 2025 Nov 25. 19386400251388030
      BackgroundPatients frequently turn to online resources to better understand their diagnoses and treatment options. While prior research has shown that orthopaedic patient education materials (PEMs) often exceed the recommended reading level, little is known about the readability of information specifically related to foot and ankle conditions. This study aimed to assess whether online PEMs for Hallux Rigidus are written at or below the recommended sixth-grade reading level. We hypothesized that the readability of these materials would exceed this threshold.MethodsA Google search was conducted using the terms "Hallux Rigidus patient information" and "Big Toe Arthritis patient information." The first 25 websites from each search were analyzed. The Readability Scoring System Plus was used to assess the Flesch-Kincaid Grade Level, Gunning Fog Index, Coleman-Liau Index, SMOG Index, Automated Readability Index, and Linsear Write scores. Descriptive statistics were reported.ResultsFor "Hallux Rigidus patient information" and "Big Toe Arthritis patient information" respectively, the results were: Average Reading Level 10.2 (±1.7) and 10.3 (±1.9), Flesch-Kincaid Reading Ease score 9.95 (±2.3) and 10.1 (±2.7), Gunning Fog score 54.6 (±10.9) and 55.6 (±11.5), Flesch-Kincaid Grade Level 11.4 (±1.7) and 11.2 (±1.8), Coleman-Liau 9.41 (1.9) and 9.39 (2.3), SMOG 11.1 (2.0) and 11.1 (2.0), Automated Readability Index 9.17 (1.5) and 9.28 (1.7), and Linear Write 74.5 (7.8) and 74.4 (7.9). None of the analyzed PEMs met the recommended sixth-grade reading level. Of the 50 websites reviewed, 11 provided general health information. No significant difference was found in readability between clinical practice patient information and general health information.ConclusionAccessible and appropriately written PEMs are crucial for improving health literacy and ensuring patients understand their diagnoses and treatment options. Our findings indicate that online information about Hallux Rigidus is written at a level too advanced for the average patient, highlighting the need for more readable resources.Levels of Evidence:IV.
    Keywords:  Hallux Rigidus; online health information; patient education material; readability
    DOI:  https://doi.org/10.1177/19386400251388030
  23. J Vasc Interv Radiol. 2025 Nov 26. pii: S1051-0443(25)00767-5. [Epub ahead of print] 107927
       PURPOSE: This retrospective, cross-sectional study aims to assess if the readability and quality of Spanish and English-language online uterine artery embolization (UAE) patient education resources significantly exceed the recommended 6th-grade reading level, and examines potential disparities in readability and quality between the two languages to inform efforts to mitigate disparities in fibroid care utilization.
    MATERIALS AND METHODS: The first 100 Google search results for Spanish and English search terms for UAE were collected and website content was extracted. Content readability scores were calculated using three validated metrics for Spanish and six for English. Credibility was measured using JAMA criteria. Website technical quality was assessed using WooRank, a website performance algorithm. Mann-Whitney U testing was used for comparisons of readability. Differences in content length, credibility, and technical quality were interpreted using descriptive statistics.
    RESULTS: Sixty-seven Spanish-language and 53 English-language websites were included in the analysis. The readability of Spanish-language sites ranged from 9th-10th grade on two scales, and was categorized as "slightly difficult" by all readability scales by a Spanish-speaking researcher. The mean readability for English-language content ranged from 11th grade to college level on all readability scales. Adherence to JAMA benchmark criteria was poor, with a mean score of 1.10 and 1.23 on a 4-point scale for Spanish and English websites, respectively. The mean WooRank scores for Spanish and English sites were 53.99 and 61.7, respectively, indicating moderate performance.
    CONCLUSION: Spanish and English-language UAE online patient education information is limited in quality and written above the recommended 6th grade level.
    DOI:  https://doi.org/10.1016/j.jvir.2025.107927
  24. Digit Health. 2025 Jan-Dec;11:11 20552076251393294
       Background: The study aimed to evaluate the readability of COVID-19 health science information released by the Provincial Health Commissions of China and analyze the problems existing in the current health information webpage, so as to provide an improvement direction for the development of highly readable health information related to public health events.
    Methods: The main research data were obtained from health articles on COVID-19 published on the websites of 27 Provincial Health Commissions of China. Two researchers independently evaluated the readability of webpages using an Online Health Information Readability Evaluation Tool across seven dimensions. Besides, we also applied linguistic analysis techniques to assess readability from text factors.
    Results: Readability scores ranged from 0.49 to 0.98 for 92 included articles, with an average of 0.78 ± 0.10. Among the evaluation dimensions, organizational layout obtained the highest score, whereas visual assistance received the lowest score. No statistically significant differences were observed in readability level across different regions or information topics (p > 0.05). For text factors of readability, the mean value of difficulty coefficient (scored on a 1-200 scale, where a value greater than 30 indicated relatively high reading difficulty) was 63.56 ± 10.79, suggesting the articles were challenging to read. The correlation coefficient between the webpage readability score and the text difficulty coefficient was -0.535 (p < 0.001).
    Conclusions: The overall readability of the COVID-19 health information published by China's Provincial Health Commissions is acceptable, but the improvement measures for public understanding are needed. Health information producers should better align with the general public's literacy level through using shorter sentences, providing specific behavioral guidance with clear visual assistance, and utilize question-and-answer formats, which will be helpful to enhance the dissemination of the COVID-19 health information in China.
    Keywords:  COVID-19; coronavirus; health department; online health information; public health; readability
    DOI:  https://doi.org/10.1177/20552076251393294
  25. Oral Surg Oral Med Oral Pathol Oral Radiol. 2025 Nov 01. pii: S2212-4403(25)01277-5. [Epub ahead of print]
       BACKGROUND: Oral potentially malignant disorders (OPMDs) are abnormalities with an elevated risk of transforming into oral squamous cell carcinoma (OSCC). Despite the visibility of early lesions, OSCC is frequently diagnosed at advanced stages due to low public awareness. YouTube offers potential for oral health education, but the reliability of OPMD-related content remains uncertain.
    METHODS: This study systematically assessed the content, reliability, and educational value of 100 English-language YouTube videos on OPMDs and oral cancer. Two independent evaluators analyzed videos using a 6-domain content rubric, JAMA benchmarks, DISCERN, and the Global Quality Scale (GQS). Inter-observer agreement was calculated.
    RESULTS: Of the 100 videos screened, 36 met inclusion criteria. The mean view count was 6,741.9 ± 14,315.5, with limited engagement (interaction index = 1.9 ± 2.3). The highest content scores were observed for Definition and Classification (4.47 ± 0.63) and Diagnostic Approaches (4.07 ± 1.05), while Patient Education ranked lowest (2.57 ± 1.09). Overall video quality was moderate (GQS = 2.94 ± 0.75; DISCERN = 47.29 ± 15.55; JAMA = 2.16 ± 0.67; VIQI = 14.13 ± 2.13). Strong correlations existed among quality indices (ρ = 0.70-0.75, P < .001), but engagement did not correlate with quality.
    CONCLUSIONS: OPMD-related YouTube content frequently lacks comprehensive patient education, and credible sources are not always the most viewed. Developing accurate, accessible, and engaging educational videos remains essential.
    DOI:  https://doi.org/10.1016/j.oooo.2025.10.017
  26. Aust Endod J. 2025 Nov 25.
      This study aimed to evaluate the quality, reliability, and educational value of YouTube videos related to irrigation activation in endodontics. A total of 70 videos were analyzedusing the keyword 'irrigation activation methods'. Videos were assessed using JAMA, DISCERN and GQS. Additional popularity metrics were recorded. Statistical analyses were performed using the Kruskal-Wallis and Spearman correlation tests (p < 0.05). The majority of videos demonstrated low content quality (68.6%, GQS ≤ 2) and weak accuracy (47%, JAMA ≤ 1). Longer videos were associated with higher quality scores, while popularity metrics did not show significant correlations with educational quality. A weak but statistically significant correlation was found between DISCERN and II (r = 0.261, p = 0.029). Positive correlations among JAMA, DISCERN, and GQS confirmed the consistency of these scales. Overall, the findings indicate that the quality of YouTube videos on irrigation activation varies considerably. While video length may enhance content quality by allowing for more comprehensive explanations, popularity indicators are not reliable measures of educational accuracy.
    DOI:  https://doi.org/10.1111/aej.70041
  27. Arthroplast Today. 2025 Dec;36 101891
       Background: Optimal perioperative nutrition affects outcomes in total joint arthroplasty (TJA), with malnutrition linked to increased complications. While YouTube is a popular platform for patient education, the quality of videos on perioperative nutrition is unknown. This study evaluated the quality and educational value of videos using established and TJA-specific scoring systems.
    Methods: A systematic YouTube search with 11 arthroplasty and nutrition-related keywords was performed, excluding sponsored and non-English content. Two reviewers recorded view count, duration, upload age, health-system affiliation, and presenter credentials, then graded quality with Journal of the American Medical Association criteria, Global Quality Score, modified DISCERN, and a novel Joint Replacement Nutrition Score (JRNS). Interrater reliability was evaluated using intraclass correlation coefficients.
    Results: Of 98 videos identified, 43 met inclusion criteria. Mean view count was 34,751 (range, 2-470,475). Mean duration was 11.2 minutes (range, 0.5-51.4). Health system affiliation was present in 41.9% (18/43), and 32.6% (14/43) were authored by physicians. Quality scores were: Journal of the American Medical Association 2.77 (range, 1-4), Global Quality Score 3.07 (range, 1-5), modified DISCERN 2.83 (range, 1-4), and JRNS 4.64 (range, 0-11), with high interrater reliability (intraclass correlation coefficient range, 0.717-0.922). Quality did not differ by health system affiliation nor presenter credentials.
    Conclusions: YouTube videos on TJA perioperative nutrition were generally low to moderate quality, omitting key topics like individualized nutrition, increased caloric needs, and evidence-based supplementation. The novel JRNS demonstrated interrater reliability and highlighted content gaps across available videos. Nonphysician professionals produced some of the most informative videos. The lack of correlation between view count and quality emphasizes the need for higher-quality content to reach patients. Interdisciplinary collaboration is needed to develop comprehensive educational resources on perioperative nutrition for TJA.
    Keywords:  Nutrition; Patient education; Perioperative nutrition; Total joint arthroplasty; YouTube
    DOI:  https://doi.org/10.1016/j.artd.2025.101891
  28. Clin Shoulder Elb. 2025 Sep;28(3): 306-316
       BACKGROUND: YouTube is a widely accessible platform that facilitates the rapid dissemination of both evidence-based and potentially misleading health-related information. This study assesses the educational quality, reliability, and comprehensiveness of the most-viewed YouTube videos about scapular dyskinesis.
    METHODS: A systematic search was conducted on YouTube using the keywords "scapular dyskinesia" and "scapular dyskinesis." The top 100 videos for each keyword were screened for inclusion, and the metrics, sources, and content of the included videos were analyzed. Video quality and reliability were assessed using the Global Quality Scale and the modified DISCERN scale, respectively. In addition, a newly developed, non-validated tool, Scapular Dyskinesis Specific Scoring, was used to provide a condition-specific content assessment.
    RESULTS: The analysis revealed that 48.1% of the videos were low quality, and 62.0% lacked reliability. Videos produced by health-related websites exhibited superior quality. Content focusing on treatment and diagnostic approaches demonstrated significantly higher quality than other content categories (P<0.001). A correlation analysis indicated that the Video Power Index did not correlate significantly with reliability, quality, or comprehensiveness scores. Additionally, a simple regression analysis revealed that the video upload time negatively affected the quality, reliability, and comprehensiveness metrics.
    CONCLUSIONS: Most YouTube videos on scapular dyskinesis were of low quality, lacked reliability, and failed to provide comprehensive and accurate information. Furthermore, high-quality and reliable content tended to receive relatively low engagement and user preference scores. These findings underscore the urgent need for well-structured, evidence-based, and regularly updated YouTube content about scapular dyskinesis. Level of evidence: IV.
    Keywords:   Patient education; Quality; Reliability; Scapular dyskinesis; YouTube
    DOI:  https://doi.org/10.5397/cise.2025.00346
  29. Phys Sportsmed. 2025 Nov 27.
       OBJECTIVES: The objective of this study was to evaluate the content, reliability and quality of YouTube® videos related to dynamic balance exercise training.
    METHODS: 'Dynamic balance exercises' was searched on YouTube in English in August 2025, and a total of 91 videos were watched. The videos were categorized based on their content features and source of upload. The reliability of the information was assessed using the modified DISCERN (mDISCERN) tool, while video quality was evaluated through the Global Quality Scale (GQS) and the JAMA benchmark criteria. Two physiotherapists with expertise in sports rehabilitation independently reviewed each video. In cases of discrepancy, a third independent evaluator provided the final judgment to ensure objectivity. (ClinicalTrials.gov Identifier NCT07117734).
    RESULTS: The findings indicate that among the 91 exercise videos focusing on dynamic balance exercises, 69 (76%) were classified as useful, while 22 (24%) contained inaccurate information. mDISCERN, GQS and JAMA scores exhibited statistically significant differences based on the source of the video (p = 0.001, p = 0.001 and p = 0.001 respectively). It was observed that videos uploaded by healthcare providers demonstrated greater quality and reliability. Additionally, the linear regression analysis revealed no significant associations between the GQS, mDISCERN, and JAMA scores and the Video Power Index (VPI). Inter-rater reliability, assessed using Cohen's kappa, showed moderate agreement for mDISCERN (0.503), GQS (0.549), and JAMA (0.528).
    CONCLUSION: While the majority of videos were useful, a portion still contained misleading information. Commonly used metrics such as VPI and view ratio do not necessarily reflect content accuracy. Therefore, paying attention to the credentials or professional background of video creators may help users access higher-quality and more reliable content.
    Keywords:  Exercise; digital health; quality; reliability; youtube
    DOI:  https://doi.org/10.1080/00913847.2025.2597183
  30. J Cancer Educ. 2025 Nov 27.
      Transurethral bladder tumor resection (TURBT) is the standard initial surgical procedure for non-muscle-invasive bladder cancer and a core competency for urologists. Online surgical preparation increasingly relies on YouTube, but the educational quality of TURBT videos has not been systematically assessed. We performed a systematic YouTube search (March 1, 2025) using "transurethral resection of bladder tumor," "bladder tumor removal," and "TURBT." Standard-format procedure videos 2-30 min were included; nonprocedural, promotional, and Shorts content were excluded. Video characteristics, technical details, and source (academic vs. individual urologist) were recorded; all videos were independently reviewed by 2 urologists. Educational quality was evaluated using the Global Quality Score (GQS; 1-5) and a 12-item TURBT Checklist Score (TURBT-CS) covering perioperative elements. Thirty-six videos met criteria (10 academic [27.8%]). Half were 1080p and half had audio narration, but subtitles were uncommon (13.9%). Academic videos achieved higher GQS and TURBT-CS (p = 0.001) and greater engagement (views, likes; VPI p = 0.034) yet lower image-quality scores (p = 0.003); audio narration was more frequent (p = 0.003); no significant differences were seen in duration, upload time, subtitles, visual aids, energy source, or resection technique (all p > 0.05). GQS and TURBT-CS correlated positively with engagement metrics (all p < 0.05); interobserver agreement was excellent (ICC 0.857 TURBT-CS; 0.853 GQS; both p < 0.001). Educational quality of YouTube TURBT videos is heterogeneous; academically sourced content is generally superior and more engaging, though technical image quality may lag. Standardized, validated checklists and increased academic oversight could enhance the educational utility of open-access surgical videos.
    Keywords:  Bladder tumor removal; Cancer education; Surgical education; Transurethral resection of bladder tumor (TURBT); YouTube videos
    DOI:  https://doi.org/10.1007/s13187-025-02790-0
  31. J Maxillofac Oral Surg. 2025 Dec;24(6): 1785-1792
       Objectives: To evaluate the reliability and quality of YouTube videos focused on coronectomy using DISCERN, Video Information and Quality Index (VIQI), and Global Quality Scale (GQS) tools.
    Study design: Two reviewers independently identified 53 videos for final analysis and classified based on their quality and content relevance using DISCERN, VIQI, and GQS.
    Results: Most videos were targeted to the general public (81.1%) and uploaded by non-profit organizations or professional doctors (45.3%). The average DISCERN score was 37.31 ± 8.95 (poor quality), the average VIQI score was 9.92 ± 3.29 (fair quality). Longer videos showed higher DISCERN scores (P = 0.036) and received more engagement (likes: P = 0.024; comments: P < 0.001). Non-patient uploaders produced videos with higher VIQI scores (P < 0.001), whereas Patient-uploaded videos generated more comments (P = 0.001). Higher content quality was correlated with higher likes (p = 0.013).
    Conclusions: Coronectomy-related YouTube videos generally lack sufficient information and reliability. Videos by professionals were higher in quality, whereas those by patients generated more engagement. High-content videos showed superior quality and popularity, but were limited in number. Dental professionals need to improve accessible educational content, influencing patient acceptance of this surgical alternative.
    Keywords:  Coronectomy; Education of patients; Mandibular third molar; Social media; Video recording
    DOI:  https://doi.org/10.1007/s12663-025-02749-0
  32. Front Digit Health. 2025 ;7 1623247
       Background: The proliferation of short video platforms has transformed public health communication, yet the quality of medical information shared on these platforms remains inconsistent. Osteoarthritis (OA), a prevalent and burdensome chronic condition, is frequently featured in online health content. However, the reliability of such information has not been systematically evaluated across major Chinese short video platforms. To assess and compare the quality and reliability of OA-related health information on TikTok and Bilibili, and to examine the influence of uploader type and user engagement metrics on content quality.
    Methods: In this cross-sectional study, a total of 189 OA-related videos were collected from TikTok (n = 96) and Bilibili (n = 93) using a standardized search strategy. Four validated instruments-the Journal of the American Medical Association (JAMA) benchmarks, modified DISCERN (mDISCERN), Global Quality Score (GQS), and Health on the Net Code (HONcode)-were used for video assessment. Each video was independently rated by two trained reviewers. Differences in quality scores were compared across platforms and uploader types (health professionals vs. non-professionals). Spearman correlation analysis was conducted to explore associations between video quality and engagement metrics (likes, comments, shares, favorites).
    Results: TikTok videos exhibited significantly higher median scores on JAMA (2.4 vs. 2.1, P = 0.001), GQS (3.0 vs. 3.0, P = 0.006), and HONcode (11.0 vs. 9.3, P = 0.005) compared to Bilibili. No significant difference was observed for mDISCERN scores. Videos uploaded by healthcare professionals had significantly higher GQS (P = 0.004) and HONcode scores (P = 0.010) than those from non-professionals. User engagement metrics were positively correlated with content quality, particularly on TikTok (e.g., likes vs. JAMA, r = 0.732, P < 0.001).
    Conclusions: OA-related videos on TikTok demonstrate higher overall quality and reliability compared to Bilibili, especially when created by healthcare professionals. User engagement metrics are positively associated with information quality, underscoring the importance of expert-led digital health communication. These findings highlight the need for platform-level interventions to promote trustworthy content and improve the digital health information ecosystem.
    Keywords:  Bilibili; TikTok; health communication; osteoarthritis; patient education; quality of health information; social media
    DOI:  https://doi.org/10.3389/fdgth.2025.1623247
  33. Gastroenterol Nurs. 2025 Nov-Dec 01;48(6):48(6): 436-443
      Social media is constantly evolving. This study assessed 150 TikTok videos related to colorectal cancer (CRC) using the hashtags #colonoscopy, #coloncancer, and #coloncancerawareness, measured video quality engagement metrics, and examined the potential effect of influencers. Two independent raters used the Global Quality Scale (GQS) tool to rate video quality. The study found that videos posted by healthcare professionals (M = 2.89, SD = 1.19) had a significantly higher GQS score (p < .001) compared to personal content creators (M = 1.79, SD = 0.90). Videos with music, including popular trending sounds, had a significantly greater number of views than videos with dialogue alone (p = .05). This study sheds light on the substantial role of influencers in increasing engagement. The hashtags #coloncancer and #coloncancerawareness were more likely to be utilized after a young influencer raised awareness and later died from colon cancer. The relationship was significant χ2 (2, = 6.84), p = .033, and videos had a statistically significantly higher average GQS rating (p = .012). Future research should examine TikTok and influencer collaborations as effective strategies for raising awareness about CRC among young, diverse minority populations at heightened risk, ensuring communication efforts resonate with these communities.
    DOI:  https://doi.org/10.1097/SGA.0000000000000908
  34. Digit Health. 2025 Jan-Dec;11:11 20552076251398464
       Background: Alzheimer's disease (AD) poses a significant public health challenge to China's aging population. Patients and their families increasingly turn to short-video platforms such as Douyin and Bilibili for information. However, there is currently a lack of systematic analysis regarding the quality and reliability of advertising content on these platforms, creating a critical gap in understanding this emerging information ecosystem.
    Aim: Systematically evaluate the quality and reliability of videos on Douyin and Bilibili, analyzing the relationship between content themes, upload sources, and user engagement metrics.
    Methods: Using "Alzheimer's disease" as the keyword, we retrieved the top 100 videos from multiple platforms. Videos were categorized by uploader type and content. Two qualified researchers assessed their reliability and quality using the JAMA, the modified DISCERN instrument (mDISCERN), and Global Quality Score (GQS) scale. Data analysis employed nonparametric statistical methods. Apply relevance and logistic regression analysis to discuss factors that may influence video quality.
    Results: This study analyzed a total of 171 videos. Results indicate that compared to Douyin, videos on the Bilibili platform scored higher across multiple quality evaluation metrics (GQS: 2.0(1.0-2.0) vs 1.0(1.0-2.0); mDISCERN: 2.0(2.0-2.0) vs. 2.0(2.0-2.0); JAMA: 2.0(1.0-2.0) vs. 1.0 (1.0-2.0); p < 0.001). This disparity may be attributed to Bilibili's longer video format, which allows for more in-depth content, and its user base that tends to favor detailed, knowledge-oriented media. Regarding uploader identity, videos posted by professionals (e.g. physicians) demonstrated superior quality compared to nonprofessional sources (e.g. patients). However, patient-uploaded videos exhibited stronger engagement metrics (e.g. likes, comments). Content-wise, videos focusing on disease prevention and treatment consistently achieved the highest overall quality (all comparisons p < 0.05). Correlation analysis indicated that while interaction metrics showed strong internal correlations, they did not significantly correlate with JAMA, mDISCERN, or GQS scores. Ordered logistic regression analysis indicates that uploader identity, content classification, and presentation format are the three key factors influencing video quality.
    Conclusion: This study reveals a pronounced "quality-dissemination paradox" in AD content across mainstream short-video platforms: While scientifically rigorous content published by medical professionals receives high quality ratings, it significantly underperforms in user engagement metrics compared to nonprofessional content centered on patient narratives and lived experiences. This highlights a severe disconnect between scientific rigor and public participation within algorithmic dissemination ecosystems. To address this, platforms should optimize algorithms to enhance the visibility of authoritative content, encourage collaboration between professional and nonprofessional creators to boost content appeal, and strengthen health media literacy education for the public-particularly older adults-to improve their ability to discern information.
    Keywords:  Alzheimer's disease; Bilibili; Public health; cross-sectional study; short videos
    DOI:  https://doi.org/10.1177/20552076251398464
  35. Front Public Health. 2025 ;13 1652579
       Background: Chronic renal failure is projected to be one of the fastest-growing causes of death among non-communicable diseases by 2040. TikTok has emerged as a major platform for disseminating health-related videos. However, the reliability and quality of Chinese videos related to chronic renal failure on TikTok remain unclear. We systematically searched and screened videos related to chronic renal failure from the Chinese version of TikTok. Two independent raters assessed the reliability and quality of the videos using two validated evaluation tools: the DISCERN instrument and the Global Quality Score (GQS). Moreover, the correlation between the reliability and quality of the videos and their characteristics (duration, likes, comments, shares, and number of followers) was further investigated.
    Results: After searching and screening, a total of 78 eligible videos were ultimately included for analysis. According to their sources, 94.87% were uploaded by medical professionals. The median DISCERN and GQS scores were 39 (IQR 37-46.25) and 3 (IQR 2.75-4), respectively, indicating that videos related to chronic renal failure on TikTok were unreliable and of mediocre quality, mainly at poor (42.31%) and moderate (44.87%) categories. The reliability and quality of the videos were positively correlated with video duration (r = 0.384, p = 0.001; r = 0.469, p < 0.01) and showed no statistically significant correlation with popularity or number of followers. Consequently, due to their unreliability and low quality, these Chinese videos related to chronic renal failure on TikTok cannot provide patients with accurate assessments and are unsuitable as a source of medical knowledge.
    Keywords:  DISCERN tool; Global Quality Score; TikTok; chronic renal failure; content analysis; health information quality; online health information; video quality
    DOI:  https://doi.org/10.3389/fpubh.2025.1652579
  36. Can J Sch Psychol. 2025 Dec;40(3-4): 213-226
      Social media platforms such as Pinterest are a popular medium for locating and consuming health and mental health information, as well as educational resources to assist struggling learners. Despite parents and educators being frequent consumers of education-related information on Pinterest, no studies to date have explored the accuracy of intervention information for dyslexia on Pinterest-linked web pages, meaning that the extent to which it aligns with evidence-based practice and the science of reading is unclear. This study reviewed online information about interventions for dyslexia from 41 Pinterest-linked web pages to evaluate accountability, presentation, alignment with evidence-based practice, and readability using a set of standardized criteria. The quality of intervention information was generally poor, with websites meeting less than 10% of the standardized criteria. Further, most information was published by unspecified authors or authors without formal experience providing evidence-based interventions for dyslexia. Most sites also neglected to reference their sources or recommend follow-up with a professional. These findings suggest that psychologists should be steering educators away from Pinterest as a resource and towards more reliable websites. Possibilities for future research, and practical implications for school psychologists are discussed.
    Keywords:  Pinterest; dyslexia; evidence-based practice; intervention; learning disability
    DOI:  https://doi.org/10.1177/08295735251387494
  37. Psychooncology. 2025 Dec;34(12): e70342
       INTRODUCTION: Patients with cancer increasingly engage in online health information seeking (OHIS), yet the impact thereof on their anxiety and uncertainty remains unclear. This study aimed to: (1) examine how, when, and why patients engage in OHIS before and after oncological consultations; (2) identify patient characteristics (sociodemographic, medical, psychological) associated with OHIS; and (3) explore the relationship between OHIS, state anxiety, and uncertainty.
    METHODS: Patients with various cancer diagnoses and at various phases-of-care completed three self-report questionnaires: before (T0), directly after (T1), and 2 weeks after (T2) their outpatient consultation.
    RESULTS: Half (50%) of patients (n = 281) engaged in OHIS. Commonly sought topics included physical complaints (T0: 57%, T2: 51%), chances of recovery after treatment and life expectancy (T0: 48%, T2: 47%), and common treatments (T0: 43%, T2: 33%). A stronger monitoring coping style, higher levels of trait anxiety, higher educational levels, and early phase-of-care were significantly associated with OHIS (all p < 0.01). Age, gender, health literacy, or uncertainty intolerance were not associated with OHIS (all p > 0.05). Seekers reported more uncertainty than non-seekers (p < 0.001), but OHIS was not significantly associated with state anxiety (p = 0.642).
    CONCLUSION: One in two patients engaged in OHIS, particularly those who are recently diagnosed, highly educated, generally anxious or have a stronger monitoring coping style. Clinicians should not be concerned that patients' OHIS will increase patients' anxiety, as this study found no such association. As OHIS was associated with uncertainty, future research should explore whether addressing OHIS in consultations reduces uncertainty.
    Keywords:  anxiety; cancer; communication; consultation; longitudinal; oncology; online health information seeking; patient reported outcomes; uncertainty
    DOI:  https://doi.org/10.1002/pon.70342
  38. BMC Complement Med Ther. 2025 Nov 24. 25(1): 431
       BACKGROUND: Using a large cross-national dataset (N ≈ 23,000), this study investigates the relationship between several aspects of online health information-seeking and the use of traditional, complementary, and integrative medicine (TCIM), as well as the belief that TCIM is better than conventional medicine in Western societies. It also examines how perceptions of the internet as a valuable tool to guide health decisions and perceived reliability of online information relate to TCIM use and beliefs about its superiority.
    METHODS: Ordinal logistic regression models were used to assess the association between online health information-seeking behavior, perceived usefulness, and reliability of online health information, and two outcomes: TCIM use and belief in TCIM superiority over conventional medicine. Analyses were based on data from the 2021 ISSP module on Health and Healthcare, restricted to Western countries.
    RESULTS: Findings reveal a significant, graded association between more frequent online health information-seeking and both higher TCIM use and stronger belief that TCIM is better than conventional medicine. Those who perceived the internet as helpful in verifying doctors' advice or evaluating symptoms also had significantly higher odds of TCIM use and belief in its superiority. Notably, respondents expressing uncertainty about distinguishing reliable online health information showed the highest odds of TCIM use and belief in its superiority. Those agreeing it was difficult also had elevated odds, though less pronounced.
    CONCLUSIONS: This study reveals that online health information-seeking is significantly linked to TCIM adoption and belief in its superiority over conventional medicine, including among individuals who express uncertainty about the reliability of online information. We suggest that the profiles of internet-engaged complementary medicine users are not uniform and may consist of both astute seekers who make an independently informed choice to use TCIM, as well as vulnerable users, potentially overwhelmed by misinformation. This study highlights the need to integrate TCIM into institutional healthcare frameworks, develop legal standards for TCIM use, promote digital health literacy, and improve doctor-patient communication.
    Keywords:  Alternative medicine; Complementary and integrative; Digital health; Internet use; Online health Information-Seeking; TCIM; Traditional
    DOI:  https://doi.org/10.1186/s12906-025-05167-4
  39. medRxiv. 2025 Oct 15. pii: 2025.10.14.25338014. [Epub ahead of print]
       Background: People with disabilities (PWDs) face disparities in the healthcare system that lead to poorer health outcomes. A lack of health information and accessible communication with healthcare professionals is linked to these health inequities. PWDs report lower health literacy, technology use, and different access needs that limit effective health-related communication. Due to the broad spectrum of disabilities, the barriers PWDs face with accessing health information vary greatly.
    Methods: We conducted an existing data analysis using the nationally representative Health Information National Trends Survey (HINTS) conducted in 2024. We estimated adjusted odds ratios (aORs) using logistic regression for five different cancer-related health information outcomes among US civilian, non-institutionalized adults who included disability status (weighted n=250,488,318).
    Results: Primary findings indicated that PWDs-especially those with multiple disabilities, chronic pain, or deafness-had disparities in their health information seeking experiences compared to people without disabilities. People with multiple disabilities had higher odds of reporting frustration and difficulty understanding health information, as well as not searching for health information in the first place.
    Conclusion: People with disabilities experience barriers to seeking health information but these barriers differ on the type of disability. This study is novel in the ability to compare different types of disabilities to different health information outcomes, but true disability representation is likely not possible due to inaccessible survey design. The findings in this study highlight the need for accessible health information, surveys, and more interventions that include PWDs in public health programming.
    DOI:  https://doi.org/10.1101/2025.10.14.25338014
  40. Sensors (Basel). 2025 Nov 12. pii: 6899. [Epub ahead of print]25(22):
      Sports tracking produces large, unstructured trajectory datasets. The search and retrieval of interesting plays are essential parts of their analysis. Since annotations are sparse, similarity search remains the standard technique. It relies on learned lower-dimensional representations for its computational feasibility. Siamese Networks learn dimensionality reduction from pairwise distances. However, complete training datasets are impractical to compute due to their combinatorial nature and the cost of distance calculations. Sub-sampling sacrifices representation quality for speed, leading to less meaningful search results. We propose the novel sampling technique Pairwise Diverse and Uncertain Gradient (PairDUG), which exploits the model's gradient signals to select representative and informative pairs for training. The broad experimental study implements the method for large-scale basketball and American football datasets. The results show that PairDUG at least halves the required compute time while maintaining, or even improving, retrieval quality, and outperforms other baseline methods. Furthermore, our evaluation shows that the selected pairs' gradient signals exhibit greater magnitude, diversity, and stability than those of any other method. This work represents a foundational contribution to pairwise distance learning. Hence, future work transfers the method not only to other sports, such as soccer, but also to complex trajectory datasets outside the sports domain.
    Keywords:  active sampling; information retrieval; machine learning; position data; representation learning
    DOI:  https://doi.org/10.3390/s25226899