bims-librar Biomed News
on Biomedical librarianship
Issue of 2026–06–21
thirty-six papers selected by
Thomas Krichel, Open Library Society



  1. Sci Rep. 2026 Jun 17.
      Libraries and archives hold large collections of medieval manuscripts that are of cultural importance to the regions they serve. These collections are often highly studied for their written material as a snapshot of how life was at the time of writing. By studying that, codicologists can estimate a moderately accurate date for the objects, but it requires significant time, expertise, and effort to achieve these dates. Limitations regarding sample collection and preparation prevent large scale and robust scientific investigation of these culturally significant collections. Recent advancements in the scientific community around non-destructive and non-invasive analysis have helped open the door to study these collections. This study attempts to utilize External Reflectance Fourier Transform Infrared (ER-FTIR) spectroscopy as a tool for molecular decay (MD) dating. By working closely with codicologists and conservators, it is possible to build an MD dating tool by looking at the chemical differences in leather as it ages. With an estimated uncertainty of ± 66 years, this study aims to prove that it is possible to have an accurate dating tool that is fast, easy to interpret, and non-destructive/non-invasive to cultural heritage collections.
    DOI:  https://doi.org/10.1038/s41598-026-58309-0
  2. Campbell Syst Rev. 2026 Jun;22(2): 18911803261449731
      Prompt engineering is the formation of queries or instructions (prompts) that are deployed in large language models. These prompts are often underscored by frameworks, designed to give structure and encourage robust answers. Discussions in recent information specialists' networks and events have highlighted on multiple occasions that information specialists are well placed to undertake prompt engineering tasks. However, there is little published information outlining why and how information specialists are best placed for these tasks and the universal understanding between information specialists has not filtered out to the wider research synthesis community so progress in this area is slow. Here, we discuss the parallels between information specialist tasks and large language model engineering tasks and demonstrate that the parallels run deeper than just prompts. There are strong similarities between information retrieval and context engineering, prompt engineering and vibing. In the briefest sense, we can consider context engineering to be like a search platform, prompt engineering like a structured search strategy, and vibe coding like a search engine input. Knowledge sharing and dissemination of these core concepts amongst information specialists and research synthesists will drive methods development, particularly with the rise of large language models in synthesis automation, give potential for continual professional development courses and e-learning to be developed, and expand the roles of information specialists. To initiate progress in this area, we discuss the anticipated future direction of information specialist roles.
    Keywords:  artificial intelligence; evidence synthesis; information retrieval; prompt engineering
    DOI:  https://doi.org/10.1177/18911803261449731
  3. Res Synth Methods. 2026 Jun 17. 1-26
      This study investigated information retrieval of preprint records in the context of evidence synthesis work and compared 12 sources used to discover preprints. Identification of grey literature is often required or recommended in evidence synthesis guidance, and preprints are categorized as grey literature. The purpose of this work is to inform how and where to search for preprints to maximize coverage (through exploration of preprint server across aggregators and databases) while balancing search efficiency. Authors selected aggregators and databases hosting two or more preprint servers and then tested search functionality and extracted characteristics and features. Authors analyzed and compared the selected sources, tabulated the number of essential features, and created comparison tables reflecting database and aggregator features. The study protocol was registered in Open Science Framework registries. Preprint aggregators and databases differ in their content coverage, and their ability to design a comprehensive and reproducible search strategy. Limitations such as character or word limits for queries, limited advanced search operators, and missing export functionality affect the usability of aggregators for evidence synthesis searches. Ongoing updates to search interfaces and functionality and differing approaches to versioning make it challenging to study discovery of preprints across sources. The recommendations and scenarios in this article will assist searchers engaged in evidence synthesis to make informed decisions about where to search for preprints.
    Keywords:  evidence synthesis; information retrieval; preprints; systematic searching
    DOI:  https://doi.org/10.1017/rsm.2026.10101
  4. JAMIA Open. 2026 Jun;9(3): ooag078
       Objectives: This study evaluated the quality and trustworthiness of large language model (LLM)-generated scientific and plain language summaries (PLS) from clinical oncology literature, focusing on faithfulness (absence of hallucinations), relevance, and readability.
    Materials and Methods: Ten LLM-generated scientific summaries and PLS from the INSIDE (artificial INtelligence to Support Informed DEcision making) prostate cancer dataset. For comparison, expert-written PLS from the BioLaySumm dataset were used. A panel of 5 LLMs and 3 human experts verified faithfulness. Verification was performed on original facts and facts modified with varying levels of error (subtle, moderate, contradictory). Readability was assessed using Flesch-Kincaid Reading Ease (FRE) scores.
    Results: Fact verification against the summaries was ∼100%, confirming accurate fact extraction. LLM panel vs human panel agreement was substantial (kappa 0.67), outperforming agreement among the interhuman (0.43 [95% CI, 0.34-0.52]) and inter-LLM (0.40 [0.38-0.42]) panels. Large language model scientific summaries showed high faithfulness (88.9% [88.0-89.8]) and low hallucinations (9.6% [6.5-12.7]) compared to human-written PLS (61.6% [60.1-63.1] faithfulness; 40.6% [37.8- 43.4] hallucinations). The LLMs detected errors sensitively with scores decreasing as fact modifications became more severe. Finally, LLM-generated PLS were more readable than human-written versions (FRE 42.3 [interquartile range, IQR 35.27-49.41] vs 28.8 [IQR 21.02-36.18]).
    Discussion: A panel of LLMs reliably assessed the faithfulness of scientific summaries to their original source and thus can help increase reliability for clinical use. The lower faithfulness in human-written PLS likely reflects extrinsic hallucinations added for context.
    Conclusion: The study demonstrates a novel approach to automatically assess the quality and trustworthiness of LLM-generated scientific and PLS via faithfulness, relevance, and readability.
    Keywords:  artificial intelligence; hallucinations; large language models; oncology literature; scientific summaries
    DOI:  https://doi.org/10.1093/jamiaopen/ooag078
  5. BMC Oral Health. 2026 Jun 16.
       BACKGROUND: Artificial intelligence (AI) chatbots are increasingly used in healthcare to provide health-related information and answer patient questions. However, their reliability in specialized dental fields such as restorative dentistry remains insufficiently evaluated. This study aimed to evaluate and compare the accuracy and consistency of responses generated by five artificial intelligence chatbot systems-ChatGPT-3.5 and ChatGPT-4 (OpenAI), Bing Chat (Microsoft), Gemini (Google), and Claude-Instant (Anthropic)-regarding dental bleaching.
    METHODS: Fifteen frequently asked questions about dental bleaching, identified by restorative dentistry specialists, were categorized as undergraduate-or specialist-level questions. All questions were submitted to five artificial intelligence chatbots in both Turkish and English. Each question was asked three times per day over three consecutive days using standardized prompts. Responses were independently evaluated by two experts using a five-point Likert scale, and mean scores were calculated. A three-way ANOVA was conducted to assess the effects of chatbot type, knowledge level, and question language on response accuracy. Inter-rater agreement between evaluators was assessed using Cohen's kappa coefficient. Statistical significance was set at p < 0.05.
    RESULTS: Chatbot type had a significant effect on response accuracy (p < 0.001, η² = 0.405). ChatGPT-4 showed the highest accuracy, followed by Claude-Instant and GPT-3.5, whereas Gemini and especially Bing Chat demonstrated significantly lower performance (p < 0.001). Question language and knowledge level showed no significant main effects (p > 0.05). Significant interactions were observed between chatbot type and knowledge level and between chatbot type and language (p < 0.001). No significant differences were observed across days or time periods (p > 0.05).
    CONCLUSIONS: The accuracy of chatbot-generated information regarding dental bleaching depends strongly on the specific AI model used. Advanced large language models, particularly ChatGPT-4, generate more accurate and consistent responses than other evaluated systems. AI chatbots should therefore not be considered interchangeable sources of clinical information, and their outputs should be interpreted cautiously and verified with professional guidance.
    CLINICAL RELEVANCE: These findings highlight the importance of critically evaluating AI-generated health information and emphasize that chatbot responses should not replace professional clinical consultation.
    Keywords:  Cross-lingual evaluation; Health information reliability; Large language models; Multilingual natural language processing
    DOI:  https://doi.org/10.1186/s12903-026-08852-z
  6. Appl Clin Inform. 2026 Jun 19.
      Background Clinicians frequently face questions that require rapid, evidence-based answers. Artificial intelligence (AI) tools are increasingly used for this purpose, yet their reliability for clinical decision-making remains uncertain. This study compared two generative large language model (LLM) systems (ChatGPT and Gemini) and a retrieval-supported clinical platform (OpenEvidence) to determine which provides the most reliable, clear, and clinically applicable information in obstetrics, gynecology, and urogynecology. Methods A cross-sectional comparative design was used to evaluate ChatGPT (GPT-5), Gemini (Gemini 2.5), and the retrieval-supported platform OpenEvidence. Twenty-four clinical questions across three subspecialties were independently assessed by two blinded specialists using the Expert-Adapted DISCERN (EA-DISCERN) tool, which rates 12 quality domains on a five-point scale. Mean ± SD scores were compared across systems and clinical domains using repeated-measures analysis. Results OpenEvidence achieved the highest mean total score (54.0 ± 2.3), outperforming Gemini (50.3 ± 2.4) and ChatGPT (48.7 ± 2.4) (p < 0.001). OpenEvidence scored significantly higher in evidence-based domains; clinical accuracy, guideline consistency, completeness, transparency, and reliability across all fields. As of this writing, Gemini ranked between the two, showing a modest advantage over ChatGPT in rationale explanation and evidence transparency, while both generative models scored higher in language fluency and readability. Overall, total EA-DISCERN scores ranked OpenEvidence highest, followed by Gemini, then ChatGPT. Inter-rater reliability for the total score was ICC[2,1] (absolute agreement = 0.391). Conclusions OpenEvidence provided more guideline-aligned and transparent responses, whereas ChatGPT and Gemini were generally more fluent and readable. For OB/GYN clinicians, retrieval-supported platforms may be more suitable for point-of-care verification, while generative models should be used more cautiously and with clinician oversight.
    DOI:  https://doi.org/10.1055/a-2899-0123
  7. Knee. 2026 Jun 17. pii: S0968-0160(26)00196-1. [Epub ahead of print]62 104516
       BACKGROUND: Patients increasingly seek health information prior to and following total knee arthroplasty (TKA), often using generative AI chatbots integrated into search engines. As these tools grow in popularity, their accuracy and utility in postoperative education for complex medical topics like TKA remain unclear. This study assessed four leading AI models (ChatGPT, Gemini, Copilot, and Grok), on their ability to answer common patient questions three months following TKA.
    METHODS: Four frequently asked questions (FAQs) were chosen based on surgeon experience in a high-volume total joint practice, covering return to sport, range of motion, persistent symptoms, and scar healing. Questions were input into each AI platform using a cleared browser. Ten orthopedic surgeons rated responses on a 1-4 scale (1 = Excellent, no clarification needed; 4 = Unsatisfactory, needing substantial clarification) and selected preferred answers. Mean scores, standard deviation (SD), and vote counts were analyzed using ANOVA for consistency and consensus.
    RESULTS: Significant score variations were noted for three questions (1, 2, and 4). Gemini consistently scored lowest (best), with mean scores of 1.4-1.6 and low variability (SD: 0.48-0.70), earning the most first place votes (7/10) for three questions. ChatGPT ranked second, followed by Grok and Copilot, showing higher (worse) scores and variability. Subjective questions like symptom normalization showed greater rater disagreement.
    CONCLUSION: Generative AI shows potential for postoperative education, but response quality differs across platforms. Gemini had highest consistency and clinical alignment supporting cautious integration of AI in patient communication with ongoing surgeon oversight for safety and accuracy.
    Keywords:  Artificial Intelligence (AI); Total KneeArthroplasty (TKA)
    DOI:  https://doi.org/10.1016/j.knee.2026.104516
  8. Rev Bras Ortop (Sao Paulo). 2026 Apr;61(2): s00461820460
       Objective: Artificial intelligence (AI) tools based on natural language, such as ChatGPT 4.1 mini (OpenAI Group PBC) and Gemini 2.5 Flash (Alphabet Inc.), are used by patients as sources of medical information. The current study aimed to evaluate and compare the quality and readability of responses provided by these AIs, in Brazilian Portuguese, regarding rotator cuff surgery.
    Methods: The present cross-sectional, descriptive, and comparative study followed qualitative and quantitative approaches. A total of 24 frequently-asked patient questions were used, classified according to Rothwell. Each question was entered individually into both platforms, and only the first response was considered. The quality assessment used the DISCERN instrument, developed by the University of Oxford and the British Library, and the Journal of the American Medical Association (JAMA) benchmark criteria. Readability was estimated using Análise de Legibilidade Textual (ALT, "Text Readibility Anallysis", in Portuguese) software, validated for Brazilian Portuguese. The statistical analyses included the Wilcoxon and Friedman tests, repeated-measures analysis of variance (ANOVA), and the Conover post-hoc test with Bonferroni correction.
    Results: ChatGPT achieved a mean DISCERN score of 58.7 ± 4.0, and Gemini, 56.3 ± 3.5, with no significant difference ( p  = 0.174), but with a maximum effect size (rank-biserial correlation [rrb] = 1.0). Both models showed a mean readability corresponding to 13.3 years of schooling ( p  = 1.000). No response met the JAMA benchmark criteria. Value-based questions achieved the highest quality scores, whereas policy-related questions were the most complex in terms of readability. The correlation between quality and readability was moderate (ρ = 0.73; p  = 0.099).
    Conclusion: ChatGPT 4.1 mini and Gemini 2.5 Flash do not yet provide adequate medical information in Brazilian Portuguese regarding editorial reliability, quality, and textual accessibility for the general public.
    Keywords:  ChatGPT; artificial intelligence; health education; information technology; language models; rotator cuff
    DOI:  https://doi.org/10.1055/s-0046-1820460
  9. JCO Clin Cancer Inform. 2026 Apr;10(2): e2600114
       PURPOSE: Artificial intelligence (AI) is increasingly integrated into cancer care and accessible to patients, yet the quality and accessibility of patient-facing information on this topic are poorly characterized. We evaluated the quality, readability, and AI safety concept coverage of publicly available Webpages and YouTube videos that patients are likely to encounter when searching to learn about AI in cancer care.
    METHODS: We conducted a cross-sectional analysis of digital content identified using Google Trends-derived search terms on August 6, 2025. Two independent reviewers screened the first 170 Google and 150 YouTube results for patient-facing content relevant to AI in cancer care, with discrepancies resolved by a third reviewer. Information quality was assessed using DISCERN (scale 1-5; ≥4 = high quality). Webpage readability was evaluated using Flesch-Kincaid (FK), Gunning Fog (GF), and Simplified Measure of Gobbledygook (SMOG) indices. Coverage of AI safety concepts (hallucination risk, clinician oversight, bias/equity, transparency) was assessed.
    RESULTS: Of the 320 resources screened, 52 Webpages (31%) and 30 videos (20%) met inclusion criteria. Median DISCERN scores were 2.5 (IQR, 2.5-4.0) for Webpages and 2.25 (IQR, 1.5-3.0) for videos, indicating overall low-quality information. Only 17 Webpages (33%) and seven videos (23%) were of high quality. Median Webpage readability corresponded to college level across all indices (FK 12.5, GF 15.0, SMOG 14.3), exceeding the sixth- to eighth-grade reading level recommended by the American Medical Association and National Institutes of Health. Most Webpages addressed clinician oversight (79%) and transparency (79%), but few discussed hallucination risk (15%).
    CONCLUSION: Patient-facing online information about AI in cancer care is limited, low in quality, difficult to read, and frequently omits safety concepts, highlighting an urgent need for accessible, high-quality resources.
    DOI:  https://doi.org/10.1200/CCI-26-00114
  10. Front Cell Dev Biol. 2026 ;14 1768270
       Purpose: This study compares ChatGPT, DeepSeek, and Google Search in addressing cartilage repair-related questions across two domains-cartilage tissue engineering (CTE) and cartilage repair surgery (CRS)-using a dual-axis framework that integrates classification, blinded quality scoring, and readability analysis.
    Methods: Google Search was queried for the top 20 frequently asked questions (FAQs) in each domain (CTE in 2023, CRS in 2024). The identical Top-10 Google-derived FAQs per domain were subsequently submitted to all three platforms-Google, ChatGPT (GPT-4 API), and DeepSeek (V3 API)-enabling a matched three-way comparison. Questions and answer sources were classified using a modified Rothwell taxonomy. Answer quality was independently evaluated by three blinded raters using the Accuracy-Safety-Hallucination (ASH) framework. Readability was assessed via the Flesch-Kincaid formula.
    Results: In the CTE domain, DeepSeek achieved the highest Accuracy (median 5.00, IQR 4.67-5.00) and significantly outperformed Google (median 4.00, Bonferroni-corrected p = 0.036), while ChatGPT (median 3.67) did not differ significantly from either platform. In the CRS domain, both ChatGPT (median 5.00) and DeepSeek (median 5.00) significantly outperformed Google (median 4.17; p = 0.024 and p = 0.045, respectively), with Safety significantly favoring both LLMs (Cochran's Q p = 0.018). ChatGPT and DeepSeek did not differ significantly in Accuracy in either domain. Readability analysis paradoxically favored Google (Grade Level 12.6-13.2 vs. 15.7-17.4 for LLMs), attributable to extreme snippet brevity inflating formulaic scores rather than genuine comprehensibility.
    Conclusion: ChatGPT and DeepSeek outperform Google Search in accuracy and safety, yet their value lies in complementary functional roles rather than direct competition. ChatGPT's policy- and education-oriented framing and strong CRS safety profile position it as a practical tool for patient education. DeepSeek's technical depth and academically concentrated sourcing make it better suited for clinical decision support and research. Google offers the highest readability and closely mirrors patient concerns but carries measurable safety risks in surgical contexts. These findings advocate for stakeholder-specific AI tool matching rather than one-size-fits-all recommendations.
    Keywords:  ChatGPT; cartilage surgery; cartilage tissue engineering; deepseek; machine learning
    DOI:  https://doi.org/10.3389/fcell.2026.1768270
  11. Eur J Pediatr. 2026 Jun 13. pii: 498. [Epub ahead of print]185(7):
      This study aims to comparatively evaluate the quality, reliability, and readability of responses generated by ChatGPT 5.2 and Gemini 3 Pro regarding feeding management and oral health guidance for children with cleft lip and palate (CLP). A cross-sectional design was used. Both models were asked 20 questions on feeding and oral-dental care in infants and children with CLP. Response quality was assessed using the Global Quality Score (GQS) and the CLEAR tool, reliability with the modified-DISCERN (m-DISCERN), and readability with the Flesch Reading Ease (FRES) and Flesch-Kincaid Grade Level (FKGL). The mean GQS was significantly higher for Gemini 3 Pro than for ChatGPT 5.2 (4.51 ± 0.29 vs. 3.76 ± 0.33; p < 0.001). The CLEAR tool score was likewise significantly higher in Gemini 3 Pro compared with ChatGPT 5.2 (22.95 ± 0.94 vs. 19.40 ± 1.47; p < 0.001), and the m-DISCERN scores were also significantly higher for Gemini 3 Pro (4.0 [4.0-4.0]) than for ChatGPT 5.2 (3.0 [2.0-3.0]; p < 0.001). The FRES values were significantly higher for Gemini 3 Pro compared with ChatGPT 5.2 (49.50 ± 10.33 vs. 23.60 ± 14.48; p < 0.001), whereas the FKGL values were significantly lower (9.77 ± 1.54 vs. 13.44 ± 2.81; p < 0.001). Correlation analysis revealed strong positive correlations between GQS and CLEAR (r = 0.753) and between GQS and m-DISCERN (r = 0.740), while FKGL showed significant negative correlations with all quality and reliability measures (-0.566≤ r ≤ -0.484; all p < 0.001).
    CONCLUSION: Gemini 3 Pro outperformed ChatGPT 5.2 across content quality, reliability, and readability. Although both models can support guidance in CLP care, AI chatbot outputs should be used as complementary tools alongside professional clinical guidance.
    WHAT IS KNOWN: • Large language model (LLM) chatbots are increasingly used as sources of health information for patients and families. • Existing studies on cleft lip and palate largely focus on single-model evaluations.
    WHAT IS NEW: • This study provides the first comparative analysis of ChatGPT 5.2 and Gemini 3 Pro for feeding management and oral-health guidance in CLP care. • Gemini 3 Pro demonstrated higher content quality, reliability, and readability than ChatGPT 5.2, indicating model-dependent variability in chatbot performance.
    Keywords:  Artificial intelligence; Chatbots; Cleft lip and palate; Feeding guidance; Oral health
    DOI:  https://doi.org/10.1007/s00431-026-07143-7
  12. Front Public Health. 2026 ;14 1833611
       Objective: This study aims to clarify the impact of Large Language Models (LLMs) and health education content categories on generated text quality (patient education appropriateness and overall quality) and readability, providing empirical evidence for the standardized application of LLMs-assisted health communication.
    Methods: Five mainstream models (Doubao, Deep Seek, Wenxin Yiyan, Gemini and GPT-5) were selected to generate 100 texts (20 per model, 20 per theme) across five health education categories: disease cognition dimension, etiology and risk factors dimension, diagnosis and examination dimension, treatment and management dimension, and prevention and prognosis dimension. Test quality was assessed using the Chinese version of the Patient Education Material Readability Assessment Scale (C-PEMAT) and the Global Quality Scale (GQS), while readability was measured via seven metrics including the Automated Readability Index (ARI) and the Flesch Reading Ease Score (FRES). Correlation analyses were used to explore relationships among indicators.
    Results: Our analysis revealed clear hierarchical performance across five large language models: GPT-5 achieved the highest scores in both patient education appropriateness (C-PEMAT: 11.10 ± 2.40) and overall text quality (GQS: 5.00 [4.00, 5.00]). GPT-5 exhibited significantly higher GQS scores than all other models (χ 2 = 66.52, p < 0.001), while Wenxin Yiyan ranked lowest in core quality (GQS: 1.00 [1.00, 2.00]). Content categories exhibited differentiated readability but stable quality: texts on "Prevention and Prognosis" and "Treatment and Management" yielded the highest C-PEMAT scores, whereas "Etiology and Risk Factors" texts showed weaker reading fluency. Correlation analysis confirmed that quality and readability were largely independent, though subtle associations emerged-including a weak positive link between FRES and GQS. In the factual-accuracy assessment, 19.0% of responses contained factual inaccuracies, while no response was judged to contain potentially clinically harmful misinformation. Significant between-model differences were observed in factual accuracy scores.
    Conclusion: This study demonstrates significant hierarchical performance among LLMs in health science text creation. Different health education themes show partial indicator variation but stable overall quality. Notably, quality and readability are relatively independent (with weak correlations), providing empirical evidence for understanding LLMs in health popularization.
    Keywords:  artificial intelligence; gestational hypertension; large language models; online medical information; quality; readability
    DOI:  https://doi.org/10.3389/fpubh.2026.1833611
  13. JMIR Med Inform. 2026 Jun 18. 14 e93054
       Background: Although large language models (LLMs) show potential for patient education, their accuracy, usability, and comprehensibility lack validation in high-risk pediatric anesthesia. Rigorous evaluation is therefore essential prior to widespread clinical use in perioperative parental anesthesia education.
    Objective: This study aims to evaluate the accuracy, reliability, and readability of responses generated by 5 LLMs to parental inquiries regarding pediatric anesthesia, and to assess their suitability for clinical use in perioperative caregiver education.
    Methods: Two expert anesthesiologists identified 33 parental questions on pediatric anesthesia by screening authoritative resources and Google Trends. On December 14, 2025, these questions were submitted to 5 LLMs (DeepSeek-V3.2, ChatGPT-5, Gemini 2.5 Flash, Copilot, and Perplexity) via official web interfaces with default settings and zero-shot prompting, with each query in a separate conversation. Responses were standardized for blinded assessment. Two pediatric anesthesiologists with ≥10 years of clinical experience independently evaluated accuracy and reliability using the 4-point Likert accuracy scale, DISCERN, Ensuring Quality Information for Patients (EQIP), Journal of the American Medical Association (JAMA) benchmark, and Global Quality Score (GQS). After text preprocessing, readability was evaluated using 6 algorithms (Automated Readability Index [ARI], Flesch Reading Ease Score [FRES], Gunning Fog Index [GFI], Flesch-Kincaid Grade Level [FKGL], Coleman-Liau Index [CL], and the Simple Measure of Gobbledygook [SMOG]) via an online calculator. Interrater reliability was analyzed using the intraclass correlation coefficient (ICC); differences across models were assessed with the Kruskal-Wallis H test; and deviations from the sixth-grade benchmark were evaluated using 1-sample Wilcoxon signed-rank tests (P<.05 considered significant).
    Results: All 5 LLMs demonstrated high clinical accuracy (>90%; P=.12), with Gemini reaching 100%. Nevertheless, safety risks and content hallucinations were still observed. Excluding Gemini and Copilot, the remaining 3 models (ChatGPT, DeepSeek, and Perplexity) each produced unsafe content in 3.03% (n=1) of the 33 queries. Hallucinations were detected in all models except Gemini, with DeepSeek and Perplexity showing the highest hallucination rate (3/33, 9.09%). Furthermore, Perplexity showed superior reliability on DISCERN (median 41; P<.05), yet no model achieved a "good" rating. Gemini achieved the highest EQIP (median 66.67%; P<.05) despite lower GQS (median 3). Transparency was universally poor (JAMA median ≤1), with DeepSeek and ChatGPT showing a "floor effect." ChatGPT had superior readability, but all models exceeded the recommended 6-grade complexity level.
    Conclusions: In this study, 5 LLMs generally provided clinically accurate information when responding to parental questions about pediatric anesthesia. However, limitations were also identified, including hallucinated content, safety-related deficiencies, limited source transparency, and readability levels exceeding recommended standards. Therefore, LLM-generated information should be interpreted with caution and should not replace clinician guidance.
    Keywords:  accuracy; large language model; patient education; pediatric anesthesia; readability; reliability
    DOI:  https://doi.org/10.2196/93054
  14. Rev Bras Ortop (Sao Paulo). 2026 Apr;61(2): s00461820461
       Objective: Artificial intelligence (AI) tools based on natural language, such as ChatGPT 4.1 mini (OpenAI Group PBC) and Gemini 2.5 Flash (Alphabet Inc.), are used by patients as sources of medical information. The current study aimed to evaluate and compare the quality and readability of responses provided by these AIs, in Brazilian Portuguese, regarding rotator cuff surgery.
    Methods: The present cross-sectional, descriptive, and comparative study followed qualitative and quantitative approaches. A total of 24 frequently-asked patient questions were used, classified according to Rothwell. Each question was entered individually into both platforms, and only the first response was considered. The quality assessment used the DISCERN instrument, developed by the University of Oxford and the British Library, and the Journal of the American Medical Association (JAMA) benchmark criteria. Readability was estimated using Análise de Legibilidade Textual (ALT, "Text Readibility Anallysis", in Portuguese) software, validated for Brazilian Portuguese. The statistical analyses included the Wilcoxon and Friedman tests, repeated-measures analysis of variance (ANOVA), and the Conover post-hoc test with Bonferroni correction.
    Results: ChatGPT achieved a mean DISCERN score of 58.7 ± 4.0, and Gemini, 56.3 ± 3.5, with no significant difference ( p  = 0.174), but with a maximum effect size (rank-biserial correlation [rrb] = 1.0). Both models showed a mean readability corresponding to 13.3 years of schooling ( p  = 1.000). No response met the JAMA benchmark criteria. Value-based questions achieved the highest quality scores, whereas policy-related questions were the most complex in terms of readability. The correlation between quality and readability was moderate (ρ = 0.73; p  = 0.099).
    Conclusion: ChatGPT 4.1 mini and Gemini 2.5 Flash do not yet provide adequate medical information in Brazilian Portuguese regarding editorial reliability, quality, and textual accessibility for the general public.
    Keywords:  ChatGPT; artificial intelligence; health education; information technology; language models; rotator cuff
    DOI:  https://doi.org/10.1055/s-0046-1820461
  15. J Fr Ophtalmol. 2026 Jun 15. pii: S0181-5512(26)00149-X. [Epub ahead of print]49(7): 104923
       PURPOSE: To evaluate the appropriateness and readability of responses generated by large language model (LLM) chatbots to frequently asked patient questions regarding intravitreal anti-vascular endothelial growth factor (anti-VEGF) therapy.
    METHODS: Forty patient-centered anti-VEGF-related questions were developed by retinal specialists and posed in English to six LLM chatbots (ChatGPT-4.0, ChatGPT-5.2, Google Gemini 3, Microsoft Copilot, Grok 4, and Manus 1.6 Lite) under identical conditions. Responses were recorded verbatim and anonymized. Two ophthalmologists evaluated clinical appropriateness using a three-point Likert scale. Readability was assessed using five validated indices, and text length and time-based parameters were analyzed.
    RESULTS: None of the responses were classified as inappropriate. Gemini 3 demonstrated the highest rate of appropriate responses (97.5%), followed by ChatGPT-5.2 (90%), ChatGPT-4.0 (87.5%), and Manus 1.6 Lite (87.5%), while Copilot and Grok 4 showed lower appropriateness due to a higher proportion of partially appropriate responses (P=0.033). Significant differences were observed across all readability indices (P<0.001). Gemini 3 achieved the highest Flesch Reading Ease scores, indicating better patient accessibility, whereas Grok 4 produced more complex texts requiring higher educational levels. Manus 1.6 Lite generated the longest and most information-dense responses, while Gemini 3 demonstrated a more balanced profile between informational depth and readability.
    CONCLUSIONS: While LLM chatbots generally provide clinically appropriate information on intravitreal anti-VEGF therapy, substantial model-dependent differences exist in readability and communication quality. LLMs should therefore be used as physician-supervised tools to support patient education rather than as standalone information sources.
    Keywords:  Anti-VEGF intravitréen; Grands modèles de langage; Large language models; Lisibilité; Ophtalmologie; Ophthalmology; Patient education; Readability; Éducation des patients; İntravitreal anti-VEGF
    DOI:  https://doi.org/10.1016/j.jfo.2026.104923
  16. Am J Health Syst Pharm. 2026 Jun 19. pii: zxag183. [Epub ahead of print]
       PURPOSE: This study aimed to evaluate the completeness and readability of patient education materials available through 3 widely used drug information databases and assess their potential impact on patient comprehension.
    METHODS: A cross-sectional comparative analysis of patient education materials for 50 commonly prescribed medications was conducted across 3 different drug information databases. Medications were selected from diverse therapeutic classes, such as cardiovascular, endocrine, mental health, and others. Completeness was evaluated using the American Society of Health-System Pharmacists guidelines, including indications, dosing, administration, adverse effects, drug interactions, and precautions. Readability was analyzed using validated tools, such as the Simplified Measure of Gobbledygook (SMOG) and Flesch-Kincaid scale. Subgroup analysis examined variation by therapeutic class.
    RESULTS: All 3 databases were assigned high completeness scores, with drug interactions, lifestyle, and diet being the most common missing components. Readability scores varied significantly across databases. Database 1 produced materials closest to the National Institutes of Health/American Medical Association-recommended 6th-grade reading level, whereas databases 2 and 3 had higher literacy demands, producing materials from the 7th-grade to college level. It was also observed that education materials include numerous specific drug names. For instance, database 2 and database 3 often provided comprehensive medication lists for potential drug interactions, which increased readability scores and grade levels. When these lists were removed, the readability scores and grade levels decreased.
    CONCLUSION: While most essential information was included in the patient education materials, readability remained a barrier to patient comprehension. Balancing clarity with completeness is essential for supporting adherence and safety.
    Keywords:  drug information services; health literacy; patient education; pharmacy
    DOI:  https://doi.org/10.1093/ajhp/zxag183
  17. Sci Rep. 2026 Jun 16.
      Avascular necrosis of the femoral head (ANFH) is a progressive orthopedic disorder characterized by a significant rate of impairment. With the proliferation of short-video platforms in health communication, apprehensions have emerged over the quality and credibility of medical information shared through these mediums. Nevertheless, investigations specifically focusing on ANFH remain mostly unexamined. This research utilized a cross-sectional approach to examine 177 videos pertaining to ANFH on TikTok and Bilibili. The recorded data encompassed video sources, content themes, expression forms, and interaction metrics, with quality assessment performed according to four criteria. The overall video quality was average (GQS = 3, mDISCERN = 3, JAMA = 3, VIQI = 12). Healthcare personnel constituted the predominant uploaders (72.88%), with chief physicians representing 32.77%. Videos primarily showcased monologues (58.19%), with video content focused on treatment (71.75%) and symptoms (68.93%). Surgical operations are the predominant therapeutic modality, accounting for 36.72%. TikTok videos had a shorter median duration of 56 s but elicited higher interaction (P < 0.001) and exhibited enhanced performance in JAMA and VIQI scores (P < 0.001). Videos produced by attending/registered physicians demonstrated superior quality, while Q&A-format videos revealed improved reliability compared to monologues. Interaction metrics showed weak-to-moderate associations with quality scores overall, with relatively stronger correlations observed between interaction metrics and GQS and VIQI scores, particularly on TikTok. The general quality of short videos on ANFH is average, with considerable variations among platforms and uploaders. Efforts must concentrate on fortifying early preventive education, harmonizing surgical and conservative treatment methods, and instituting expert review and certification systems to improve the scientific precision and distribution efficacy of orthopedic health information on social media.
    Keywords:  Avascular necrosis of the femoral head; Bilibili; Social media; TikTok
    DOI:  https://doi.org/10.1038/s41598-026-58333-0
  18. Cureus. 2026 May;18(5): e109084
       BACKGROUND: Nitrous oxide inhalation sedation is a cornerstone of behavior management in pediatric dentistry, widely used to reduce anxiety and enhance cooperation in child dental patients. Online health information significantly influences parental understanding and decision-making regarding their children's dental treatment. However, the readability of Arabic web-based information on this topic remains unexplored.
    OBJECTIVE: To assess the readability of Arabic web-based information on nitrous oxide sedation in pediatric dentistry using the Gunning Fog Index (GFI), with implications for parental understanding.
    METHODS: A cross-sectional study was conducted in May 2026. The first 100 search results were screened from Google, Yahoo, and Bing using four Arabic search phrases. Eligible Arabic-language websites were analyzed using the online Gunning Fog Index calculator. Descriptive statistics were calculated, and readability was stratified by search engine and content source. The exact Arabic search terms used were "أكسيد النيتروز للأطفال" (nitrous oxide for children), "أكسيد النيتروز" (nitrous oxide), "الغاز الضاحك" (laughing gas), and "التخدير باستنشاق أكسيد النيتروز" (nitrous oxide inhalation sedation). The searches were conducted in May 2026 using standard browser settings (Google Chrome in incognito mode; Google LLC, Mountain View, CA, USA)) to minimize personalization effects.
    RESULTS: Thirty-two websites met the inclusion criteria. The mean GFI was 9.92 ± 4.42 (range: 2.00-19.00), corresponding to a high school sophomore reading level. Only 12.5% (4/32) of websites met the American Medical Association's (AMA) recommended reading level (GFI ≤6). Most content originated from private dental practices (65.6%, 21/32). Public journals demonstrated significantly higher GFI scores (mean: 16.00 ± 3.46) compared to private practices (9.32 ± 3.96).
    CONCLUSION: Arabic web-based information on nitrous oxide sedation for pediatric dental patients exceeds recommended readability levels for parental education materials. Pediatric dentists and content creators should simplify online health information to improve parental comprehension.
    Keywords:  health literacy; inhalation sedation; nitrous oxide; online health information; parental understanding; pediatric dentistry; readability
    DOI:  https://doi.org/10.7759/cureus.109084
  19. Endocr Pract. 2026 Jun 15. pii: S1530-891X(26)01037-2. [Epub ahead of print]
       OBJECTIVES: Patient education can improve patients' health literacy, participation in shared decision-making, and quality of life. Contemporary guidelines, including the 2025 American Thyroid Association Guidelines for differentiated thyroid cancer (DTC), emphasize high-quality, accessible written information and survivorship-focused counseling as core components of care. This exploratory cross-sectional study performed a multimodal evaluation of DTC patient information booklets (PIBs).
    METHODS: Ten PIBs (five in English, five in French) were assessed by six healthcare professionals for: (i) information content across seven themes from a systematic review of unmet patient needs; (ii) quality of treatment information using DISCERN; (iii) understandability and actionability using PEMAT-P; and (iv) readability using four standard indices (Flesch-Kincaid Grade Level, Gunning Fog Index, Simple Measure of Gobbledygook and Flesch Reading Ease formulas).
    RESULTS: Interrater agreement was excellent. Global content scores ranged from 25-70%. General information, diagnosis, treatment, and follow-up were mostly addressed, whereas psychosocial concerns, care coordination, and complementary medicine were consistently underrepresented. Three PIBs scored "excellent", four "good", and three "fair" on DISCERN. Eight PIBs were sufficiently understandable and five sufficiently actionable on PEMAT-P. Readability levels were systemically higher than recommended. Content scores correlated positively with booklet length, DISCERN scores, and PEMAT-P actionability.
    CONCLUSIONS: Assessed DTC PIBs show a persistent mismatch with patient-identified needs, particularly in psychosocial, care-coordination, and complementary-medicine domains, alongside variable quality of treatment information, understandability, and actionability. Readability remains uniformly more complex than recommended. These findings are useful for the development and clinical use of patient information materials.
    Keywords:  DISCERN; actionability; differentiated thyroid cancer; patient information material; quality of life; readability; understandability
    DOI:  https://doi.org/10.1016/j.eprac.2026.06.003
  20. J Cancer Educ. 2026 Jun 18.
      Phytotherapy is widely used by cancer patients as a complementary and alternative medicine approach. With the increasing reliance on the internet for health-related information, concerns regarding the quality, reliability, and readability of online phytotherapy content have become more prominent. This study aimed to evaluate the readability and quality of web-based information on phytotherapy for cancer patients using validated assessment tools and to identify specific deficiencies in content quality. A descriptive cross-sectional analysis was conducted using the Google search engine with four predefined search terms related to phytotherapy and oncology. The first 50 websites for each term were screened, yielding 200 websites, of which 99 met the inclusion criteria. Websites were categorized by source type and visibility. Readability was assessed using the Flesch-Kincaid Grade Level (FKGL), Gunning Fog Index, SMOG, and Coleman-Liau Index. Content quality was evaluated using the JAMA benchmark criteria and the DISCERN instrument, including item-level analysis. Non-parametric statistical tests were applied where appropriate. The median FKGL score was 9.3, indicating that most content required a high reading level. The median JAMA score was 4, while the median DISCERN score was 55, reflecting moderate but variable quality. Item-level analysis revealed that critical aspects such as treatment risks, benefits, uncertainties, and consequences of no treatment were frequently insufficiently addressed. Commercial websites demonstrated lower DISCERN scores compared with non-commercial sources. No significant differences were observed between first-page and subsequent search results. Online phytotherapy information for cancer patients is characterized by moderate quality, high readability demands, and important deficiencies in key domains necessary for informed decision-making. In the evolving landscape of AI-assisted health information retrieval, these limitations may have broader implications, highlighting the need for accurate, evidence-based, and accessible online resources.
    Keywords:  Cancer; Online health information; Phytotherapy; Readability
    DOI:  https://doi.org/10.1007/s13187-026-02926-w
  21. Cureus. 2026 May;18(5): e109113
      Objective This study aimed to evaluate the accuracy, completeness, and clinical usefulness of YouTube (Alphabet Inc., Mountain View, CA) videos about cannabinoid hyperemesis syndrome (CHS). Methods This cross-sectional study assessed YouTube videos that were identified using standardized CHS-related search terms and predefined eligibility criteria. Trained reviewers used an author-developed CHS content checklist and a global usefulness scale for each video, with any discrepant or complex cases resolved by medical toxicologists. Outcomes included accuracy across key CHS domains, overall clinical usefulness, and the association between video quality and basic engagement metrics. Results A total of 97 videos were analyzed, with a mean length of 9.8 minutes and a cumulative view count of 795,264. Only 25.8% of videos were rated as useful and 2.1% as exemplary, whereas 52.6% were rated not useful and 19.6% as misleading due to missing essential content and/or unsubstantiated claims. Personal testimonial videos were common and often combined accurate symptom descriptions with speculative etiologies and non-evidence-based management advice. Engagement metrics showed little meaningful association with reviewer-rated accuracy or usefulness, and several of the most-viewed videos contained substantial misinformation. Interrater agreement for key classifications was substantial (κ = 0.78). Conclusions CHS-related information on YouTube shows considerable variation in quality, and most videos provide incomplete or inaccurate guidance regarding diagnosis and management. Patients relying on these videos may encounter persuasive narratives that normalize symptoms or promote ineffective or harmful management strategies rather than encouraging cannabis cessation and medical evaluation. Clinicians should anticipate that patients have likely been exposed to such content, directly address misconceptions, and guide them toward vetted educational resources. High-quality, expert-developed CHS content is needed to improve the reliability of information available on social media platforms.
    Keywords:  cannabinoid hyperemesis syndrome; cannabis use; effects of social media; mixed methods research; qualitative content analysis
    DOI:  https://doi.org/10.7759/cureus.109113
  22. J Clin Orthop Trauma. 2026 Sep;80 103519
       Background: Scaphoid fractures are the most common carpal fractures and often affect young, active individuals. Patients frequently turn to online media for medical guidance, where quality and transparency vary widely.
    Purpose: To evaluate the transparency, comprehensibility, and educational value of YouTube videos related to scaphoid fractures using established and condition-specific assessment tools.
    Methods: The fifty most-viewed English-language YouTube videos related to scaphoid fractures were analyzed for content source, engagement metrics, and quality using the JAMA benchmark criteria, a 4-point Likert comprehensibility scale, and a Scaphoid Fracture-Specific Score (SFS-SS).
    Results: Physicians produced over half of the videos (50%), followed by non-physician health professionals (24%) and trainers/therapists (12%). The median JAMA transparency score was 2, the median comprehensibility score was 2, and the median SFS-SS was 8 out of 16. Physician-created videos had higher JAMA scores (p < 0.001) but lower comprehensibility than non-physician sources (p = 0.016).
    Conclusion: YouTube videos related to scaphoid fractures demonstrated limited transparency, modest comprehensibility, and incomplete educational coverage. Physician-created videos were common but did not consistently provide more comprehensive patient-oriented information. Improved online educational content may require clearer authorship, evidence citation, and patient-centered communication.
    Level of evidence: Level V.
    Keywords:  Hand surgery; Online health information; Patient education; Scaphoid fracture; YouTube
    DOI:  https://doi.org/10.1016/j.jcot.2026.103519
  23. Knee. 2026 Jun 15. pii: S0968-0160(26)00209-7. [Epub ahead of print]62 104529
       BACKGROUND: YouTube is widely used by patients seeking preoperative information before total knee arthroplasty (TKA). As online video content may influence patient expectations and postoperative satisfaction, reliable tools are needed to assess educational quality. Advances in artificial intelligence (AI) enable large-scale evaluation of digital medical content. This study assessed the educational quality of the most viewed YouTube videos on TKA and compared evaluations by orthopedic surgeons and AI models.
    METHODS: On December 10, 2025, YouTube was searched using the term "total knee arthroplasty." The 61 most viewed videos were screened; 50 met inclusion criteria. Two orthopedic surgeons independently evaluated each video using DISCERN, JAMA benchmark criteria, Global Quality Score (GQS), Patient Education Materials Assessment Tool (PEMAT), and a novel Total Knee Arthroplasty-Specific Scoring System (TKA-SS). Interobserver reliability was calculated using intraclass correlation coefficients (ICC). AI evaluations were performed using transcript-based and multimodal large language models. Correlations between human and AI-generated scores and absolute differences were analyzed.
    RESULTS: Educational quality ranged from low to moderate with substantial heterogeneity. Interobserver reliability was excellent for most instruments, particularly DISCERN (ICC = 0.985), but low for PEMAT-understandability. Significant positive correlations were observed between surgeon reference and AI-generated scores across all instruments (all p < 0.001), strongest for TKA-SS. Mean absolute differences were small, and no significant differences were found between AI models.
    CONCLUSION: Widely viewed TKA-related YouTube videos demonstrate predominantly low-to-moderate educational quality. AI-generated assessments show meaningful agreement with surgeon evaluations and may serve as a scalable tool for preliminary content stratification rather than definitive quality assessment.
    Keywords:  Artificial intelligence; Health information quality; Patient education; Total knee arthroplasty; YouTube
    DOI:  https://doi.org/10.1016/j.knee.2026.104529
  24. Digit Health. 2026 Jan-Dec;12:12 20552076261460940
       Objective: Social media has become an important channel for obtaining information about vertigo. Benign paroxysmal positional vertigo (BPPV) is the most common peripheral vestibular disorder, yet the quality and reliability of BPPV-related content on social media platforms remain limited. Inaccurate information may mislead patients or encourage inappropriate self-management. This study quantitatively assessed the quality and reliability of BPPV-related short videos on Douyin (the Chinese version of TikTok) and Bilibili.
    Methods: On September 28, 2025, videos about BPPV on Douyin and Bilibili were retrieved, and various key information such as video types, content, parameters, and user participation was extracted. The Global Quality Scale (GQS) and modified DISCERN (mDISCERN) were used to evaluate video quality and reliability, and non-parametric testing, correlation analysis, and multivariable logistic regression were used for data analysis.
    Results: The quality and reliability of BPPV videos on Douyin and Bilibili were rated as fair quality (median GQS = 3) and low reliability (median mDISCERN = 2). Doctors were the main uploaders, but their scores were close to the overall average. User engagement (likes, collections, comments, shares) was highly correlated but had no relationship with video quality. The duration of videos with high user participation was 60-180 seconds on Douyin and 150-300 seconds on Bilibili. Longer duration independently predicted higher video quality (OR = 1.003, P = 0.024).
    Conclusions: The quality and reliability of BPPV short videos on Douyin and Bilibili were suboptimal. Improving this situation requires the joint efforts of doctors, patients, and social platforms.
    Keywords:  Bilibili; TikTok; benign paroxysmal positional vertigo; online health information; short videos; social media
    DOI:  https://doi.org/10.1177/20552076261460940
  25. PLoS One. 2026 ;21(6): e0350430
       BACKGROUND: There is a concerning trend of misinformation of healthcare related content on social media. Recent studies have examined themes and narratives about Crohn's disease but have not quantitatively assessed the accuracy and quality of content on Instagram Reels. Our aim was to assess the quality and accuracy of Instagram Reels about Crohn's disease and examine differences in content by type of creator, from medical professionals to lay individuals.
    METHODS: Seventy-eight top-viewed English-language Instagram Reels tagged with "#crohns" were evaluated. Videos were categorized by creator and content type. Two reviewers evaluated each video for accuracy and quality using an adapted harm/benefit score and the Journal of the American Medical Association (JAMA) benchmark criteria, respectively.
    RESULTS: Seventeen percent of videos were created by medical professionals and 83% by non-medical users. Educational content was significantly more common among medical professionals than other content creators (62% vs 23%; P = 0.005). No significant correlation was found between engagement metrics and either JAMA or harm/benefit scores. Medical professionals had significantly higher JAMA scores than non-medical users (2.5 vs 2, P < 0.001), but there was no significant difference in harm/benefit scores between groups (0 vs 0, P = 0.9601). Videos offering medical advice had the lowest median harm/benefit score (-1), with frequent misinformation noted. Forty-two percent of harmful videos were created by medical professionals.
    CONCLUSIONS: The average Instagram Reel about Crohn's disease was of moderate quality and neutral impact. Accuracy or quality was unrelated to video popularity. While videos by medical professionals had higher JAMA scores, this did not correspond to greater accuracy. Medical advice videos by medical professionals were not more accurate than those by non-medical creators, and multiple harmful videos were created by medical professionals, underscoring the need for critical evaluation of Crohn's disease-related social media content.
    DOI:  https://doi.org/10.1371/journal.pone.0350430
  26. Front Med (Lausanne). 2026 ;13 1868829
       Background: Xiaohongshu has become an important platform for the public to obtain eye health information, but the quality and uploader-related differences of dry eye educational videos remain unclear.
    Methods: In this cross-sectional study, Xiaohongshu's "Video" section was searched on April 16, 2026, using the Chinese keyword "" (dry eye). The first 200 videos generated by the default sorting algorithm were screened. After excluding duplicate, promotional, off-topic, and reposted content, 136 videos were included. As all eligible videos were uploaded by individual accounts, comparisons were performed between non-medical individual users and individual medical users. Video characteristics, content coverage, and engagement metrics were extracted, and video quality was assessed using DISCERN, PEMAT-A/V, and the Global Quality Score.
    Results: Among the 136 included videos, 108 were uploaded by non-medical individual users and 28 by individual medical users. Video content was mainly concentrated on lifestyle recommendations (87.5%) and treatment or management strategies (62.5%), whereas topics such as definition, diagnosis, classification, and follow-up were less frequently covered. Compared with videos uploaded by individual medical users, those uploaded by non-medical individual users had a higher prevalence of incorrect or potentially misleading information (57.4% vs. 10.7%; OR = 11.23, 95% CI: 3.20-39.47). In addition, videos uploaded by individual medical users scored significantly higher than those from non-medical individual users in DISCERN total score, PEMAT understandability, PEMAT actionability, and GQS score, with all comparisons reaching statistical significance (all p < 0.001). Correlation analysis indicated that the number of covered content domains was positively associated with the DISCERN total score, GQS score, and PEMAT understandability.
    Conclusion: Dry eye-related videos on Xiaohongshu varied substantially in quality and mainly emphasized lifestyle recommendations and treatment management, with limited coverage of systematic medical knowledge. In this cross-sectional sample, videos uploaded by individual medical users were associated with higher quality scores, broader content coverage, and a lower prevalence of incorrect or potentially misleading information than videos from non-medical individual users. These findings suggest the potential value of platform-level quality control, improved mechanisms for identifying and recommending professional content, and greater participation of ophthalmic healthcare professionals in standardized science communication.
    Keywords:  Xiaohongshu; dry eye; health information quality; patient education; short videos
    DOI:  https://doi.org/10.3389/fmed.2026.1868829
  27. Sci Rep. 2026 Jun 15.
      Short-video platforms, specifically TikTok and Bilibili, are widely utilized for health information seeking in China. However, the quality of chemotherapy-related content on these channels remains unverified. This study compares the reliability and quality of chemotherapy videos across these two platforms, evaluating the impact of uploader identity, video characteristics, and user engagement metrics. A total of 188 chemotherapy-related videos were retrieved from TikTok (n = 89) and Bilibili (n = 99). Content quality and reliability were independently assessed using the Global Quality Scale (GQS) and modified DISCERN (mDISCERN). Intra-platform interactions, video length, and uploader profiles were cross-sectionally analyzed using localized stratified Spearman's rank correlation and fully adjusted multivariate Poisson regression models. Overall information quality was suboptimal. Significant baseline disparities were observed between platforms: while TikTok dominated in interaction metrics (P < .001), Bilibili achieved significantly higher reliability (mDISCERN, P < .001) and overall quality scores (GQS, P = .034). Certified oncologists were the primary contributors (46.28%), producing significantly higher-quality content than patients (P < .001). Localized stratified analysis exposed a platform-specific popularity paradox: within TikTok, GQS scores were significantly negatively correlated with likes (r = - .211, P = .047) and comments (r = - .269, P = .011), whereas this quality-engagement trade-off completely vanished within Bilibili, where mDISCERN positively aligned with collections (r = .221, P = .028) and shares (r = .249, P = .013). Fully adjusted Poisson regression confirmed that micro-level engagement counts and video length possessed no independent predictive capacity (P > .20), whereas the macro-platform identity of Bilibili emerged as a robust, standalone independent positive predictor of information reliability (RR = 1.388, 95% CI 1.069-1.801, P = .0138). A critical socio-technical divide exists within the digital health landscape, manifested as a localized popularity paradox heavily plaguing engagement-first networks like TikTok. Fully adjusted estimations reveal that the overarching macro-platform architecture itself, rather than micro-level clip features or standalone video length, serves as the independent determinant of clinical reliability. Platforms must transition from traffic-oriented metrics toward professional, quality-weighted recommendation system interventions to effectively bridge the gap between evidence-based clinical rigor and public digital accessibility.
    Keywords:  Bilibili; Chemotherapy; Health communication; Short-video platforms; Social media; TikTok/Douyin
    DOI:  https://doi.org/10.1038/s41598-026-56880-0
  28. Medicine (Baltimore). 2026 Jun 12. 105(24): e49308
      Diabetic nephropathy (DN) is a common, life-threatening complication of diabetes, contributing to the global disease burden. With the advent of video platforms, health information is being more widely disseminated. However, the quality of such content varies widely, which may influence the public's perception. This study aimed to evaluate the upload sources, content, and characteristics of DN-related videos on TikTok and Bilibili and to explore descriptive associations between video quality scores and selected video characteristics. This cross-sectional content analysis included 166 DN-related videos. Video quality was assessed using the Global Quality Scale (GQS), modified DISCERN (mDISCERN), and Journal of the American Medical Association (JAMA) benchmark criteria. Descriptive subgroup and correlation analyses were performed to examine cross-sectional associations between video quality scores and selected video attributes. No multivariable adjustment was performed. In unadjusted cross-sectional comparisons, TikTok videos showed higher observed engagement counts at the time of data collection than Bilibili videos, whereas no statistically significant differences were observed in video duration or quality indicators after correction for multiple comparisons. In unadjusted descriptive subgroup comparisons, videos uploaded by experts showed more favorable results in selected quality-related measures, particularly GQS and JAMA, than videos uploaded by individual users. No clear association was observed between video quality and snapshot engagement metrics recorded at the time of retrieval. This study identified descriptive differences in the presentation and dissemination patterns of DN-related health information across TikTok and Bilibili. Because the analyses were observational, cross-sectional, and unadjusted for potential confounders such as video length and content type, the observed differences between platforms and uploader types should be interpreted as descriptive associations only rather than independent effects.
    Keywords:  diabetic nephropathy; health information quality; public health; social media
    DOI:  https://doi.org/10.1097/MD.0000000000049308
  29. Medicine (Baltimore). 2026 Jun 12. 105(24): e49206
      Short-video platforms are important sources of cerebral infarction-related health information, but the completeness, reliability, and platform differences of such content remain unclear. This study compared engagement, topic coverage, and information quality of cerebral infarction-related videos on TikTok and Bilibili. Using the keyword "cerebral infarction," we screened the top 150 search results from each platform. Of 300 retrieved videos, 289 were included (TikTok: n = 146; Bilibili: n = 143). Video duration and engagement metrics were extracted, and uploaders were classified as doctors of health professions (DHPs), non-doctors of health professions (NDHPs), or individual users. Topic coverage was coded as not covered, partially covered, or completely covered. Information quality and reliability were assessed using GQS, mDISCERN, JAMA benchmarks, and VIQI. Mann-Whitney U, Kruskal-Wallis with Dunn-Bonferroni post hoc tests, and Spearman correlation analyses were performed. Content coverage was uneven. Epidemiology was absent in 85.8% of videos and completely covered in only 0.3%, whereas diagnosis was absent in 69.6% and completely covered in 10.0%. Etiology and clinical manifestations were more frequently addressed, with complete coverage in 61.2% and 37.7%, respectively. Bilibili videos were longer than TikTok videos, whereas TikTok videos showed higher likes, collections, comments, and shares (all P < .001). Platform differences were observed across all 4 quality instruments. TikTok videos had higher JAMA scores, whereas Bilibili videos had higher mDISCERN and VIQI scores; GQS also differed significantly despite a median score of 3.0 on both platforms. DHP-uploaded videos had higher quality scores than NDHP- or individual-user videos across all 4 instruments (all P < .001). Platform-stratified correlations between engagement and quality scores were weak, ranging from r = -0.257 to r = 0.139 on TikTok and from R = 0.066 to R = 0.298 on Bilibili. Cerebral infarction-related videos on TikTok and Bilibili showed modest overall quality and gaps in key actionable domains, particularly epidemiology and diagnosis. Engagement was a poor proxy for information quality. Stronger credibility labeling, source disclosure, and quality-aware recommendation strategies may improve short-video health information.
    Keywords:  Bilibili; TikTok; cerebral infarction; health communication; information quality; social media
    DOI:  https://doi.org/10.1097/MD.0000000000049206
  30. Sci Rep. 2026 Jun 18.
      Short-form video platforms are increasingly used to obtain information about chronic obstructive pulmonary disease (COPD), but the quality and reliability of COPD-related content across Chinese platforms remain unclear. We evaluated 228 COPD-related videos from Douyin, Kwai, and Bilibili using the Journal of the American Medical Association (JAMA) benchmark criteria, Global Quality Scale, modified DISCERN (mDISCERN), and Patient Education Materials Assessment Tool. Video characteristics, creator identity, verification status, content theme, presentation format, and visible engagement metrics were also analyzed. Video quality differed significantly across platforms. Douyin showed the highest transparency, reliability, and overall quality scores. Kwai showed the highest visible engagement but had lower overall quality and actionability scores. Bilibili had the longest videos and the highest understandability scores. Organization verified accounts generally achieved higher quality scores than individual verified or unverified accounts, although this finding should be interpreted cautiously because of the small number of such accounts. Visible engagement metrics, including likes, comments, saves, and shares, were not significantly correlated with medical quality scores. Together, these findings suggest that visible popularity is a poor proxy for medical quality. COPD-related digital health communication may benefit from more accessible evidence-based content, clearer source identification, and closer oversight of high-risk therapeutic claims.
    Keywords:  Chronic obstructive pulmonary disease; Digital health literacy; Health information quality; Short-form video platforms; Social media; TikTok
    DOI:  https://doi.org/10.1038/s41598-026-57615-x
  31. J Clin Neurosci. 2026 Jun 13. pii: S0967-5868(26)00295-X. [Epub ahead of print]152 112144
       BACKGROUND: The increasing reliance on mobile internet for health information necessitates a critical evaluation of content quality. This study aimed to systematically assess the quality of Alzheimer's Disease (AD) treatment-related short videos on TikTok, a leading platform for health information dissemination.
    METHOD: A total of 100 CE treatment videos from TikTok, retrieved on December 20, 2025, were comprehensively evaluated using established assessment tools. Specifically, the Journal of American Medical Association (JAMA) benchmark criteriaand themodified Decision-making Information Support Criteria for Evaluating the Reliability of Non-randomised Studies (mDIS) scorewere used to evaluate thereliabilityof the video content. TheGlobal Quality Score (GQS) was used to assess theoverall quality, and the Patient Education Materials Assessment Tool for Audio Visual Content (PEMAT-A/U)was used to evaluate understandability and actionability.
    RESULTS: Neurologists were identified as primary contributors of high-quality content, while videos on experimental treatments like deep cervical lymphovenous anastomosis (LVA) generally exhibited lower quality. Videos from emerging first-tier cities and those uploaded by top-tier creators demonstrated superior audience engagement and often higher content quality. A significant positive correlation was found between video duration, audience engagement metrics, and content quality scores.
    CONCLUSIONS: Neurologists play a crucial role in providing reliable AD treatment information on short video platforms. There is an urgent need to improve the quality of content on experimental treatments and to encourage longer, well-referenced videos. Platforms should enhance content moderation and explicitly label experimental therapies to ensure accurate and trustworthy public health education regarding AD.
    Keywords:  Alzheimer’s disease; Global Quality Score; Journal of American Medical Association; Modified DISCERN; Patient Education Materials Assessment Tool; Quality; Short Videos
    DOI:  https://doi.org/10.1016/j.jocn.2026.112144
  32. J Thorac Dis. 2026 May 31. 18(5): 469
       Background: Chronic obstructive pulmonary disease (COPD) continues to pose a significant global health burden, while social media platforms such as TikTok and Bilibili have become important sources of public health information. However, the quality and reliability of video content on these platforms remain unclear. Therefore, this study aimed to evaluate the quality and reliability of COPD-related short videos on TikTok and Bilibili in China.
    Methods: A cross-sectional study was conducted on 275 COPD-related short videos on TikTok and Bilibili. Video characteristics, uploader types, content themes, and presentation formats were extracted. Video quality and reliability were evaluated using the Global Quality Score (GQS), the modified Decision-making Information Support Criteria for Evaluating the Reliability of Non-randomised Studies (mDISCERN) tool, the Journal of the American Medical Association (JAMA) benchmark criteria, and the Video Information and Quality Index (VIQI). Correlation analysis was further performed among video metrics and quality scores.
    Results: Compared to Bilibili, TikTok was more popular, although the length of the videos on TikTok was shorter than that of the videos on Bilibili (P<0.001). Videos on Bilibili had significantly higher GQS scores (P<0.001) and mDISCERN scores (P=0.01) than those on TikTok. Videos from medical practitioners and science communicators generally exhibited higher quality and reliability compared to those from general users across most assessment tools. Medical practitioners scored higher on the JAMA criteria compared to science communicators, with no significant differences observed in the other assessment tools. Negative correlations were found between GQS scores and engagement metrics (likes, comments, and shares), while positive correlations were found between VIQI scores and all four engagement metrics.
    Conclusions: COPD-related videos on TikTok and Bilibili were deficient in quality and reliability. Enhancing the educational value of health information on both platforms requires greater involvement from medical practitioners and science communicators, alongside improved platform-level content regulation.
    Keywords:  Chronic obstructive pulmonary disease (COPD); information quality; public health; social media
    DOI:  https://doi.org/10.21037/jtd-2026-0582
  33. JMIR Form Res. 2026 Jun 18. 10 e82923
       Background: The clinical diagnosis rate of adenoid hypertrophy (AH) in children has increased in recent years, drawing growing attention from parents. Short-video platforms such as Bilibili, TikTok, and Xiaohongshu host a large volume of educational content on this condition. However, the quality and reliability of this information remain unclear.
    Objective: This study aimed to evaluate the completeness, understandability, actionability, reliability, and overall quality of short videos on AH across Bilibili, TikTok, and Xiaohongshu and to explore factors associated with these quality metrics, including uploader characteristics and engagement indicators.
    Methods: We collected 220 videos (Bilibili: n=90, 40.9%; TikTok: n=63, 28.6%; and Xiaohongshu: n=67, 30.5%) using newly registered accounts. Two independent reviewers evaluated video quality using a 6-item content completeness scale (score range 0-12), the Patient Education Materials Assessment Tool for Audiovisual Materials, the modified DISCERN instrument, and the Global Quality Scale (GQS). Interrater reliability was high (Cohen κ=0.77-0.993). Completeness assessed essential informational components of AH. As data were nonnormally distributed, results are presented as median (IQR). Cross-platform comparisons were conducted using the Kruskal-Wallis H test with post hoc Mann-Whitney U tests (with Bonferroni correction). Spearman correlation was used to explore associations between video characteristics (ie, duration and engagement metrics) and quality outcomes. Stepwise linear regression identified independent predictors of overall quality (GQS).
    Results: Video duration differed significantly across platforms (Bilibili: median 113.5, IQR 66.5-271.5 seconds; TikTok: median 73, IQR 44-100 seconds; and Xiaohongshu: median 63, IQR 41-127.5 seconds; P<.001). Bilibili videos demonstrated higher completeness than videos on the other 2 platforms (Bilibili: median 2, IQR 1.5-4.0; TikTok: median 1.5, IQR 0.5-2.0; and Xiaohongshu: median 1.5, IQR 0.5-2.8; P<.001); overall differences were observed for understandability and reliability, but pairwise comparisons did not reach statistical significance after Bonferroni correction. Xiaohongshu videos showed greater actionability than TikTok videos (P=.011). Medical professionals (n=158, 71.8%) had higher understandability than nonprofessionals (n=158, 81.8% vs n=62, 66.7%; P=.001). Video duration positively correlated with completeness (ρ=0.64, 95% CI 0.56-0.71; P<.001). Shares showed weak positive correlations with completeness and actionability. Stepwise regression identified understandability (using the Patient Education Materials Assessment Tool-Understandability) as the strongest independent predictor of overall quality (GQS), followed by actionability, video duration, and uploader type; engagement metrics and platform did not enter the final model.
    Conclusions: The quality of AH-related videos on Chinese short-video platforms is generally suboptimal. Bilibili offers higher completeness, while Xiaohongshu excels in actionability and interactivity. Understandability is the strongest predictor of overall quality, surpassing uploader type and engagement metrics. To improve online health information, platforms should move beyond engagement-based algorithms, and health care professionals should prioritize clear, actionable content.
    Keywords:  adenoid hypertrophy; health information; quality assessment; reliability; short videos
    DOI:  https://doi.org/10.2196/82923
  34. BMC Urol. 2026 Jun 17.
       BACKGROUND: Prostatitis, particularly chronic prostatitis/chronic pelvic pain syndrome (CP/CPPS), remains a common yet challenging condition in urological practice, largely due to its heterogeneous etiology, diagnostic complexity, and high levels of patient anxiety. Increasingly, patients seek information about prostatitis from short-video platforms; however, the quality and clinical relevance of such content remain unclear.
    METHODS: A cross-sectional content analysis was conducted to evaluate prostatitis-related videos on TikTok and Bilibili. Video characteristics, engagement metrics, uploader type, and content themes were extracted. Information quality and reliability were assessed using the Global Quality Scale (GQS) and the modified DISCERN (mDISCERN) instrument, while content completeness was evaluated using a six-domain clinical information framework. Comparisons were performed across platforms and uploader categories, and correlations between engagement metrics and quality indicators were analyzed.
    RESULTS: A total of 223 videos were included. Overall information quality and reliability were moderate. Content distribution was heavily skewed toward treatment-related information, whereas disease awareness, diagnostic principles, and preventive or long-term management content were markedly underrepresented. Videos uploaded by healthcare professionals demonstrated significantly higher quality, reliability, and content completeness than those from non-professional sources. TikTok videos achieved higher overall quality and reliability scores compared with Bilibili. Engagement metrics showed only weak correlations with information quality and reliability indicators.
    CONCLUSION: Prostatitis-related information on short-video platforms is characterized by moderate quality and substantial structural imbalance, with an overemphasis on treatment and insufficient coverage of diagnostic reasoning and long-term management. From a urological clinical perspective, incomplete or misleading online information may shape patient expectations and has the potential to complicate clinical communication and shared decision-making, particularly regarding diagnostic evaluation, antibiotic expectations, and long-term management. Greater involvement of urologists in structured digital patient education and the promotion of clinically accurate content are needed to mitigate misinformation and improve patient understanding.
    Keywords:  Bilibili; Digital health; Health information quality; Prostatitis; Short video platforms; TikTok
    DOI:  https://doi.org/10.1186/s12894-026-02225-y
  35. J Health Commun. 2026 Jun 15. 1-11
      Past research has typically examined online health information seeking behaviors without differentiating the types of seeking occurring across diverse online platforms. However, the breadth of sources and rising popularity of new online venues such as generative AI applications have accentuated the need to understand platform-specific online seeking. This study investigated the antecedents and correlates of platform-specific cervical cancer information seeking among Black and White women ages 21 to 65 in the U.S. (for whom routine cervical cancer screening is recommended). Findings revealed that health websites and search engines were the most commonly-used sources, followed by e-consultations, social media, generative AI, and virtual assistants. Black women used each of these platforms more frequently than White women; increased frequency was explained by predictors informed by the theory of planned behavior. Cervical cancer information seeking from health websites, search engines, generative AI, and virtual assistants were positively associated with cervical cancer screening guideline adherence among Black but not White respondents.
    Keywords:  Cervical cancer; cancer screening; information seeking; online sources
    DOI:  https://doi.org/10.1080/10810730.2026.2689090
  36. J Med Internet Res. 2026 Jun 16. 28 e82081
       BACKGROUND: Online health information seeking (OHIS) has become a central component of chronic disease management within an increasingly interactive, algorithm-mediated digital ecosystem. For individuals with diabetes, ongoing self-management demands create sustained needs for accessible, actionable health information. Although prior reviews have described general information-seeking behaviors, few have integrated technological evolution, multilevel determinants, and equity considerations specific to diabetes.
    OBJECTIVE: This scoping review maps patterns of OHIS among individuals with diabetes, identifies the types of information sought, synthesizes the multilevel determinants of OHIS, and explores temporal shifts across major phases of digital health development.
    METHODS: This scoping review was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) and Preferred Reporting Items for Systematic Reviews and Meta-Analyses literature search extension (PRISMA-S) reporting guidelines and was guided by the Sample, Phenomenon of Interest, Design, Evaluation, Research type framework. Five electronic databases (PubMed, Scopus, Web of Science, CINAHL, and Embase) were systematically searched for English-language empirical studies from inception to May 4, 2026. Eligible studies included empirical research investigating OHIS behaviors among individuals with type 1 diabetes, type 2 diabetes, or gestational diabetes. Data were extracted using a standardized charting form and synthesized descriptively. Determinants were organized according to the Social Ecological Model, and qualitative findings were analyzed using content analysis. Studies were stratified into 3 periods reflecting shifts in digital infrastructure: early web environments (2002-2010), expansion of social media and mobile technologies (2011-2018), and integrated digital and artificial intelligence (AI)-enabled ecosystems (2019-2025).
    RESULTS: Eighty-one studies from 32 countries met the inclusion criteria. The use of digital sources diversified over time. Early studies emphasized search engines and institutional websites, whereas later studies increasingly reported engagement with social media platforms and online communities. Mobile health apps and generative AI chatbots appeared in recent publications, although evidence on AI use remained limited. The most frequently sought content included self-management and lifestyle guidance, general diabetes knowledge, and treatment-related information. Determinants of OHIS operated across multiple levels. At the individual level, younger age, greater educational attainment, higher income, and better eHealth literacy were associated with increased engagement, while psychological factors such as perceived knowledge gaps and a desire for autonomy motivated searching. Interpersonal influences included peer support and clinician communication. Organizational and environmental factors encompassed health care access, digital infrastructure, information quality, and platform characteristics. Persistent disparities were observed among older adults and socioeconomically disadvantaged groups.
    CONCLUSIONS: This review synthesizes OHIS among individuals with diabetes through the lenses of technological evolution, multilevel determinants, and digital health equity. Unlike previous reviews that focused on specific platforms or general information-seeking behaviors, it maps the transition from web-based resources to social media and emerging AI-enabled ecosystems. This temporally informed synthesis advances understanding of digital engagement in diabetes self-management, identifies key evidence gaps, and informs clinical, organizational, and policy strategies to promote equitable access to trustworthy online health information.
    Keywords:  behavior; diabetes; digital; health; information; online; seeking
    DOI:  https://doi.org/10.2196/82081