bims-librar Biomed News
on Biomedical librarianship
Issue of 2025–01–05
eleven papers selected by
Thomas Krichel, Open Library Society



  1. Turk J Ophthalmol. 2024 Dec 31. 54(6): 313-317
       Objectives: To assess the appropriateness and readability of large language model (LLM) chatbots' answers to frequently asked questions about refractive surgery.
    Materials and Methods: Four commonly used LLM chatbots were asked 40 questions frequently asked by patients about refractive surgery. The appropriateness of the answers was evaluated by 2 experienced refractive surgeons. Readability was evaluated with 5 different indexes.
    Results: Based on the responses generated by the LLM chatbots, 45% (n=18) of the answers given by ChatGPT 3.5 were correct, while this rate was 52.5% (n=21) for ChatGPT 4.0, 87.5% (n=35) for Gemini, and 60% (n=24) for Copilot. In terms of readability, it was observed that all LLM chatbots were very difficult to read and required a university degree.
    Conclusion: These LLM chatbots, which are finding a place in our daily lives, can occasionally provide inappropriate answers. Although all were difficult to read, Gemini was the most successful LLM chatbot in terms of generating appropriate answers and was relatively better in terms of readability.
    Keywords:  Artificial intelligence; ChatGPT; Copilot; Gemini; chatbots; refractive surgery FAQs
    DOI:  https://doi.org/10.4274/tjo.galenos.2024.28234
  2. Cureus. 2024 Nov;16(11): e74876
       INTRODUCTION: Artificial intelligence (AI) plays a significant role in creating brochures on radiological procedures for patient education. Thus, this study aimed to evaluate the responses generated by ChatGPT (San Francisco, CA: OpenAI) and Google Gemini (Mountain View, CA: Google LLC) on abdominal ultrasound, abdominal CT scan, and abdominal MRI.
    METHODOLOGY: A cross-sectional original research was conducted over one week in June 2024 to evaluate the quality of patient information brochures produced by ChatGPT 3.5 and Google Gemini 1.5 Pro. The study assessed variables including word count, sentence count, average words per sentence, average syllables per sentence, grade level, and ease score using the Flesch-Kincaid calculator. Similarity percentage was evaluated using Quillbot (Chicago, IL: Quillbot Inc.), and reliability was measured using the modified DISCERN score. Statistical analysis was conducted using R version 4.3.2 (Vienna, Austria: R Foundation for Statistical Computing).
    RESULTS: There is no significant difference between sentence count (p=0.8884), average words per sentence (p=0.1984), average syllables per sentence (p=0.3868), ease (p=0.1812), similarity percentage (p=0.8110), and reliability score (p=0.6495). However, there is a statistically significant difference, with ChatGPT having a higher word count (p=0.0409) and grade level (p=0.0482) than Google Gemini. P-values <0.05 were considered significant.
    CONCLUSIONS: Both ChatGPT and Google Gemini demonstrate the ability to generate content that maintains consistency assessed through readability and reliability scores. Nevertheless, the noticeable disparities in word count and grade level underscore a crucial area for improvement in customizing content to accommodate varying levels of patient literacy.
    Keywords:  abdominal ct; abdominal ultrasound; artificial intelligence; chatgpt; educational tool; google gemini; mri of abdomen; patient education brochure
    DOI:  https://doi.org/10.7759/cureus.74876
  3. Turk J Ophthalmol. 2024 Dec 31. 54(6): 330-336
       Objectives: This study compared the readability of patient education materials from the Turkish Ophthalmological Association (TOA) retinopathy of prematurity (ROP) guidelines with those generated by large language models (LLMs). The ability of GPT-4.0, GPT-4o mini, and Gemini to produce patient education materials was evaluated in terms of accuracy and comprehensiveness.
    Materials and Methods: Thirty questions from the TOA ROP guidelines were posed to GPT-4.0, GPT-4o mini, and Gemini. Their responses were then reformulated using the prompts "Can you revise this text to be understandable at a 6th-grade reading level?" (P1 format) and "Can you make this text easier to understand?" (P2 format). The readability of the TOA ROP guidelines and the LLM-generated responses was analyzed using the Ateşman and Bezirci-Yılmaz formulas. Additionally, ROP specialists evaluated the comprehensiveness and accuracy of the responses.
    Results: The TOA brochure was found to have a reading level above the 6th-grade level recommended in the literature. Materials generated by GPT-4.0 and Gemini had significantly greater readability than the TOA brochure (p<0.05). Adjustments made in the P1 and P2 formats improved readability for GPT-4.0, while no significant change was observed for GPT-4o mini and Gemini. GPT-4.0 had the highest scores for accuracy and comprehensiveness, while Gemini had the lowest.
    Conclusion: GPT-4.0 appeared to have greater potential for generating more readable, accurate, and comprehensive patient education materials. However, when integrating LLMs into the healthcare field, regional medical differences and the accuracy of the provided information must be carefully assessed.
    Keywords:  Retinopathy of prematurity; large language models; patient education; readability
    DOI:  https://doi.org/10.4274/tjo.galenos.2024.58295
  4. PLoS One. 2025 ;20(1): e0316635
       BACKGROUND AND PURPOSE: The most widely used social media platform for video content is YouTubeTM. The present study evaluated the quality of information on YouTubeTM on artificial intelligence (AI) in dentistry.
    METHODS: This cross-sectional study used YouTubeTM (https://www.youtube.com) for searching videos. The terms used for the search were "artificial intelligence in dentistry," "machine learning in dental care," and "deep learning in dentistry." The accuracy and reliability of the information source were assessed using the DISCERN score. The quality of the videos was evaluated using the modified Global Quality Score (mGQS) and the Journal of the American Medical Association (JAMA) score.
    RESULTS: The analysis of 91 YouTube™ videos on AI in dentistry revealed insights into video characteristics, content, and quality. On average, videos were 22.45 minutes and received 1715.58 views and 23.79 likes. The topics were mainly centered on general dentistry (66%), with radiology (18%), orthodontics (9%), prosthodontics (4%), and implants (3%). DISCERN and mGQS scores were higher for videos uploaded by healthcare professionals and educational content videos(P<0.05). DISCERN exhibited a strong correlation (0.75) with the video source and with JAMA (0.77). The correlation of the video's content and mGQS, was 0.66 indicated moderate correlation.
    CONCLUSION: YouTube™ has informative and moderately reliable videos on AI in dentistry. Dental students, dentists and patients can use these videos to learn and educate about artificial intelligence in dentistry. Professionals should upload more videos to enhance the reliability of the content.
    DOI:  https://doi.org/10.1371/journal.pone.0316635
  5. BMC Oral Health. 2025 Jan 02. 25(1): 9
       BACKGROUND: The use of ChatGPT in the field of health has recently gained popularity. In the field of dentistry, ChatGPT can provide services in areas such as, dental education and patient education. The aim of this study was to evaluate the quality, readability and originality of pediatric patient/parent information and academic content produced by ChatGPT in the field of pediatric dentistry.
    METHODS: A total of 60 questions were asked to ChatGPT for each topic (dental trauma, fluoride, and tooth eruption/oral health) consisting of pediatric patient/parent questions and academic questions. The modified Global Quality Scale (the scoring ranges from 1: poor quality to 5: excellent quality) was used to evaluate the quality of the answers and Flesch Reading Ease and Flesch-Kincaid Grade Level were used to evaluate the readability. A similarity index was used to compare the quantitative similarity of the answers given by the software with the guidelines and academic references in different databases.
    RESULTS: The evaluation of answers quality revealed an average score of 4.3 ± 0.7 for pediatric patient/parent questions and 3.7 ± 0.8 for academic questions, indicating a statistically significant difference (p < 0.05). Academic questions regarding dental trauma received the lowest scores (p < 0.05). However, no significant differences were observed in readability and similarity between ChatGPT answers for different question groups and topics (p > 0.05).
    CONCLUSIONS: In pediatric dentistry, ChatGPT provides quality information to patients/parents. ChatGPT, which is difficult to readability for patients/parents and offers an acceptable similarity rate, needs to be improved in order to interact with people more efficiently and fluently.
    Keywords:  Artificial intelligence; Fluorides; Oral health; Pediatric dentistry; Public health informatics; Tooth eruption; Tooth injuries
    DOI:  https://doi.org/10.1186/s12903-024-05393-1
  6. Cureus. 2024 Nov;16(11): e74698
       INTRODUCTION: With advances in AI and machine learning, platforms like OpenAI's ChatGPT are emerging as educational resources. While these platforms offer easy access and user-friendliness due to their personalized conversational responses, concerns about the accuracy and reliability of their information persist. As one of the most common surgical procedures performed by plastic surgeons worldwide, breast reduction surgery (BRS) offers relief for the physical and emotional burdens of large breasts. However, like any surgical procedure, it can raise a multitude of questions and anxiety.
    METHODS: To evaluate the quality of medical information provided by ChatGPT in response to common patient inquiries about breast reduction surgery, we developed a 15-question questionnaire with typical patient questions about BRS. These questions were presented to ChatGPT, and the answers were compiled and presented to five board-certified plastic surgeons. Each specialist categorized the response as (1) Appropriate, the response accurately reflects current medical knowledge and best practices for BRS; (2) No, not thorough, the response lacks sufficient detail to be a helpful educational resource; (3) No, inaccurate, the response contains misleading or incorrect information.
    RESULTS: A total of 75 survey responses were obtained, with five experts each analyzing 15 answers from ChatGPT. Of these, 69 (92%) responses were determined to be accurate. However, six (8%) responses were concerning to our experts: four (5.3%) lacked detail, and two (2.7%) were found to be inaccurate. Chi-square analysis revealed no statistical significance in the distribution of responses categorized as "accurate" versus "not thorough/inaccurate," and "not thorough" versus "inaccurate" (p=0.778 and p=0.306, respectively).
    CONCLUSION: While ChatGPT can provide patients with basic background knowledge on BRS and empower patients to ask more informed questions during consultations, it should not replace the consultation and expert guidance of a board-certified plastic surgeon.
    Keywords:  artificial intelligence (ai); breast reduction surgery (brs); chatgpt; machine learning; patient education
    DOI:  https://doi.org/10.7759/cureus.74698
  7. Hand Surg Rehabil. 2024 Dec 27. pii: S2468-1229(24)00513-9. [Epub ahead of print] 102073
       INTRODUCTION: ChatGPT has been increasingly utilized to create, simplify, and revise hand surgery patient education materials. While significant research has examined the quality and readability of ChatGPT-derived hand surgery patient education, the patient perspective has not previously been evaluated. This study compared patient reported clarity and readability grades as well as patient preferences for carpal tunnel surgery educational information from medical education websites and ChatGPT.
    METHODS: Patients without a history of carpal tunnel release surgery at two orthopaedic hand surgery outpatient clinics were asked to complete an anonymous survey which gathered demographic information and included a blinded educational passage on carpal tunnel release surgery from ChatGPT, WebMD, or Mayo Clinic. Patients graded the blinded passages regarding clarity, readability, length, likeliness to recommend to others, and overall satisfaction with the education material.
    RESULTS: There were no significant differences in clarity (p = 0.682),readability (p = 0.328), or likeliness to recommend to others (p = 0.106) between the different educational sources. When stratified by age, younger patients (under 55) were more likely to recommend Mayo Clinic over other resources (p = 0.002). When further stratified to include only those who reported previously using websites for healthcare information, patients tended to have a higher likelihood of recommending Mayo Clinic compared to other sources, but this was not a statistically significant difference.
    CONCLUSIONS: There were no differences in clarity, readability, or preference ratings between patient education materials that were produced by ChatGPT, WebMD and Mayo Clinic. However, while ChatGPT-generated materials are comparable in quality based on patient ratings, younger patients may still favor well-established sources for medical education. This information regarding patient preferences provides valuable insights for hand surgeons when selecting suitable educational resources for their patients.
    Keywords:  ChatGPT; Mayo Clinic; WebMD; carpal tunnel release; patient education; patient preference
    DOI:  https://doi.org/10.1016/j.hansur.2024.102073
  8. Cureus. 2024 Dec;16(12): e76526
      Introduction The internet is a crucial source of health information, including cancer-related topics, but the quality and reliability of these resources can vary, affecting patient decision-making. Objectives This study aimed to evaluate the quality of thyroid cancer-related websites in the Arabic language, using the DISCERN tool, and explore the content and sources provided by different types of websites. Methods A total of 78 websites were included after excluding 21 based on predefined criteria (e.g., duplicates, non-functional uniform resource locators (URLs)). The websites were categorized into commercial, non-profit, and individual types. Two independent reviewers assessed the websites using the DISCERN tool. Interrater agreement was measured using the k-score. A one-way analysis of variance (ANOVA) was used to compare DISCERN scores across website types, and Spearman's rank correlation was used to analyze the relationship between website ranking and DISCERN scores. Results Almost all websites included a definition of thyroid cancer. Additionally, 15 websites (19.2%) covered the definition, clinical presentation, risk factors, diagnosis, and treatment, while 14 websites (17.9%) offered only clinical presentation, diagnosis, and treatment, and 11 websites (14.1%) offered other combinations of similar content. However, there was a lack of information regarding prognosis and predictors of outcomes following thyroid cancer surgery. The average overall DISCERN score for the 78 websites was 42.65 ± 12.35. Statistically significant differences were found in DISCERN scores across website types, with non-profit websites scoring the highest (38.93 ± 14.12), followed by commercial (37.67 ± 10.34) and individual websites (28.63 ± 10.02). A significant negative correlation was also found between website rank and DISCERN scores (Spearman's r = -0.38, p < 0.0001). Conclusion The study found that non-profit websites provide higher-quality thyroid cancer information compared to commercial and individual sites. Website ranking also affects content quality, emphasizing the importance of patients assessing online resources critically. Health organizations are encouraged to improve the visibility and quality of trustworthy information.
    Keywords:  arabic; discern instrument; online health information; quality; thyroid cancer
    DOI:  https://doi.org/10.7759/cureus.76526
  9. Int J Gen Med. 2024 ;17 6487-6493
       Purpose: Arabic is the primary language used in the Middle East, where sickle cell disease (SCD) is prevalent. This study aims to quantify Arabic web educational materials for patients with SCD and provide a descriptive standardized assessment.
    Methods and Materials: This retrospective, descriptive study aimed to analyze Arabic websites on SCD through the Discern instrument and JAMA benchmark.
    Results: We evaluated the quality and reliability of 27 Arabic SCD-related websites. Regarding website content, all 27 (100%) defined sickle cell disease, whereas 25 (96.30%) and 24 (92.59%) illustrated its manifestations and treatments, respectively. However, only 12 (44.44%) discussed the prevention of the disease through premarital genetic screening and counseling. According to the Discern score, 11 (40.74%) websites were of low quality, while 16 (59.26%) were of moderate quality. On the other hand, the JAMA score reveals that only 2 (7.41%) websites were high reliability, while the majority 25 (92.59%) were low reliability. Additionally, analysis revealed a weak positive correlation between the Discern and JAMA scores (correlation coefficient of 0.19). There were no statistically significant differences in the Discern and JAMA scores between websites on the first page of the search results and those on other pages (p = 0.941 and 0.359, respectively).
    Conclusion: Empowering patients with comprehensive knowledge about various disease aspects is a pivotal component in the effective management of SCD and, consequently, improving its outcomes. Regrettably, there is a notable scarcity of credible and high-quality written web-based health resources available in Arabic despite significant advancements in other clinical aspects of SCD. Augmenting the existing online resources in Arabic patients' native language could yield substantial enhancements in patient care across various dimensions.
    Keywords:  Arabic language; education; sickle cell disease
    DOI:  https://doi.org/10.2147/IJGM.S495248
  10. Comput Inform Nurs. 2025 Jan 01. pii: e01217. [Epub ahead of print]43(1):
      This descriptive study aims to investigate the content, quality, and reliability of YouTube videos containing content related to endotracheal tube aspiration. The study was scanned using the keywords "endotracheal aspiration" and "endotracheal tube aspiration," and 22 videos were included in the study. The contents of the selected videos were measured using the Endotracheal Tube Aspiration Skill Form, their reliability was measured using the DISCERN Survey, and their quality was measured using the Global Quality Scale. Of the 22 videos that met the inclusion criteria, 18 (81.8%) were educational, and four (18.2%) were product promotional videos. When pairwise comparisons were made, the coverage score of open aspiration videos was higher for educational videos than for product promotion videos (P < .005). Useful videos had higher reliability and quality scores than misleading videos (P < .05). In addition, the reliability and quality scores of videos uploaded by official institutions were significantly higher than those of videos uploaded by individual users (P < .05). This study found that the majority of endotracheal tube aspiration training videos reviewed in the study were published by individual users, and a significant proportion of these videos had low levels of reliability and quality.
    DOI:  https://doi.org/10.1097/CIN.0000000000001217
  11. BMC Public Health. 2024 Dec 30. 24(1): 3606
       BACKGROUND: At present, the participation rate in cancer screening is still not ideal, and the lack of screening information or misunderstanding of information is an important factor hindering cancer screening behaviour. Therefore, a systematic synthesis of information needs related to cancer screening is critical.
    METHODS: On July 23, 2024, we searched the Cochrane Library, MEDLINE (Ovid), Embase, EBSCO, PsycINFO, Scopus, ProQuest, PubMed, Web of Science, and CINAHL to collect qualitative or mixed-methods studies on information needs of cancer screening. We also searched for grey literature on OpenGrey and Google websites. Data were synthesised using Sandelowski and Barroso's framework. A top-down approach was adopted to group and synthesise the encodes and then generate analytical themes.
    RESULTS: A total of 37 studies were included. The results of the analysis of cancer screening-related information needs content, cancer-specific information needs content, requirements and preferences for information, and associated factors of information-seeking behaviour were reported. Based on the event timeline, we summarised the information needs of the screening demand side into four themes. Their information needs focus on disease risk factors, signs and symptoms, the importance of screening, the benefits and harms of screening, the detailed screening process, and screening results and explanations. Regarding cancer-specific information needs content, we summarised the specific information needs of cervical, breast, colorectal, and lung cancer. By referring to relevant concepts in the Comprehensive Model of Information Seeking, we synthesised the requirements and preferences for information according to the themes of editorial tone, communication potential, recommended information channels, and recommended source place. The information-seeking behaviours of the screening demanders are mainly passive attention and active searching. The common factors leading to the passive attention of screening demanders are demographic factors and fear of cancer. The most common reason for them to actively search information is lack of information.
    CONCLUSIONS: The list of information needs identified in this review can serve as a reference for health professionals and information service providers before carrying out screening-related work to help the cancer screening participants obtain valuable information.
    Keywords:  Cancer screening; Information needs; Information-seeking behaviour; Systematic review
    DOI:  https://doi.org/10.1186/s12889-024-21096-2