bims-arines 2025-07-27 papers

BMJ Ment Health. 2025 Jul 22. pii: e301762. [Epub ahead of print]28(1):

Development and evaluation of prompts for a large language model to screen titles and abstracts in a living systematic review.

BACKGROUND: Living systematic reviews (LSRs) maintain an updated summary of evidence by incorporating newly published research. While they improve review currency, repeated screening and selection of new references make them labourious and difficult to maintain. Large language models (LLMs) show promise in assisting with screening and data extraction, but more work is needed to achieve the high accuracy required for evidence that informs clinical and policy decisions.
OBJECTIVE: The study evaluated the effectiveness of an LLM (GPT-4o) in title and abstract screening compared with human reviewers.
METHODS: Human decisions from an LSR on prodopaminergic interventions for anhedonia served as the reference standard. The baseline search results were divided into a development and a test set. Prompts guiding the LLM's eligibility assessments were refined using the development set and evaluated on the test set and two subsequent LSR updates. Consistency of the LLM outputs was also assessed.
RESULTS: Prompt development required 1045 records. When applied to the remaining baseline 11 939 records and two updates, the refined prompts achieved 100% sensitivity for studies ultimately included in the review after full-text screening, though sensitivity for records included by humans at the title and abstract stage varied (58-100%) across updates. Simulated workload reductions of 65-85% were observed. Prompt decisions showed high consistency, with minimal false exclusions, satisfying established screening performance benchmarks for systematic reviews.
CONCLUSIONS: Refined GPT-4o prompts demonstrated high sensitivity and moderate specificity while reducing human workload. This approach shows potential for integrating LLMs into systematic review workflows to enhance efficiency.

Keywords: Data Interpretation, Statistical; Machine Learning; PSYCHIATRY

DOI: https://doi.org/10.1136/bmjment-2025-301762

J Clin Epidemiol. 2025 Jul 18. pii: S0895-4356(25)00236-7. [Epub ahead of print] 111903

Lack of Methodological Rigor and Limited Coverage of Generative AI in Existing AI Reporting Guidelines: A Scoping Review.

ADVANCED working group

OBJECTIVES: This study aimed to systematically map the development methods, scope, and limitations of existing artificial intelligence (AI) reporting guidelines in medicine and to explore their applicability to generative AI (GAI) tools, such as large language models (LLMs).
STUDY DESIGN AND SETTING: We reported a scoping review adhering to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews (PRISMA-ScR). Five information sources were searched, including MEDLINE (via PubMed), EQUATOR Network, CNKI, FAIRsharing, and Google Scholar, from inception to December 31, 2024. Two reviewers independently screened records and extracted data using a predefined Excel template. Data included guideline characteristics (e.g., development methods, target audience, AI domain), adherence to EQUATOR Network recommendations, and consensus methodologies. Discrepancies were resolved by a third reviewer.
RESULTS: 68 AI reporting guidelines were included. 48.5% focused on general AI, while only 7.4% addressed GAI/LLMs. Methodological rigor was limited: 39.7% described development processes, 42.6% involved multidisciplinary experts, and 33.8% followed EQUATOR recommendations. Significant overlap existed, particularly in medical imaging (20.6% of guidelines). GAI-specific guidelines (14.7%) lacked comprehensive coverage and methodological transparency.
CONCLUSION: Existing AI reporting guidelines in medicine have suboptimal methodological rigor, redundancy, and insufficient coverage of GAI applications. Future and updated guidelines should prioritize standardized development processes, multidisciplinary collaboration, and expanded focus on emerging AI technologies like LLMs.

Keywords: Artificial intelligence; Generative artificial intelligence; Large language models; Reporting guidelines; Scoping review

DOI: https://doi.org/10.1016/j.jclinepi.2025.111903

J Dairy Sci. 2025 Jul 18. pii: S0022-0302(25)00522-3. [Epub ahead of print]

Artificial intelligence meets dairy cow research: Large language model's application in extracting daily time-activity budget data for a meta-analytical study.

M Lamanna, E Muca, C Giannone, M Bovo, F Boffo, A Romanzin, D Cavallini.

This study investigates the application of ChatGPT-4 in extracting and classifying behavioral data from scientific literature, focusing on the daily time-activity budget of dairy cows. Accurate analysis of time-activity budgets is crucial for understanding dairy cow welfare and productivity. Traditional methods are time-intensive and prone to bias. This study evaluates the accuracy and reliability of ChatGPT-4 in data extraction and data categorization, considering explicit, inferred, and ambiguous labels for the data, compared with human analysis. A collection of 55 papers on dairy cow behavior were used in the studies. Data extraction for eating, ruminating, and lying behaviors was performed manually and via ChatGPT-4. The artificial intelligence (AI) model's accuracy and labeling performance were assessed through descriptive and statistical analyses. Mixed model analysis was used to compare human and AI outcomes. Artificial intelligence and human time budget data showed significant differences for eating and ruminating but not for lying. ChatGPT-4 estimated daily eating time at 22.3% compared with 23.8% by humans. For ruminating, AI reported 33.4% against 31.7% by humans. Daily lying times were nearly identical, with AI at 44.4% and human analysis at 44.2%. The global accuracy in data extraction was ∼75%, and labeling accuracy reached 67.3%, with significant variability across behavioral categories. In general, the AI model demonstrates moderate accuracy in extracting and categorizing behavioral data, particularly for inferred and ambiguous data. However, explicit data extraction posed challenges, highlighting AI's dependence on input quality and structure. The consistency between AI and human analyses for lying behavior underscores AI's potential for specific applications. ChatGPT-4 offers a promising complementary tool for behavioral research, enabling efficient and scalable data extraction. However, improvements in AI algorithms and standardized reporting in scientific literature are essential for broader applicability. The study advocates for hybrid approaches combining AI capabilities with human oversight to enhance the reliability and accuracy of dairy cow behavioral research.

Keywords: ChatGPT; artificial intelligence tool; behavior analysis; dairy cow

DOI: https://doi.org/10.3168/jds.2025-26385