JMIR Form Res. 2026 Feb 12. 10
e69707
Background: Annotated bibliographies summarize literature, but training, experience, and time are needed to create concise yet accurate annotations. Summaries generated by artificial intelligence (AI) can save human resources, but AI-generated content can also contain serious errors.
Objective: To determine the feasibility of using AI as an alternative to human annotators, we explored whether ChatGPT can generate annotations with characteristics that are comparable to those written by humans.
Methods: We had 2 humans and 3 versions of ChatGPT (3.5, 4, and 5) independently write annotations on the same set of 15 publications. We collected data on word count and Flesch Reading Ease (FRE). In this study, 2 assessors who were masked to the source of the annotations independently evaluated (1) capture of main points, (2) presence of errors, and (3) whether the annotation included a discussion of both the quality and context of the article within the broader literature. We evaluated agreement and disagreement between the assessors and used descriptive statistics and assessor-stratified binary and cumulative mixed-effects logit models to compare annotations written by ChatGPT and humans.
Results: On average, humans wrote shorter annotations (mean 90.20, SD 36.8 words) than ChatGPT (mean 113, SD 16 words) which were easier to interpret (human FRE score, mean 15.3, SD 12.4; ChatGPT FRE score, mean 5.76, SD 7.32). Our assessments of agreement and disagreement revealed that one assessor was consistently stricter than the other. However, assessor-stratified models of main points, errors, and quality/context showed similar qualitative conclusions. There was no statistically significant difference in the odds of presenting a better summary of main points between ChatGPT- and human-generated annotations for either assessor (Assessor 1: OR 0.96, 95% CI 0.12-7.71; Assessor 2: OR 1.64, 95% CI 0.67-4.06). However, both assessors observed that human annotations had lower odds of having one or more types of errors compared to ChatGPT (Assessor 1: OR 0.31, 95% CI 0.09-1.02; Assessor 2: OR 0.10, 95% CI 0.03-0.33). On the other hand, human annotations also had lower odds of summarizing the paper's quality and context when compared to ChatGPT (Assessor 1: OR 0.11, 95% CI 0.03-0.33; Assessor 2: OR 0.03, 95% CI 0.01-0.10). That said, ChatGPT's summaries of quality and context were sometimes inaccurate.
Conclusions: Rapidly learning a body of scientific literature is a vital yet daunting task that may be made more efficient by AI tools. In our study, ChatGPT quickly generated concise summaries of academic literature and also provided quality and context more consistently than humans. However, ChatGPT's discussion of the quality and context was not always accurate, and ChatGPT annotations included more errors. Annotated bibliographies that are AI-generated and carefully verified by humans may thus be an efficient way to provide a rapid overview of literature. More research is needed to determine the extent that prompt engineering can reduce errors and improve chatbot performance.
Keywords: ChatGPT; annotated bibliography; artificial intelligence; evidence synthesis; information management; large language model