Ann Intern Med. 2025 Nov 04.
Gerald Gartlehner,
Shannon Kugley,
Karen Crotty,
Meera Viswanathan,
Andreea Dobrescu,
Barbara Nussbaumer-Streit,
Graham Booth,
Jonathan R Treadwell,
Jung Min Han,
Jesse Wagner,
Eric A Apaydin,
Erin L Coppola,
Margaret Maglione,
Rainer Hilscher,
Robert Chew,
Meagan Pilar,
Bryan Swanton,
Leila C Kahwati.
BACKGROUND: Data extraction is a critical but error-prone and labor-intensive task in evidence synthesis. Unlike other artificial intelligence (AI) technologies, large language models (LLMs) do not require labeled training data for data extraction.
OBJECTIVE: To compare an AI-assisted versus a traditional, human-only data extraction process.
DESIGN: Study within reviews (SWAR) using a prospective, parallel-group comparison with blinded data adjudicators.
SETTING: Workflow validation within 6 ongoing systematic reviews of interventions under real-world conditions.
INTERVENTION: Initial data extraction using an LLM (Claude, versions 2.1, 3.0 Opus, and 3.5 Sonnet) verified by a human reviewer.
MEASUREMENTS: Concordance, time on task, accuracy, sensitivity, positive predictive value, and error analysis.
RESULTS: The 6 systematic reviews in the SWAR yielded 9341 data elements from 63 studies. Concordance between the 2 methods was 77.2% (95% CI, 76.3% to 78.0%). Compared with the reference standard, the AI-assisted approach had an accuracy of 91.0% (CI, 90.4% to 91.6%) and the human-only approach an accuracy of 89.0% (CI, 88.3% to 89.6%). Sensitivities were 89.4% (CI, 88.6% to 90.1%) and 86.5% (CI, 85.7% to 87.3%), respectively, with positive predictive values of 99.2% (CI, 99.0% to 99.4%) and 98.9% (CI, 98.6% to 99.1%). Incorrect data were extracted in 9.0% (CI, 8.4% to 9.6%) of AI-assisted cases and 11.0% (CI, 10.4% to 11.7%) of human-only cases, with corresponding proportions of major errors of 2.5% (CI, 2.2% to 2.8%) versus 2.7% (CI, 2.4% to 3.1%). Missed data items were the most frequent error type in both approaches. The AI-assisted method reduced data extraction time by a median of 41 minutes per study.
LIMITATIONS: Assessing concordance and classifying errors required subjective judgment. Consistently tracking time on task was challenging.
CONCLUSION: Data extraction assisted by AI may offer a viable, more efficient alternative to human-only methods.
PRIMARY FUNDING SOURCE: Agency for Healthcare Research and Quality and RTI International.