Int J Med Inform. 2026 Jun 02. pii: S1386-5056(26)00256-X. [Epub ahead of print]218
106516
OBJECTIVE: This study evaluated model-configuration and stopping-rule decisions when using active learning-based title-and-abstract screening in health technology evidence syntheses.
METHODS: We conducted retrospective simulations using seven pre-labelled datasets from systematic, scoping, and overview reviews in health technology. Simulations were implemented with ASReview Makita and compared lightweight configurations based on one-hot encoding or term frequency-inverse document frequency with naive Bayes, logistic regression, random forest, and support vector machine classifiers. Performance was evaluated using normalised recall regret ("loss"), work saved over sampling at 95% (WSS@95) and 100% recall (WSS@100), early recall, and K%-consecutive-irrelevant stopping rules. Repeated simulations and exploratory dataset-level analyses were conducted for the highest-ranked configuration.
RESULTS: SVM + TF-IDF (with bigrams) had the most favourable overall performance, with an average loss of 0.08 (95% CI 0.06 to 0.09), WSS@95 of 0.70 (95% CI 0.59 to 0.79), and WSS@100 of 0.50 (95% CI 0.30 to 0.69). At a fixed 7% consecutive-irrelevant stopping rule, all datasets reached at least 95% recall in the main analysis, with mean recall of 98%. In repeated simulations, the fixed 7% rule achieved mean recall of 97%; however, one very low-prevalence dataset did not reach 95% recall until K = 33%. Exploratory analyses suggested that relevant-record prevalence, textual similarity among relevant records, and abstract completeness may help explain variation in model performance and stopping-rule reliability, although these analyses were hypothesis-generating.
CONCLUSION: Active learning-based screening reduced workload in these health technology datasets, but its use requires explicit implementation choices. SVM + TF-IDF (with bigrams) was the most pragmatic initial configuration, and a 7% consecutive-irrelevant rule was a useful stopping heuristic. However, stopping decisions should depend on the review's tolerance for missed studies, dataset quality, topic heterogeneity, and available safeguards, rather than on a fixed threshold alone.
Keywords: ASReview; Computational simulation; Evidence-based health science; Systematic review