Intern Emerg Med. 2026 Feb 26.
BACKGROUND: Large language models (LLMs) are increasingly used in biomedical research for statistical support, yet their reliability in selecting appropriate tests and generating correct software commands remains insufficiently evaluated. This study compared the performance of ChatGPT-5, Claude, and DeepSeek in identifying statistical tests and generating corresponding Stata 15 commands.
METHODS: Thirty-two examples were adapted from the UCLA Institute for Digital Research and Education Stata tutorial. Each model was tested twice independently using standardized prompts. Responses were classified using a four-level taxonomy: COR (reference-equivalent, i.e., no deviation), SYN (minor syntactic deviation, i.e., low-risk deviation), ALT (alternative valid specification, i.e., low-risk deviation), and CMM (conceptual mismatch with potential inferential impact, i.e., high-risk deviation). Accuracy was defined as the proportion of outputs with no or low-risk deviations, calculated as (COR + SYN + ALT)/32. Model comparisons used Fisher's exact test, and reproducibility across rounds was assessed with McNemar's test and Fisher's exact test.
RESULTS: All three models correctly identified the statistical test in all 32 examples (100% accuracy in both rounds). For Stata command generation, accuracy was high and comparable across models (round 1: ChatGPT = 90.6%, Claude = 93.8%, DeepSeek = 93.8%; round 2: ChatGPT = 90.6%, Claude = 96.9%, DeepSeek = 87.5%; p > 0.05). High-risk deviations were rare (≤ 12.5% in any model-round combination). Reproducibility between rounds was excellent (ChatGPT = 100%, Claude = 96.9%, DeepSeek = 93.8%; p > 0.05).
CONCLUSION: ChatGPT-5, Claude, and DeepSeek demonstrated high accuracy and reproducibility in structured statistical reasoning tasks, with rare high-risk deviations that could potentially affect statistical inference. These findings support the use of advanced LLMs as complementary tools for applied statistical reasoning.
Keywords: AI; Accuracy; Artificial intelligence; ChatGPT; Claude; Comparison; DeepSeek; Large language model (LLM); Performance; Reproducibility; Scientific writing; Stata; Statistical analysis