J Korean Med Sci. 2026 Jun 22. 41(24):
e168
BACKGROUND: Social media platforms such as X (formerly Twitter) are increasingly used by journals, authors, and institutions to promote newly published research. Well-designed posts can enhance visibility, accelerate knowledge translation, and increase altmetric attention. However, creating accurate and policy-compliant content is time-intensive. Large language models (LLMs) offer a potential solution, yet systematic evaluations of their performance in post-publication promotion remain limited.
METHODS: We conducted a blinded, crossed, offline evaluation of four LLMs: GPT-5 (OpenAI), Gemini 2.5 Pro (Google DeepMind), Grok-3 (xAI), and Perplexity Pro (Perplexity AI), tasked with generating X-style posts (≤ 260 characters) for 36 open access articles from The Lancet Public Health, The Lancet Planetary Health, and Annual Review of Public Health. Posts were generated using a standardized system and user prompt. A single blinded rater scored outputs using a five-domain rubric (factual accuracy, clarity, policy compliance, call-to-action quality, structure/metadata; maximum score 10). Secondary measures included character count, hashtag use, and readability (Flesch-Kincaid Grade Level). General linear models with Bonferroni-adjusted post hoc tests and non-parametric analyses were applied.
RESULTS: All four models achieved perfect factual accuracy and no policy violations. Mean total quality scores differed significantly by model, P < 0.001. GPT-5 (9.60) and Perplexity Pro (9.60) performed best, followed by Gemini 2.5 Pro (9.47), while Grok-3 scored lower (8.80). Domain analyses showed Grok-3 underperformed in call-to-action quality (1.40 vs. ≥1.97 in other models, P < 0.001) and produced significantly shorter posts (median 194 characters, P < 0.001). Perplexity Pro scored highest for policy compliance, while GPT-5 and Gemini 2.5 Pro achieved superior structural scores. Readability varied: GPT-5 8.9 (7.3-9.2) and Perplexity Pro 7.3 (6.5-8.8) generated more complex outputs, whereas Gemini 2.5 Pro 5.1 (4.8-6.5) and Grok-3 4.5 (3.6-6.3) produced more accessible posts.
CONCLUSION: LLMs can reliably generate accurate and policy-compliant social media posts for research promotion, with differences in style and readability that may inform audience targeting. GPT-5, Gemini 2.5 Pro, and Perplexity Pro produced high-quality outputs, while Grok-3 underperformed across several domains. These findings highlight the potential of LLMs as scalable first-draft tools for post-publication promotion, capable of improving the reach and accessibility of scientific research. Careful model selection, tailored to audience and communication goals, together with human oversight, remains essential.
Keywords: Knowledge Management; Large Language Models; Public Health; Publishing; Scholarly Communication; Social Media