Am J Hum Genet. 2025 Dec 04. pii: S0002-9297(25)00431-8. [Epub ahead of print]112(12): 3030-3045
Large-scale biobanks provide comprehensive electronic health records (EHRs) that capture detailed clinical phenotypes, potentially enhancing disease prediction. However, traditional polygenic risk score (PRS) methods rely on simplified phenotype definitions or predefined trait sets, limiting their ability to represent the complex structures embedded within EHRs. To address this gap, we introduce EHR-embedding-enhanced PRS (EEPRS), leveraging phenotype embeddings derived from EHRs to improve PRSs using only genome-wide association study (GWAS) summary statistics. Employing embedding methods such as Word2Vec and GPT, we conducted EHR-embedding-based GWASs and identified a cardiovascular cluster via hierarchical clustering of genetic correlations. Across 41 traits in the UK Biobank, EEPRS consistently outperformed single-trait PRSs, particularly within this cluster. PRS-based phenome-wide association studies further demonstrated robust associations between EHR-embedding-based PRS and circulatory system diseases. We then developed EEPRS_optimal, a data-adaptive method that uses cross-validation to select the best embedding, yielding additional improvements. We also developed MTAG_EEPRS for multi-trait PRSs, which further improved prediction accuracy compared to single-trait PRSs and MTAG_PRS. Finally, we validated the benefits of EEPRS in the All of Us cohort for seven selected diseases. Overall, EEPRS represents a robust and interpretable framework, enhancing single-trait and multi-trait PRSs by integrating EHR embeddings.
Keywords: EHRs; GWAS; PRS; PheWAS; electronic health records; genome-wide association study; multi-trait analysis; phenome-wide association study; phenotype embeddings; polygenic risk score