bioRxiv. 2025 Dec 18. pii: 2025.09.18.676787. [Epub ahead of print]
The rapid expansion of protein sequence databases has far outpaced experimental structure determination, leaving many unannotated sequences, particularly the more remote homologs with low sequence identity. Because protein folds are more conserved and functionally informative than sequences alone, structural information offers a powerful lens for analysis. Here, we introduce a generative, structure-aware framework that integrates geometric encoding and coevolutionary constraints to map, cluster, and design protein sequences. Our approach employs the 3D interaction (3Di) alphabet to convert local residue geometries into compact, 20-state discrete representations. Using ProstT5, we enable bidirectional translation between amino acid sequences and 3Di representations, facilitating sensitive homology detection and structure-guided sequence generation. We then augment the latent generative landscape methodology by combining 3Di-based alignments with direct coupling analysis (DCA) and variational autoencoders (VAE), imbuing tasks such as clustering, annotation, and design with structural information. This integrative framework enhances the detection of coevolutionary signals and enables rational sampling of structural variants, even without functional labels. We demonstrate the utility of our method across diverse protein families, including globins, kinases, and malate dehydrogenases, achieving improved contact prediction, homology inference, and sequence generation. Together, our approach offers a quantitative, generative view of protein structure space, advancing protein evolution and design studies.
Significance Statement: Protein sequence databases are growing far faster than our ability to experimentally determine structures, leaving much of protein space poorly annotated, especially for distant homologs. Because protein structure is more conserved and informative than sequence alone, new approaches are needed to exploit structural signals at scale. We present a generative framework that integrates compact structural representations with evolutionary constraints to map, cluster, and design protein sequences. By combining geometric encoding with coevolutionary modeling, our approach enables sensitive homology detection, improved inference of structural contacts, and rational exploration of sequence space without requiring functional labels. This work provides a quantitative bridge between protein sequence and structure, advancing our ability to interpret protein evolution and guide protein design.