Genome Res. 2021 Jul 19. pii: gr.274563.120. [Epub ahead of print]
High-throughput sequencing-based assays measure different biochemical activities pertaining to gene regulation, genome-wide. These activities include transcription factor (TF)-DNA binding, enhancer activity, open chromatin, and more. A major goal is to understand underlying sequence components, or motifs, that can explain the measured activity. It is usually not one motif, but a combination of motifs bound by cooperatively acting proteins that confers activity to such regions. Furthermore, regions can be diverse, governed by different combinations of TFs/motifs. Current approaches do not take into account this issue of combinatorial diversity. We present a new statistical framework cisDiversity, which models regions as diverse modules characterized by combinations of motifs, while simultaneously learning the motifs themselves. Because cisDiversity does not rely on knowledge of motifs, modules, cell type, or organism, it is general enough to be applied to regions reported by most high-throughput assays. For example, in enhancer predictions resulting from different assays - GRO-cap, STARR-seq, and those measuring chromatin structure - cisDiversity discovers distinct modules and combinations of TF binding sites, some specific to the assay. From protein-DNA binding data, cisDiversity identifies potential cofactors of the profiled TF, while from ATAC-seq data it identifies tissue-specific regulatory modules. Finally, analysis of single-cell ATAC-seq data suggests that regions open in one cell state encode information about future states, with certain modules staying open and others closing down in the next time point.