Gigascience. 2025 Oct 20. pii: giaf126. [Epub ahead of print]
BACKGROUND: Technological advances in sequencing and computation have allowed deep exploration of the molecular basis of diseases. Biological networks have proven to be a valuable framework for analyzing omics data and modeling regulatory interactions between genes and proteins. Large collaborative projects, such as The Cancer Genome Atlas (TCGA), have provided a rich resource for building and validating new computational methods, resulting in a plethora of open-source software for downloading, pre-processing, and analyzing those data. However, for an end-to-end analysis of regulatory networks, a coherent and reusable workflow is essential to integrate all relevant packages into a robust pipeline.
FINDINGS: We developed tcga-data-nf, a Nextflow workflow that allows users to reproducibly infer regulatory networks from the thousands of samples in TCGA using a single command. The workflow can be divided into three main steps: multi-omic data, such as RNA-seq and methylation, are (i) downloaded, (ii) pre-processed, and (iii) analyzed to infer regulatory network models with the Network Zoo. The workflow is powered by the NetworkDataCompanion R package, a standalone collection of functions for managing, mapping, and filtering TCGA data. Here, we demonstrate how the pipeline can be used to investigate the differences between colon cancer subtypes attributed to epigenetic mechanisms. Lastly, we provide a database of pre-generated networks for the 10 most common cancer types that can be readily accessed by the public.
CONCLUSIONS: tcga-data-nf is a complete, yet flexible and extensible, framework that enables the reproducible inference and analysis of cancer regulatory networks, bridging a gap in the current universe of software tools for analyzing TCGA data.
Keywords: ”Cancer”; ”Gene Regulatory Network”; ”NetworkDataCompanion”; ”Nextflow”; ”The Cancer Genome Atlas”; ”reproducibility” (3 to 10 keywords)