Methods Mol Biol. 2023 ;2629 271-303
Proteins are the functional molecules for almost all cellular and biological processes. They are also the targets of most drugs. Proteins employ complex, multilevel regulations, so their abundance levels do not well correlated with their mRNA expression levels. The structure, activity, and functional roles of proteins are affected by posttranslational modifications (PTM), which are even less correlated with mRNA expression levels than protein abundances. Comprehensive characterization of the proteomics data is critical for understanding the molecular and cellular mechanisms of biological systems and developing news therapeutics. Current large-scale proteomic profiling technologies, such as mass spectrometry, provide relative identification of peptides and proteins, with data vulnerable to outliers, batch effects, and nonrandom missingness. In order to perform high-quality proteomic data analysis, we will first introduce a data preprocessing and quality control pipeline that includes normalization, outlier detection and removal, batch effect identification and handling, and missing data imputation. Then, we will describe several statistical methods that leverage well-processed proteomic data to generate scientific discoveries, especially with an integration with genomics and transcriptomics. These methods cover topics like association analysis, network construction, clustering, and cell-type deconvolution. To demonstrate these methods, we will use the proteogenomic data from the lung squamous cell carcinoma study of the Clinical Proteomic Tumor Analysis Consortium and provide sample codes for data access and analyses.
Keywords: Integrative proteogenomic analysis; Mass spectrometry; Preprocessing and quality control; Proteomics