Gigascience.  2025  Oct  30.  pii:  giaf139. [Epub  ahead  of  print]
  
BACKGROUND: While cell-free DNA (cfDNA) is a promising biomarker for cancer diagnosis and monitoring, there is limited agreement on optimal cfDNA collection and extraction protocols as well as analysis pipelines of the corresponding cfDNA sequencing data. In this paper, we address the latter by studying the effect of various bioinformatics preprocessing choices on derived genetic and epigenetic cfDNA features and study how observed feature differences influence the downstream task of separating between healthy and cancer cfDNA samples.
RESULTS: Using low-pass whole-genome cfDNA sequencing data from 20 lung cancer and 20 healthy samples, we assessed the influence of various preprocessing settings such a read trimming, filtering of secondary alignments and choice of genome build as well as practices such as downsampling or selecting for short fragment on derived cfDNA features including cfDNA fragment size, fragment end motifs, copy number alterations, and nucleosome footprints. Our results demonstrate that the analyzed features are robust to common preprocessing choices, but exhibit variable sensitivity to sequencing coverage. Fragment length statistics and end motifs are the least affected by low coverages, whereas nucleosome footprint analysis is very sensitive to it. Our findings confirm that selecting for shorter fragments, enhances cancer-specific signals, however, by removing data, also reduces signals in general. Interestingly, we find that fragment end motif analysis benefits the most from in silico size selection. We also observe that the filtering of low-quality and secondary alignments and choice of genome build result in slight improvements in cancer classification performance based on nucleosome coverage and copy number features.
CONCLUSIONS: Altogether, we conclude that cfDNA analysis is minimally affected by different bioinformatics preprocessing settings, however we describe some synergistic effects between analytical approaches, which can be leveraged to improve cancer detection.