Fragmentomic Analysis of cfDNA for Cancer Detection and Subtyping.
Early detection and monitoring of cancer have been demonstrated to be
crucial in the long-term survival of patients (1). Numerous studies have
shown that cfDNA (cell-free DNA) can be used to detect cancer and
determine the cancer subtype using blood samples from the patient (2).
ctDNA (circulating tumour DNA) is a subset of cfDNA present in the blood
sample of individuals living with cancer. Certain biomarkers such as the
length of the DNA fragment, methylation, and DNA modifications exist
that allow scientists to detect ctDNA from the pool of cfDNA and
determine its cell of origin. One of the most prominent biomarkers
discovered in the paper by Katsman et. al. is the higher ratio of short
mono-nucleosome and di-nucleosome cfDNA as compared to normal cfDNA in
cancer patients (5). The objective of cfDNAfragmentomics R package is to
analyse sequenced data from inputted data files and find their
nucleosome ratios and fragmentation patterns and classify the cfDNAs into cancerous or not using a decision tree model. The model evaluates the information gain criteria using entropy loss function. To further the research
done by Katsman et. al. and others in the field, Wilcoxon test analysis can be
performed to compare the significance of the difference in the
nucleosome ratios between the patient and control data. Additionally,
this package contains python scripts to plot nucleosome coverage plots
and also perform fragmentation analysis on larger datasets. The Wilcoxon
rank sum test, also known as the Mann-Whitney U test, is a
non-parametric test used to determine significance of the difference in
fragment size of the control and patient cfDNA lengths to determine if
the patient data contains the cancer biomarker. Here, we are comparing
two independent samples, which are the population of healthy control
cfDNA data and population of patient cfDNA data, and we are making
inference about the state of being cancer positive or negative of the
population of cfDNA molecules in the patient as being potentially cancer
positive or negative. Both control and patient data inputted to the
functions are assumed to be reads for the specific loci corresponding to
the cancer type of interest. Hence, both the control and patient data
are required in the analysis to ensure the loci being analysed are
corresponding to the specific cancer type of interest. This package is
novel since there are no other R packages that perform statistical
analysis on patient and control cfDNA to find the likelihood of cancer
in the patient cfDNA with an R Shiny interface as well as python scripts
to perform command line functions that require more memory.
The
biological data analysed by cfDNAfragmentomics is cfDNA (cell-free DNA)
which are short DNA fragments that have entered the blood due to cell
apoptosis or necrosis. This DNA can have variable lengths in the range
of 150-400 base pairs (3). These DNA fragments are sequenced using
Illumina Sequencing or Oxford Nanopore Technology. The DNA sequences are
available in .bed or .txt files, with format as discussed in the
assumptions, as inputs to the cfDNAfragmentomics package.
The CfDNAfragmentomics
package was developed using
R version 4.2.2 (2022-10-31 ucrt)
,
Platform: x86_64-w64-mingw32/x64 (64-bit)
and
Running under: Windows 10 x64 (build 22000)
.
To install the latest version of the package:
require("devtools")
devtools::install_github("Yasamin-Nourijelyani/CfDNAfragmentomics")
library("CfDNAfragmentomics")
To run the shinyApp:
runShinyCfDNAfragmentomics()
ls("package:CfDNAfragmentomics")
Run help for functions like the command below:
?nucleosomeRatio
-
R Analysis function: nucleosomeRatio: Fragmentation sizes of cfDNA molecules are potential cancer biomarkers (4). Hence, to find if a patient data file contains cancer information, it can be compared with a control dataset with known healthy cfDNA fragments. The U-test is a non-parametric test used to determine significance of the difference in fragment size of the control and patient cfDNA lengths to determine if the patient data contains the cancer biomarker. Here, we are comparing two independent samples, which are the control and patient data, and we are making inference about the state (cancer positive or negative) of the population of cfDNA molecules in the patient as being potentially cancer positive or negative. This function takes as input cfDNA read dataframes from patient and controls, which can be read from .bed or .txt files. This function also takes as input, p-values for the U-test analysis of mononucleosomes and dinucleosomes. The function returns a list of outputs. This list contains a real number: ttest_mono_pvalue, a real number: ttest_di_pvalue, and a boolean value: cancerous. ttest_mono_pvalue is the p-value calculated for the difference in mononucleosome lengths between patient and controls, ttest_di_pvalue is the p-value calculated for the difference in dinucleosome lengths, and if the patient input files shows significant shorter mononucleosome and dinucleosome sizes compared to the control with respect to the inputted p-values that are considered significant, cancerous will be TRUE. and cancerous will be FALSE otherwise. Note that even if the p-values show significance, if the patient cfDNA lengths are not shorter, then the patient is not considered as being cancerous.
-
nuc_ratio.py In the nuc_ratio.py file, the function nucleosome_ratio takes as input a sample cfDNA file and a healthy control bed file to compare the mono-nucleosome and dinucleosome fragment lengths. Using the patient and healthy bed files, we find the cfDNA fragment lengths, and seperate them into mononucleosome and dinucleosome arrays based on their lengths. Then, The sample and control mononucleosome lengths and the sample and control dinucleosome lengths are compared using the wilcoxon rank sum test.
-
nucleosome_occupancy.py : We designed a python script to aggregate fragment coverage along different areas of the genome. The script takes as input a bed file containing cfDNA start and end intervals and also a set of transcription factor binding site loci. We run through every TFBS and using the midpoint region, we aggregate the fragment coverage in a window size. The size of the window is specified by the user as an argument (i.e. 1000bp). Then, we use the aggregated coverage vector to plot the coverage of the nucleosome occupancy in a line plot. The plot is smoothed using a Savitzky-Golay filter.
-
R Plotting functions: nucleosomeDensityPlot: A visualization function that generates a density plot showing nucleosome fragment lengths for control and patient data to visually compare the mono-nucleosome and di-nucleosome fragment length densities. This function takes as input cfDNA read dataframes from patient and controls, which can be read from .bed or .txt files.
The fragmentation length analysis on the python script can be performed using the command line code:
python <./location/nuc_ratio.py> <control_data.tsv> -s <sample_data.tsv>
The Nucleosome Occupancy analysis on the python script can be performed using the command line code:
python <./location/nucleosome_occupancy.py> -t <./location/TFBS_loc.tsv>
-w <Window size> -n <name> <./locationdata/gene.hg38.frag.bed>
Assumptions:
-
The sequenced data that is inputted to the shiny app are in the form of BED or txt files so that they can be parsed and analysed. Files do not contain a header.
-
The input data to the functions are a dataframe for the a control and patient data with at least 3 columns: first column is a string representing the chromosome number for the cfDNA for instance: “chr1”, the second column includes integer values indicating the start location of the read and third column includes integer values indicating end location of the read. This is also the format for the bed and txt files inputted to the shiny app.
-
The length of the fragments to analyse is within the range of 100-400 base pairs to avoid using a large memory space and to ensure only short cfDNA fragments are analysed
-
It is also assumed that the data is real human data and contains both mono-nucleosome and di-nucleosome length data, sequenced from Illumina and Oxford Nanopore Technology sequencers to ensure data viability.
-
Both control and patient data inputted to the functions are assumed to be reads for the specific loci corresponding to the cancer type of interest. Hence, both the control and patient data are required in the analysis to ensure the loci being analysed are corresponding to the specific cancer type of interest.
The author of the package is Yasamin Nouri Jelyani.
The nucleosomeRatio function, written by the author makes use of
dplyr
to select rows from the data frame that corresponds to
mono-nucleosome or di-nucleosome sizes. Using the stats
wilcox.test
function, it calculates the wilcox-test for the patient versus the
control cfDNA sizes. This is used to evaluate mono-nucleosome and
di-nucleosome cfDNA lengths differences, and are used to determine the
significance of the difference to determine if the patient data can be
categorized as cancerous cfDNA. The nucleosomeDensityPlot function
written by the author makes use of the ggplot
R package to plot the
size distribution of the cfDNA for patient and control data. The python
script nucleosome_occupancy.py is written with significant contribution
from Jonathan Broadbent. This
package contains significant contributions by Dr.Jared Simpson and
Jonathan Broadbent, who are the supervisors of this package development.
Contributions include: suggestion about using non-parametric tests
instead of parametric. Also, nucleosome coverage plot idea of algorithm
is from Jonathan Broadbent. The python script nuc_ratio.py and p.py
are written by the author.
-
Cree, I. A., Uttley, L., Buckley Woods, H., Kikuchi, H., Reiman, A., Harnan, S., Whiteman, B. L., Philips, S. T., Messenger, M., Cox, A., Teare, D., Sheils, O., Shaw, J., & UK Early Cancer Detection Consortium (2017). The evidence base for circulating tumour DNA blood-based biomarkers for the early detection of cancer: a systematic mapping review. BMC cancer, 17(1), 697. https://doi.org/10.1186/s12885-017-3693-7
-
Lo, Y. M. D., Han, D. S. C., Jiang, P., & Chiu, R. W. K. (2021). Epigenetics, fragmentomics, and topology of cell-free DNA in liquid biopsies. Science (American Association for the Advancement of Science), 372(6538), 144–. https://doi.org/10.1126/science.aaw3616
-
Anna-Lisa Doebley, Minjeong Ko, Hanna Liao, A. Eden Cruikshank, Caroline Kikawa, Katheryn Santos, Joseph Hiatt, Robert D. Patton, Navonil De Sarkar, Anna C.H. Hoge, Katharine Chen, Zachary T. Weber, Mohamed Adil, Jonathan Reichel, Paz Polak, Viktor A. Adalsteinsson, Peter S. Nelson, Heather A. Parsons, Daniel G. Stover, David MacPherson, Gavin Ha.(2021) Griffin: Framework for clinical cancer subtyping from nucleosome profiling of cell-free DNA, medRxiv 2021.08.31.21262867; doi: https://doi.org/10.1101/2021.08.31.21262867
-
Katsman, E., Orlanski, S., Martignano, F., Fox-Fisher, I., Shemer, R., Dor, Y., Zick, A., Eden, A., Petrini, I., Conticello, S. G., & Berman, B. P. (2022). Detecting cell-of-origin and cancer-specific methylation features of cell-free DNA from Nanopore sequencing. Genome biology, 23(1), 158. https://doi.org.myaccess.library.utoronto.ca/10.1186/s13059-022-02710-1
-
Cristiano, S., Leal, A., Phallen, J., Fiksel, J., Adleff, V., Bruhm, D. C., Jensen, S. Ø., Medina, J. E., Hruban, C., White, J. R., Palsgrove, D. N., Niknafs, N., Anagnostou, V., Forde, P., Naidoo, J., Marrone, K., Brahmer, J., Woodward, B. D., Husain, H., van Rooijen, K. L., … Velculescu, V. E. (2019). Genome-wide cell-free DNA fragmentation in patients with cancer. Nature, 570(7761), 385–389. https://doi.org.myaccess.library.utoronto.ca/10.1038/s41586-019-1272-6
-
Wang H (2022). cfDNAPro: cfDNAPro Helps Characterise and Visualise Whole Genome Sequencing Data from Liquid Biopsy. R package version 1.2.0, https://github.com/hw538/cfDNAPro.
-
Alkodsi A, Meriranta L, Pasanen A, Leppä S (2020). “ctDNAtools: An R package to work with sequencing data of circulating tumor DNA.” bioRxiv.
-
Puranachot P (2022). cfdnakit : an R package for fragmentation analysis of cfDNA and copy-number alteration calling. R package version 0.0.1, https://github.com/Pitithat.pu/cfdnakit.
-
Morgan M, Pagès H, Obenchain V, Hayden N (2022). Rsamtools: Binary alignment (BAM), FASTA, variant call (BCF), and tabix file import. R package version 2.12.0, https://bioconductor.org/packages/Rsamtools.
-
Lawrence M, Huber W, Pagès H, Aboyoun P, Carlson M, Gentleman R, Morgan M, Carey V (2013). “Software for Computing and Annotating Genomic Ranges.” PLoS Computational Biology, 9. doi: 10.1371/journal.pcbi.1003118, http://www.ploscompbiol.org/article/info%3Adoi%2 F10.1371%2Fjournal.pcbi.1003118.
-
Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org
-
R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
-
Wickham, H. and Bryan, J. (2019). R Packages (2nd edition). Newton, Massachusetts: O’Reilly Media. https://r-pkgs.org/
This package was developed by Yasamin Nouri Jelyani for BCB430: Research
Course under the supervision of Professor Jared Simpson and Jonathan
Broadbent. Using teachings of Professor Anjali Silva. at the University
of Toronto, Toronto, CANADA. CfDNAfragmentomics
welcomes issues,
enhancement requests, and other contributions. To submit an issue, use
the GitHub
issues.
Many thanks to those who provided feedback to improve this package.