-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RNA Seq normalization methods #136
Conversation
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. Thanks for integrating Codecov - We've got you covered ☂️ |
docs/rnaseq_normalization.ipynb
Outdated
"\n", | ||
"#### RPKM:\n", | ||
"\n", | ||
"RPKM (Reads per kilobase million) normalization at first determines a scaling factor, by calculating the sum of all reads in a sample and dividing that number by 1,000,000. That scaling factor is used to calculate RPM (Reads per million), by dividing the read counts for each sample with it, normalizing for sequencing depth. To get RPKM and normalize for gene length, RPM values are divided by genelength in kilobases. RPKM is applied by using the `RNASeq.rpkms` function.\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it would be beneficial to add the formula for RPKM.
"\n", | ||
"#### TPM:\n", | ||
"\n", | ||
"What differentiates TPM (Transcripts per kilobase million) from RPKM is the order of operations. To calculate TPM values, data gets normalized for gene length first. This is achieved by calculating RPK values (reads per kilobase), by dividing the read counts by genelength in kilobases. The sum of all RPK values is divided by 1,000,000, to get a scaling factor. Finally, TPM values are calculated by dividing the RPK values by the scaling factor, also normalizing for sequencing depth.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it would be beneficial to add the formula for TPM.
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The effects of both normalizations becomes apparent when comparing the relation of the samples " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, at least not at first glance. I would suggest setting the y axis range of the RPKM and TPM plots to the same range, which would show more clearly that tpm values are lower than rpkm values. As the plot stands now, the values look identical until one looks at the axes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, this plot would really benefit from a 4th chart showing the gene length of each gene.
docs/rnaseq_normalization.ipynb
Outdated
"metadata": {}, | ||
"source": [ | ||
"## RPKM & TPM\n", | ||
"RNA-Seq is a transcriptomics technique, that quantifies RNA molecules in a biological sample. When dealing with RNA-sequencing data, normalization is needed to correct technical biases. RPKM and TPM are two metrics that normalize for gene length and sequencing depth. RNA-Sequencing data needs to be normalized for gene length, because longer genes show greater read counts when expressed at the same level and for sequencing depth, as deeper sequencing depth produces more read counts per gene.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this introduction understates the complexity of these datasets a little. I would suggest adding a few more sentences about the method, e.g. that it is high-throughput and can quantify the full transcriptome.
open System.Collections.Generic | ||
|
||
|
||
module RNASeq = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least the public functions and types should have XML documentation to give context about what they do without the need of browsing the documentation page.
testCase "RPKM" (fun _ -> | ||
Expect.equal | ||
(RNASeq.rpkms testInSeq | ||
|> Array.ofSeq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use Expect.sequenceEqual
instead of casting to arrays here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far. The documentation page could use some more information, see the individual comments. Here are some relevant sources:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
small nitpick and we can merge this:
the y axis titles on the plot should be different from each other, as they are not all indicate 'Read Counts'. For that, you can set the axis title on the individual charts before creating the grid. Also, a little more space to improve readability and keep axes from overlapping would be nice (see Chart.withSize
)
🥳 |
Thank you for contributing to BioFSharp. Please take the time to tell us a bit more about your PR.
Please list the changes introduced in this PR
Description
This contribution adds RPKM and TPM normalization, as well as unit tests and documentation. RPKM and TPM are metrics for normalized RNA-sequencing data.
[Required] please make sure you checked that
[Optional]