Add RNA Seq normalization methods #136

ScheidTo · 2024-05-06T06:26:30Z

Thank you for contributing to BioFSharp. Please take the time to tell us a bit more about your PR.

Please list the changes introduced in this PR

added RPKM Normalization
added TPM Normalization

Description
This contribution adds RPKM and TPM normalization, as well as unit tests and documentation. RPKM and TPM are metrics for normalized RNA-sequencing data.

[Required] please make sure you checked that

The project builds without problems on your machine

[Optional]

Added unit tests regarding the added features

codecov · 2024-05-06T06:49:57Z

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

kMutagene · 2024-05-06T07:04:00Z

docs/rnaseq_normalization.ipynb

+    "\n",
+    "#### RPKM:\n",
+    "\n",
+    "RPKM (Reads per kilobase million) normalization at first determines a scaling factor, by calculating the sum of all reads in a sample and dividing that number by 1,000,000. That scaling factor is used to calculate RPM (Reads per million), by dividing the read counts for each sample with it, normalizing for sequencing depth. To get RPKM and normalize for gene length, RPM values are divided by genelength in kilobases. RPKM is applied by using the `RNASeq.rpkms` function.\n"


Here it would be beneficial to add the formula for RPKM.

kMutagene · 2024-05-06T07:04:43Z

docs/rnaseq_normalization.ipynb

+    "\n",
+    "#### TPM:\n",
+    "\n",
+    "What differentiates TPM (Transcripts per kilobase million) from RPKM is the order of operations. To calculate TPM values, data gets normalized for gene length first. This is achieved by calculating RPK values (reads per kilobase), by dividing the read counts by genelength in kilobases. The sum of all RPK values is divided by 1,000,000, to get a scaling factor. Finally, TPM values are calculated by dividing the RPK values by the scaling factor, also normalizing for sequencing depth.\n",


Here it would be beneficial to add the formula for TPM.

kMutagene · 2024-05-06T07:08:00Z

docs/rnaseq_normalization.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The effects of both normalizations becomes apparent when comparing the relation of the samples "


Not really, at least not at first glance. I would suggest setting the y axis range of the RPKM and TPM plots to the same range, which would show more clearly that tpm values are lower than rpkm values. As the plot stands now, the values look identical until one looks at the axes.

Additionally, this plot would really benefit from a 4th chart showing the gene length of each gene.

kMutagene · 2024-05-06T07:13:26Z

docs/rnaseq_normalization.ipynb

+   "metadata": {},
+   "source": [
+    "## RPKM & TPM\n",
+    "RNA-Seq is a transcriptomics technique, that quantifies RNA molecules in a biological sample. When dealing with RNA-sequencing data, normalization is needed to correct technical biases. RPKM and TPM are two metrics that normalize for gene length and sequencing depth. RNA-Sequencing data needs to be normalized for gene length, because longer genes show greater read counts when expressed at the same level and for sequencing depth, as deeper sequencing depth produces more read counts per gene.\n",


I think this introduction understates the complexity of these datasets a little. I would suggest adding a few more sentences about the method, e.g. that it is high-throughput and can quantify the full transcriptome.

kMutagene · 2024-05-06T07:33:52Z

src/BioFSharp.Stats/RNASeq.fs

+open System.Collections.Generic
+
+
+module RNASeq = 


at least the public functions and types should have XML documentation to give context about what they do without the need of browsing the documentation page.

kMutagene · 2024-05-06T07:35:12Z

tests/BioFSharp.Tests/BioFSharp.Stats/RNASeqTests.fs

+        testCase "RPKM" (fun _ ->
+            Expect.equal 
+                (RNASeq.rpkms testInSeq
+                |> Array.ofSeq)


you can use Expect.sequenceEqual instead of casting to arrays here

kMutagene

Looks good so far. The documentation page could use some more information, see the individual comments. Here are some relevant sources:

ScheidTo · 2024-05-06T13:04:02Z

@kMutagene

kMutagene

Nice!

small nitpick and we can merge this:

the y axis titles on the plot should be different from each other, as they are not all indicate 'Read Counts'. For that, you can set the axis title on the individual charts before creating the grid. Also, a little more space to improve readability and keep axes from overlapping would be nice (see Chart.withSize)

kMutagene · 2024-05-13T10:54:49Z

🥳

ScheidTo added 5 commits April 9, 2024 13:15

add empty test

e6e566e

rename file ending

a967c53

tpkm & rpkm + tests

7a3c992

with record type

2b906ea

input as record types

b40fa7a

kMutagene reviewed May 6, 2024

View reviewed changes

kMutagene requested changes May 6, 2024

View reviewed changes

Added XML tags

ce12904

kMutagene requested changes May 13, 2024

View reviewed changes

modified axis labelling

a5382ff

kMutagene approved these changes May 13, 2024

View reviewed changes

kMutagene changed the title ~~RNA Seq~~ Add RNA Seq normalization methods May 13, 2024

kMutagene merged commit 32d20c3 into CSBiology:developer May 14, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RNA Seq normalization methods #136

Add RNA Seq normalization methods #136

ScheidTo commented May 6, 2024

codecov bot commented May 6, 2024

kMutagene May 6, 2024

kMutagene May 6, 2024

kMutagene May 6, 2024

kMutagene May 6, 2024

kMutagene May 6, 2024

kMutagene May 6, 2024

kMutagene May 6, 2024

kMutagene left a comment

ScheidTo commented May 6, 2024

kMutagene left a comment

kMutagene commented May 13, 2024

		open System.Collections.Generic


		module RNASeq =

Add RNA Seq normalization methods #136

Add RNA Seq normalization methods #136

Conversation

ScheidTo commented May 6, 2024

codecov bot commented May 6, 2024

Welcome to Codecov 🎉

kMutagene May 6, 2024

Choose a reason for hiding this comment

kMutagene May 6, 2024

Choose a reason for hiding this comment

kMutagene May 6, 2024

Choose a reason for hiding this comment

kMutagene May 6, 2024

Choose a reason for hiding this comment

kMutagene May 6, 2024

Choose a reason for hiding this comment

kMutagene May 6, 2024

Choose a reason for hiding this comment

kMutagene May 6, 2024

Choose a reason for hiding this comment

kMutagene left a comment

Choose a reason for hiding this comment

ScheidTo commented May 6, 2024

kMutagene left a comment

Choose a reason for hiding this comment

kMutagene commented May 13, 2024