-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question Regarding mask1_front and Barcode Demultiplexing in Direct RNA Seq" #1060
Comments
You should specify your flanks in |
hello @malton-ont thank you for the advice!! Although this is not directly related to the previous topic, I have a question I would like to ask. My current plan is to perform basecalling, then demultiplex using custom barcode analysis, and subsequently categorize the raw signal (POD5) according to the demultiplexing results. When looking at the basecalled data, each read has a unique read_id, and I am thinking of using this to match it with the raw data for classification. Would it be possible to do this? If so, how can it be done? Is there an already established method for this? I would appreciate your advice on this matter. Thank you! |
Yes, this should be possible. Note that any reads that have been split will have new read-ids, so you'll need to look at the You'll probably want to take a look at https://pypi.org/project/pod5/, particularly the filter and subset commands, but that discussion may be better placed on the community forums as that isn't a dorado issue. |
Just jumping into this thread with a Q: do the barcodes in RNA004 have to be RNA, or can they be DNA? It's unclear to me what the basecaller would do for read trimming as I'm guessing it removes a DNA-associated signal. Eg. If I have ADAPTER-BARCODE-AAAAA-RNA, does the barcode have to be RNA or can it be DNA? |
@malton-ont Thank you for reply!! i'll try it. Hello @billytcl custom adapter is made with 2 DNA primer that contains partially complementary sequence and when making library, annealing step is needed. so i don'n know about the RTA, but it seems like RTA is also composed of DNA strand. checking the library protocol will be helpful. (https://nanoporetech.com/document/direct-rna-sequencing-sequence-specific-sqk-rna004) and also there are a few article about demultiplexing direct RNA seq data. it said that the raw signal is very different between RNA read region and adapter region because of difference DNA & RNA. (https://genome.cshlp.org/content/30/9/1345) more, DORADO manual said it detects DNA adapter sequence, so i assume it auto-trim DNA adapter sequence. but it has an option --no-trim that inhibits adapter trimming. so i think it is okay making barcode with DNA I had similar question and searched for this, and this informations is what I found. |
Dorado attempts to remove any DNA signal from RNA reads - in 0.8.0 this occurs regardless of the |
@malton-ont original : RNA - [RTA] - RLA mine : RNA - [target-specific-sequence - barcode - rear region of RTA] - RLA and the barcode is annealed DNA... if i detect this barcode with the code under dorado basecaller sup --barcode-arrangement [barcode_arra.toml] --barcode-sequences [barcode_sequence.fastq] [POD5] > [output] it cannot detect barcode sequence?? and I have two thoughts regarding this situation: Could it be possible that the barcode sequence is not being trimmed because it’s different from the existing RTA sequence? Although I feel like this might not be the case, as the signal itself would likely still be classified as DNA. If that's not the case, would it be possible to correctly read only the DNA barcode if I performed basecalling using an option that specifically basecalls DNA signals? If that’s possible, I’ve heard that the POD5 file contains information about the library kit. Is it possible to ignore this and still run the Dorado basecaller? |
I would expect the barcode to also be removed if it looks like DNA - this is done on signal, not sequence. It may be plausible to basecall the reads first with a DNA model and then subset by barcode and rebasecall with the RNA model - I haven't tried this though, and you may get better answers about this kind of process on the Nanopore community forums. |
Issue Report
Please describe the issue:
Hello, I am currently performing target-specific custom barcode demultiplexing using Direct RNA seq data.
In the options, there is a setting for mask1_front, and the explanation states:
(Required) The leading flank for the front barcode (applies to single and double ended barcodes). Can be an empty string.
From my understanding, this option is for specifying the flank sequence for the front-attached barcode.
In my Direct RNA seq library, adapters are only attached to the rear end of the read, and I have inserted barcodes into this adapter sequence.
In this case, how should I adjust this option? I am thinking of setting it as follows:
mask1_front = ""
mask1_rear = ""
mask2_front = ""
mask2_rear = "GGCC"
What do you think of this approach?
For reference, although this is target-specific, there are multiple targets, so it’s difficult to define a single flank sequence for the front barcode. However, the rear part is clear since I can identify the custom-specific adapter sequence from the Direct RNA seq manual.
Steps to reproduce the issue:
Please list any steps to reproduce the issue.
Run environment:
Dorado version: 0.8.0
Dorado command:
/home/rnagenomics/sm/Nanopore/20240923_Histone_Direct_RNA_seq/dorado-0.8.0-linux-x64/bin/dorado basecaller sup --no-trim --barcode-arrangement barcode_arra.toml --barcode-sequences barcode_sequence.fastq /home/rnagenomics/sm/Nanopore/20240923_Histone_Direct_RNA_seq/rawdata/data/240923histone/HepG2/20240923_1720_P2S-01504-A_PAW01284_59f98b81/pod5_total/ > DORADO_Barcode_basecall_3.bam
Operating system:
Hardware (CPUs, Memory, GPUs) : A100
Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): POD5
Source data location (on device or networked drive - NFS, etc.):
Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB):
Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):
Logs
The text was updated successfully, but these errors were encountered: