-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: porting ngsderive #35
base: main
Are you sure you want to change the base?
Conversation
This reverts commit 553fd68.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some intermediate comments (some are old) from my last review. I want to get these in front of you so you can address them before squashing or whatever you plan to do with the commits here.
@@ -20,6 +25,25 @@ pub struct DeriveArgs { | |||
/// All possible subcommands for `ngs derive`. | |||
#[derive(Subcommand)] | |||
pub enum DeriveSubcommand { | |||
/// Derives the quality score encoding used to produce the file. | |||
Encoding(self::encoding::DeriveEncodingArgs), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, we're going to change how this is named slightly in the future.
Co-authored-by: Clay McLeod <[email protected]>
Porting https://github.com/stjudecloud/ngsderive to this repo.
An attempt was made to keep the algorithms in this port and the original repo as similar/consistent as possible. However some tweaks and improvements were made, which will be detailed below. The biggest difference between the Python code and the Rust code is the output results.
ngsderive
reported TSVs with pretty limited information. This code reports results as JSON and outputs comprehensive metrics related to processing, to make debugging/assessment easier. I'm not going to detail all the new metrics reported here, check out the code for that.encoding
endedness
is_segmented
bitwise flag. This led to a rework ofSingle-End
classification.is_segmented == false
we DO NOT checkis_first
oris_last
is_segmented == false
then we call itSingle-End
RPT==1.0
to classify asSingle-End
)Paired-End
logic is essentially the same, but now we require that every readis_segmented
instrument
:junction-annotation
:ngsderive
, we have a MAPQ filter of30
as the default. On our STAR data, this meant "only look at unique mappers. (Discard multi-mappers.)". Doesn't work in Rust when we're using noodles.junction-saturation
algorithm (unlike the Python code, which was complicated by a reliance on PYSAM). So we can implement that pretty easily in the future.readlen
:strandedness
:junction-annotation
--min-reads-per-gene
. Everything was being counted as evidence. This filter didn't do anything. It now works. I didn't notice any swing in the results when I played with disabling this. IDK if it really impacts results much at all. At least when it's at a value of10
VS0
. Maybe jacking it up would change things? IDK, haven't investigated.Before submitting this PR, please make sure:
cargo test
andcargo clippy
).