-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: porting ngsderive #35
Open
a-frantz
wants to merge
91
commits into
main
Choose a base branch
from
a-frantz/ngsderive
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
91 commits
Select commit
Hold shift + click to select a range
0c01978
feat(derive): new readlen subcommand
a-frantz 6bb2c2a
fix(derive/command/readlen): proper error message
a-frantz 23d8a6c
[WIP] protoype for endedness and skeleton of encoding
a-frantz 5ed13e6
[WIP]: doesn't compile. Begin implementing RPT calculations
a-frantz c2fe1d8
revise: applies Clay's edits
claymcleod b4ce76b
[WIP]: Broken, not compiling. starting to calc RPT.
a-frantz 9aeebf4
[WIP]
a-frantz 137beea
feat: working derive endedness subcommand
a-frantz d1cac4d
chore: better comments/test name/log message
a-frantz d92368f
fix: apply some of Clay's performance suggestions
a-frantz 7e6ad22
fix(derive/endedness/compute.rs): test updates
a-frantz 34bf08d
chore(derive/readlen/compute): cleanup
a-frantz 43aaec3
fix(derive/command/endedness): try using HashMap instead of Trie
a-frantz 553fd68
feat(derive/command/endedness): lazy record reading
a-frantz 1b1117c
Revert "feat(derive/command/endedness): lazy record reading"
a-frantz b1e9e86
tests(derive/endedness/comput): reimplement tests
a-frantz ae133d6
[WIP] suggestions from @zaeleus. Compiler error
a-frantz e3b8e85
tests(derive/command/endedness): rewrite tests with latest changes
a-frantz 8b23de0
chore(endedness): handle error where RG tag can't be parsed as_str()
a-frantz 65ea501
fix(endedness): remove all Arcs and lazy_statics
a-frantz 76305f2
refactor(derive/readlen): move num_samples counting to outer func
a-frantz d63eb2a
perf(derive/readlen): don't iterate through all read_lengths
a-frantz cb1d2f5
feat(derive/endedness): add `validate_read_group_info()` call
a-frantz 9b45316
fix: typos
a-frantz 520ae46
revert
a-frantz f280c0f
revert
a-frantz 12f3f66
fix: corrections made after previous reverts
a-frantz 81511e2
tests(derive/endedness): reimplement tests
a-frantz 55ee258
chore(derive): disable index checking when not needed
a-frantz 35ccc3f
chore:(derive/readlen): return an anyhow::Ok instead of plain Ok
a-frantz e8c4816
docs(derive/readlen): correction in module name
a-frantz 7e1578b
docs(derive/endedness): fix docs referring to wrong subcommand
a-frantz eab7cf9
fix: cap QNAME warnings to 100 QNAMES
a-frantz 019d61d
feat(src/derive): use NumberOfRecords and RecordCounter structs
a-frantz dab1dca
style: make arg_in_range() nicer everywhere its used
a-frantz 489c2bb
[WIP]: junction annotation
a-frantz 14db813
feat(derive): junction-annotation subcommand
a-frantz bf31648
chore: remove radix_trie dep
a-frantz 390dc1f
feat(derive/junction-annotation): better results reporting
a-frantz 3a7051f
docs(derive/junction_annotation): be more clear in results docs
a-frantz 5e3cb89
feat(derive/junction_annotation): add short options to params
a-frantz 1ea4154
chore: return anyhow::Ok where appropriate
a-frantz ee77abd
tests: add a test for process() and summarize()
a-frantz 65c4935
chore: typos
a-frantz 6fcbd60
feat: remove fuzzy searching ability. Boosts performance as well.
a-frantz d872c82
feat: first pass implementation of `encoding`
a-frantz 89b60e0
[WIP] to share code. Partial strandedness implementation
a-frantz d5b5b5c
fix(derive/endedness): add logic for 0x1 bit
a-frantz fe6eabd
[WIP] pushing to share. Partial strandedness implementation
a-frantz aaaf1ac
style: rename ignored_flags to filtered_by_flags
a-frantz 9b4f6ea
feat: derive strandedness (prototype)
a-frantz 33f5afc
style: much prettier code. One broken test.
a-frantz 3c15c10
refactor: make read_groups util nicer
a-frantz 7f95d34
refactor(derive/strandedness): separate out results from compute
a-frantz d8d9413
fix(strandedness): break when successful
a-frantz b084993
docs: junction_annotation to junction-annotation
a-frantz 7767da4
[WIP]: switch min_mapq to a proper MappingQuality
a-frantz 64318f3
fix: rework MappingQuality argument to work
a-frantz 2280e15
style: f32 -> f64 and code cleanup
a-frantz 8f90308
feat: add a log_every param to RecordCounter
a-frantz 85d9f19
style: code clean up
a-frantz 54dc217
style: use custom Strand enum more
a-frantz 725b936
fix: junction-annotation works again
a-frantz 9668a6d
feat(derive/instrument): behave more like other derive commands
a-frantz f569b38
fix(derive): print!(output) -> println!(output)
a-frantz 6a23773
fix(derive/strandedness): move RG validation out of compute
a-frantz 9be3850
tests(derive/endedness): fix the broken tests
a-frantz 2eac8d7
style(derive/junction-annotation): group reported junctions by contig
a-frantz b67bac8
fix: lots of code cleanup
a-frantz e1473a6
tests(derive): all derive commands have tests
a-frantz 42d09dc
fix: consistently return Options for String results
a-frantz 55b3bc7
style: prettify imports
a-frantz ded9a89
style: wrap optional variable in Option
a-frantz 11dc18d
docs: fix intra links
a-frantz 4225ec0
tests: fix broken gene test
a-frantz 780d3bc
feat(derive/instrument): more debug statements
a-frantz 09a2be2
feat(derive/instrument): in output, report found unique names
a-frantz 05a53fc
feat: more info in JSON report. Not complete yet. Open TODOs
a-frantz eae95c7
style(derive/instrument): a bit of code clean up
a-frantz 00999c1
docs: filling in TODOs
a-frantz 2661300
chore: delete dead code
a-frantz bfac8a1
tests(derive/junction-annotation): rewrite tests more modular
a-frantz 7ab84f2
stlye: Michael M. feedback
a-frantz 68f999c
Apply suggestions from code review
a-frantz 451a2a4
fix(derive/endedness): complete rename from last commit
a-frantz aa0880d
feat: use NonZeroUsize for Number of Records
a-frantz 406a7b8
feat(utils/args): improved behavior for NumberOfRecords CL utility
a-frantz f83015f
feat(derive): report by read group where appropriate and feasible
a-frantz cf3a869
chore: removing dead code
a-frantz 7bbb7e5
tests(derive/instrument): assert that read groups succeed
a-frantz 6091af4
fix(instrument): properly init flowcell entries
a-frantz File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,9 @@ | ||
//! Functionality related to the `ngs derive` subcommand. | ||
|
||
pub mod command; | ||
pub mod encoding; | ||
pub mod endedness; | ||
pub mod instrument; | ||
pub mod junction_annotation; | ||
pub mod readlen; | ||
pub mod strandedness; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
//! Functionality relating to the `ngs derive encoding` subcommand itself. | ||
|
||
use anyhow::{Context, Ok}; | ||
use clap::Args; | ||
use noodles::bam; | ||
use num_format::{Locale, ToFormattedString}; | ||
use std::collections::HashSet; | ||
use std::io::BufReader; | ||
use std::path::PathBuf; | ||
use tracing::info; | ||
|
||
use crate::derive::encoding::compute; | ||
use crate::utils::args::NumberOfRecords; | ||
use crate::utils::display::RecordCounter; | ||
|
||
/// Clap arguments for the `ngs derive encoding` subcommand. | ||
#[derive(Args)] | ||
pub struct DeriveEncodingArgs { | ||
/// Source BAM. | ||
#[arg(value_name = "BAM")] | ||
src: PathBuf, | ||
|
||
/// Examine the first `n` records in the file. | ||
#[arg( | ||
short, | ||
long, | ||
default_value_t, | ||
value_name = "'all' or a positive, non-zero integer" | ||
)] | ||
num_records: NumberOfRecords, | ||
} | ||
|
||
/// Main function for the `ngs derive encoding` subcommand. | ||
pub fn derive(args: DeriveEncodingArgs) -> anyhow::Result<()> { | ||
info!("Starting derive encoding subcommand."); | ||
|
||
let file = std::fs::File::open(args.src); | ||
let reader = file | ||
.map(BufReader::new) | ||
.with_context(|| "opening BAM file")?; | ||
let mut reader = bam::Reader::new(reader); | ||
let _header: String = reader.read_header()?.parse()?; | ||
reader.read_reference_sequences()?; | ||
|
||
let mut score_set: HashSet<u8> = HashSet::new(); | ||
|
||
// (1) Collect quality scores from reads within the | ||
// file. Support for sampling only a portion of the reads is provided. | ||
let mut counter = RecordCounter::default(); | ||
for result in reader.lazy_records() { | ||
let record = result?; | ||
|
||
for i in 0..record.quality_scores().len() { | ||
let score = record.quality_scores().as_ref()[i]; | ||
score_set.insert(score); | ||
} | ||
|
||
counter.inc(); | ||
if counter.time_to_break(&args.num_records) { | ||
break; | ||
} | ||
} | ||
|
||
info!( | ||
"Processed {} records.", | ||
counter.get().to_formatted_string(&Locale::en) | ||
); | ||
|
||
// (2) Derive encoding from the observed quality scores | ||
let result = compute::predict(score_set)?; | ||
|
||
// (3) Print the output to stdout as JSON (more support for different output | ||
// types may be added in the future, but for now, only JSON). | ||
let output = serde_json::to_string_pretty(&result).unwrap(); | ||
println!("{}", output); | ||
|
||
Ok(()) | ||
a-frantz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,164 @@ | ||
//! Functionality relating to the `ngs derive endedness` subcommand itself. | ||
|
||
use anyhow::Context; | ||
use clap::Args; | ||
use num_format::{Locale, ToFormattedString}; | ||
use std::collections::{HashMap, HashSet}; | ||
use std::path::PathBuf; | ||
use std::sync::Arc; | ||
use tracing::{info, trace}; | ||
|
||
use crate::derive::endedness::compute; | ||
use crate::derive::endedness::compute::OrderingFlagsCounts; | ||
use crate::utils::args::arg_in_range as deviance_in_range; | ||
use crate::utils::args::NumberOfRecords; | ||
use crate::utils::display::RecordCounter; | ||
use crate::utils::formats::bam::ParsedBAMFile; | ||
use crate::utils::formats::utils::IndexCheck; | ||
use crate::utils::read_groups::{get_read_group, validate_read_group_info, ReadGroupPtr}; | ||
|
||
/// Clap arguments for the `ngs derive endedness` subcommand. | ||
#[derive(Args)] | ||
pub struct DeriveEndednessArgs { | ||
/// Source BAM. | ||
#[arg(value_name = "BAM")] | ||
src: PathBuf, | ||
|
||
/// Examine the first `n` records in the file. | ||
#[arg( | ||
short, | ||
long, | ||
default_value_t, | ||
value_name = "'all' or a positive, non-zero integer" | ||
)] | ||
num_records: NumberOfRecords, | ||
|
||
/// Distance from 0.5 split between number of f+l- reads and f-l+ reads | ||
/// allowed to be called 'Paired-End'. The default value of `0.0` is only appropriate | ||
/// if the whole file is being processed. | ||
#[arg(long, value_name = "F64", default_value = "0.0")] | ||
paired_deviance: f64, | ||
|
||
/// Calculate and output Reads-Per-Template. This will produce a more | ||
/// sophisticated estimate for endedness, but uses substantially more memory. | ||
#[arg(long, default_value = "false")] | ||
calculate_reads_per_template: bool, | ||
|
||
/// Round RPT to the nearest INT before comparing to expected values. | ||
/// Appropriate if using `-n` > 0. Unrounded value is reported in output. | ||
#[arg(long, default_value = "false")] | ||
round_reads_per_template: bool, | ||
} | ||
|
||
/// Main function for the `ngs derive endedness` subcommand. | ||
pub fn derive(args: DeriveEndednessArgs) -> anyhow::Result<()> { | ||
// (0) Parse arguments needed for subcommand. | ||
let paired_deviance = deviance_in_range(args.paired_deviance, 0.0..=0.5) | ||
.with_context(|| "Paired deviance is not within acceptable range")?; | ||
|
||
info!("Starting derive endedness subcommand."); | ||
|
||
let mut found_rgs = HashSet::new(); | ||
|
||
let mut ordering_flags: HashMap<ReadGroupPtr, OrderingFlagsCounts> = HashMap::new(); | ||
|
||
// only used if args.calc_rpt is true | ||
let mut read_names: Option<HashMap<String, Vec<ReadGroupPtr>>> = None; | ||
|
||
let ParsedBAMFile { | ||
mut reader, header, .. | ||
} = crate::utils::formats::bam::open_and_parse(args.src, IndexCheck::None)?; | ||
|
||
// (1) Collect ordering flags (and QNAMEs) from reads within the | ||
// file. Support for sampling only a portion of the reads is provided. | ||
let mut counter = RecordCounter::default(); | ||
for result in reader.records(&header.parsed) { | ||
let record = result?; | ||
|
||
// Only count primary alignments and unmapped reads. | ||
if (record.flags().is_secondary() || record.flags().is_supplementary()) | ||
&& !record.flags().is_unmapped() | ||
{ | ||
continue; | ||
} | ||
|
||
let read_group = get_read_group(&record, Some(&mut found_rgs)); | ||
|
||
if args.calculate_reads_per_template { | ||
let read_name_map = read_names.get_or_insert_with(HashMap::new); | ||
match record.read_name() { | ||
Some(rn) => { | ||
let rn = rn.to_string(); | ||
let rg_vec = read_name_map.get_mut(&rn); | ||
|
||
match rg_vec { | ||
Some(rg_vec) => { | ||
rg_vec.push(Arc::clone(&read_group)); | ||
} | ||
None => { | ||
read_name_map.insert(rn, vec![(Arc::clone(&read_group))]); | ||
} | ||
} | ||
} | ||
None => { | ||
trace!("Could not parse a QNAME from a read in the file."); | ||
trace!("Skipping this read and proceeding."); | ||
continue; | ||
} | ||
} | ||
} | ||
|
||
match ( | ||
record.flags().is_segmented(), | ||
record.flags().is_first_segment(), | ||
record.flags().is_last_segment(), | ||
) { | ||
(false, _, _) => { | ||
ordering_flags.entry(read_group).or_default().unsegmented += 1; | ||
} | ||
(true, true, false) => { | ||
ordering_flags.entry(read_group).or_default().first += 1; | ||
} | ||
(true, false, true) => { | ||
ordering_flags.entry(read_group).or_default().last += 1; | ||
} | ||
(true, true, true) => { | ||
ordering_flags.entry(read_group).or_default().both += 1; | ||
} | ||
(true, false, false) => { | ||
ordering_flags.entry(read_group).or_default().neither += 1; | ||
} | ||
} | ||
|
||
counter.inc(); | ||
if counter.time_to_break(&args.num_records) { | ||
break; | ||
} | ||
} | ||
|
||
info!( | ||
"Processed {} records.", | ||
counter.get().to_formatted_string(&Locale::en) | ||
); | ||
|
||
// (2) Validate the read group information. | ||
let rgs_in_header_not_records = validate_read_group_info(&found_rgs, &header.parsed); | ||
for rg_id in rgs_in_header_not_records { | ||
ordering_flags.insert(Arc::new(rg_id), OrderingFlagsCounts::new()); | ||
} | ||
|
||
// (3) Derive the endedness based on the ordering flags gathered. | ||
let result = compute::predict( | ||
ordering_flags, | ||
read_names, | ||
paired_deviance, | ||
args.round_reads_per_template, | ||
); | ||
|
||
// (4) Print the output to stdout as JSON (more support for different output | ||
// types may be added in the future, but for now, only JSON). | ||
let output = serde_json::to_string_pretty(&result).unwrap(); | ||
println!("{}", output); | ||
|
||
anyhow::Ok(()) | ||
} |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, we're going to change how this is named slightly in the future.