Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster loss during rule reformat_filter_clusters #7

Open
kathleenabadie opened this issue Jan 1, 2021 · 1 comment
Open

Cluster loss during rule reformat_filter_clusters #7

kathleenabadie opened this issue Jan 1, 2021 · 1 comment

Comments

@kathleenabadie
Copy link

Hello, During the rule_reformat_filter_clusters step, only 36% of the clusters are being written, and I am not sure why. I understand this is happening in parse_clusters.py but have not been able to identify the reason for this large loss of clusters identified. Do you have any tips for troubleshooting this? My complete run output is copied below. See step 6, where only 36% of the initial 49660 clusters are written:

$ snakemake -j 30 reads --configfile config_te_pilot.yml
Targets: target_bcl11b_te
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 30
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 cluster
1 cluster_consensus
1 copy_bed
1 detect_umi_consensus_fasta
1 detect_umi_fasta
1 map_1d
2 map_consensus
1 polish_clusters
1 reads
1 reformat_consensus_clusters
1 reformat_filter_clusters
1 seqkit_bam_acc_tsv
1 split_reads
14
Select jobs to execute...

[Wed Dec 23 15:12:18 2020]
rule map_1d:
input: data/te_pilot/20201214_sequencing_run.fastq.tar.gz, data/te_pilot/bcl11b_te_reference.fasta
output: te_pilot/align/1d.bam, te_pilot/align/1d.bam.bai
jobid: 11
wildcards: name=te_pilot
threads: 30

[M::mm_idx_gen::0.0020.00] collected minimizers
[M::mm_idx_gen::0.007
3.29] sorted minimizers
[M::main::0.0073.27] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.008
3.18] mid_occ = 2
[M::mm_idx_stat] kmer size: 13; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.0083.11] distinct minimizers: 528 (100.00% are singletons); average occurrences: 1.000; average spacing: 5.225
[M::worker_pipeline::34.157
9.68] mapped 213509 sequences
[M::worker_pipeline::54.73413.15] mapped 212395 sequences
[M::worker_pipeline::64.724
14.21] mapped 211398 sequences
[M::worker_pipeline::79.61315.89] mapped 214326 sequences
[M::worker_pipeline::115.077
15.69] mapped 214710 sequences
[M::worker_pipeline::119.20515.19] mapped 215066 sequences
[M::worker_pipeline::139.776
15.51] mapped 216572 sequences
[M::worker_pipeline::151.83815.92] mapped 218548 sequences
[M::worker_pipeline::183.751
16.53] mapped 217083 sequences
[M::worker_pipeline::187.47316.23] mapped 209564 sequences
[M::worker_pipeline::207.912
16.37] mapped 209419 sequences
[M::worker_pipeline::218.89416.57] mapped 209530 sequences
[M::worker_pipeline::250.810
16.79] mapped 208914 sequences
[M::worker_pipeline::255.00116.54] mapped 210411 sequences
[M::worker_pipeline::275.629
16.72] mapped 208463 sequences
[M::worker_pipeline::286.47516.80] mapped 209649 sequences
[M::worker_pipeline::320.776
16.77] mapped 209401 sequences
[M::worker_pipeline::324.87016.58] mapped 208663 sequences
[M::worker_pipeline::344.302
16.64] mapped 208906 sequences
[M::worker_pipeline::355.86816.81] mapped 209277 sequences
[M::worker_pipeline::385.738
16.95] mapped 209489 sequences
[M::worker_pipeline::389.32916.81] mapped 208428 sequences
[M::worker_pipeline::409.746
16.84] mapped 210045 sequences
[M::worker_pipeline::420.26416.90] mapped 210041 sequences
[M::worker_pipeline::436.151
17.33] mapped 210967 sequences
[M::worker_pipeline::460.69417.09] mapped 211264 sequences
[M::worker_pipeline::462.820
17.01] mapped 121987 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -ax map-ont -k 13 -t 30 data/te_pilot/bcl11b_te_reference.fasta -
[M::main] Real time: 462.893 sec; CPU: 7874.300 sec; Peak RSS: 2.822 GB
[bam_sort_core] merging from 30 files and 5 in-memory blocks...
[Wed Dec 23 15:24:01 2020]
Finished job 11.
1 of 14 steps (7%) done
Select jobs to execute...

[Wed Dec 23 15:24:01 2020]
rule split_reads:
input: te_pilot/align/1d.bam
output: te_pilot/fasta_filtered, te_pilot/stats/umi_filter_reads_stats.txt
jobid: 10
wildcards: name=te_pilot

[Wed Dec 23 15:24:01 2020]
rule copy_bed:
input: data/te_pilot/te_target_amplicon.bed
output: te_pilot/targets.bed
jobid: 1
wildcards: name=te_pilot

[Wed Dec 23 15:24:01 2020]
Finished job 1.
2 of 14 steps (14%) done
Reads found: 6763079
Reads unmapped: 488026 (0%)
target_bcl11b_te
Reads found: 3363560
On target: 5129999 (75%)
841189 concatamers - 16%
925250 short - 18%
[Wed Dec 23 15:36:15 2020]
Finished job 10.
3 of 14 steps (21%) done
Select jobs to execute...

[Wed Dec 23 15:36:15 2020]
rule detect_umi_fasta:
input: te_pilot/fasta_filtered
output: te_pilot/fasta_umi/target_bcl11b_te_detected_umis.fasta
jobid: 9
wildcards: name=te_pilot, target=target_bcl11b_te

Counting reads in te_pilot/fasta_filtered/target_bcl11b_te.fastq
100%|███████████████████████████████████████████████████████████████| 3363560/3363560 [03:35<00:00, 15598.04it/s]
Found 1656884 fwd and 1706676 rev reads (ratio: 0.9708251595499087)
71.71184697166098% of reads contained both UMIs with max 3 mismatches
[Wed Dec 23 15:40:11 2020]
Finished job 9.
4 of 14 steps (29%) done
Select jobs to execute...

[Wed Dec 23 15:40:11 2020]
rule cluster:
input: te_pilot/fasta_umi/target_bcl11b_te_detected_umis.fasta
output: te_pilot/clustering/target_bcl11b_te/clusters_centroid.fasta, te_pilot/clustering/target_bcl11b_te/clusters_consensus.fasta, te_pilot/clustering/target_bcl11b_te/vsearch_clusters
jobid: 8
wildcards: name=te_pilot, target=target_bcl11b_te
threads: 10

vsearch v2.15.1_linux_x86_64, 125.8GB RAM, 64 cores
https://github.com/torognes/vsearch

Reading file te_pilot/fasta_umi/target_bcl11b_te_detected_umis.fasta 100%
134176110 nt in 2410777 seqs, min 50, max 60, avg 56
maxseqlength 60: 1294 sequences discarded.
Sorting by length 100%
Counting k-mers 100%
Clustering 100%
Sorting clusters 100%
Writing clusters 100%
Clusters: 49660 Size min 1, max 2789, avg 48.5
Singletons: 13422, 0.6% of seqs, 27.0% of clusters
Multiple alignments 100%
[Wed Dec 23 15:43:31 2020]
Finished job 8.
5 of 14 steps (36%) done
Select jobs to execute...

[Wed Dec 23 15:43:31 2020]
rule reformat_filter_clusters:
input: te_pilot/clustering/target_bcl11b_te/clusters_consensus.fasta, te_pilot/clustering/target_bcl11b_te/vsearch_clusters
output: te_pilot/clustering/target_bcl11b_te/clusters_fa, te_pilot/stats/target_bcl11b_te_vsearch_cluster_stats.tsv, te_pilot/clustering/target_bcl11b_te/smolecule_clusters.fa
jobid: 7
wildcards: name=te_pilot, target=target_bcl11b_te

100%|█████████████████████████████████████████████████████████████████████| 49660/49660 [01:09<00:00, 713.84it/s]
Clusters: 36% written (18041)
Reads: 2410777 found
Reads: 0 removed (0.0%)
Reads: 25% written
Reads: 95% in written clusters
[Wed Dec 23 15:44:42 2020]
Finished job 7.
6 of 14 steps (43%) done
Select jobs to execute...

[Wed Dec 23 15:44:42 2020]
rule polish_clusters:
input: te_pilot/clustering/target_bcl11b_te/clusters_fa, te_pilot/clustering/target_bcl11b_te/smolecule_clusters.fa
output: te_pilot/fasta/target_bcl11b_te_consensus_tmp, te_pilot/fasta/target_bcl11b_te_consensus.bam, te_pilot/fasta/target_bcl11b_te_consensus.fasta
jobid: 6
wildcards: name=te_pilot, target=target_bcl11b_te
threads: 30

[Wed Dec 23 21:31:06 2020]
Finished job 6.
7 of 14 steps (50%) done
Select jobs to execute...

[Wed Dec 23 21:31:06 2020]
rule detect_umi_consensus_fasta:
input: te_pilot/fasta/target_bcl11b_te_consensus.fasta
output: te_pilot/fasta_umi/target_bcl11b_te_detected_umis_final.fasta
jobid: 5
wildcards: name=te_pilot, target=target_bcl11b_te

[Wed Dec 23 21:31:06 2020]
rule map_consensus:
input: te_pilot/fasta/target_bcl11b_te_consensus.fasta, data/te_pilot/bcl11b_te_reference.fasta
output: te_pilot/align/target_bcl11b_te_consensus.bam, te_pilot/align/target_bcl11b_te_consensus.bam.bai
jobid: 13
wildcards: name=te_pilot, target=target_bcl11b_te, type=consensus
threads: 3

[M::mm_idx_gen::0.0030.00] collected minimizers
[M::mm_idx_gen::0.005
0.84] sorted minimizers
[M::main::0.0050.83] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.005
0.80] mid_occ = 2
[M::mm_idx_stat] kmer size: 13; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.005*0.78] distinct minimizers: 528 (100.00% are singletons); average occurrences: 1.000; average spacing: 5.225
Counting reads in te_pilot/fasta/target_bcl11b_te_consensus.fasta
100%|███████████████████████████████████████████████████████████████████| 18043/18043 [00:01<00:00, 15345.54it/s]
Found 18015 fwd and 28 rev reads (ratio: 643.3928571428571)
99.29058360583052% of reads contained both UMIs with max 3 mismatches
[Wed Dec 23 21:31:08 2020]
Finished job 5.
8 of 14 steps (57%) done
Select jobs to execute...

[Wed Dec 23 21:31:08 2020]
rule cluster_consensus:
input: te_pilot/fasta_umi/target_bcl11b_te_detected_umis_final.fasta
output: te_pilot/clustering_consensus/target_bcl11b_te/clusters_centroid.fasta, te_pilot/clustering_consensus/target_bcl11b_te/clusters_consensus.fasta, te_pilot/clustering_consensus/target_bcl11b_te/vsearch_clusters
jobid: 4
wildcards: name=te_pilot, target=target_bcl11b_te
threads: 10

vsearch v2.15.1_linux_x86_64, 125.8GB RAM, 64 cores
https://github.com/torognes/vsearch

Reading file te_pilot/fasta_umi/target_bcl11b_te_detected_umis_final.fasta 100%
1004613 nt in 17914 seqs, min 51, max 60, avg 56
maxseqlength 60: 1 sequence discarded.
Sorting by length 100%
Counting k-mers 100%
Clustering 100%
Sorting clusters 100%
Writing clusters 100%
Clusters: 5796 Size min 1, max 19, avg 3.1
Singletons: 1756, 9.8% of seqs, 30.3% of clusters
Multiple alignments 100%
[Wed Dec 23 21:31:09 2020]
Finished job 4.
9 of 14 steps (64%) done
Select jobs to execute...

[Wed Dec 23 21:31:09 2020]
rule reformat_consensus_clusters:
input: te_pilot/clustering_consensus/target_bcl11b_te/clusters_consensus.fasta
output: te_pilot/fasta/target_bcl11b_te_final.fasta
jobid: 3
wildcards: name=te_pilot, target=target_bcl11b_te

[Wed Dec 23 21:31:09 2020]
Finished job 3.
10 of 14 steps (71%) done
Select jobs to execute...

[Wed Dec 23 21:31:09 2020]
rule map_consensus:
input: te_pilot/fasta/target_bcl11b_te_final.fasta, data/te_pilot/bcl11b_te_reference.fasta
output: te_pilot/align/target_bcl11b_te_final.bam, te_pilot/align/target_bcl11b_te_final.bam.bai
jobid: 2
wildcards: name=te_pilot, target=target_bcl11b_te, type=final
threads: 3

[M::mm_idx_gen::0.0010.00] collected minimizers
[M::mm_idx_gen::0.002
0.00] sorted minimizers
[M::main::0.0020.00] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.002
0.00] mid_occ = 2
[M::mm_idx_stat] kmer size: 13; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.0021.73] distinct minimizers: 528 (100.00% are singletons); average occurrences: 1.000; average spacing: 5.225
[M::worker_pipeline::2.113
2.86] mapped 5796 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -ax map-ont -k 13 -t 3 data/te_pilot/bcl11b_te_reference.fasta te_pilot/fasta/target_bcl11b_te_final.fasta
[M::main] Real time: 2.114 sec; CPU: 6.044 sec; Peak RSS: 0.033 GB
[bam_sort_core] merging from 0 files and 5 in-memory blocks...
[Wed Dec 23 21:31:11 2020]
Finished job 2.
11 of 14 steps (79%) done
[M::worker_pipeline::6.481*2.88] mapped 18043 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -ax map-ont -k 13 -t 3 data/te_pilot/bcl11b_te_reference.fasta te_pilot/fasta/target_bcl11b_te_consensus.fasta
[M::main] Real time: 6.485 sec; CPU: 18.676 sec; Peak RSS: 0.069 GB
[bam_sort_core] merging from 0 files and 5 in-memory blocks...
[Wed Dec 23 21:31:13 2020]
Finished job 13.
12 of 14 steps (86%) done
Select jobs to execute...

[Wed Dec 23 21:31:13 2020]
rule seqkit_bam_acc_tsv:
input: te_pilot/align/target_bcl11b_te_consensus.bam
output: te_pilot/stats/target_bcl11b_te_consensus_size_vs_acc.tsv
jobid: 12
wildcards: name=te_pilot, target=target_bcl11b_te, stage=consensus

[Wed Dec 23 21:31:14 2020]
Finished job 12.
13 of 14 steps (93%) done
Select jobs to execute...

[Wed Dec 23 21:31:14 2020]
localrule reads:
input: te_pilot/targets.bed, te_pilot/align/target_bcl11b_te_final.bam.bai, te_pilot/stats/target_bcl11b_te_vsearch_cluster_stats.tsv, te_pilot/stats/target_bcl11b_te_consensus_size_vs_acc.tsv
jobid: 0

[Wed Dec 23 21:31:14 2020]
Finished job 0.
14 of 14 steps (100%) done

@SemiQuant
Copy link
Contributor

I haven't really familiarized myself with the code yet, but I think some is due to singleton clusters and the rest may be due to the balance_strands flag, which I'm not really sure what it's doing or why but it needs the same forward and reverse reads for each cluster? (or something like that.. which would make sense if its the same read and the UMI on each end, but I'm not sure). Anyway, try adding 'balance_strands: False' to your config file and see if that does anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants