Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion for [--rc_identity_threshold] #2

Open
edgardomortiz opened this issue Jun 16, 2020 · 3 comments
Open

Suggestion for [--rc_identity_threshold] #2

edgardomortiz opened this issue Jun 16, 2020 · 3 comments
Labels
enhancement New feature or request

Comments

@edgardomortiz
Copy link

Hello Kristoffer,

I ran NGSpeciesID with the following command:

for file in *.fastq; do
	NGSpeciesID --t 4 --ont --fastq $file \
	--consensus --medaka \
	--abundance_ratio 0.005 \
	--rc_identity_threshold 0.85 \
	--outfolder ../04_ngspeciesid/${file//.fastq/_clusters_medaka} \
	"&>"../04_ngspeciesid/${file//.fastq/.ngspeciesid.medaka.log.txt}
done

And everything worked nicely but in some cases it creates two separate consensus sequences in opposite directions that should have been joined by --rc_identity_threshold 0.85. For example in the picture, the overlapping region of both consensuses has a similarity of 98.7%

2020-06-16 01 49 55 pm

Would it be too hard to implement reverse complement during the initial clustering steps to avoid this kind of result? This would be very useful for the case of fragmented amplicons (and it would save time in general, even with full length amplicons, by avoiding the separate spoa centering of the forward and reverse clusters just to be merged afterwards by --rc_identity_threshold).

@ksahlin
Copy link
Owner

ksahlin commented Jun 17, 2020

Implementing this during initial clustering is quite an undertaking. However, would it be sufficient for your purpose to do semi-global alignment in the reverse complement step and check the identity based on this semi-global alignment? This should have the same outcome any is much easier to implement.

@edgardomortiz
Copy link
Author

Yes, that would be awesome!

@ksahlin ksahlin added the enhancement New feature or request label Mar 8, 2021
@2tony2
Copy link

2tony2 commented Jun 3, 2021

To comment further on this, it is actually nice to have an option to fuse clusters which are highly identical when comparing seq1 sense vs seq2 sense rather then only comparing seq1 and reverse complement seq2 in the detect_reverse_complement script in the consensus.py module.

I was getting separate clusters because the primer regions might be ambiguous due to low coverage and low sequencing quality/higher degree of mutations. Thus, I would get multiple clusters which were highly similar when comparing sense vs sense (95%+ identity, only differing in primer regions). This is something which is easy to add (I did it a bit ugly here but it works):

def detect_reverse_complements(centers, rc_identity_threshold):
    filtered_centers = []
    already_removed = set()
    for i, (nr_reads_in_cl, c_id, seq, reads_path) in enumerate(centers):
        if type(reads_path) != list:
            all_reads = [reads_path]
        else:
            all_reads = reads_path

        merged_cluster_id = c_id
        merged_nr_reads = nr_reads_in_cl
        if c_id in already_removed:
            print("has already been merged, skipping")
            continue

        elif i == len(centers) - 1: # last sequence and it is not in already removed
            filtered_centers.append( [merged_nr_reads, c_id, seq, all_reads ] )

        else:
            for j, (nr_reads_in_cl2, c_id2, seq2, reads_path) in enumerate(centers[i+1:]):
                #check for reverse complement matches
                seq2_rc = reverse_complement(seq2)
                seq_aln_rc, seq2_rc_aln, cigar_string_rc, cigar_tuples_rc, alignment_score_rc = parasail_alignment(seq, seq2_rc)
                seq_aln, seq2_aln, cigar_string, cigar_tuples, alignment_score = parasail_alignment(seq, seq2)

                #Determine alignment identity between complement seq1 and reverse complement seq2
                nr_mismatching_pos_rc = len([1 for n1, n2 in zip(seq_aln_rc, seq2_rc_aln) if n1 != n2])
                total_pos_rc = len(seq_aln_rc)
                aln_identity_rc =  (total_pos_rc - nr_mismatching_pos_rc) / float(total_pos_rc)
                
                #Determine alignment identity between complement seq1 and complement seq2
                nr_mismatching_pos = len([1 for n1, n2 in zip(seq_aln, seq2_aln) if n1 != n2])
                total_pos = len(seq_aln)
                aln_identity =  (total_pos - nr_mismatching_pos) / float(total_pos)

                print("Complement and Reverse Complement Identity seq 1 and 2: " + str(aln_identity_rc))
                print("Complement and Complement Identity seq 1 and 2: " + str(aln_identity))

                    if aln_identity_rc >= rc_identity_threshold:
                    print("Detected alignment identidy above threshold for complement vs reverse complement. Keeping center with the most read support and adding rc reads to supporting reads.")
                    merged_nr_reads += nr_reads_in_cl2
                    already_removed.add(c_id2)

                    if type(reads_path) != list:
                        all_reads.append(reads_path)
                    else:
                        for rp in reads_path:
                            all_reads.append(rp)
                
                if aln_identity >= rc_identity_threshold:
                    print("Detected alignment identidy above threshold for complement vs complement. Keeping center with the most read support and adding rc reads to supporting reads.")
                    merged_nr_reads += nr_reads_in_cl2
                    already_removed.add(c_id2)

                    if type(reads_path) != list:
                        all_reads.append(reads_path)
                    else:
                        for rp in reads_path:
                            all_reads.append(rp)

            filtered_centers.append( [merged_nr_reads, c_id, seq, all_reads] )

    print(len(filtered_centers), "consensus formed.")
    return filtered_centers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants