Skip to content

Commit

Permalink
[FEATURE] Get Tandem Duplications from CIGAR
Browse files Browse the repository at this point in the history
Signed-off-by: Lydia Buntrock <[email protected]>
  • Loading branch information
Irallia committed Oct 25, 2021
1 parent 272ea6a commit ad5ef63
Show file tree
Hide file tree
Showing 8 changed files with 292 additions and 32 deletions.
1 change: 1 addition & 0 deletions include/iGenVar.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ struct cmd_arguments
/* -c */ clustering_methods clustering_method{hierarchical_clustering}; // default: hierarchical clustering
/* -r */ refinement_methods refinement_method{no_refinement}; // default: no refinement
// SV specifications:
/* -e TODO (irallia 19.08.21): duplication_errors */
/* -k */ uint64_t min_var_length = 30;
/* -l */ uint64_t max_var_length = 100000;
/* -m */ uint64_t max_tol_inserted_length = 50;
Expand Down
90 changes: 90 additions & 0 deletions include/modules/sv_detection_methods/analyze_cigar_method.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,96 @@

#include "structures/junction.hpp" // for class Junction

/*! \brief This function checks if the inserted bases are tandem duplicated.
*
* \param[in] config - configuration for a semi-gobal alignment
* \param[in] min_length - minimum length of variants to detect (default 30 bp,
* \param[in] sequence - suffix or prefix sequence
* \param[in] inserted_bases - the inserted bases of the possible duplication
* \param[in] is_suffix - true: suffix case, false: prefix case
* param[out] match_score - if we found a matching duplication, this value represents the maximal
* amount of matches with the reference and thus the length of the existing
* duplicated part on the reference. If there is no duplication, this value
* is 0.
* param[out] length_of_single_dupl_sequence - greatest common divisor of length of inserted sequence and length of
* maching part -> length of a single duplicated sequence
* \returns std::tie(match_score, length_of_single_dupl_sequence) - a tuple of the resulting values
*
* \details In this function the inserted bases are recursively aligned segment by segment until it has been proven that
* it is a real duplication.
* For simplicity, we only consider the suffix case here, since the prefix case works the same way:
*
* suffix_sequence AAAACCGCGTAGCGGGGCGGG
* ||||||||||
* inserted_bases GCGGGGCGGGGCGGG -> unmatched_inserted_bases: GCGGG
* -> match_score = 10
* -> length_of_single_dupl_sequence = gcd(15, 10) = 5
*
* suffix_sequence AAAACCGCGTAGCGGGGCGGG
* |||||
* unmatched_inserted_bases GCGGG
*
* -> match_score = 5, length_of_single_dupl_sequence = gcd(15, 5) = 5
*/
std::tuple<size_t, size_t> align_suffix_or_prefix(auto const & config,
int32_t const min_length,
std::span<const seqan3::dna5> & sequence,
std::span<const seqan3::dna5> & inserted_bases,
bool is_suffix);

/*! \brief This function checks if the inserted bases are tandem duplicated.
*
* \param[in] query_sequence - SEQ field of the SAM/BAM file
* \param[in] length - length of inserted part, given by the CIGAR
* \param[in] pos_ref - position of the inserted part in the ref (current position)
* \param[in] pos_read - position of the inserted part in the read (current position)
* \param[in] inserted_bases - the inserted bases of the possible duplication
* \param[in] min_length - minimum length of variants to detect (default 30 bp,
* expected to be non-negative)
* \param[in, out] pos_start_dup_seq - start position of the duplicated seq (excluding itself)
* \param[in, out] pos_end_dup_seq - end position of the duplicated seq (including itself)
* \param[in, out] tandem_dup_count - the number of tandem copies of the inserted sequence
* \returns duplicated_bases - the duplicated bases of the duplication
*
* \details If the inserted bases include one or more copies of a duplicated sequence, which is suffix or prefix of
* another copy, we have a Tandem Duplication. To check this, we have to do a semi-global alignment.
*
* Case 1: The duplication (insertion) comes after the matched sequence. Thus we need to check if the inserted
* sequence matches (partly) the suffix sequence. The length of the matching part yields to the amount
* of copies, thus we can calculate the tandem_dup_count and the inserted_bases.
*
* ref AAAACCGCGTAGCGGG----------TACGTAACGGTACG
* |||||||||||||| |||||||| -> inserted sequence: GCGGGGCGGG
* read AACCGCGTAGCGGGGCGGGGCGGGTACGTAAC
*
* suffix_sequence AAAACCGCGTAGCGGG -> free_end_gaps_sequence1_leading{true},
* ||||| free_end_gaps_sequence1_trailing{false}
* inserted_bases GCGGGGCGGG -> free_end_gaps_sequence2_leading{false},
* free_end_gaps_sequence2_trailing{true}
* -> tandem_dup_count = 3, duplicated_bases = GCGGG
*
* Case 2: The duplication (insertion) comes before the matched sequence.
* ref AAAACCGCGTA----------GCGGGTACGTAACGGTACG
* ||||||||| ||||||||||||| -> inserted sequence: GCGGGGCGGG
* read AACCGCGTAGCGGGGCGGGGCGGGTACGTAAC
*
* prefix_sequence GCGGGTACGTAACGGTACG -> free_end_gaps_sequence1_leading{false},
* ||||| free_end_gaps_sequence1_trailing{true}
* inserted_bases GCGGGGCGGG -> free_end_gaps_sequence2_leading{true},
* free_end_gaps_sequence2_trailing{false}
* -> tandem_dup_count = 3, duplicated_bases = GCGGG
* \see For some complex examples see detection_test.cpp.
*/
std::span<seqan3::dna5 const> detect_tandem_duplication(seqan3::dna5_vector const & query_sequence,
int32_t length,
int32_t pos_ref,
int32_t pos_read,
std::span<seqan3::dna5 const> & inserted_bases,
int32_t const min_length,
int32_t & pos_start_dup_seq,
int32_t & pos_end_dup_seq,
size_t & tandem_dup_count);

/*! \brief This function steps through the CIGAR string and stores junctions with their position in reference and read.
*
* \param[in] read_name - QNAME field of the SAM/BAM file
Expand Down
1 change: 0 additions & 1 deletion src/modules/clustering/hierarchical_clustering_method.cpp
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
#include "iGenVar.hpp" // for global variable gVerbose
#include "modules/clustering/hierarchical_clustering_method.hpp"

#include <limits> // for infinity
#include <random> // for random_device

#include <seqan3/core/debug_stream.hpp>
Expand Down
Loading

0 comments on commit ad5ef63

Please sign in to comment.