-
Notifications
You must be signed in to change notification settings - Fork 7
BasicBWT API
Matt Holt edited this page Jul 29, 2016
·
2 revisions
This page contains part of the pydoc output generated for the BasicBWT class.
CLASSES
__builtin__.object
BasicBWT
class BasicBWT(__builtin__.object)
| This class is the root class for ANY msbwt created by this code regardless of it being compressed or no.
| Shared Functions:
| __init__
| constructIndexing
| countOccurrencesOfSeq
| findIndicesOfStr
| getSequenceDollarID
| recoverString
|
| Override functions:
| loadMsbwt
| constructTotalCounts
| constructFMIndex
| getCharAtIndex
| getOccurrenceOfCharAtIndex
| getBWTRange
| getFullFMAtIndex
| iterInit
| iterNext
| iterNext_cython
|
| Methods defined here:
|
| __init__(...)
| Constructor
| Nothing special, use this for all at the start
|
| countOccurrencesOfSeq(...)
| This function counts the number of occurrences of the given sequence
| @param seq - the sequence to search for
| @param givenRange - the range to start from (if a partial search has already been run), default=whole range
| @return - an integer count of the number of times seq occurred in this BWT
|
| countPileup(...)
| This function takes an input sequence "seq" and counts the number of occurrences of all k-mers of size
| "kmerSize" in that sequence and return it in an array. Automatically includes reverse complement.
| @param seq - the seq to scan
| @param kmerSize - the size of the k-mer to count
| @return - a numpy array of size (len(seq)-kmerSize+1) containing the counts
|
| countSeqMatches(...)
| This function takes an input sequence "seq" and counts the number of occurrences of all k-mers of size
| "kmerSize" in that sequence and return it in an array.
| @param seq - the seq to scan
| @param kmerSize - the size of the k-mer to count
| @return - a numpy array of size (len(seq)-kmerSize+1) containing the counts
|
| countStrandedSeqMatches(...)
| This function takes an input sequence "seq" and counts the number of occurrences of all k-mers of size
| "kmerSize" in that sequence and return it in an array.
| @param seq - the seq to scan
| @param kmerSize - the size of the k-mer to count
| @return - a numpy array of size (len(seq)-kmerSize+1) containing the counts, and the other choice also
|
| countStrandedSeqMatchesNoOther(...)
| This function takes an input sequence "seq" and counts the number of occurrences of all k-mers of size
| "kmerSize" in that sequence and return it in an array.
| @param seq - the seq to scan
| @param kmerSize - the size of the k-mer to count
| @return - a numpy array of size (len(seq)-kmerSize+1) containing the counts
|
| findIndicesOfRegex(...)
| This function will search for a string and find the location of that string OR the last index less than it. It also
| will start its search within a given range instead of the whole structure. Note that have a small tail string can
| lead to fast exponential blowup of the solution space.
| @param seq - the sequence to search for with valid symbols [$, A, C, G, N, T, *, ?]
| $, A, C, G, N, T - exact match of specific symbol
| * - matches 0 or more of any non-$ symbols (may be different symbols)
| ? - matches exactly one of any non-$ symbol
| @param givenRange - the range to search for, whole range by default
| @return - a python list of ranges representing the start and end of the sequence in the bwt
|
| findIndicesOfStr(...)
| This function will search for a string and find the location of that string OR the last index less than it. It also
| will start its search within a given range instead of the whole structure
| @param seq - the sequence to search for
| @param givenRange - the range to search for, whole range by default
| @return - a python range representing the start and end of the sequence in the bwt
|
| findKTOtherStranded(...)
| This function takes an input sequence "seq" and counts the number of occurrences of all k-mers of size
| "kmerSize" in that sequence and return it in an array.
| @param seq - the seq to scan
| @param kmerSize - the size of the k-mer to count
| @param isStranded - if True, it ONLY counts the forward strand (aka, exactly matches "seq")
| if False, it counts forward strand and reverse-complement strand and adds them together
| @return - a numpy array of size (len(seq)-kmerSize+1) containing the counts
|
| findKmerThreshold(...)
| This function takes an input sequence "seq" and counts the number of occurrences of all k-mers of size
| "kmerSize" in that sequence and return it in an array.
| @param seq - the seq to scan
| @param kmerSize - the size of the k-mer to count
| @param isStranded - if True, it ONLY counts the forward strand (aka, exactly matches "seq")
| if False, it counts forward strand and reverse-complement strand and adds them together
| @return - a numpy array of size (len(seq)-kmerSize+1) containing the counts
|
| findKmerWithError(...)
| This function takes a k-mer input and finds all k-mers with an edit distance of 1 that occur at least
| "minThresh" times in the dataset. Indels at the beginning/end of the 'seq' are not considered.
| @param seq - the k-mer sequence we want to match
| @param minThresh - the minimum number of times any in-exact matching k-mers must occur to be returned
| @return - a list of ranges AND the change made to the k-mer to get that range stored in tuples
| tuple: (start range, end range, k-mer)
| start range - the start index in the bwt
| end range - the end index in the bwt
| k-mer - the k-mer associated with this range
|
| findKmerWithErrors(...)
| This function takes a k-mer input and finds all k-mers with an edit distance of 1 that occur at least
| "minThresh" times in the dataset. Indels at the beginning/end of the 'seq' are not considered.
| @param seq - the k-mer sequence we want to match
| @param editDistance - the maximum edit distance to match
| @param minThresh - the minimum number of times any in-exact matching k-mers must occur to be returned
| @return - a list of ranges AND the change made to the k-mer to get that range stored in tuples
| tuple: (start range, end range, k-mer)
| start range - the start index in the bwt
| end range - the end index in the bwt
| k-mer - the k-mer associated with this range
|
| findPatternWithError(...)
| This function will search the BWT for strings which match the given sequence allowing for one error.
| In this function, "seq" must be close to the length of the read or else the ends of the reads will be counted
| as long insertions leading to no matches in the data.
| @param seq - the sequence to search for with valid symbols [A, C, G, N, T]
| @param bonusStr - in the case of a deletion in the search, this is an extra character that must match at the front
| of seq, aka it must match (bonusStr+seq) with one symbol deleted
| @return - a python list of ranges representing the start and end of the sequence in the bwt, these ranges will be
| in the '$' indices, so they will correspond to a specific read
| NOTE: these results may overlap, user expected to check for overlaps if important
|
| findReadsMatchingSeq(...)
| REQUIRES LCP
| This function takes a sequence and finds all strings of length "stringLen" which exactly match the sequence
| @param seq - the sequence we want to match, assumed to be buffered on both ends with 'N' symbols
| @param strLen - the length of the strings we are trying to extract
| @return - a list of dollar IDs corresponding to strings that exactly match the seq somewhere
|
| findStrWithError(...)
| This function will search the BWT for strings which match the given sequence allowing for one error.
| In this function, "seq" must be close to the length of the read or else the ends of the reads will be counted
| as long insertions leading to no matches in the data.
| @param seq - the sequence to search for with valid symbols [A, C, G, N, T], NOTE: we assume the string is implicity
| flanked by '$' so do NOT pass the '$' in the string or no result will return
| @param bonusStr - in the case of a deletion in the search, this is an extra character that must match at the front
| of seq, aka it must match (bonusStr+seq) with one symbol deleted
| @return - a python list of ranges representing the start and end of the sequence in the bwt, these ranges will be
| in the '$' indices, so they will correspond to a specific read
| NOTE: these results may overlap, user expected to check for overlaps if important
|
| getBinBits(...)
| @return - the number of bits in a bin
|
| getCharAtIndex(...)
| dummy function, shouldn't be called
|
| getOccurrenceOfCharAtIndex(...)
| dummy function, shouldn't be called
|
| getSequenceDollarID(...)
| This will take a given index and work backwards until it encounters a '$' indicating which dollar ID is
| associated with this read
| @param strIndex - the index of the character to start with
| @return - an integer indicating the dollar ID of the string the given character belongs to
|
| getSymbolCount(...)
| @param symbol - this is an integer from [0, 6)
| @return - the total count for the passed in symbol
|
| getTotalSize(...)
| @return - the total number of symbols in the BWT
|
| iterInit(...)
| this function must be called to reset the iterator to the beginning, used for both normal and
| compressed data structures since it's so simple
|
| iterNext(...)
|
| recoverString(...)
| This will return the string that starts at the given index
| @param strIndex - the index of the string we want to recover
| @return - string that we found starting at the specified '$' index
|
Holt, J., & McMillan, L. (2014). Merging of multi-string BWTs with applications. Bioinformatics, btu584.
Holt, J., & McMillan, L. (2014, September). Constructing burrows-wheeler transforms of large string collections via merging. In Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (pp. 464-471). ACM.
James Holt - [email protected]