Usage: SeqFilter <fa/fq> SeqFilter [filter] <fa/fq> --out <fa/fq> cat <fa/fq> | SeqFilter [filter] --out - | ... Options: -o|--out <FILE/-> [OFF] -c|--stdout Output to file. -c for STDOUT. --stats <FILE> [-] Print stats to file. Default STDOUT or STDERR if STDOUT is in use. --ids <FILE/-/IDLIST> File of sequence IDs or literal list of IDs to be reported. Reads comma-, whitespace and newline separated lists. Leading '>' or '@' are ignored. SeqFilter fasta.fa --ids "seq35,seq49" # list SeqFilter fasta.fa --ids ids.list # file --ids-pattern <FILE/-/PATTERNLIST> Match a perl PATTERN or a link to a file containing multiple PATTERNs, one per line, against sequence ids. Keep matching sequences. SeqFilter seqs.fq --ids-patt 'comp2_c13_seq.*|comp2_c88_seq.*' Extended usage: add capture groups to perl P(A)TT(ER)N using "()". Splits matched sequences to different output files based on capture. Use sprintf conversions (%s, %02d, ...) in --out to define a template for the output file names. # split seqs by library (LIB1, LIB2, LIB14) SeqFilter multilib.fq --ids-pattern '\w+(\d+)' --out multilib_%02d.fq # creates multilib_01.fq, multilib_02.fq, multilib_14.fq NOTE: Perl needs to open a filehandle to every split file, this can slow things down considerably if you want to split into more than 1000 different files with occurances of patterns randomly mixed in source file. --ids-exclude Reverse behaviour of --ids and --ids-pattern to excluding matched sequences, and keeping unmatched ones. --ids-rename <PATTERN> Provide a perl substitution pattern as string. The pattern is applied to every id. Use global "$COUNT" to access the output sequence counter, $PICARD to access the picard number. SeqFilter library_1.fq --ids-rename='s/.*\//sprintf("MYLIB_%05d\/%d", $COUNT, $PICARD)/e' # creates ids: # @MYLIB_00001/1 # @MYLIB_00002/1 ... --desc-replace --desc-append Remove (no arg)/replace/append description of sequences. -l|--min-length <INT> Minimum sequence length. -L|--max-length <INT> Maximum length for sequences to be retrieved. Default off. -A|--fasta -Q|--fastq <CHAR> Convert FASTQ/FASTA to FASTA/FASTQ, respectively. -w|--line-width [80 for FASTA] FASTA only. Set 0 for single line. Ignored for FASTQ. --rc|--rev-comp <FILE/-/LIST> File of sequence IDs or list of IDs to be transformed, no argument to transform all sequences. Formatting follows the same rules as --ids files. --careful [OFF] FASTQ only. Check every FASTQ record to have a valid format. '@/+' at the start of id lines, sequence and quality of identical length, phred within boundaries. Slows down the parsing. --lower-case --upper-case Convert output sequence to lower/upper case. --iupac-to-N Convert non [ATGCNatgcn] characters to N. --phred-offset [auto] FASTQ only. Specify Phred offset for quality scores. Default auto-detect. --phred-transform FASTQ only. Transform phreds from input offset to specified "--phred-transform" offset, usually 33 to 64 or wise versa. --phred-mask FASTQ only. At least two values separated by ",", e.g "0,10" to mask all Nucleotides with phred below 10 with an "N". Optional additional values control: minimum length of unmasked regions minimum length of masked regions bps to ignore at the ends of masked regions (shorten masked regions to their core) a ratio that determines whether to mask/unmasked terminal regions that are smaller than are minimum unmasked region NOTE: Requires "--phred-offset". --trim-window <INT1>,[<INT2>],[<INT3>] FASTQ only. Trim sequences to quality >= SOFT,HARD,SIZE in a sliding window, default 10. The sliding window allows to have positions below the SOFT cutoff provided the window mean is higher than SOFT. Qualities below HARD, default 0, will always terminate a stretch. It is made sure that a) positions with quality below cutoff only occur within the remaining sequence, not at its start/end and b) windows never overlap eachother. --trim-lcs <INT,INT,INT> FASTQ only. Three values separated by ",", e.g. "30,40,50" to grep all stretches of quality >= 30 and minimum length 50 from the sequences. Faster than "--trim-window" yet breaks sequences even on a single low quality position. NOTE: "--trim-lcs" and "--trim-window" can be combined, e.g. --trim-lcs 5,40,100 --trim-window 10 will generate sequences with qualities of at least 5 at every position and a window mean of 10. --substr <FILE/-/LIST> Pathname to a FILE containing information for subseq extraction/modification. The format is a tsv, by default lines of the format ID FROM TO are expected. Lines prepened by '#' are treated as comments and therefore ignored. If --substr-perl-style is set, the lines must start with the ID of the read, followed by the substr values OFFSET,LENGTH,REPLACESEQ,REPLACEQUAL. The parameter usage is than the same as for perl builtin "substr" function, meaning an OFFSET alone is sufficient, a positive value is set from the start of the sequence, a negative offset from the end, without LENGTH, the sequence is returned from OFFSET to its end. REPLACEMENTS are introduced at the OFFSET position, if LENGTH is 0, it is a simple insertion, else a part is deleted first and the REPLACEMENT is then inserted. Substring extraction is of course performed prior to any other trimming. To trim all reads use '*' instead of the read id. This command will be performed prior to any indiviual substr command. FROM TO: # extract sequence from pos 10 to pos 50 read1 10 50 OFFSET [LENGTH [REPLACEMENT]] # trim read1 head and tail by 10 read1 10 # extract from read2 250 nts starting at pos 15 read2 15 250 # replace 3 nt by an "N"" with qual "!" (for FASTQ) read3 3 1 N ! # trim from all reads 5nts at the beginning and the end. * 5 * -5 --substr-perl-style By default, substr information are read according to the format FROM TO. Set this flag to switch the behaviour to perl substr() like style of "OFFSET [LENGTH [REPLACEMENT]]" -N|--Nx <INT,INT...> Report Nx value (N50, N90...). Default "50,90". -C|--base-composition <BASE(S),BASE(S),BASE(S),...> Report relative amount of given bases. Takes a "," separated list, each element of the list can be one or more bases (cummulative). --base-composition=GC,N # combined GC and N content -H|--histogram Plot distribution of bases by length as ASCII plot. Uses linear scale for data sets with difference in order of magnitude < 2, log scale otherwise. --[no]-smart-labels Toggle shortening filepaths to shortest unique labels. -p|--progress Display progress bars (eq. '--verbose 2') -q|--quiet Omit all verbose messages. The same as --verbose=0, Superceeds --verbose settings. --verbose <INT> Toggle verbose level, default 2, which outputs statistics and progress. Set 1 for statistics only or 0 for no verbose output. -h|--help Display this help -V|--version Display current version