Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use sequence to query meryl db #47

Open
Adamtaranto opened this issue Jun 4, 2024 · 1 comment
Open

Use sequence to query meryl db #47

Adamtaranto opened this issue Jun 4, 2024 · 1 comment

Comments

@Adamtaranto
Copy link

I want to generate a db of all kmers and their counts for a reference genome using meryl count, then for thousands of small (~1-5 kbp) sequences I want to extract all kmers and find their counts in the genome kmer db.

Is there a way to provide a short sequence as an argument to meryl to query its kmers against an existing db?

It seems like it would not be efficient to run meryl count on all of the short seqs and have to clean up the .meryl files between each query.

@Adamtaranto Adamtaranto changed the title Get kmer counts from existing meryl db for kmers from a small query sequence Use sequence to query meryl db Jun 4, 2024
@brianwalenz
Copy link
Member

That sounds like a job for meryl-lookup:

usage: meryl-lookup <report-type> \
         -sequence <input1.fasta> [<input2.fasta>] \
         -output   <output1>      [<output2>] \
         -mers     <input1.meryl> [<input2.meryl>] [...] [-estimate] \
         -labels   <input1name>   [<input2name>]   [...]

  Compare kmers in input sequences against kmers in input meryl databases.

  Input sequences (-sequence) can be FASTA or FASTQ, uncompressed, or
  compressed with gzip, xz, or bzip2.

  To compute and report only estimated memory usage, add option '-estimate'.

  Report types:
    Run `meryl-lookup <report-type> -help` for details on each method.


  -bed:
     Generate a BED format file showing the location of kmers in
     any input database on each sequence in 'input1.fasta'.
     Each kmer is reported in a separate bed record.

  -bed-runs:
     Generate a BED format file showing the location of kmers in
     any input database on each sequence in 'input1.fasta'.
     Overlapping kmers are combined into a single bed record.

  -wig-count:
     Generate a WIGGLE format file showing the multiplicity of the
     kmer starting at each position in the sequence, if it exists in
     an input kmer database.

  -wig-depth:
     Generate a WIGGLE format file showing the number of kmers in
     any input database that cover each position in the sequence.

  -existence:
     Generate a tab-delimited line for each input sequence with the
     number of kmers in the sequence, in the database and common to both.

  -include:
  -exclude:
     Copy sequences from 'input1.fasta' (and 'input2.fasta') to the
     corresponding output file if the sequence has at least one kmer
     present (include) or no kmers present (exclude) in 'input1.meryl'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants