Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom DB and custom taxonomy (GTDB or similar) #884

Open
osvatic opened this issue Oct 18, 2024 · 14 comments
Open

Custom DB and custom taxonomy (GTDB or similar) #884

osvatic opened this issue Oct 18, 2024 · 14 comments

Comments

@osvatic
Copy link

osvatic commented Oct 18, 2024

Is there a way to create a completely new database and non-NCBI taxonomy for that database? I am primarily interested in using something like GTDB, which has a genome download and taxonomy tsv file available.

I understand that I can add the contigs/genomes to a custom database using "kraken2-build --add-to-library" but the creation of a new taxonomy doesn't seem straight forward.

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Oct 31, 2024

Hello,

We are in the process of adding GTDB support to the k2 wrapper (python rewrite of the kraken2 scripts), stay tuned.

@lingrongjin
Copy link

I'm also interested in building a custom database with GTDB taxonomy! When can we expect to have this function available?

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Nov 5, 2024

I pushed a change to the k2 wrapper today.

Here's how you can build a GTDB database with the changes:

k2 build --special gtdb --gtdb-files gtdb_genomes_reps.tar.gz --threads 6 --db gtdb_reps

The --gtdb-files parameter takes a list of filenames or a pattern that matches the files needed
to build the GTDB database.

You can find the list of file names here https://data.gtdb.ecogenomic.org/releases/latest/genomic_files_reps/
or https://data.gtdb.ecogenomic.org/releases/latest/genomic_files_all/

The command line will also offer a list of files if you specify the wrong file name:

k2 build --special --gtdb-files foo.tar.gz --db gtdb_reps
[ERROR - 2024-11-05 17:57:48,567]: Could not find any files matching foo.tar.gz
[ERROR - 2024-11-05 17:57:48,567]: Here are a list of candidates:
bac120_msa_marker_genes_all.tar.gz
ar53_msa_marker_genes_all.tar.gz
ssu_all.fna.gz
bac120_marker_genes_all.tar.gz
ar53_marker_genes_all.tar.gz
bac120_ssu_reps.fna.gz
ar53_msa_marker_genes_reps.tar.gz
gtdb_proteins_aa_reps.tar.gz
bac120_msa_marker_genes_reps.tar.gz
bac120_msa_reps.faa.gz
bac120_marker_genes_reps.tar.gz
ar53_msa_reps.faa.gz
ar53_marker_genes_reps.tar.gz
gtdb_proteins_nt_reps.tar.gz
gtdb_genomes_reps.tar.gz
ar53_ssu_reps.fna.gz

We welcome your testing in making sure that this feature works as expected.

@YiAngBeao
Copy link

YiAngBeao commented Nov 13, 2024

Traceback (most recent call last):
File "10software/mambaforge/envs/kraken2/bin/k2.py", line 3582, in
k2_main()
~~~~~~~^^
File "10software/mambaforge/envs/kraken2/bin/k2.py", line 3562, in k2_main
build_gtdb_database(args)
~~~~~~~~~~~~~~~~~~~^^^^^^
File "/10software/mambaforge/envs/kraken2/bin/k2.py", line 1768, in build_gtdb_database
for accession, filepath in result[0]:
^^^^^^^^^^^^^^^^^^^
ValueError: too many values to unpack (expected 2)

I pushed a change to the k2 wrapper today.

Here's how you can build a GTDB database with the changes:

k2 build --special gtdb --gtdb-files gtdb_genomes_reps.tar.gz --threads 6 --db gtdb_reps

The parameter takes a list of filenames or a pattern that matches the files needed to build the GTDB database.--gtdb-files

You can find the list of file names here https://data.gtdb.ecogenomic.org/releases/latest/genomic_files_reps/ or https://data.gtdb.ecogenomic.org/releases/latest/genomic_files_all/

The command line will also offer a list of files if you specify the wrong file name:

k2 build --special --gtdb-files foo.tar.gz --db gtdb_reps
[ERROR - 2024-11-05 17:57:48,567]: Could not find any files matching foo.tar.gz
[ERROR - 2024-11-05 17:57:48,567]: Here are a list of candidates:
bac120_msa_marker_genes_all.tar.gz
ar53_msa_marker_genes_all.tar.gz
ssu_all.fna.gz
bac120_marker_genes_all.tar.gz
ar53_marker_genes_all.tar.gz
bac120_ssu_reps.fna.gz
ar53_msa_marker_genes_reps.tar.gz
gtdb_proteins_aa_reps.tar.gz
bac120_msa_marker_genes_reps.tar.gz
bac120_msa_reps.faa.gz
bac120_marker_genes_reps.tar.gz
ar53_msa_reps.faa.gz
ar53_marker_genes_reps.tar.gz
gtdb_proteins_nt_reps.tar.gz
gtdb_genomes_reps.tar.gz
ar53_ssu_reps.fna.gz

We welcome your testing in making sure that this feature works as expected.

Hello, I tried to build a GTDB database with the k2 you provided, but it resulted in this error.

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Nov 15, 2024

I pushed a fix for this issue a few days ago and have also pushed a fix for potential crashes while masking the genomes. Can you try pulling these changes and trying again?

@phiweger
Copy link

hi, is there any update to this issue? thanks :)

@osvatic
Copy link
Author

osvatic commented Dec 13, 2024

I was able to run the command with no errors after the push.

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Dec 13, 2024

We will soon be publishing an index for the GTDB representative genomes, stay tuned.

@replikation
Copy link

That would be really great, thank you.

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Dec 16, 2024

The database is now available, see: https://benlangmead.github.io/aws-indexes/k2

@replikation
Copy link

thank you

@lingrongjin
Copy link

Has anyone tried k2 classify with the pre-built gtdb database? I tried to run with one sample but it's stuck at "loading database information" step for about 16 hours now. I don't know what might have gone wrong because there is no error reported. I was able to run kraken2 on the same sample with standard database.
Here is the code I used:
./k2 classify --db k2_gtdb_v220
--threads 16
--use-names
--report test.report
--use-mpa-style
--output test.standard
--paired test_R1_.fastq.gz test_R2_.fastq.gz

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Dec 26, 2024

I realized that too. Try adding the --memory-mapping option. That did the trick for me.

@dgl123dgl123
Copy link

I get the following error when building the GTDB database, is there a solution?

[INFO - 2024-12-30 10:21:37,302]: Creating sequence ID to taxonomy ID map
[INFO - 2024-12-30 10:21:54,761]: Created sequence ID to taxonomy ID map
[INFO - 2024-12-30 10:21:54,779]: Running: /data/penglab3-20T/dongguolin/GTDB-GENOMIC/kraken2/estimate_capacity -S 1111111111111111101010101010101 -k 35 -l 31 -p 32 -B 16384
[INFO - 2024-12-30 13:22:54,397]: Estimated hash table requirement: 496.88GB
[INFO - 2024-12-30 13:22:54,398]: Starting database build
[INFO - 2024-12-30 13:22:54,441]: Running: /data/penglab3-20T/dongguolin/GTDB-GENOMIC/kraken2/build_db -H hash.k2d.tmp -t taxo.k2d.tmp -o opts.k2d.tmp -n taxonomy -m seqid2taxid.map -c 133379191954.28572 -S 1111111111111111101010101010101 -k 35 -l 31 -p 32 -B 16384 -b 512 -r 0
Traceback (most recent call last):
File "/data/penglab3-20T/dongguolin/GTDB-GENOMIC/kraken2/k2", line 3677, in
k2_main()
File "/data/penglab3-20T/dongguolin/GTDB-GENOMIC/kraken2/k2", line 3657, in k2_main
build_gtdb_database(args)
File "/data/penglab3-20T/dongguolin/GTDB-GENOMIC/kraken2/k2", line 1838, in build_gtdb_database
build_kraken2_db(args)
File "/data/penglab3-20T/dongguolin/GTDB-GENOMIC/kraken2/k2", line 2339, in build_kraken2_db
proc.stdin.write(data)
BrokenPipeError: [Errno 32] Broken pipe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants