-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with ensDbFromGff using recent ensembl gff files #97
Comments
Thanks for reporting @stephenrong. My quick guess is that there has been a change in the gff file format from Ensembl. Is there any reason you're building the |
Hi
|
Thanks for reporting @weir12 . Note that |
@jorainer |
I've created the To use it simply do: > library(ensembldb)
> edb <- EnsDb("EnsDb.Spombe.v96.sqlite")
> genes(edb)
GRanges object with 7268 ranges and 8 metadata columns:
seqnames ranges strand | gene_id gene_name
<Rle> <IRanges> <Rle> | <character> <character>
SPBC460.01c AB325691 1479-3197 - | SPBC460.01c SPBC460.01c
SPBC460.02c AB325691 8856-9803 - | SPBC460.02c SPBC460.02c
... ... ... ... . ... ...
SPMTR.03 MTR 15593-15721 - | SPMTR.03 mat3-Mm
SPMTR.04 MTR 15856-16401 + | SPMTR.04 mat3-Mc
gene_biotype seq_coord_system
<character> <character>
SPBC460.01c protein_coding chromosome
SPBC460.02c protein_coding chromosome
... ... ...
SPMTR.03 protein_coding chromosome
SPMTR.04 protein_coding chromosome
description
<character>
SPBC460.01c amino-acid permease, unknown [Source:PomBase;Acc:SPBC460.01c]
SPBC460.02c eukaryotic translation elongation factor, glutathione S-transferase (predicted) [Source:PomBase;Acc:SPBC460.02c]
... ...
SPMTR.03 silenced mating-type m-specific polypeptide Mi [Source:PomBase;Acc:SPMTR.03]
SPMTR.04 silenced mating-type m-specific polypeptide Mc [Source:PomBase;Acc:SPMTR.04]
gene_id_version symbol entrezid
<character> <character> <list>
SPBC460.01c SPBC460.01c SPBC460.01c NA
SPBC460.02c SPBC460.02c SPBC460.02c NA
... ... ... ...
SPMTR.03 SPMTR.03 mat3-Mm c(2539637, 3361261)
SPMTR.04 SPMTR.04 mat3-Mc 2540048
-------
seqinfo: 6 sequences from ASM294v2 genome You can then obviously also use the corrdinate mapping tools - and since it contains protein data you can also map to protein sequence-relative coordinates. |
@jorainer |
Hi Johannes, http://metazoa.ensembl.org/Lottia_gigantea/Info/Index
|
Hi @guidohooiveld , I'll create the |
The |
Wow, within 1 hr a working database... Thanks very much Johannes, much appreciated! |
Let me know if you need an updated version or other too. |
I am struggling with a similar issue. But 'AnnotationHub' doesn't contain information for Pseudomonas aeruginosa |
@Olakmephd, bacteria are a little more difficult. Ensembl does not provide a single database for each bacteria, but has databases for a collection of bacteria. I did not yet find a good way to extract the data for a single bacteria from such a database using You could however download the GFF file for this bacteria e.g. from: > library(ensembldb)
> db <- ensDbFromGff("Pseudomonas_aeruginosa_gca_003325605.ASM332560v1.49.gff3.gz")
Importing GFF ... OK
Fixing IDs ... OK
Processing genes ... OK
Processing transcripts ... OK
Processing exons ... OK
-------------
Proceeding to create the database.
Processing genes ...
Attribute availability:
o gene_id ... OK
o gene_name ... OK
o entrezid ... Nope
o gene_biotype ... OK
OK
Processing transcripts ...
Attribute availability:
o transcript_id ... OK
o gene_id ... OK
o transcript_biotype ... OK
o transcript_name ... Nope
OK
Processing exons ... OK
Processing chromosomes ... Fetch seqlengths from ensembl ... FAIL
OK
Processing metadata ... OK
Generating index ... OK
-------------
Verifying validity of the information in the database:
Checking transcripts ... OK
Checking exons ... OK
Warning messages:
1: In `[<-.factor`(`*tmp*`, idx, value = "transcript") :
invalid factor level, NA generated
2: In ensDbFromGRanges(theGff, outfile = outfile, path = path, organism = orgFromFile, :
I'm missing column(s): 'entrezid'. The corresponding database column(s) will be empty!
3: In ensDbFromGRanges(theGff, outfile = outfile, path = path, organism = orgFromFile, :
I'm missing column(s): 'transcript_name'. The corresponding database columns will be empty!
4: In .getSeqlengthsFromMysqlFolder(organism = organism, ensembl = ensemblVersion, :
Can not get the sequence lengths from Ensembl or Ensemblgenomes. Seqinfo will lack the sequence lengths.
5: In tryGetSeqinfoFromEnsembl(organism, version, seqnames = chroms$seq_name) :
Unable to retrieve sequence lengths from Ensembl.
You can ignore the warning messages. Also, the function fails to retrieve the correct chromosome lengths from Ensembl, so you'll not have that information in the database. You can then load this databases with: > edb <- EnsDb(db)
> edb
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.0.1
|Creation time: Mon May 16 06:06:51 2022
|ensembl_version: 49
|ensembl_host: unknown
|Organism: Pseudomonas_aeruginosa_gca_003325605
|genome_build: ASM332560v1
|DBSCHEMAVERSION: 1.0
|source_file: Pseudomonas_aeruginosa_gca_003325605.ASM332560v1.49.gff3.gz
| No. of genes: 8770.
| No. of transcripts: 8770.
> organism(edb)
[1] "Pseudomonas aeruginosa gca 003325605" |
@jorainer Thanks
|
@IrshadUlHaq1 , the warnings just tell you that some of the content is not available in the |
@jorainer, thank you for your input. The generated 'edb' is usable, however, my end goal is to extract 'entrez' ID for a subset of the 'gene_id'. I need 'entrez' to perform pathway analyses. While I am aware that the 'gff' file does not contain 'entrez' info, I wonder if you have other suggestion for me? I tried to generate the 'edb' using Ensembl Perl API but I ran into problems on my Arch linux. On another note, could the 'gene_id' be converted to 'entrez' id's by a different package such as 'ClusterProfiler'? Also, the 'gene_id' in the 'edb' are 'stable IDs' and I having a hard time to understand the differences here. I appreciate your help and time. |
Unfortunately the scripts in I guess it should be possible to convert the IDs in the The |
I'm having troubles with ensDbFromGff for some gff3 files downloaded from Ensembl ftp. For example, Danio rerio in Ensembl versions 94, 95, and 96 (but not Ensembl versions 93, 92, 91, 90, or 85), using the following code:
I get the corresponding warning message:
And zero transcripts show up using transcripts(edb)
This also happens for Ailuropoda_melanoleuca (works for Ens version 85 but not 91), so I bet this is a widespread problem across organisms.
The text was updated successfully, but these errors were encountered: