Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added mapping to ensembl chrom names in fetchExtendedChromInfoFromUCSC #5

Merged
merged 3 commits into from
Dec 20, 2019

Conversation

ivanek
Copy link
Contributor

@ivanek ivanek commented Dec 21, 2018

The function fetchExtendedChromInfoFromUCSC can now fetch the mapping to ensembl chromosomes from UCSC tables (chromAlias and/or ucscToEnsembl) for selected genomes (stored in SUPPORTED_UCSC_GENOMES

following on issue #4

@hpages
Copy link
Contributor

hpages commented Jan 7, 2019

Thanks @ivanek

The proposed improvement to fetchExtendedChromInfoFromUCSC seems useful but we cannot accept the PR in its current form. See below.

Best,
H.

  1. The current implementation breaks the examples for fetchExtendedChromInfoFromUCSC and Seqinfo:

    example(fetchExtendedChromInfoFromUCSC)
    # Error in data.frame(..., check.names = FALSE) : 
    #   arguments imply differing number of rows: 17, 18
    
    example(Seqinfo)
    # Error in .getDataInFile(species) : 
    #   Organism Bos taurus is not supported by GenomeInfoDb
    

    Note that running R CMD check (which you should absolutely do before sending a PR) should report these problems.

    These errors are easy to reproduce and affect many genomes e.g.:

    fetchExtendedChromInfoFromUCSC("galGal4")
    # Error in .getDataInFile(species) : 
    #   Organism Gallus gallus is not supported by GenomeInfoDb
    
    sacCer3_ci <- fetchExtendedChromInfoFromUCSC("sacCer3")
    # Error in data.frame(..., check.names = FALSE) : 
    #   arguments imply differing number of rows: 17, 18
    

    Also for some genomes (e.g. hg18) the returned Ensembl seqlevels are not useful so it would probably be better to not return them in that case:

    suppressWarnings(fetchExtendedChromInfoFromUCSC("hg18")$Ensembl_seqlevels)
    #  [1] <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
    # [11] <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
    # [21] <NA>   <NA>   <NA>   <NA>   <NA>   c6_COX c22_H2 c5_H2  c6_QBL <NA>  
    # [31] <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
    # [41] <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
    # Levels: c22_H2 c5_H2 c6_COX c6_QBL
    

    Please make sure that the proposed change works properly on all the genomes supported by fetchExtendedChromInfoFromUCSC().

  2. Use only 1 new field (e.g. get_Ensembl_seqlevels_from) instead of 2 (chromAlias and ucscToEnsembl) to support this new feature. The value of the field will then be c("chromAlias", "ucscToEnsembl") or a subset of it. When the UCSC seqlevels cannot be mapped to the Ensembl seqlevels (or when the mapping is incomplete and/or not useful like in the case of hg38), the field should be missing instead of being set to NULL.

  3. The new column returned by fetchExtendedChromInfoFromUCSC() should be named Ensembl_seqlevel (singular) instead of Ensembl_seqlevels (plural) for consistency with columns UCSC_seqlevel and NCBI_seqlevel.

  4. The new column should be a character vector, not a factor.

  5. Use suppressWarnings() to silence warnings possibly returned by mapGenomeBuilds() when the supplied genome is not found.

  6. This line will do the right thing only if chroms$UCSC is identical to names(ensembl_seqlevels):

    ensembl_seqlevels[chroms$UCSC] <- chroms$Ensembl
    

    but it doesn't seem safe to assume that this will always be the case.

  7. Variable name ucsc_wo_e is too cryptic. Please choose a name that can be understood by everybody.

  8. Finally the man page for fetchExtendedChromInfoFromUCSC would need to be updated to reflect the new behavior.

@ivanek
Copy link
Contributor Author

ivanek commented Jan 8, 2019

Dear @hpages, Thank you for very useful comments. I tried to implement them in the PR.

@ivanek
Copy link
Contributor Author

ivanek commented Feb 20, 2019

Dear @hpages, is there a chance to accept it before next release?

@hpages
Copy link
Contributor

hpages commented Mar 6, 2019

Hi @ivanek , I'll take a look ASAP. Thanks for the reminder!

@LiNk-NY
Copy link
Contributor

LiNk-NY commented Mar 12, 2019

Hi Hervé, @hpages
Note. I ran R CMD check on the PR and there are no issues.

@hpages
Copy link
Contributor

hpages commented Dec 20, 2019

@ivanek Patch looks good. Sorry that it took so long and thanks again for implementing this useful feature.

@hpages hpages merged commit 8a8f4d8 into Bioconductor:master Dec 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants