Skip to content
This repository has been archived by the owner on Oct 5, 2023. It is now read-only.
/ gene.model.db Public archive

Import refseq gene model (gff3) to sqlite3 db for gene.iobio and update with xref and canonical infor for transcripts

Notifications You must be signed in to change notification settings

iobio/gene.model.db

Repository files navigation

IMPORTANT: This is a deprecated repository. The new repo is geneinfo.db https://github.com/iobio/geneinfo.db

These are the steps involved in creating the gene.iobio.db

  1. Install sqlite3.

  2. Create the schema for gene.iobio.db by running sql/schema.ddl in sqlite3.

       sqlite3 gene.iobio.db < sql/schema.ddl
    
  3. Download and uncompress the gff3 files for Gencode and RefSeq (Ensembl) for GRCh37 and GRCh38. These files go in the data/gencode and data/refseq directories respectively.

  4. Populate the gene.iobio.db from the gff3 files for each genome build

    • Compile the java app.
    mvn install
    
    • Execute scripts/run_gff_importer.sh, which will import the gffs for each build an source. (This should take approximately 5 mins for each source/build to run)
    sh scripts/run_gff_importer.sh
    

    After the import, run the script to show the gene and transcript counts. Here is the output from the most recent run.

    sh scripts/verify_transcript_counts.sh
    
    GENE COUNTS
    gencode|GRCh37|57783
    gencode|GRCh38|61815
    refseq|GRCh37|31508
    refseq|GRCh38|47757
    
    TRANSCRIPT COUNTS
    gencode|GRCh37|196483
    gencode|GRCh38|251199
    refseq|GRCh37|81768
    refseq|GRCh38|155012
    
  5. Update the gene.iobio.db transcripts table to specify the UTR features, which are not included in the GFF3, but can be determined by reading the EXON and CDS features.

    • This is a long running update statement, so it works better to update one reference (for a build) at a time.

    • Run the bash shell script to update the features to include the UTRs.

      sh scripts/determineUTR.sh
      

      When this script finished, both build GRCh37 and GRCh38 will be updated. Update the .sh script if you want to run the updates on other builds.

    • Verify that the UTRs are now present in the features of the transcript

      sh scripts/verifyUTR.sh
      

      You should see 2 lines for the UTR features of the canonical transcript for gene RAI1. Compare the transcript counts of those with UTRs to the total number of transcripts to make sure that most of the transcripts have UTRs.

      "feature_type":"UTR"
      "feature_type":"UTR"
      
      Number of transcripts 
      gencode|GRCh37|196483
      gencode|GRCh38|251199
      refseq|GRCh37|81768
      refseq|GRCh38|155012
      
      Number of transcripts with UTR features
      gencode|GRCh37|192230
      gencode|GRCh38|247158
      refseq|GRCh37|81458
      refseq|GRCh38|154587
      
  6. Update the canonical information on the gene.iobio.db transcripts. For build GRCh38, there is a is_canonical flag that can be used. This flag can be found using the UCSC table browser.
    How to get canonical and ccsd flags on transcripts:

    • For GRCh38, you can get the canonical designation for a transcript using the UCSC table browser, use instructions outlined here

      
      In a new browser window navigate to the Table Browser: http://genome.ucsc.edu/cgi-bin/hgTables
      
      1. Select hg38, and set "group:" "Genes and Gene Predictions" to track "GENCODE v22" (this should be the default selection).
      2. Change table from "knownGene" to "knownCanonical". 
      This step is a good opportunity to click the "describe table schema" button to see more about the table data you are requesting.
      
      
      3. Change "output format" to "selected fields from primary and related tables".
      
      This step allows you add information from other tables beyond the knownCanonical table.
      4. Click "get output".
      This screen is where we can add requests to get information from other tables, we are going to request information from the hg38 refGene table and the hg38 knownToEnsembl.
      5. Scroll down to the "Linked Tables" section and click the box next to "hg38 refGene" and the box next to "hg38 knownToEnsembl".
      6. Scroll to the very bottom and click the "allow selection from checked tables". 
      Now we can select the fields we want from each of these three tables.
      7. Under the "Select Fields from hg38.knownCanonical" click the box next to transcript.
      This will be the transcript location driving all the related table output.
      8. Under hg38.knownToEnsembl fields click "check all". 
      9. Under hg38.refGene fields click "check all". 
      10. Click "get output".
      
      The results will be rows like the following: 
      
      #hg38.knownCanonical.transcript    hg38.knownToEnsembl.name    hg38.knownToEnsembl.value    hg38.refGene.bin    hg38.refGene.name    hg38.refGene.chrom    hg38.refGene.strand    hg38.refGene.txStart    hg38.refGene.txEnd    hg38.refGene.cdsStart    hg38.refGene.cdsEnd    hg38.refGene.exonCount    hg38.refGene.exonStarts    hg38.refGene.exonEnds    hg38.refGene.score    hg38.refGene.name2    hg38.refGene.cdsStartStat    hg38.refGene.cdsEndStat    hg38.refGene.exonFrames
      
      uc001ggs.5    uc001ggs.5    ENST00000367772.7    29    NM_181093    chr1    -    169853075    169893959    169853712    169888840    14    169853075,169854269,169855795,169859040,169862612,169864368,169866895,169868927,169870254,169873695,169875977,169878633,169888675,169893787,    169853772,169854964,169855957,169859212,169862797,169864508,169866973,169869039,169870357,169873752,169876091,169878819,169888890,169893959,    0    SCYL3    cmpl    cmpl    0,1,1,0,1,2,2,1,0,0,0,0,0,-1,
      The first field (hg38.knownCanonical.transcript) is the id used from knownCanonical that is driving the selection of all the other data. The second two fields are the entire knownToEnsembl table that exists to provide the related ENST id (ENST00000367772.7), the remaining fields are all the fields from the refGene table that correspond to the entries in the knownCanonical table.
      
      Please do make these selections independently. Here is a session to compare your steps against to help see the output: http://genome.ucsc.edu/cgi-bin/hgTables?hgS_doOtherUser=submit&hgS_otherUserName=Brian%20Lee&hgS_otherUserSessionName=hg38.refGene.canonical
      
      
      In a new browser window navigate to the Table Browser: http://genome.ucsc.edu/cgi-bin/hgTables
      
      1. Select hg38, and set "group:" "Genes and Gene Predictions" to track "GENCODE v22" (this should be the default selection).
      2. Change table from "knownGene" to "knownCanonical". 
      This step is a good opportunity to click the "describe table schema" button to see more about the table data you are requesting.
      
      
      3. Change "output format" to "selected fields from primary and related tables".
      
      This step allows you add information from other tables beyond the knownCanonical table.
      4. Click "get output".
      This screen is where we can add requests to get information from other tables, we are going to request information from the hg38 refGene table and the hg38 knownToEnsembl.
      5. Scroll down to the "Linked Tables" section and click the box next to "hg38 refGene" and the box next to "hg38 knownToEnsembl".
      6. Scroll to the very bottom and click the "allow selection from checked tables". 
      Now we can select the fields we want from each of these three tables.
      7. Under the "Select Fields from hg38.knownCanonical" click the box next to transcript.
      This will be the transcript location driving all the related table output.
      8. Under hg38.knownToEnsembl fields click "check all". 
      9. Under hg38.refGene fields click "check all". 
      10. Click "get output".
      
      The results will be rows like the following: 
      
      #hg38.knownCanonical.transcript    hg38.knownToEnsembl.name    hg38.knownToEnsembl.value    hg38.refGene.bin    hg38.refGene.name    hg38.refGene.chrom    hg38.refGene.strand    hg38.refGene.txStart    hg38.refGene.txEnd    hg38.refGene.cdsStart    hg38.refGene.cdsEnd    hg38.refGene.exonCount    hg38.refGene.exonStarts    hg38.refGene.exonEnds    hg38.refGene.score    hg38.refGene.name2    hg38.refGene.cdsStartStat    hg38.refGene.cdsEndStat    hg38.refGene.exonFrames
      
      uc001ggs.5    uc001ggs.5    ENST00000367772.7    29    NM_181093    chr1    -    169853075    169893959    169853712    169888840    14    169853075,169854269,169855795,169859040,169862612,169864368,169866895,169868927,169870254,169873695,169875977,169878633,169888675,169893787,    169853772,169854964,169855957,169859212,169862797,169864508,169866973,169869039,169870357,169873752,169876091,169878819,169888890,169893959,    0    SCYL3    cmpl    cmpl    0,1,1,0,1,2,2,1,0,0,0,0,0,-1,
      The first field (hg38.knownCanonical.transcript) is the id used from knownCanonical that is driving the selection of all the other data. The second two fields are the entire knownToEnsembl table that exists to provide the related ENST id (ENST00000367772.7), the remaining fields are all the fields from the refGene table that correspond to the entries in the knownCanonical table.
      
      Please do make these selections independently. Here is a session to compare your steps against to help see the output: http://genome.ucsc.edu/cgi-bin/hgTables?hgS_doOtherUser=submit&hgS_otherUserName=Brian%20Lee&hgS_otherUserSessionName=hg38.refGene.canonical
      
      • Remember to join to tables to get the gencode transcript id (align id)
      • Use excel to convert tsv to csv
    • For GRCh37, you can get the ccds_id from the UCSC table browser, using a similar technique described above.

    • Use excel to convert tsv to csv

    • Run sql script sql/update_transcripts_for_canonical.sql which will update both the is_canonical for GRCh38 transcripts and the ccds_id for GRCh37 transcripts.

      • First, update the .sql script .import lines to reference the files that you downloaded in the ./data/canonical directory
      • Make sure you are in the root directory where the gene.iobio.db file resides
      • Run the sql script
      sqlite3 gene.iobio.db 
      
      > (copy and paste contents of sql/update_transcripts_for_canonical.sql here)
      > .exit
      
    • Verify the counts

      sh scripts/verify_canonical.sh
      

      Here is the output from the latest run, showing ~62,000 transcripts for build GRCh38 flagged as canonical and ~200 transcripts for build GRCh37 with a CCDS id.:

      IS CANONICAL
      gencode|GRCh38||189384
      gencode|GRCh38|true|61815
      CCDS_ID
      gencode|GRCh37|36379
      
  7. Update the xref tables to map gencode transcripts to refseq transcripts

    • For GRCh38, use UCSC table browser.

    • For GRCh37, use Ensembl Biomart.

    • Use excel to convert tsv to csv

    • Now run sql script insert_transcripts_xref.sql in gene.iobio.db

      sqlite3 gene.iobio.db 
      
      > (copy and paste contents of ./sql/insert_transcripts_xref.sql here)
      > .exit
      
    • Verify the counts

      sh scripts/verify_transcripts_xref.sh
      

      Here is the output from the latest run:

       XREF GENCODE -> REFSEQ
       gencode|GRCh37|46377
       gencode|GRCh38|23495
       XREF REFSEQ -> GENCODE
       refseq|GRCh37|39496
       refseq|GRCh38|68295
      
  8. Record MANE transcripts

    • Download the MANE gff here

    • Extract the transcript lines from the 'gff', passing in the path to the MANE gff

      sh scripts/extract_mane_transcripts.sh data/mane/MANE.GRCh38.v1.0.ensembl_genomic.gff 
      > data/mane/transcripts.csv
      
    • Run sql script sql/update_mane_transcripts.sql which will update is_mane_select for GRCh38 transcripts.

      sqlite3 gene.iobio.db 
      
      > (copy and paste contents of ./sql/update_mane_transcripts.sql here)
      > .exit
      
    • Verify the counts

      sh scripts/verify_mane_select.sh
      

      Here is the output from the latest run:

      IS MANE SELECT
      gencode|GRCh38||232120
      gencode|GRCh38|true|19079
      refseq|GRCh38||133099
      refseq|GRCh38|true|21913
      

About

Import refseq gene model (gff3) to sqlite3 db for gene.iobio and update with xref and canonical infor for transcripts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published