GitHub - iobio/gene.model.db: Import refseq gene model (gff3) to sqlite3 db for gene.iobio and update with xref and canonical infor for transcripts

IMPORTANT: This is a deprecated repository. The new repo is geneinfo.db https://github.com/iobio/geneinfo.db

These are the steps involved in creating the gene.iobio.db

Install sqlite3.
Create the schema for gene.iobio.db by running sql/schema.ddl in sqlite3.
```
   sqlite3 gene.iobio.db < sql/schema.ddl
```
Download and uncompress the gff3 files for Gencode and RefSeq (Ensembl) for GRCh37 and GRCh38. These files go in the data/gencode and data/refseq directories respectively.
- Gencode
  - GRCh37
  - GRCh38
- RefSeq
  - GRCh37
  - GRCh38
Populate the gene.iobio.db from the gff3 files for each genome build
- Compile the java app.
  - Install Java JDK
  - Install Apache Maven
  - From the command line, build the project
```
mvn install
```
- Execute scripts/run_gff_importer.sh, which will import the gffs for each build an source. (This should take approximately 5 mins for each source/build to run)
```
sh scripts/run_gff_importer.sh
```
After the import, run the script to show the gene and transcript counts. Here is the output from the most recent run.
```
sh scripts/verify_transcript_counts.sh
```
```
GENE COUNTS
gencode|GRCh37|57783
gencode|GRCh38|61815
refseq|GRCh37|31508
refseq|GRCh38|47757

TRANSCRIPT COUNTS
gencode|GRCh37|196483
gencode|GRCh38|251199
refseq|GRCh37|81768
refseq|GRCh38|155012
```
Update the gene.iobio.db transcripts table to specify the UTR features, which are not included in the GFF3, but can be determined by reading the EXON and CDS features.
- This is a long running update statement, so it works better to update one reference (for a build) at a time.
- Run the bash shell script to update the features to include the UTRs.
```
sh scripts/determineUTR.sh
```
  When this script finished, both build GRCh37 and GRCh38 will be updated. Update the .sh script if you want to run the updates on other builds.
- Verify that the UTRs are now present in the features of the transcript
```
sh scripts/verifyUTR.sh
```
  You should see 2 lines for the UTR features of the canonical transcript for gene RAI1. Compare the transcript counts of those with UTRs to the total number of transcripts to make sure that most of the transcripts have UTRs.
```
"feature_type":"UTR"
"feature_type":"UTR"

Number of transcripts 
gencode|GRCh37|196483
gencode|GRCh38|251199
refseq|GRCh37|81768
refseq|GRCh38|155012

Number of transcripts with UTR features
gencode|GRCh37|192230
gencode|GRCh38|247158
refseq|GRCh37|81458
refseq|GRCh38|154587
```

Update the canonical information on the gene.iobio.db transcripts. For build GRCh38, there is a is_canonical flag that can be used. This flag can be found using the UCSC table browser.
How to get canonical and ccsd flags on transcripts:

For GRCh38, you can get the canonical designation for a transcript using the UCSC table browser, use instructions outlined here


In a new browser window navigate to the Table Browser: http://genome.ucsc.edu/cgi-bin/hgTables

1. Select hg38, and set "group:" "Genes and Gene Predictions" to track "GENCODE v22" (this should be the default selection).
2. Change table from "knownGene" to "knownCanonical". 
This step is a good opportunity to click the "describe table schema" button to see more about the table data you are requesting.


3. Change "output format" to "selected fields from primary and related tables".

This step allows you add information from other tables beyond the knownCanonical table.
4. Click "get output".
This screen is where we can add requests to get information from other tables, we are going to request information from the hg38 refGene table and the hg38 knownToEnsembl.
5. Scroll down to the "Linked Tables" section and click the box next to "hg38 refGene" and the box next to "hg38 knownToEnsembl".
6. Scroll to the very bottom and click the "allow selection from checked tables". 
Now we can select the fields we want from each of these three tables.
7. Under the "Select Fields from hg38.knownCanonical" click the box next to transcript.
This will be the transcript location driving all the related table output.
8. Under hg38.knownToEnsembl fields click "check all". 
9. Under hg38.refGene fields click "check all". 
10. Click "get output".

The results will be rows like the following: 

#hg38.knownCanonical.transcript    hg38.knownToEnsembl.name    hg38.knownToEnsembl.value    hg38.refGene.bin    hg38.refGene.name    hg38.refGene.chrom    hg38.refGene.strand    hg38.refGene.txStart    hg38.refGene.txEnd    hg38.refGene.cdsStart    hg38.refGene.cdsEnd    hg38.refGene.exonCount    hg38.refGene.exonStarts    hg38.refGene.exonEnds    hg38.refGene.score    hg38.refGene.name2    hg38.refGene.cdsStartStat    hg38.refGene.cdsEndStat    hg38.refGene.exonFrames

uc001ggs.5    uc001ggs.5    ENST00000367772.7    29    NM_181093    chr1    -    169853075    169893959    169853712    169888840    14    169853075,169854269,169855795,169859040,169862612,169864368,169866895,169868927,169870254,169873695,169875977,169878633,169888675,169893787,    169853772,169854964,169855957,169859212,169862797,169864508,169866973,169869039,169870357,169873752,169876091,169878819,169888890,169893959,    0    SCYL3    cmpl    cmpl    0,1,1,0,1,2,2,1,0,0,0,0,0,-1,
The first field (hg38.knownCanonical.transcript) is the id used from knownCanonical that is driving the selection of all the other data. The second two fields are the entire knownToEnsembl table that exists to provide the related ENST id (ENST00000367772.7), the remaining fields are all the fields from the refGene table that correspond to the entries in the knownCanonical table.

Please do make these selections independently. Here is a session to compare your steps against to help see the output: http://genome.ucsc.edu/cgi-bin/hgTables?hgS_doOtherUser=submit&hgS_otherUserName=Brian%20Lee&hgS_otherUserSessionName=hg38.refGene.canonical


In a new browser window navigate to the Table Browser: http://genome.ucsc.edu/cgi-bin/hgTables

1. Select hg38, and set "group:" "Genes and Gene Predictions" to track "GENCODE v22" (this should be the default selection).
2. Change table from "knownGene" to "knownCanonical". 
This step is a good opportunity to click the "describe table schema" button to see more about the table data you are requesting.


3. Change "output format" to "selected fields from primary and related tables".

This step allows you add information from other tables beyond the knownCanonical table.
4. Click "get output".
This screen is where we can add requests to get information from other tables, we are going to request information from the hg38 refGene table and the hg38 knownToEnsembl.
5. Scroll down to the "Linked Tables" section and click the box next to "hg38 refGene" and the box next to "hg38 knownToEnsembl".
6. Scroll to the very bottom and click the "allow selection from checked tables". 
Now we can select the fields we want from each of these three tables.
7. Under the "Select Fields from hg38.knownCanonical" click the box next to transcript.
This will be the transcript location driving all the related table output.
8. Under hg38.knownToEnsembl fields click "check all". 
9. Under hg38.refGene fields click "check all". 
10. Click "get output".

The results will be rows like the following: 

#hg38.knownCanonical.transcript    hg38.knownToEnsembl.name    hg38.knownToEnsembl.value    hg38.refGene.bin    hg38.refGene.name    hg38.refGene.chrom    hg38.refGene.strand    hg38.refGene.txStart    hg38.refGene.txEnd    hg38.refGene.cdsStart    hg38.refGene.cdsEnd    hg38.refGene.exonCount    hg38.refGene.exonStarts    hg38.refGene.exonEnds    hg38.refGene.score    hg38.refGene.name2    hg38.refGene.cdsStartStat    hg38.refGene.cdsEndStat    hg38.refGene.exonFrames

uc001ggs.5    uc001ggs.5    ENST00000367772.7    29    NM_181093    chr1    -    169853075    169893959    169853712    169888840    14    169853075,169854269,169855795,169859040,169862612,169864368,169866895,169868927,169870254,169873695,169875977,169878633,169888675,169893787,    169853772,169854964,169855957,169859212,169862797,169864508,169866973,169869039,169870357,169873752,169876091,169878819,169888890,169893959,    0    SCYL3    cmpl    cmpl    0,1,1,0,1,2,2,1,0,0,0,0,0,-1,
The first field (hg38.knownCanonical.transcript) is the id used from knownCanonical that is driving the selection of all the other data. The second two fields are the entire knownToEnsembl table that exists to provide the related ENST id (ENST00000367772.7), the remaining fields are all the fields from the refGene table that correspond to the entries in the knownCanonical table.

Please do make these selections independently. Here is a session to compare your steps against to help see the output: http://genome.ucsc.edu/cgi-bin/hgTables?hgS_doOtherUser=submit&hgS_otherUserName=Brian%20Lee&hgS_otherUserSessionName=hg38.refGene.canonical

Remember to join to tables to get the gencode transcript id (align id)
Use excel to convert tsv to csv

For GRCh37, you can get the ccds_id from the UCSC table browser, using a similar technique described above.

Use excel to convert tsv to csv
Run sql script sql/update_transcripts_for_canonical.sql which will update both the is_canonical for GRCh38 transcripts and the ccds_id for GRCh37 transcripts.
- First, update the .sql script .import lines to reference the files that you downloaded in the ./data/canonical directory
- Make sure you are in the root directory where the gene.iobio.db file resides
- Run the sql script
```
sqlite3 gene.iobio.db 
```
```
> (copy and paste contents of sql/update_transcripts_for_canonical.sql here)
> .exit
```
Verify the counts
```
sh scripts/verify_canonical.sh
```
Here is the output from the latest run, showing ~62,000 transcripts for build GRCh38 flagged as canonical and ~200 transcripts for build GRCh37 with a CCDS id.:
```
IS CANONICAL
gencode|GRCh38||189384
gencode|GRCh38|true|61815
CCDS_ID
gencode|GRCh37|36379
```

Update the xref tables to map gencode transcripts to refseq transcripts
- For GRCh38, use UCSC table browser.
- For GRCh37, use Ensembl Biomart.
- Use excel to convert tsv to csv
- Now run sql script insert_transcripts_xref.sql in gene.iobio.db
```
sqlite3 gene.iobio.db 
```
```
> (copy and paste contents of ./sql/insert_transcripts_xref.sql here)
> .exit
```
- Verify the counts
```
sh scripts/verify_transcripts_xref.sh
```
  Here is the output from the latest run:
```
 XREF GENCODE -> REFSEQ
 gencode|GRCh37|46377
 gencode|GRCh38|23495
 XREF REFSEQ -> GENCODE
 refseq|GRCh37|39496
 refseq|GRCh38|68295
```

Record MANE transcripts

Download the MANE gff here

Extract the transcript lines from the 'gff', passing in the path to the MANE gff

sh scripts/extract_mane_transcripts.sh data/mane/MANE.GRCh38.v1.0.ensembl_genomic.gff 
> data/mane/transcripts.csv

Run sql script sql/update_mane_transcripts.sql which will update is_mane_select for GRCh38 transcripts.

sqlite3 gene.iobio.db

> (copy and paste contents of ./sql/update_mane_transcripts.sql here)
> .exit

Verify the counts

sh scripts/verify_mane_select.sh

Here is the output from the latest run:

IS MANE SELECT
gencode|GRCh38||232120
gencode|GRCh38|true|19079
refseq|GRCh38||133099
refseq|GRCh38|true|21913

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.settings		.settings
data		data
doc		doc
scripts		scripts
sql		sql
src		src
target/classes/geneModelGffImporter		target/classes/geneModelGffImporter
.gitignore		.gitignore
.project		.project
pom.xml		pom.xml
readme.md		readme.md
sqlite-jdbc-3.8.11.2.jar		sqlite-jdbc-3.8.11.2.jar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

iobio/gene.model.db

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages