forked from UCDenver-ccp/datasource
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
updated README with maven pom info and RDF generation instructions
- Loading branch information
1 parent
c1c194c
commit 95e3855
Showing
1 changed file
with
141 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,143 @@ | ||
# datasource | ||
A library of code for parsing (mostly biomedical) data source files and converting their contents to RDF | ||
|
||
This library contains file parsers for files from many different biomedical databases. It also contains | ||
code that uses a file parser as input and outputs RDF. The structure of the RDF is described in: | ||
``` | ||
KaBOB: Ontology-Based Semantic Integration of Biomedical Databases | ||
Kevin M Livingston, Michael Bada, William A Baumgartner, Lawrence E Hunter | ||
BMC Bioinformatics (accepted) | ||
``` | ||
|
||
## Use with Java 1.8 | ||
``` | ||
Note: This code was developed using Java 1.7. There have been reports that the | ||
project won't build using Java 1.8. | ||
``` | ||
|
||
|
||
## Maven signature if only using the file parser API | ||
```xml | ||
<dependency> | ||
<groupId>edu.ucdenver.ccp</groupId> | ||
<artifactId>datasource-fileparsers</artifactId> | ||
<version>0.5</version> | ||
</dependency> | ||
|
||
<repository> | ||
<id>bionlp-sourceforge</id> | ||
<url>http://svn.code.sf.net/p/bionlp/code/repo/</url> | ||
</repository> | ||
``` | ||
|
||
## Maven signature if interested in generating RDF of parsed file content | ||
```xml | ||
<dependency> | ||
<groupId>edu.ucdenver.ccp</groupId> | ||
<artifactId>datasource-rdfizer</artifactId> | ||
<version>0.5</version> | ||
</dependency> | ||
|
||
<repository> | ||
<id>bionlp-sourceforge</id> | ||
<url>http://svn.code.sf.net/p/bionlp/code/repo/</url> | ||
</repository> | ||
``` | ||
|
||
## Bulk RDF Generation | ||
This library has been built to work easily with distributed resource management | ||
systems such as Oracle Grid Engine or Torque. All this really means is that there | ||
is a script available that will kick of RDF generation for a specific file parser | ||
based on an integer argument. | ||
|
||
#### Integer -to- File mappings | ||
To see the integer-to-file mappings, | ||
run edu.ucdenver.ccp.datasource.rdfizer.rdf.ice.FileDataSource.main() | ||
Note that due to licensing issues, some files are not available for download directly. | ||
The resources denoted in italics below must be manually obtained in order to be used. | ||
Those resources not listed in italics are capable of being automatically downloaded at | ||
RDF generation time. | ||
``` | ||
*1 ==> DIP* | ||
*2 ==> HPRD_ID_MAPPINGS* | ||
*3 ==> TRANSFAC_GENE* | ||
*4 ==> TRANSFAC_MATRIX* | ||
*5 ==> GAD* | ||
6 ==> PHARMGKB_DISEASE | ||
7 ==> PHARMGKB_GENE | ||
*8 ==> PHARMGKB_RELATION* | ||
9 ==> PHARMGKB_DRUG | ||
10 ==> DRUGBANK | ||
11 ==> HGNC | ||
12 ==> HOMOLOGENE | ||
13 ==> IREFWEB | ||
14 ==> MGI_ENTREZGENE | ||
15 ==> MGI_MGIPHENOGENO | ||
16 ==> MGI_MRKLIST | ||
17 ==> MGI_MRKREFERENCE | ||
18 ==> MGI_MRKSEQUENCE | ||
19 ==> MGI_MRKSWISSPROT | ||
20 ==> MIRBASE | ||
*21 ==> OMIM* | ||
22 ==> RGD_GENES | ||
23 ==> RGD_GENE_MP | ||
24 ==> RGD_GENE_RDO | ||
25 ==> RGD_GENE_NBO | ||
26 ==> RGD_GENE_PW | ||
27 ==> PREMOD_HUMAN | ||
28 ==> PREMOD_MOUSE | ||
29 ==> PR_MAPPINGFILE | ||
30 ==> REACTOME_UNIPROT2PATHWAYSTID | ||
31 ==> REFSEQ_RELEASECATALOG | ||
32 ==> NCBIGENE_GENE2REFSEQ | ||
33 ==> NCBIGENE_GENEINFO | ||
34 ==> NCBIGENE_MIM2GENE | ||
35 ==> NCBIGENE_REFSEQUNIPROTCOLLAB | ||
36 ==> GOA | ||
37 ==> UNIPROT_SWISSPROT | ||
38 ==> UNIPROT_IDMAPPING | ||
39 ==> UNIPROT_TREMBL_SPARSE | ||
40 ==> INTERPRO_NAMESDAT | ||
41 ==> INTERPRO_INTERPRO2GO | ||
42 ==> INTERPRO_PROTEIN2IPR | ||
``` | ||
|
||
While this is very convenient when dealing with some job schedulers, | ||
it also allows for easy execution of single RDF generation jobs. For | ||
example, to generate RDF for the MirBase database file (index = 20): | ||
|
||
``` | ||
export DATA_DIR=[BASE_DIRECTORY_WHERE_DATA_FILES_TO_PARSE_LIVE] | ||
export RDF_DIR=[BASE_DIRECTORY_WHERE_RDF_WILL_BE_WRITTEN] | ||
mkdir $DATA_DIR | ||
mkdir $RDF_DIR | ||
export DATE=[TODAYS_DATE_TO_TIMESTAMP_THE_DATA e.g. 2015-04-16] | ||
cd datasource | ||
mvn clean install | ||
mvn -f datasource-rdfizer/scripts/pom-rdf-gen.xml exec:exec -DstartStage=20 \ | ||
-DnumStages=1 -DbaseSourceDir=$DATA_DIR -DbaseRdfDir=$RDF_DIR -DcompressRdf=true \ | ||
-DoutputRecordLimit=-1 -Ddate=$DATE > rdfgen.log 2>&1 | ||
``` | ||
|
||
Note: you may need to adjust the Java Heap size in pom-rdf-gen.xml depending on the | ||
memory limitations of your hardware. | ||
|
||
#### Species-specific subsets | ||
It can sometime be beneficial to limit RDF output to a specific species or group of species. | ||
Doing so can improve RDF generation time as well as limit the number of triples produced when | ||
parsing a file. Some of the file parsers are *species-aware* and there are two pre-built scripts | ||
that allow for RDF generation using species-specific subsets. | ||
|
||
``` | ||
For human, use: datasource-rdfizer/scripts/pom-rdf-gen-9606.xml | ||
For human plus seven model organisms, use: datasource-rdfizer/scripts/pom-rdf-gen-modelorgs.xml | ||
The seven model organisms are: fly, rat, mouse, yeast, worm, arabidopsis, zebrafish | ||
``` | ||
|
||
As mentioned above, note that when a taxon-aware file parser is used, some extra data is downloaded that ensures | ||
mappings from biological concepts to taxon identifiers are present. This download can be time | ||
consuming due to one of the files being very large, but it is a one-time cost. | ||
|
||
|
||
|