Each folder cordis, sdss, oncomx holds the relevant files (i.e. seed data, synth data, and dev data) for each of the datasets. Additionally each file contains a tables.json file, which contains a json structure of the database schema including table names, column names, column data types and primary/foreign key relationships.
The following is an example of the file structure:
- dev.json --> the manually generated development dataset
- seed.json --> the manually generated seed dataset
- synth.json --> the synthetically generated dataset using the seed query templates
- tables.json --> a json representation of the schema containing:
- the database name ("db_id"),
- free text table names for NLP pipelines ("table_names") e.g. "Stellar spectral line indices" vs "spplines"
- original table names ("table_names_original") i.e. the table names as they are in the database
- free text column names for NLP pipelines ("column_names")
- original column names ("column_names_original") i.e. the column names as they are in the database
- column data types ("column_types"): time, text or number
- foreign key relationships("foreign_keys")
- primary keys ("primary_keys")
The PostgreSQL databases for each of the 3 databases used for this benchmark can be found at the following links: CORDIS SDSS OncoMX
PostgreSQL specification: DBMS: PostgreSQL (ver. 9.5.20) Case sensitivity: plain=lower, delimited=exact Driver: PostgreSQL JDBC Driver (ver. 42.5.0, JDBC4.2)