GitSchemas (Permissive Licenses)
This repository contains scripts to crawl SQL-files from GitHub, parse them and extract structured database schema information from them. We do this, because we want to learn about the semantics of database tables in the wild (table names, column names, foreign key relations etc.).
Disclaimer: The dataset was created with research problems in the area of large scale data integration in mind, such as foreign key detection, and CSV-header detection. For other use cases, we strongly encourage users to revisit the data collection, parsing and extraction methods and align them with their concrete research objective.
Samples, containing only data from repos with a permissive license like MIT, Apache-2.0, are available for direct download above. We can provide other datasets for reproduction purposes upon request.
- GitSchemas Raw SQL File URLs (CSV)
- GitHub SQL Files Dataset (ZIP)
- GitSchemas (JSON; Full Dataset)
- Foreign Key Detection Training Dataset (CSV, Parquet)
The GitHub SQL Crawler was built using the GitHub Search API. We search GitHub for SQL code that contains a CREATE TABLE and a FOREIGN KEY statement to make sure we retrieve SQL scripts which actually define a schema.
The GitHub Search index contains ~7.8M SQL files, out of which ~0.7M contain the "CREATE TABLE FOREIGN KEY" keywords (s. Table).
Query | # Search Reults (Nov. 2021) |
---|---|
language: SQL | ~7.8M |
keywords: "CREAT TABLE FOREIGN KEY" language: SQL | ~ 0.8M |
keywords: "CREAT TABLE FOREIGN KEY" language: SQL (successfully downloaded) | ~ 0.7M |
keywords: "CREAT TABLE FOREIGN KEY" language: SQL (deduplicated) | ~ 0.37M |
The file-sizes of all SQL scripts containing the desired keywords, are distributed as follows:
It is Interesting to note is that there are no files >400 KB. Maybe GitHub does not perform language detection on files >400 KB.
The crawler performs the following three steps:
- (1) crawl a list of URLs
- (2) download files based on the list of URLs
- (3) deduplicate the downloaded files based on their sha256 hash
To facilitate downstream use cases, we want to extract structured schema information from the crawled SQL files. This includes things such as table names, column names, primary keys, and foreign keys including their reference table and their reference column names. However, the extraction is no simple feat, because the crawled SQL files may use different SQL dialects, contain comments, be incomplete, and/or contain syntactical errors.
We tried different parsing options available in Python, including the libraries sqlparse (Non-validating), mysqlparse (MySQL), pglast (Postgres), and queryparser (MySQL/Postrgres). Untimately, we foundpglast
to provide the best tradeoff between conveniece and parsing success (~16%), leaving us with a total of 61,038 schemas (s. Table).
Description | # of files |
---|---|
All SQL files | 373,156 |
Parsable with pglast | 61,038 (16.36%) |
pglast
extracts an abstract syntax tree (AST) from the SQL script, which we ultimately translate to JSON to facilitate further analysis.
Note: There should be quite some room for improvement in terms of the parsing success rate. E.g., backtick-quotes `
(MySQL-style) are incompatible with the postgres parser and currently lead to an immediate error. While something like this might be easy to solve with a search-and-replace, other issues are more intricate. There are many different SQL dialects, and for not every dialect there is an open-source parser available. Therefore it does not seem feasible to simply trial-and-error all possible parsers/dialects. Creating a robust SQL parser which can extract schema information from SQL files, without knowing which particular database system the query was written for, could be an interesting project in and of itself. Any useful pointers regarding this are highly apprecited.
The parsing step results in a JSON file which looks as follows. A sample dataset can be downloaded (s. Download Links section):
{'schema_000001':
{'INFO': {
'user': 'user_xxx',
'url': 'https://github.com/2212khushboo/Project/blob/5a6746d../db/sql/tables.sql?raw=true\n',
'filename': 'tables.sql',
'project': 'Project',
'filesize': 2885},
'TABLES': {
'state': {
'COLUMNS': [
['id', 'serial'],
['uuid', 'varchar'],
['state_name', 'varchar']],
'PRIMARY_KEYS': ['id'],
'FOREIGN_KEYS': []},
'address': {
'COLUMNS': [
['id', 'serial'],
['uuid', 'varchar'],
['flat_buil_number', 'varchar'],
['locality', 'varchar'],
['city', 'varchar'],
['pincode', 'varchar'],
['state_id', 'int4']],
'PRIMARY_KEYS': ['id'],
'FOREIGN_KEYS': [{
'FOREIGN_KEY': ['state_id'],
'REFERENCE_TABLE': 'state',
'REFERENCE_COLUMN': ['id']}]}}},
'schema_000002': {....}
}
The following section gives an overview of some of the properties of the dataset.
Entity | # |
---|---|
schemas | 61,038 |
tables | 393,653 |
columns | 2,544,164 |
unique column names | 303,443 |
primary keys | 322,627 |
unique primary key names | 31,599 |
foreign keys | 175,589 |
unique foreign key names | 31,041 |
Apart from the raw JSON data, we derived a tabular data set particularly for the foreign key detection problem. The dataset contains only those tables, that have a non-composite-key foreign key relation to another table. For tables that have multiple foreign key relations, we added one line for every relation. It has the following schema and a sample can be downloaded as CSV or parquet file (s. Download Links section):
schema | table_name_a | table_name_b | columns_a | columns_b | primary_keys_a | primary_keys_b | key_a | key_b |
---|---|---|---|---|---|---|---|---|
000361_xxxx.sql | event | resourcetbl | planno, lineno, locno, resno, timestart, time... | id | id | resno | resno | |
000376_yyyy.derby.sql | category | db_category | id, name, parent_id | id, name, parent_id | id | id | parent_id | id |
total rows: 124605
📄 GitSchemas: A Dataset for Automating Relational Data Preparation Tasks
@inproceedings{gitschemas2022,
author = {D{\"o}hmen, Till and Hulsebos, Madelon and Beecks, Christian and Schelter, Sebastian},
title = {GitSchemas: A Dataset for Automating Relational Data Preparation Tasks},
year = {2022},
maintitle = {ICDE 2022 – 38th IEEE International Conference on Data Engineering},
booktitle = {Workshop on Databases and Machine Learning (DBML)},
organization={IEEE}
}