The software-mentions dataset exists in three different formats.
The first, the raw extractions, is the raw output of running the software-mentions
application on the paper dataset.
This is a large hierarchical file structure with several directories and files for each of the ~20 million parsed papers.
The second, the JSONL files, is a collated form of the raw output.
These files are sequences of JSON objects, each representing paper metadata or extracted software mentions in the paper.
Part 1 of these instructions details converting to this format.
The last, the Parquet tables, is a more user-friendly form of the data in a columnar data format.
Part 2 of these instructions details how to convert the JSONL files to the Parquet tables, or to create new Parquet table definitions.
These instructions are for those who wish to extract their own tables from the software-mentions dataset, or otherwise want to handle the dataset in its original form. If you just want the already-available Parquet tables, see the Readme. Instructions assume commands are run from a Linux distribution.
If you already have the .jsonl files, you can skip to part 2.
Before starting, have an empty secondary hard drive with at least 2TB of storage and at least 120 million inodes available. Most 1TB disks do not have sufficient inodes to successfully extract the software-mentions dataset. This process may take over a week in total. Performing other work on the target disk will be slow until you complete these steps.
Please read this entire paragraph before doing anything, or your system may become temporarily unusable. The raw software-mentions dataset is ~150 GB compressed and ~800 GB uncompressed.
The dataset is a hierarchical collection of about 100 million folders and JSON files.
If your disk does not have at least this many inodes available, the extraction process will fail when the filesystem runs out of inodes, even if there is still disk space available.
Running out of inodes can make many common operations slow or unstable.
To check the free inodes available on your disk, use df -i
and check the IFree
column.
Ensure this number is comfortably over 100 million - preferably at least 120 million for the disk you will extract the dataset to.
Even if your disk has sufficient inodes, we do not recommend running this process on the same disk as your operating system. The fast response time of many filesystem operations is due to your disk caching recently accessed parts of the filesystem, and this task will disrupt many of the caching logic's assumptions. After you have extracted software-mentions, you can expect many operations, such as listing the contents of a directory, to take over a second on the target disk.
The below steps have not been rigorously tested. Please contact [email protected] if you encounter problems or wish to assist in improving this documentation. As-is the process takes about a week and so making these robust is not a priority.
- Use
unzip
to decompress the archive to the target disk.
This operation may take 48-72 hours. If you chose to use an alternative such as 7zip, you may experience performance issues and the process may hang indefinitely. The result of the operation will be a nested directory structure containing multiple files for each analyzed paper. The directory is based on the software-mentions UUID of the analyzed paper, using the first 8 characters to generate directories.
The file AABBCCDD-EEFF-GGHH-IIJJ-KKLLMMNNOOPP.json
, representing the paper's metadata, will be located in the directory:
AA/BB/CC/DD/AABBCCDD-EEFF-GGHH-IIJJ-KKLLMMNNOOPP/...
Each directory will contain at least two files, and rarely six or more. Different filename patterns indicate different parses of the paper, and potentially contain different sets of detected software mentions from that parse.
- Use the
merge
command incmd/merge
to merge the small JSON files into JSONL files.
go run cmd/merge/merge.go IN OUT
The IN
path should be the top level directory containing the 256 subdirectories representing the first two characters of the UUID.
The OUT
path should ideally be an empty directory, as about 1,500 files will be written to it.
This process overwrites existing conflicting files without warning.
This process may take 48-72 hours, and will run faster if IN
and OUT
point to different disks.
The merge
command may be used on a subdirectory to only merge JSON files in a portion of the dataset.
Before merging the entire dataset, we recommend merging a small portion of the files to ensure the process is running smoothly, such as with:
go run cmd/merge/merge.go IN/AA OUT
This will merge approximately 1/256th of the data.
If you are not interested in the remaining files that were not covered by this merge process, you may now delete the directory
extracted by unzip
and proceed to Part 2.
- [Optional] Use
rm-processed
to recursively remove the files and directories that were merged into JSONL via the above.
go run cmd/rm-processed/rm-processed.go IN
There will be a small number of files and directories remaining (~2,000). These files are not handled by the logic above, and their data is not present in the resulting JSONL files. Per the author of the dataset, these files are errors and may be safely ignored and deleted.
To begin this step, you need the dataset as .jsonl files. There should be about 1,500 of these in a flat directory. Files are groups of either metadata or detected software mentions, grouped by the first two characters of the software-mentions UUID. Each of the 256 two-letter prefixes has up to six files.
- Every prefix includes
.papers.jsonl.gz
, which is the paper metadata. - Every prefix includes
.software.jsonl.gz
, which is the extracted software mentions from the default paper parse. - The four other prefixes indicate alternative parses (JATS, Pub2Tei, LaTeX, and GROBID) for each paper. Many papers do not have alternative parses, and some UUID prefix groups do not include any of a particular parse.
The .jsonl format is a sequence of JSON objects delimited by newline.
Many JSON decoders, such as encoding/json.Encoder
in Go, handle this automatically and can treat the contents of these files as a stream of JSON objects.
Table definitions are defined in Go code. These are currently a work in progress.
To extract tables, run extract-columns
on the directory containing the JSONL files.
go run cmd/extract-columns/extract-columns.go [papers|software] IN_DIR OUT_DIR
For now there are only two tables, papers
and software
.
You may define new Parquet table definitions that extract information from the JSONL files, but you must insert the reference in extract-columns.go to use them.