Skip to content

Latest commit

 

History

History
159 lines (130 loc) · 7.67 KB

README.md

File metadata and controls

159 lines (130 loc) · 7.67 KB

CSV-LD: A CSV-based Serialization for Linked Data

Comma-separated-values (CSV) is a useful data serialization and sharing format. This document introduces the idea of CSV-LD as a CSV-based format to serialize Linked Data, mirroring the way that JSON-LD is a JSON-based format to serialize Linked Data. "CSV" here includes any dialect that uses a different delimiter, such as tab-separated-values (TSV).

The syntax of CSV-LD is designed to easily integrate into deployed systems that already use CSV, and provides a smooth upgrade path from CSV to CSV-LD. It is primarily intended to be a way to use Linked Data as part of spreadsheet-based data entry; to facilitate data validation, display, and conversion of CSV into other formats via use of CSV on the Web (CSVW) metadata; and to build FAIR data services.

The term "CSV-LD" was previously used to describe a now-obsoleted precursor to the CSVW specifications; both approaches require a second file, a JSON-LD template document, to be shared along with a CSV file. The approach described herein, in contrast, requires only a CSV file from the data producer, one that includes links to CSVW-powered metadata.

Data Producers

Data producers need to add two header rows to their CSV to make it CSV-LD: the "key-spec" row, and the "column-spec" row. This mirrors the way that JSON can minimally be made JSON-LD by adding a "@context" key-value pair. CSV producers commonly include a header row to label columns with names, and it is not uncommon to see CSV files with additional header rows (headerRowCount is part of the CSVW vocabulary).

The Key-Spec Row

The most important job of the top-most row, the key-spec row, is not at all about specifying keys. Rather, it is to communicate to a data consumer that this file is special in some way -- the left-most cell must have a URL in it, and hopefully, someone who has never seen CSV-LD before will click on the link.

The link is http://example.org/csv-ld/2021/01/inKey (not really, but I'll change this soon -- I registered csv-ld.org and currently point it to the CSVW vocabulary page), and it will provide a friendly introduction to CSV-LD. The file could indicate that it uses a different version of CSV-LD by using a different prefix before inKey, which would link to that version's page.

The page will also explain that inKey marks a CSV column as being part of how to identify a row uniquely. A CSV table could have a single key, such as an ID column, or a compound key, such as year and semester (e.g. "Fall", "Spring") uniquely identifying each row as an academic term. All key columns must be contiguous and start on the left side -- this (1) makes the links easy to spot for someone unfamiliar with CSV-LD, and (2) makes it easier for a data consumer/steward to implement a CSV-LD processor.

Cell values in the key-spec row after (to the right of) the key columns (the columns containing the inKey link) have only one requirement: they cannot have the same inKey-link value as the key-column cells. They can be blank, comments, whatever.

So, to summarize, the key-spec row (1) communicates that the file is a CSV-LD file, and (2) communicates the (possibly compound) key that uniquely identifies a row.

The Column-Spec Row

The job of the column-spec row is to be an unambiguous labeling of each column. It is fine for there to be a header row below the column-spec row that exhibits the common header-row practice of using short names ("x", "y", etc.) as column labels.

For a data producer, this task should be a simple matter of using a template or reference guide authored by a data steward that provides URLs for each column of interest. For example, one might be given the following table of terms to record environmental metadata for collected biosamples:

Term URL Comment
http://example.org/nmdc/id Sample ID
http://example.org/nmdc/lat_lon Latitude and longitude
http://example.org/nmdc/ecosystem Type of ecosystem
... ...

These aren't real URLs (but I'll eventually update this example to be real). It's also possible that a data steward may provide namespaces for data producers, e.g. http://example.org/nmdc/team42/, and producers can use the namespaces to prefix invented terms that will later resolve to working URLs through work done by the data steward.

The term URLs should resolve to pages that explain how values should be formatted. For example, http://example.org/nmdc/lat_lon could explain that the value should be latitude in degrees, a space, and longitude in degrees. This explanation could be automatically generated by CSVW metadata (authored by the data steward) that will also be used by a CSV-LD processor to validate the data. For example, the CSVW metadata for this field could look like

{
  "@context": {"@vocab": "http://www.w3.org/ns/csvw#"},
  "name": "lat_long",
  "separator": " ",
  "ordered": true,
  "datatype": {
    "base": "number",
    "minimum": "-180",
    "maximum": "180"
  }
}

Data Consumers

A CSV-LD file is still just a CSV file, so a data consumer can simply ignore the "extra" headers rows. Each of the "extra" header rows is prefixed by a "#" and a space, so that popular parsers can recognize these rows as "comment" lines and skip to the "real" header row, e.g. pandas.read_csv(...,comment="#") for the popular Python pandas data-processing library. They could also click any link in the column-spec header to learn more about how to interpret the data in that column. If they have access to a CSV-LD processor, they can use it to validate the data and/or convert it to another format like JSON (i.e., JSON-LD).

Data Stewards

Data stewards are concerned with managing data integrity. They can author JSON-LD metadata for csvw:Column entities, as shown in the above example for http://example.org/nmdc/lat_lon, and make that metadata downloadable from the URL used for the column.

A CSV-LD processor will request http://example.org/nmdc/lat_lon using an HTTP Accept header that expresses a preference for a JSON-LD response, whereas a human loading http://example.org/nmdc/lat_lon in their browser will get a web page (HTML) response that the data steward has produced (perhaps auto-generated from the metadata JSON-LD).

Thus, a data steward needs know how to serve web content, or needs to collaborate with someone who can. I hope to provide in this repository a reference server implementation, written in Python.

Implementers of CSV-LD Processors

The final stakeholder in the CSV-LD world is the implementer of a CSV-LD processor. A more detailed specification is to come, but I will try to adhere to a "worse is better" approach that prioritizes simplicity of implementation. Furthermore, this repository will host a reference CSV-LD processor implementation, written in Python. Perhaps the reference implementation will be good enough for most.

One thing to note here is that the job of the data producer is to aggregate a set of records (rows) with well-defined fields (columns). What a record is or should be, i.e. its type or class, and thus e.g. which columns are required, is left open to data consumers depending on the application.

Validating a record as a whole is important, and the CSVW metadata spec can help with this task. I expect to elaborate later on how exactly a CSV-LD processor may be invoked e.g. as

csvld --entity "http://example.org/nmdc/BioSample" --out data.json data.csv

to validate each row of the CSV-LD, in addition to independently validating each column value.