Skip to content

2. Enriching the Data

Sytze Van Herck edited this page Jan 17, 2023 · 6 revisions

In his Ted talk Sir Tim Berners-Lee underlines the importance of connecting data. Adapting the Metadata focussed on changing the metadata.json. In Enriching the Data you learn to expand the metadata.

Contents

Virtual Columns

Adapting the Metadata added information regarding the rows of the CSV file. The example described the relationship between the neighbourhoods (rows) and the average number of maids (columns). So far the neighbourhoods or rows held the subject (?s) position. Virtual columns instead refer to columns in the subject position.

A virtual column resembles existing columns in the JSON schema file. In the metadata.json file search for the list of "columns", then start a new set of parenthesis to create a virtual column.

The following virtual column relates the average number of maids ("Dienstboden") column to an occupational category in the PST scheme, namely that of 'housemaid' with code '5,25,1,3'.

      {
        "virtual": true,
        "datatype": "string",
        "aboutUrl": "https://iisg.amsterdam/buurt.csv/column/Dienstboden",
        "propertyUrl": "https://iisg.amsterdam/vocab/PST/recon",
        "csvw:value": "5,25,1,3"
      },

The results of the JSON schema are triples like:

<https://iisg.amsterdam/buurt.csv/column/Dienstboden>  <https://iisg.amsterdam/vocab/PST/recon> "5,25,1,3"^^<http://www.w3.org/2001/XMLSchema#string>
?s                                                    ?p                                       ?o

Another notation for triples is ?s ?p ?o. For clarity, we've added this notation to the example. The notation will not show in your example files.

For advanced features of virtual columns, use Jinja. All values within a column can become virtual columns. Change the "aboutUrl": of a virtual column according to the {_row} example here.

Exercise

Try creating a virtual column yourself. Download the example csv file. Copy the file path, and switch to your terminal. Move to the folder where you saved the buurt.csvfile. Next, follow these steps:

  1. Upload the example csv file:
cow_tool build buurt.csv

The tool generates a -metadata.json file in the folder where you saved the example file.

  1. Open and edit the metadata file. In the list of "columns":[ add the following virtual column:
      },
      {
        "virtual": true,
        "datatype": "string",
        "aboutUrl": "https://iisg.amsterdam/buurt.csv/column/Dienstboden",
        "propertyUrl": "https://iisg.amsterdam/vocab/PST/recon",
        "csvw:value": "5,25,1,3"
      }
  ]
  1. Create the Linked Data file with the following command:
cow_tool convert buurt.csv

The end result is an .nq file. In this file you should find a triple with subject <https://iisg.amsterdam/buurt.csv/column/Dienstboden>.

^

Adding Provenance

Provenance information is of key importance for historians and other researchers by extension. Solid source criticism requires information about

Publisher

By default CoW only includes the publisher of the dataset in the JSON schema.

 "dc:publisher": {
  "schema:name": "CLARIAH Structured Data Hub - Datalegend",
  "schema:url": {
   "@id": "http://datalegend.net"
  }
 },

Any information about the publisher is found under "dc:publisher". The commonly used vocabulary to denote the publisher in Linked Data is DCMI Metadata terms of the Dublin Core Metadata Initiative. The resulting triples contain the following predicates ?p and objects ?o:

<http://schema.org/name> "CLARIAH Structured Data Hub - Datalegend"
<http://schema.org/url> <http://datalegend.net/>
?p                       ?o

Creator

While the publisher makes the dataset available, the creator is the person, organization, or service who created the dataset.

"dc:creator": [{
    "@id": "sdp:[email protected]",
    "rdf:type": [{
        "@id": "foaf:Person"
    }, {
        "@id": "prov:Person"
    }],
    "owl:sameAs": [{
        "@id": "http://orcid.org/xxxx-xxxx-xxxx-xxxx"
    }, {
        "@id": "info:eu-repo/dai/cc/1234"
    }, {
        "@id": "http://isni.org/isni/00000001234"
    }],
    "foaf:name": "Some Chap",
    "foaf:givenName": "Some",
    "foaf:familyName": "Chap",
    "foaf:mbox": {
        "@id": "mailto:[email protected]"
    }
}],

A single person can be identified in many ways. The creator is first found in the object (?o) position. The "@id" of the creator is an e-mail address that will usually show as ?s or subject. Two "rdf:type" definitions determine that the creator is a "Person" according to the foaf and prov vocabularies. The "owl:sameAs" predicate ?p indicates that the creator is also identified by another "@id". These "@id"'s are usually links to profile pages on other websites. Finally, the "foaf:" vocabulary is used to state the full name, first name as givenName, and familyName, together with their e-mail address as mbox.

The results of the JSON schema are the following triples:

<https://example.com/id/buurt.csv> <http://purl.org/dc/terms/creator> <sdp:[email protected]>
?s                                 ?p                                 ?o

<sdp:[email protected]> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person>
<sdp:[email protected]> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Person> 
?s                          ?p                                                ?o

<sdp:[email protected]> <http://www.w3.org/2002/07/owl#sameAs> <http://isni.org/isni/00000001234>
<sdp:[email protected]> <http://www.w3.org/2002/07/owl#sameAs> <http://orcid.org/xxxx-xxxx-xxxx-xxxx>
<sdp:[email protected]> <http://www.w3.org/2002/07/owl#sameAs> <info:eu-repo/dai/cc/1234>
?s                          ?p                                     ?o

<sdp:[email protected]> <http://xmlns.com/foaf/0.1/name> "Some Chap"
<sdp:[email protected]> <http://xmlns.com/foaf/0.1/familyName> "Chap"
<sdp:[email protected]> <http://xmlns.com/foaf/0.1/givenName> "Some"
<sdp:[email protected]> <http://xmlns.com/foaf/0.1/mbox> <mailto:[email protected]>
?s                          ?p                               ?o

Note: Another notation for triples is ?s ?p ?o. For clarity, we've added this notation to the example. The notation will not show in your example files. In addition, the order of the resulting triples in this example differs from the order in the .nq-file.

References

Aside from the publisher and creator, datasets are often derived from other sources. The dataset can be the result of an update, or it can be a new dataset based on a book or other source.

"prov:wasDerivedFrom": [{
    "@id": "urn:isbn:9789058671967",
    "rdf:type": {
        "@id": "bibo:Book"
    },
    "dc:title": {
        "@value": "Historical International Standard Classification of Occupations",
        "@lang": "en"
    },
    "schema:author": ["Marco van Leeuwen", "Ineke Maas", "Andrew Miles"],
    "dc:publisher": "Leuven University Press",
    "dc:date": {"@value":"2002","@type":"xsd:Year"}
}, {
    "@id": "http://historyofwork.iisg.nl"
}],

In the above example the dataset is derived from a website, and based on a book. The "@id" is set to a "urn:isbn", a common uniform resource name for books with an international standard book number. The "rdf:type" uses the Bibliographic Ontology bibo to set the data type to book. The "dc:title" predicate ?p lists two objects ?o: the "@value" and the "@lang". The value of the title refers to the book title, and the language of the title is set to "en" or English. The "schema:author" lists all authors, and "dc:publisher" provides the name of the book publisher. The "dc:date" specifies the "@value" and the "@type". This means that the object ?o of the date is a value "2002" with a datatype "xsd:Year" attached:

<https://example.com/id/buurt.csv> <http://www.w3.org/ns/prov#wasDerivedFrom> <http://historyofwork.iisg.nl/>
<https://example.com/id/buurt.csv> <http://www.w3.org/ns/prov#wasDerivedFrom> <urn:isbn:9789058671967>
?s                                 ?p                                         ?o

<urn:isbn:9789058671967> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/ontology/bibo/Book>
?s                       ?p                                                ?o

<urn:isbn:9789058671967> <http://purl.org/dc/terms/title> "Historical International Standard Classification of Occupations"
?s                       ?p                               ?o

<urn:isbn:9789058671967> <http://schema.org/author> "Ineke Maas"
<urn:isbn:9789058671967> <http://schema.org/author> "Marco van Leeuwen"
<urn:isbn:9789058671967> <http://schema.org/author> "Andrew Miles"
?s                       ?p                         ?o

<urn:isbn:9789058671967> <http://purl.org/dc/terms/publisher> "Leuven University Press"
?s                       ?p                                   ?o

<urn:isbn:9789058671967> <http://purl.org/dc/terms/date> "2002"^^<http://www.w3.org/2001/XMLSchema#Year>
?s                       ?p                              ?o

Note: Another notation for triples is ?s ?p ?o. For clarity, we've added this notation to the example. The notation will not show in your example files. In addition, the order of the resulting triples in this example differs from the order in the .nq-file.

Exercise

Try adding provenance yourself. Download the example csv file. Copy the file path, and switch to your terminal. Move to the folder where you saved the buurt.csvfile. Next, follow these steps:

  1. Upload the example csv file:
cow_tool build buurt.csv

The tool generates a -metadata.json file in the folder where you saved the example file.

  1. Open and edit the metadata file. Adapt the following code by adding yourself as creator and a colleague as co-creator. Keep the same reference.
 "dc:publisher": {
  "schema:name": "CLARIAH Structured Data Hub - Datalegend",
  "schema:url": {
   "@id": "http://datalegend.net"
  }
 },
 "dc:creator": [{
    "@id": "sdp:[email protected]",
    "rdf:type": [{
        "@id": "foaf:Person"
    }, {
        "@id": "prov:Person"
    }],
    "owl:sameAs": [{
        "@id": "http://orcid.org/xxxx-xxxx-xxxx-xxxx"
    }, {
        "@id": "info:eu-repo/dai/cc/1234"
    }, {
        "@id": "http://isni.org/isni/00000001234"
    }],
    "foaf:name": "Some Chap",
    "foaf:givenName": "Some",
    "foaf:familyName": "Chap",
    "foaf:mbox": {
        "@id": "mailto:[email protected]"
    }
}, {
    "@id": "sdp:[email protected]",
    "rdf:type": [{
        "@id": "foaf:Person"
    }, {
        "@id": "prov:Person"
    }],
    "owl:sameAs": {
        "@id": "http://orcid.org/xyxy-xyxy-xyxy-xyxy"
    },
    "foaf:name": "Some Gal",
    "foaf:familyName": "Gal",
    "foaf:givenName": "Some",
    "foaf:mbox": {
        "@id": "mailto:[email protected]"
    }
}],
"prov:wasDerivedFrom": [{
    "@id": "urn:isbn:9789058671967",
    "rdf:type": {
        "@id": "bibo:Book"
    },
    "dc:title": {
        "@value": "Historical International Standard Classification of Occupations",
        "@lang": "en"
    },
    "schema:author": ["Marco van Leeuwen", "Ineke Maas", "Andrew Miles"],
    "dc:publisher": "Leuven University Press",
    "dc:date": {"@value":"2002","@type":"xsd:Year"}
}, {
    "@id": "http://historyofwork.iisg.nl"
}],
  1. Create the Linked Data file with the following command:
cow_tool convert buurt.csv

The end result is an .nq file. Try to locate each of the triples determined by the JSON schema using the search function for the .nq file.

^

Dates

Dates are often recorded by historians and researchers. The datatype xsd:date formats complete dates (YYYY-MM-DD).

The example below adds a virtual column with death dates based on three different columns (Year_Death, Month_Death, Day_Death).

{
    "virtual": "true",
    "datatype": "xsd:date",
    "propertyUrl": "schema:deathDate",
    "dc:description": "Date of death",
    "csvw:value": "{{['%04d'| format(Year_Death|int),'-','%02d' | format(Month_Death|int),'-','%02d'|format(Day_Death|int)]|join}}"
      }

The object csvw:value builds on Jinja. The code joins several elements between [] into a single date string (YYYY-MM-DD). One element such as '%04d'| format(Year_Death|int) specifies the number of digits ('%04d') to be included from the Year_Death column. The numbers in the CSV column are formatted as integers (int). Each element of the date is separated by '-'.

^

Geographical Information

Buildings, places, regions, countries, and other spaces are recorded in two ways. The geographical information to mark the position of a building is described as a point, with longitude and latitude. Whereas a polygon or a series of points describes the shape of a country.

Geographical information in Linked Data can be plotted on a map. Plotting Linked Data is done using the SPARQL Protocol And RDF Query Language (SPARQL). Only certain SPARQL editors such as Yasgui can handle geographical information.

The following data example describes the location of two cinemas.

cinema;latitude;longitude
Flora;52.366097;4.899298
Paleis voor Volksvlijt;52.360018;4.89978

The simplest notation for a location is as a single triple, where the subject (?s) has a geometry (?p) of a point or shape (?o).

<https://iisg.amsterdam/atm/cinemas/0> <http://www.opengis.net/ont/geosparql#asWKT> "POINT(4.893937 52.353038)"^^<http://www.opengis.net/ont/geosparql#wktLiteral> 
?s                                     ?p                                           ?o

Another notation for triples is ?s ?p ?o. For clarity, we've added this notation to the example. The notation will not show in your example files.

Cinema 'Flora' in the first row (row 0) has a geographical representation of data type 'wktLiteral' (a point) with value (4.893937 52.353038).

Objects are rarely captured with a single geographical representation. The point for cinema Flora could be based on the location of the entrance, or the building's shape centroid. Even with polygons based on blue prints, the location could be represented differently over time when the building expanded. When paper maps are digitized, there are often slight errors in georeferencing and shape file creation. The result are multiple improved versions of the same blueprint.

Instead of linking multiple shapes directly to a single object, create a new object linked to both the building and the geographical representation (point or polygon). This approach allows you to filter only the geographical representation that is useful to you.

<https://iisg.amsterdam/atm/cinemas/0> <geo:hasGeometry> <https://iisg.amsterdam/atm/cinemas/.well-known/genid/83000>
<https://iisg.amsterdam/atm/cinemas/.well-known/genid/83000><http://www.opengis.net/ont/geosparql#asWKT> "POINT(4.893937 52.353038)"^^<http://www.opengis.net/ont/geosparql#wktLiteral>

First create a new id '83000' and then attach the point-representation of cinema Flora to it. The 'geo:hasGeometry' property allows you to query all the geometry properties for this cinema. The provenance for the point representation can also be attached to id '83000'. Remember to create unique id's, otherwise geographic representations of other cinema's or even countries might be linked incorrectly.

Exercise

Try adding geographical information yourself. Copy this example and save it as a csv file.

cinema;latitude;longitude
Flora;52.366097;4.899298
Paleis voor Volksvlijt;52.360018;4.89978

Copy the file path, and switch to your terminal. Move to the folder where you saved the cinema.csv file. Next, follow these steps:

  1. Upload the example csv file:
cow_tool build cinema.csv

The tool generates a -metadata.json file in the folder where you saved the example file.

  1. Open and edit the metadata file. By default, the row number serves as the identifier of the cinemas.
"tableSchema": {
  "aboutUrl": "atm/cinemas/{_row}",

First, create virtual columns to represent latitude and longitude.

    {
    "virtual": true,
    "propertyUrl": "schema:latitude",
    "csvw:value": "{latitude}"
    },
    {
    "virtual": true,
    "propertyUrl": "schema:longitude",
    "csvw:value": "{longitude}"
    },

Second, create the geographic representation in Linked Data as follows:

    {
    "virtual": true,
    "aboutUrl": "atm/cinemas/{_row}",
    "propertyUrl": "geo:hasGeometry",
    "valueUrl": "atm/cinemas/.well-known/genid/{{(83000 + _row)}}"
    },
    {
     "virtual": true,
     "datatype": "geosparql:wktLiteral",
     "aboutUrl": "atm/cinemas/.well-known/genid/{{(83000 + _row)}}",
     "propertyUrl": "geosparql:asWKT",
     "csvw:value": "{{'POINT(' + longitude + ' ' + latitude + ')'}}"
    },

Note: To ensure unique numbers, the arbitrary number '83000' is added. The number is derived from the number of rows in the original CSV file * 1000. It cannot be a random number, because the number needs to match the second virtual column. With large numbers the line number from the original file is still visible. A cinema on line 90, would then have genid 83090.

  1. Create the Linked Data file with the following command:
cow_tool convert cinema.csv

The end result is an .nq file.

  1. Query the data in a SPARQL editor such as Yasgui. To retrieve all available points and polygons for ?s try:
PREFIX geo: <http://www.opengis.net/ont/geosparql#>
 
?s geo:hasGeometry/asWKT ?point

While a query for a specific geometry could be:

PREFIX geo: <http://www.opengis.net/ont/geosparql#>
 
?s geo:hasGeometry <https://iisg.amsterdam/atm/cinemas/.well-known/genid/83000>
<https://iisg.amsterdam/atm/cinemas/.well-known/genid/83000>asWKT ?point

^

If Statement

To modify certain values of a CSV file during the conversion, use a conditional statement. When the if condition is true, the data will change according to the first instruction. When the condition is false, an else statement specifies a different change.

The example below converts the values of the "male" column to URIs instead of 0 (female) or 1 (male).

   {
    "name": "male",
    "datatype": "string",
    "@id": "https://iisg.amsterdam/cow_person_example.csv/column/male",
    "dc:description": "The state of being male or female",
    "titles": ["male"],
    "propertyUrl": "sdmx-code:sex",
    "valueUrl": "sdmx-code:{% if male == '0' %}sex-F{% else %}sex-M{% endif %}"
   },

The conditional statement for the object or valueUrl is written in Jinja. First, if the male column contains "0", then the "sdmx-code" URI ends in "sex-F". Otherwise, "sex-M" is added to the "sdmx-code" URI.

The result of the JSON schema are triples containing the following predicate ?p and objects ?o:

<http://purl.org/linked-data/sdmx/2009/code#sex> <http://purl.org/linked-data/sdmx/2009/code#sex-F>
<http://purl.org/linked-data/sdmx/2009/code#sex> <http://purl.org/linked-data/sdmx/2009/code#sex-M>
?p                                               ?o

A conditional statement in Jinja is placed between {% %} because the conditions themselves should not be printed, only the outcome. The if statement checks whether the value of a variable (the cell of the male column) equals == the string '0'. Only then will the text following the statement be added. When the if statement is false, the text after the {% else %} statement is used. A conditional statement ends with {% endif %}.

Exercise

Try adding an if statement yourself. Download the example csv file. Copy the file path, and switch to your terminal. Move to the folder where you saved the cow_person_example.csv file. Next, follow these steps:

  1. Upload the example csv file:
cow_tool build cow_person_example.csv

The tool generates a -metadata.json file in the folder where you saved the example file.

  1. Open and edit the metadata file. Remember to change the base URI.
"@base": "https://iisg.amsterdam/resource/cow_person_example/"

Instead of the row number, the subject should contain the personID.

"aboutUrl": "sex/{personID}",

Finally, change the "male" column to the sdmx-code:sex URIs using an if statement.

   {
    "name": "male",
    "datatype": "string",
    "@id": "https://iisg.amsterdam/cow_person_example.csv/column/male",
    "dc:description": "The state of being male or female",
    "titles": ["male"],
    "propertyUrl": "sdmx-code:sex",
    "valueUrl": "sdmx-code:{% if male == '0' %}sex-F{% else %}sex-M{% endif %}"
   },
  1. Create the Linked Data file with the following command:
cow_tool convert cow_person_example.csv

The end result is an .nq file. In this file you should find the following triple for the person with personID 12:

<https://iisg.amsterdam/resource/cow_person_example/cowgirls/012> <http://purl.org/linked-data/sdmx/2009/code#sex> <http://purl.org/linked-data/sdmx/2009/code#sex-F>

^

Next: 3. Additional Features