ReshapeRDF - Guide

A toolset to work with N-Triples

Author:	Felix Bensmann
Date:	07. Dec. 2015
Last change:	16. Jan. 2016
Please note:	This document is intended to provide help to get started with ReshapeRDF, nothing more. Its content is subject to change.

Introduction
Sorted N-Triples
Terms
Setup
Commands
- Commands for everyday use
  - block
  - checksorting
  - extractresources
  - filter
  - getenrichment
  - help
  - merge
  - mergedir
  - ntriplify
  - pick
  - removeduplicates
  - renameproperty
  - restorebn
  - securelooseends
  - sort
  - split
  - version
- Special commands
  - analyzetype
  - correct
  - extractduplicatelinks
  - extractreferenced
  - outline
  - pigeonhole
  - pumpup
  - subtract
Getting Started

Introduction

Processing RDF mass data can be a prone job. Common triplestores offer certain functionality for querying and manipulating RDF data but only few can handle mass data (let's say more than 200 Mio. statements) at the same time. Typical operations like data import and SPARQL queries tend to be time consuming and inconvenient to be used in comprehensive reshaping operations.

So, when working with simple structured graph data, a solution can be to refrain from using a triplestore and to work with dump files instead. Recurring tasks are extracting entities of a certain class from a large dataset, or subdivide a dataset into blocks according to a certain property (Blocking), filtering the data, removing resources and single statements, renaming properties and similar reshaping operations.

Unfortunately organizing ones RDF mass data in the desired manner cannot be done easily with available out-of-the-box tools.

The tool at hand was developed to enable users of large RDF datasets to efficiently organize and reshape their data without the need of a triplestore.

Sorted N-Triples

When there is an RDF dump file to process, users cannot take for granted that stored resources are held together. This is especially true for the N-Triples file format but also applies for the RDF/XML file format that even provides a way to cluster statement by syntax. At the same time resources within such files cannot be found efficiently without having to read-in the whole file and examine the stream from the start to the end to find all occurences. Complex searches cannot be handled at all.

To overcome these limitations this tool applies an intermediate file format to be used by a given set of operations to organize data in a more flexible way. This format is "Sorted N-Triples" (SNT). These are, as the name already indicates, alphabetically sorted N-Triples.

The following example depicts how SNTs can be used for an interlinking and enrichment process.

Convert a non-SNT file to N-Triples
Sort it
Extract relevant resources (one iteration)
Split the extracted resources into smaller datasets (one iteration)
Interlink - however
If necessary convert the links to SNT
Merge the links into the data (one iteration)

The flexible nature of this tool is especially helpful with heterogeneous datasets.

Setup

Copy the JAR-Archive reshaperdf-1.0-SNAPSHOT.jar and the lib folder to a directory of your choice. The software requires at least JRE 1.7 .

It is helpful to provide a script "reshaperdf" in /bin that facilitates the calls to the program.

#!/bin/bash
# Author:  John Smith
# Purpose: Facilitates calls to ReshapeRDF.

java -jar reshaperdf-1.0-SNAPSHOT.jar $@

Terms

triple and statement In this application a triple and a statement as known from the RDF context are the same thing. They always fit in one line.
line based An operation is called that if it understands triples as a string line.
statement based An operation is called that if it understands triples as triples/statements.
resource based An operation is called that if it sees the data as a list of individual resources.

Commands

This chapter outlines the operations and their usage. A command can be called using the following syntax:

java -jar reshaperdf-1.0-SNAPSHOT.jar <command> [<command parameter> ...]

The chapter is subdivided into a section that is about commands intended for everyday use and a section about special commands that do not have a purpose in everyday use but become handy in exotic use cases. The special commands are available in their own branch.

At no point any of the commands will overwrite an input file, rather they produce a new file with the desired changes. However existing files will be overwritten by output files without notification.

Comments are usually not processed by the commands. Most commands require the long forms of a URI.

Commands for everyday use

block

Name	block
Usage	block <input file> <output dir> <predicate> <char offset> <char length>
Type	Resource based
Description	Assigns the resources of the input file to blocks according to a given character sequence of a given property's value. One block is one file. Files that exceed a statement count of 100 000 are further split into files of 100 000.
Argument: input file	The input file, requires SNT.
Argument: output dir	The Directory to store the output in.
Argument: predicate	The property to block by. Requires long namespace version.
Argument: char offset	The offset of the character sequence in the property's value. Use 0 for no offset. If the offset is higher than the value's length, then the whole property value will be evaluated.
Argument: char length	The lenght of the character sequence in the property's value. If the length is higer than the value's length, then the whole property value will be evaluated.
Output	A set of SNT files in the given output directory.

checksorting

Name	checksorting
Usage	checksorting <input file>
Type	Statement based
Description	Checks the input file for proper sorting. This sorting differs from line sorting in the fact that it ignores the control characters.
Argument: input file	The input file, requires N-Triples.
Output	Prints "Sorted" to stdout if sorted correctly, "Not sorted" otherwise.

extractresources

Name	extractresources
Usage	extractresources <input file> <output file> <predicate> <object> <offset> <length>
Type	Resource based
Description	Extracts resources with a given predicate-object combination.
Argument: input file	The input file, requires N-Triples.
Argument: output file	Name of the output file, the file with the extracted resources.
Argument: predicate	The predicate to look for, namespace has to be in long form. Use a "?" to indicate a wildcard.
Argument: object	The object to look for. Can be a literal or a URL. Use a "?" to indicate a wildcard.
Argument: offset	Number of the matching resource to start from.
Argument: length	Number of resources to extract. -1 indicates to use all available resources.
Output	An SNT file with the extracted resources.

filter

Name	filter
Usage	filter <whitelist\|blacklist> <input file> <filter file> <output file>
Type	Resource based
Description	Removes statments from an N-Triples file accoringly to a white or black list.
Argument: whitelist\|blacklist	Either "whitelist" or "blacklist" to indicate what kind of filter is to be used.
Argument: input file	File to filter
Argument: filter file	A text file containing the properties to be subject to the filter. Is a simple line-based text file.
Argument: output file	Name of the file to store the output in.
Output	An SNT file with the remaining resources.

getenrichment

Name	getenrichment
Usage	getenrichment <linkfile> <resource file> <output file>
Type	statement based/Resource based
Description	Extracts resources from an SNT file, that are adressed by the object of an SNT link file. Missing resources in the resources file are ignored. The subjects of the extracted statements are altered to the subject of the link.
Argument: linkfile	The link file, requires SNT.
Argument: resource file	An SNT file containing the resources to be extracted.
Argument: output file	Name of the output file. The file containing the extracted resources.
Output	An SNT file with the extracted resources.

help

Name	help
Usage	help <cmd>
Type	-
Description	Displays the help text, for the specified command.
Argument: cmd	Name of the command.
Output	Helptext for the specified command.

merge

Name	merge
Usage	merge <output file> <input file1> <input file2> [<input file3>...]
Type	statement based
Description	Merges a couple of sorted N-Triples files.
Argument: output file	The name of the output file.
Argument: input file1	An SNT file containing statements to be merged.
Argument: input file2	Another SNT file containing statements to be merged.
Argument: input fileN	Further optional SNT files containing statements to be merged.
Output	An SNT file with the merged results.

For a simple concatenation you may also try "$ cat a.nt b.nt c.nt > mergefile.nt" in a Linux environment.

mergedir

Name	mergedir
Usage	mergedir <input dir> <output file>
Type	statement based
Description	Merges SNT files that are in the same directory. Extends namespaces to its long form.
Argument: input dir	The name of the directory containing the SNT files to be merged. Subdirectories are also searched.
Argument: output file	An SNT file containing the merged statements.
Output	An SNT file with the merged results.

For a simple concatenation you may also try "$ cat *.nt > mergefile.nt" in a Linux environment.

ntriplify

Name	ntriplify
Usage	ntriplify <input dir> <output file> [<JSON-LD context URI> <JSON-LD context file>][...]
Type	statement based
Description	Converts all RDF files from a directory into N-Triples and merges them into a single file.
Argument: input dir	The name of the directory containing the RDF files. Subdirectories are also searched.
Argument: output file	The name of the output file.
Argument: JSON-LD context URIs and files	Optional. It is possible to state a mapping of JSON-LD contexts and local JSON-LD context files. The context-URIs and file paths will have to be inserted in pairs separated by a space. The command will use the local contexts whenever the remote context is not available.
Output	An N-Triples file containing the converted statements.

pick

Name	pick
Usage	pick <input file> <output file> <s\|p\|o\|stmt\|res> <s\|list\|?> <p\|list\|?> <o\|list\|?>
Type	Dependant on search pattern
Description	Takes an input file and extracts all subjects, predicates, objects, statements or resources according to the specified pattern and outputs them into a file. A "?"-character can be used to indicate a wildcard. Example: infile.nt outfile.nt o subjectlist.txt predicatelist.txt ? This returns all objects whose statments match any combination of subjectlist and predicatelist.
Argument: input file	The name of the input file. Sorted N-Triples are required.
Argument: output file	The name of the output file.
Argument: return type	The kind of information to be returned; one of subject, predicate, object, statement or resource.
Argument: subject expression	The expression for matching the subject: A single URL, a file containing a list of URLs or a wildcard.
Argument: predicate expression	The expression for matching the predicate: A single URL, a file containing a list of URLs or a wildcard.
Argument: object expression	The expression for matching the subject: A single URL/Literal, a file containing a list of URLs or literals or a wildcard. Datatypes and language tags cannot be processed.
Output	An N-Triples file containing the output.

removeduplicates

Name	removeduplicates
Usage	removeduplicates <input file> <output file>
Type	Line based
Description	Removes duplicate statements from an SNT file. Keeps one line of each kind.
Argument: input file	The name of the input file, requires SNT.
Argument: output file	The name of the output file.
Output	An SNT file containing the remaining statements.

renameproperty

Name	renameproperty
Usage	renameproperty <input file> <output file> <property> <substitute> [<property> <substitute>...]
Type	Statement based
Description	Renames a property. Requires long namespaces.
Argument: input file	The name of the input file, requires SNT with long namespaces.
Argument: output file	The name of the output file.
Argument: property	The property to be replaced. Long namespace required.
Argument: substitutes	The substitute property. Long namespace required.
Output	An SNT a copy of the input file with replaced properties.

restorebn

Name	restorebn
Usage	restorebn <input file> <output file>
Type	Statement based
Description	Restores blank nodes within an N-Triples file that were transcripted e.g. by the ntriplify command.
Argument: input file	The name of the input file, requires N-Triples with long namespaces.
Argument: output file	The name of the output file.
Output	A copy of input file with restored blank nodes.

securelooseends

Name	securelooseends
Usage	securelooseends <file A> <file B> <output file> <predicate1> <substitue1>[<predicate2> ...]
Type	Resource based
Description	Extracts resources from file B that are referenced in file A. Then reduces this resource to a meaningful string and adds it to the original resource.
Argument: file A	An SNT input file containing the references.
Argument: file B	An SNT input file containing the resources that are referenced in file A.
Argument: output file	The name of the output file.
Argument: predicate1	A property from file A whose reference is to be looked up in file B.
Argument: substitute1	A property to map the meaningful string to.
Output	An SNT file containing the resulting statements. e.g. <s> <substitute1> "meaningful string"

sort

Name	sort
Usage	sort <input file> <output file>
Type	statement based
Description	Sorts an N-Triples file in ascending order of codepoints.
Argument: input file	The name of the input file, requires N-Triples, requires long namspace forms.
Argument: output file	The name of the output file.
Output	An SNT file containing all the statements from the input file.

See also [checksorting](#cmd:checksorting).

split

Name	split
Usage	split <input file> <output file prefix> <resources per file>
Type	Resource based
Description	Splits an SNT file into several smaller files, with a given number of resources.
Argument: input file	The name of the input file, requires SNT.
Argument: output file prefix	Prefix for the output files, e.g. /home/data/part_
Argument: resources per file	Number of resources per file.
Output	Multiple SNT files, e.g. /home/data/part_1.nt etc.

version

Name	version
Usage	version
Type	-
Description	Prints the version to the screen, e.g. v0.1 .

Special commands

analyzetype

Name	analyzetype
Usage	Usage: analyzetype <input file> <type> <predicate1> [<predicate2> ...]
Type	Resource based
Description	Counts the occurences of literal objects for one or more propertiesfor for a given rdf:type. When more than one properties are used, the combinations of properties are counted as well. Output is written to a CSV file. The entries are ranked by their occurences. Use case example: A ranking of most common first name and last name combinations for persons could be created. See also: pigeonhole
Argument: input file	The input file, requires SNT.
Argument: type	The type of resource to be analyze e.g. foaf:Person
Argument: predicate1	The property to examine. Requires long namespace version.
Argument: further predicates	Further predicates, requires long namespace version.
Output	One CSV file for every property and combination of properties, names are chosen automatically.

correct

Name	correct
Usage	correct <input file> <output file>
Type	Line based
Description	Removes invalid triples from a given file, respectively replaces invalid characters with the ?-character.
Argument: input file	The input file, requires N-Triples.
Argument: output file	Name of the output file.
Output	An N-Triples file without the problematic triples.

extractduplicatelinks

Name	extractduplicatelinks
Usage	extractduplicatelinks <input file>
Type	Statement based
Description	Extracts statements that do not address their subject or target exclusively. Use case example: Find owl#sameAs-links in a link set that connect commodity-resources, respectively identify such resources. Useful in combination with the subtract command.
Argument: input file	The input file, requires SNT.
Output	Two N-Triples files: subjects.nt contains all statements that do not address their subject exclusively; objects.nt contains all statements that do not address their objects exclusively.

extractreferenced

Name	extract referenced
Usage	extractreferenced <file A> <file B> <output file> <predicate1> [<predicate2> ...]
Type	Resource based
Description	Extracts resources from file B that are referenced in file A. Missing resources in fileB are ignored.
Argument: file A	The input file containing the references. SNT required.
Argument: file B	A second input file containing the referenced resources. SNT required.
Argument: output file	The name of the output file. This file will contain the extracted resources.
Output	An SNT file containing the extracted resources.

outline

Name	outline
Usage	outline <input file> <output file> <target property>
Type	Resource based
Description	Creates literal representations for each resource in a file. The representation is mapped to a given property.
Argument: input file	The input file with the resource to be outlined. SNT required.
Argument: output file	The name of the file to store the output in.
Argument: target property	The property to assign the outline to.
Argument: output file	Name of the file to store the output in.
Output	An SNT file with one statement for each resource. <original subject> <target property> "literal representation"

pigeonhole

Name	pigeonhole
Usage	pigeonhole <input file> <output file A> <output file B> <output file C> <CSV> <total threshold>
Type	Resource based
Description	Extracts the resources from an SNT file according to the frequency of their attributes. A CSV file, such as produced by analyzetype, is used to determine the necessary information. The CSV file contains combinations of values (a single property is also considered a combination) of the covered properties together with a number "total" that indicates the number occurences of the combination in the input file. The entries in this CSV file are sorted by this number. The command reads the CSV entries up until the threshold of the "total"-field is undershot. Then it aborts. The command then reads the input file resource-wise and handles the resources: Their property-combinations are looked up in the CSV table. If a properties combination has an entry in the table then the resource is written to file A. If a certain combination is not present in the table, then the resource is written to file B. If the resource does not even contain all of the properties stated in the CSV file then it is written to file C. Thus the command extracts the resources of the top X most frequent properties combinations.
Argument: input file	The input file with the resources to be pigeonholed. SNT required.
Argument: output file A	The name of the file to store the output in.
Argument: output file B	The name of the file to store the output in.
Argument: output file C	The name of the file to store the output in.
Argument: CSV	A CSV file containing the frequencies of the properties values. Same as the output of the analyzetype command.
Argument: total threshold	A positive integer to be used as lower threshold on the total frequencies column in the CSV. Can be used to close out uncommon values.
Output	Three SNT files with the resources of their category. File A: Contains all resources that containing a property combination that is in the top x (limited by the threshold parameter) most frequent combinations. File B: Contains all resources that have values for the requested properties but do not reside in the top x combinations. File C: Contains the remaining resources.

Use together with analyzetype.

pumpup

Name	pumpup
Usage	pumpup <input file> <output file>
Type	statement based
Description	Extends the namespaces in an N-Triples file to thier long forms. Uses the namespaces as stated below. The file "namespaces.txt" specifying these namespaces comes along with the binaries and can be adapted to custom needs. Often commands already include this functionality.
Argument: input file	The name of the input file, requires N-Triples.
Argument: output file	The name of the output file.
Output	An N-Triples file containing the merged statements.

List of namespaces with their respective short forms.

bf http://bibframe.org/vocab/
bibo http://purl.org/ontology/bibo/
dbp http://dbpedia.org/ontology/
dc http://purl.org/dc/elements/1.1/
dct http://purl.org/dc/terms/
foaf http://xmlns.com/foaf/0.1/
gnd http://d-nb.info/standards/elementset/gnd#
owl http://www.w3.org/2002/07/owl#
rdac http://rdaregistry.info/Elements/c/
rdai http://rdaregistry.info/Elements/i/
rdam http://rdaregistry.info/Elements/m/
rdau http://rdaregistry.info/Elements/u/
rdaw http://rdaregistry.info/Elements/w/
rdf http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs http://www.w3.org/2000/01/rdf-schema#
schema https://schema.org/
skos http://www.w3.org/2004/02/skos/core#
void http://rdfs.org/ns/void#
sch http://schema.org/

subtract

Name	subtract
Usage	subtract <file A> <file B> <output file>
Type	Line based
Description	Removes all statements from file A that are also in file B.
Argument: file A	The name of the first file, requires SNT.
Argument: file B	The name of the second file, requires SNT.
Argument: output file	Name of the output file.
Output	The resulting file containing SNT.

Getting Started

Some steps to get started:

Prepare your data in a single directory, have it in one of these formats: .nt, .rdf, .xml, .jsonld.
Convert your data to N-Triples if not already in use.
java -jar reshaperdf-1.0-SNAPSHOT.jar ntriplify ./myrdf ./nt/mydata.nt
Sort your data.
java -jar reshaperdf-1.0-SNAPSHOT.jar sort ./nt/mydata ./nt/mydata_sorted.nt
Extract all persons (foaf:Person) from the file into another file.
java -jar reshaperdf-1.0-SNAPSHOT.jar extract ./nt/mydata_sorted.nt ./nt/mypersons.nt http://xmlns.com/foaf/0.1/Person ? 0 -1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!