Skip to content

Latest commit

 

History

History
1264 lines (1127 loc) · 37.8 KB

Commands.md

File metadata and controls

1264 lines (1127 loc) · 37.8 KB

ReshapeRDF - Guide

A toolset to work with N-Triples

Author: Felix Bensmann
Date: 07. Dec. 2015
Last change: 16. Jan. 2016
Please note: This document is intended to provide help to get started with ReshapeRDF, nothing more.
Its content is subject to change.

TOC

  • Introduction
  • Sorted N-Triples
  • Terms
  • Setup
  • Commands
    • Commands for everyday use
      • block
      • checksorting
      • extractresources
      • filter
      • getenrichment
      • help
      • merge
      • mergedir
      • ntriplify
      • pick
      • removeduplicates
      • renameproperty
      • restorebn
      • securelooseends
      • sort
      • split
      • version
    • Special commands
      • analyzetype
      • correct
      • extractduplicatelinks
      • extractreferenced
      • outline
      • pigeonhole
      • pumpup
      • subtract
  • Getting Started

Introduction

Processing RDF mass data can be a prone job. Common triplestores offer certain functionality for querying and manipulating RDF data but only few can handle mass data (let's say more than 200 Mio. statements) at the same time. Typical operations like data import and SPARQL queries tend to be time consuming and inconvenient to be used in comprehensive reshaping operations.

So, when working with simple structured graph data, a solution can be to refrain from using a triplestore and to work with dump files instead. Recurring tasks are extracting entities of a certain class from a large dataset, or subdivide a dataset into blocks according to a certain property (Blocking), filtering the data, removing resources and single statements, renaming properties and similar reshaping operations.

Unfortunately organizing ones RDF mass data in the desired manner cannot be done easily with available out-of-the-box tools.

The tool at hand was developed to enable users of large RDF datasets to efficiently organize and reshape their data without the need of a triplestore.

Sorted N-Triples

When there is an RDF dump file to process, users cannot take for granted that stored resources are held together. This is especially true for the N-Triples file format but also applies for the RDF/XML file format that even provides a way to cluster statement by syntax. At the same time resources within such files cannot be found efficiently without having to read-in the whole file and examine the stream from the start to the end to find all occurences. Complex searches cannot be handled at all.

To overcome these limitations this tool applies an intermediate file format to be used by a given set of operations to organize data in a more flexible way. This format is "Sorted N-Triples" (SNT). These are, as the name already indicates, alphabetically sorted N-Triples.

The following example depicts how SNTs can be used for an interlinking and enrichment process.

  1. Convert a non-SNT file to N-Triples
  2. Sort it
  3. Extract relevant resources (one iteration)
  4. Split the extracted resources into smaller datasets (one iteration)
  5. Interlink - however
  6. If necessary convert the links to SNT
  7. Merge the links into the data (one iteration)

The flexible nature of this tool is especially helpful with heterogeneous datasets.

Setup

Copy the JAR-Archive reshaperdf-1.0-SNAPSHOT.jar and the lib folder to a directory of your choice. The software requires at least JRE 1.7 .

It is helpful to provide a script "reshaperdf" in /bin that facilitates the calls to the program.

#!/bin/bash
# Author:  John Smith
# Purpose: Facilitates calls to ReshapeRDF.

java -jar reshaperdf-1.0-SNAPSHOT.jar $@

Terms

  • triple and statement In this application a triple and a statement as known from the RDF context are the same thing. They always fit in one line.
  • line based An operation is called that if it understands triples as a string line.
  • statement based An operation is called that if it understands triples as triples/statements.
  • resource based An operation is called that if it sees the data as a list of individual resources.

Commands

This chapter outlines the operations and their usage. A command can be called using the following syntax:

java -jar reshaperdf-1.0-SNAPSHOT.jar <command> [<command parameter> ...]

The chapter is subdivided into a section that is about commands intended for everyday use and a section about special commands that do not have a purpose in everyday use but become handy in exotic use cases. The special commands are available in their own branch.

At no point any of the commands will overwrite an input file, rather they produce a new file with the desired changes. However existing files will be overwritten by output files without notification.

Comments are usually not processed by the commands. Most commands require the long forms of a URI.

Commands for everyday use

block

Name block
Usage block <input file> <output dir> <predicate> <char offset> <char length>
Type Resource based
Description Assigns the resources of the input file to blocks according to a given character sequence of a given property's value. One block is one file. Files that exceed a statement count of 100 000 are further split into files of 100 000.
Argument: input file The input file, requires SNT.
Argument: output dir The Directory to store the output in.
Argument: predicate The property to block by. Requires long namespace version.
Argument: char offset The offset of the character sequence in the property's value. Use 0 for no offset. If the offset is higher than the value's length, then the whole property value will be evaluated.
Argument: char length The lenght of the character sequence in the property's value. If the length is higer than the value's length, then the whole property value will be evaluated.
Output A set of SNT files in the given output directory.

checksorting

Name checksorting
Usage checksorting <input file>
Type Statement based
Description Checks the input file for proper sorting. This sorting differs from line sorting in the fact that it ignores the control characters.
Argument: input file The input file, requires N-Triples.
Output Prints "Sorted" to stdout if sorted correctly, "Not sorted" otherwise.

extractresources

Name extractresources
Usage extractresources <input file> <output file> <predicate> <object> <offset> <length>
Type Resource based
Description Extracts resources with a given predicate-object combination.
Argument: input file The input file, requires N-Triples.
Argument: output file Name of the output file, the file with the extracted resources.
Argument: predicate The predicate to look for, namespace has to be in long form. Use a "?" to indicate a wildcard.
Argument: object The object to look for. Can be a literal or a URL. Use a "?" to indicate a wildcard.
Argument: offset Number of the matching resource to start from.
Argument: length Number of resources to extract. -1 indicates to use all available resources.
Output An SNT file with the extracted resources.
See also [pick](#cmd:pick).

filter

Name filter
Usage filter <whitelist|blacklist> <input file> <filter file> <output file>
Type Resource based
Description Removes statments from an N-Triples file accoringly to a white or black list.
Argument: whitelist|blacklist Either "whitelist" or "blacklist" to indicate what kind of filter is to be used.
Argument: input file File to filter
Argument: filter file A text file containing the properties to be subject to the filter. Is a simple line-based text file.
Argument: output file Name of the file to store the output in.
Output An SNT file with the remaining resources.

getenrichment

Name getenrichment
Usage getenrichment <linkfile> <resource file> <output file>
Type statement based/Resource based
Description Extracts resources from an SNT file, that are adressed by the object of an SNT link file. Missing resources in the resources file are ignored. The subjects of the extracted statements are altered to the subject of the link.
Argument: linkfile The link file, requires SNT.
Argument: resource file An SNT file containing the resources to be extracted.
Argument: output file Name of the output file. The file containing the extracted resources.
Output An SNT file with the extracted resources.

See also extractreferenced.

help

Name help
Usage help <cmd>
Type -
Description Displays the help text, for the specified command.
Argument: cmd Name of the command.
Output Helptext for the specified command.

merge

Name merge
Usage merge <output file> <input file1> <input file2> [<input file3>...]
Type statement based
Description Merges a couple of sorted N-Triples files.
Argument: output file The name of the output file.
Argument: input file1 An SNT file containing statements to be merged.
Argument: input file2 Another SNT file containing statements to be merged.
Argument: input fileN Further optional SNT files containing statements to be merged.
Output An SNT file with the merged results.
For a simple concatenation you may also try "$ cat a.nt b.nt c.nt > mergefile.nt" in a Linux environment.

mergedir

Name mergedir
Usage mergedir <input dir> <output file>
Type statement based
Description Merges SNT files that are in the same directory. Extends namespaces to its long form.
Argument: input dir The name of the directory containing the SNT files to be merged. Subdirectories are also searched.
Argument: output file An SNT file containing the merged statements.
Output An SNT file with the merged results.
For a simple concatenation you may also try "$ cat *.nt > mergefile.nt" in a Linux environment.

ntriplify

Name ntriplify
Usage ntriplify <input dir> <output file> [<JSON-LD context URI> <JSON-LD context file>][...]
Type statement based
Description Converts all RDF files from a directory into N-Triples and merges them into a single file.
Argument: input dir The name of the directory containing the RDF files. Subdirectories are also searched.
Argument: output file The name of the output file.
Argument: JSON-LD context URIs and files Optional. It is possible to state a mapping of JSON-LD contexts and local JSON-LD context files. The context-URIs and file paths will have to be inserted in pairs separated by a space. The command will use the local contexts whenever the remote context is not available.
Output An N-Triples file containing the converted statements.

pick

Name pick
Usage pick <input file> <output file> <s|p|o|stmt|res> <s|list|?> <p|list|?> <o|list|?>
Type Dependant on search pattern
Description Takes an input file and extracts all subjects, predicates, objects, statements or resources according to the specified pattern and outputs them into a file. A "?"-character can be used to indicate a wildcard. Example: infile.nt outfile.nt o subjectlist.txt predicatelist.txt ? This returns all objects whose statments match any combination of subjectlist and predicatelist.
Argument: input file The name of the input file. Sorted N-Triples are required.
Argument: output file The name of the output file.
Argument: return type The kind of information to be returned; one of subject, predicate, object, statement or resource.
Argument: subject expression The expression for matching the subject: A single URL, a file containing a list of URLs or a wildcard.
Argument: predicate expression The expression for matching the predicate: A single URL, a file containing a list of URLs or a wildcard.
Argument: object expression The expression for matching the subject: A single URL/Literal, a file containing a list of URLs or literals or a wildcard. Datatypes and language tags cannot be processed.
Output An N-Triples file containing the output.

removeduplicates

Name removeduplicates
Usage removeduplicates <input file> <output file>
Type Line based
Description Removes duplicate statements from an SNT file. Keeps one line of each kind.
Argument: input file The name of the input file, requires SNT.
Argument: output file The name of the output file.
Output An SNT file containing the remaining statements.

renameproperty

Name renameproperty
Usage renameproperty <input file> <output file> <property> <substitute> [<property> <substitute>...]
Type Statement based
Description Renames a property. Requires long namespaces.
Argument: input file The name of the input file, requires SNT with long namespaces.
Argument: output file The name of the output file.
Argument: property The property to be replaced. Long namespace required.
Argument: substitutes The substitute property. Long namespace required.
Output An SNT a copy of the input file with replaced properties.

restorebn

Name restorebn
Usage restorebn <input file> <output file>
Type Statement based
Description Restores blank nodes within an N-Triples file that were transcripted e.g. by the ntriplify command.
Argument: input file The name of the input file, requires N-Triples with long namespaces.
Argument: output file The name of the output file.
Output A copy of input file with restored blank nodes.
See also [ntriplify](#cmd:ntriplify).

securelooseends

Name securelooseends
Usage securelooseends <file A> <file B> <output file> <predicate1> <substitue1>[<predicate2> ...]
Type Resource based
Description Extracts resources from file B that are referenced in file A. Then reduces this resource to a meaningful string and adds it to the original resource.
Argument: file A An SNT input file containing the references.
Argument: file B An SNT input file containing the resources that are referenced in file A.
Argument: output file The name of the output file.
Argument: predicate1 A property from file A whose reference is to be looked up in file B.
Argument: substitute1 A property to map the meaningful string to.
Output An SNT file containing the resulting statements. e.g. <s> <substitute1> "meaningful string"

sort

Name sort
Usage sort <input file> <output file>
Type statement based
Description Sorts an N-Triples file in ascending order of codepoints.
Argument: input file The name of the input file, requires N-Triples, requires long namspace forms.
Argument: output file The name of the output file.
Output An SNT file containing all the statements from the input file.
See also [checksorting](#cmd:checksorting).

split

Name split
Usage split <input file> <output file prefix> <resources per file>
Type Resource based
Description Splits an SNT file into several smaller files, with a given number of resources.
Argument: input file The name of the input file, requires SNT.
Argument: output file prefix Prefix for the output files, e.g. /home/data/part_
Argument: resources per file Number of resources per file.
Output Multiple SNT files, e.g. /home/data/part_1.nt etc.

version

Name version
Usage version
Type -
Description Prints the version to the screen, e.g. v0.1 .

Special commands

analyzetype

Name analyzetype
Usage Usage: analyzetype <input file> <type> <predicate1> [<predicate2> ...]
Type Resource based
Description Counts the occurences of literal objects for one or more propertiesfor for a given rdf:type. When more than one properties are used, the combinations of properties are counted as well. Output is written to a CSV file. The entries are ranked by their occurences. Use case example: A ranking of most common first name and last name combinations for persons could be created. See also: pigeonhole
Argument: input file The input file, requires SNT.
Argument: type The type of resource to be analyze e.g. foaf:Person
Argument: predicate1 The property to examine. Requires long namespace version.
Argument: further predicates Further predicates, requires long namespace version.
Output One CSV file for every property and combination of properties, names are chosen automatically.

correct

Name correct
Usage correct <input file> <output file>
Type Line based
Description Removes invalid triples from a given file, respectively replaces invalid characters with the ?-character.
Argument: input file The input file, requires N-Triples.
Argument: output file Name of the output file.
Output An N-Triples file without the problematic triples.

extractduplicatelinks

Name extractduplicatelinks
Usage extractduplicatelinks <input file>
Type Statement based
Description Extracts statements that do not address their subject or target exclusively. Use case example: Find owl#sameAs-links in a link set that connect commodity-resources, respectively identify such resources. Useful in combination with the subtract command.
Argument: input file The input file, requires SNT.
Output Two N-Triples files: subjects.nt contains all statements that do not address their subject exclusively; objects.nt contains all statements that do not address their objects exclusively.

extractreferenced

Name extract referenced
Usage extractreferenced <file A> <file B> <output file> <predicate1> [<predicate2> ...]
Type Resource based
Description Extracts resources from file B that are referenced in file A. Missing resources in fileB are ignored.
Argument: file A The input file containing the references. SNT required.
Argument: file B A second input file containing the referenced resources. SNT required.
Argument: output file The name of the output file. This file will contain the extracted resources.
Output An SNT file containing the extracted resources.

See also getenrichment.

outline

Name outline
Usage outline <input file> <output file> <target property>
Type Resource based
Description Creates literal representations for each resource in a file. The representation is mapped to a given property.
Argument: input file The input file with the resource to be outlined. SNT required.
Argument: output file The name of the file to store the output in.
Argument: target property The property to assign the outline to.
Argument: output file Name of the file to store the output in.
Output An SNT file with one statement for each resource. <original subject> <target property> "literal representation"

See also: securelooseends.

pigeonhole

Name pigeonhole
Usage pigeonhole <input file> <output file A> <output file B> <output file C> <CSV> <total threshold>
Type Resource based
Description Extracts the resources from an SNT file according to the frequency of their attributes. A CSV file, such as produced by analyzetype, is used to determine the necessary information. The CSV file contains combinations of values (a single property is also considered a combination) of the covered properties together with a number "total" that indicates the number occurences of the combination in the input file. The entries in this CSV file are sorted by this number.
The command reads the CSV entries up until the threshold of the "total"-field is undershot. Then it aborts.
The command then reads the input file resource-wise and handles the resources:
Their property-combinations are looked up in the CSV table.
If a properties combination has an entry in the table then the resource is written to file A.
If a certain combination is not present in the table, then the resource is written to file B.
If the resource does not even contain all of the properties stated in the CSV file then it is written to file C.

Thus the command extracts the resources of the top X most frequent properties combinations.
Argument: input file The input file with the resources to be pigeonholed. SNT required.
Argument: output file A The name of the file to store the output in.
Argument: output file B The name of the file to store the output in.
Argument: output file C The name of the file to store the output in.
Argument: CSV A CSV file containing the frequencies of the properties values. Same as the output of the analyzetype command.
Argument: total threshold A positive integer to be used as lower threshold on the total frequencies column in the CSV. Can be used to close out uncommon values.
Output Three SNT files with the resources of their category.
File A: Contains all resources that containing a property combination that is in the top x (limited by the threshold parameter) most frequent combinations.
File B: Contains all resources that have values for the requested properties but do not reside in the top x combinations.
File C: Contains the remaining resources.

Use together with analyzetype.

pumpup

Name pumpup
Usage pumpup <input file> <output file>
Type statement based
Description Extends the namespaces in an N-Triples file to thier long forms. Uses the namespaces as stated below. The file "namespaces.txt" specifying these namespaces comes along with the binaries and can be adapted to custom needs. Often commands already include this functionality.
Argument: input file The name of the input file, requires N-Triples.
Argument: output file The name of the output file.
Output An N-Triples file containing the merged statements.

List of namespaces with their respective short forms.

subtract

Name subtract
Usage subtract <file A> <file B> <output file>
Type Line based
Description Removes all statements from file A that are also in file B.
Argument: file A The name of the first file, requires SNT.
Argument: file B The name of the second file, requires SNT.
Argument: output file Name of the output file.
Output The resulting file containing SNT.

Getting Started

Some steps to get started:

  1. Prepare your data in a single directory, have it in one of these formats: .nt, .rdf, .xml, .jsonld.
  2. Convert your data to N-Triples if not already in use.
    java -jar reshaperdf-1.0-SNAPSHOT.jar ntriplify ./myrdf ./nt/mydata.nt
  3. Sort your data.
    java -jar reshaperdf-1.0-SNAPSHOT.jar sort ./nt/mydata ./nt/mydata_sorted.nt
  4. Extract all persons (foaf:Person) from the file into another file.
    java -jar reshaperdf-1.0-SNAPSHOT.jar extract ./nt/mydata_sorted.nt ./nt/mypersons.nt http://xmlns.com/foaf/0.1/Person ? 0 -1