-
Notifications
You must be signed in to change notification settings - Fork 2
Wikibase to Solr #12: adding constants and methods
In this completed loop script, every single value that we want to process or extract from the Wikibase JSON export is expressed/available via the below variables.
item_ID
-
item_CLAIMS
containspropertyID
-
item_PROPERTY_ARRAY
is an array of data for a particularpropertyID
which contains one or manypropertyInstance
elements -
item_PROPERTY_VALUE
is the value of thepropertyInstance
, which may be a string or an array/hash depending on the property -
item_PROPERTY_QUALIFIERS
is an array of sub-properties for a particularpropertyID
, containing one or manyqualifier
=>qualifierArray
elements -
item_PROPERTY_QUALIFIER_VALUE
is the value of thequalifierInstance
, which may be a string or an array/hash depending on the property
Below, the code is broken into sections using "headers" that describe the role of each block of code. These headers are actually Ruby comments using double-hash ## which translate to headers in markdown.
require 'json'
require 'csv'
require 'date'
require 'time'
require 'optparse'
ID_DS_ID = "P1"
ID_MANUSCRIPT_HOLDING = "P2"
ID_DESCRIBED_MANUSCRIPT = "P3"
ID_HOLDING_INSTITUTION_IN_AUTHORITY_FILE = "P4"
ID_HOLDING_INSTITUTION_AS_RECORDED = "P5"
ID_HOLDING_STATUS = "P6"
ID_INSTITUTIONAL_ID = "P7"
ID_SHELFMARK = "P8"
ID_LINK_TO_INSTITUTIONAL_RECORD = "P9"
ID_TITLE_AS_RECORDED = "P10"
ID_STANDARD_TITLE = "P11"
ID_UNIFORM_TITLE_AS_RECORDED = "P12"
ID_IN_ORIGINAL_SCRIPT = "P13"
ID_ASSOCIATED_NAME_AS_RECORDED = "P14"
ID_ROLE_IN_AUTHORITY_FILE = "P15"
ID_INSTANCE_OF = "P16"
ID_NAME_IN_AUTHORITY_FILE = "P17"
ID_GENRE_AS_RECORDED = "P18"
ID_SUBJECT_AS_RECORDED = "P19"
ID_TERM_IN_AUTHORITY_FILE = "P20"
ID_LANGUAGE_AS_RECORDED = "P21"
ID_LANGUAGE_IN_AUTHORITY_FILE = "P22"
ID_PRODUCTION_DATE_AS_RECORDED = "P23"
ID_PRODUCTION_CENTURY_IN_AUTHORITY_FILE = "P24"
ID_CENTURY = "P25"
ID_DATED = "P26"
ID_PRODUCTION_PLACE_AS_RECORDED = "P27"
ID_PLACE_IN_AUTHORITY_FILE = "P28"
ID_PHYSICAL_DESCRIPTION = "P29"
ID_MATERIAL_AS_RECORDED = "P30"
ID_MATERIAL_IN_AUTHORITY_FILE = "P31"
ID_NOTE = "P32"
ID_ACKNOWLEDGEMENTS = "P33"
ID_DATE_ADDED = "P34"
ID_DATE_LAST_UPDATED = "P35"
ID_LATEST_DATE = "P36"
ID_EARLIEST_DATE = "P37"
ID_START_TIME = "P38"
ID_END_TIME = "P39"
ID_EXTERNAL_IDENTIFIER = "P40"
ID_IIIF_MANIFEST = "P41"
ID_WIKIDATA_QID = "P42"
ID_VIAF_ID = "P43"
ID_EXTERNAL_URI = "P44"
ID_EQUIVALENT_PROPERTY = "P45"
ID_FORMATTER_URL = "P46"
ID_SUBCLASS_OF = "P47"
##
# For either an Item, a claim property, or a qualifier property, return the value
# specified by 'type'. This method works nested hashes with the structure:
#
# data['mainsnak']['datavalue']['value']
#
# or
#
# data['datavalue']['value']
#
# If 'type' == 'value', return the result of the ['datavalue']['value'] chain;
# otherwise, return 'value'.
#
# Any string will work for `type`. The only special ‘type' is `value`, which returns the
# whatever is returned by the ‘value’ key. The property value types in the DS Wikibase
# JSON are:
#
# 'entity-type'
# 'numeric-id'
# 'id'
# 'time'
# 'timezone'
# 'before'
# 'after'
# 'precision'
# 'calendarmodel'
#
# @param [Hash] data item or claim property or qualifier property
# @param [String] type the value type to be returned
# @return [Hash,String] the result of extracting the nested data specified by type
def get_value_by_type(data, type)
return unless data.instance_of?(Hash)
# if `data` has a 'mainsnak', then we need to get the nested hash with a
# 'datavalue', 'value' chain; otherwise, we assume 'data' is a hash
# with a 'datavalue', 'chain'
datavalue_hash = data['mainsnak'] || data
# Be safe anyway: make sure 'datavalue_hash' isn't nil
return unless datavalue_hash
# {"snaktype"=>"value", "property"=>"P16", "datavalue"=>{"value"=>{"entity-type"=>"item", "numeric-id"=>3, "id"=>"Q3"}, "type"=>"wikibase-entityid"}, "datatype"=>"wikibase-item"}
# if I'm right that everything at this point is a hash with a 'datavalue', 'value'
# chain, then the following will **always** return a hash or a string; but, to be
# safe, make sure value is a hash if `#dig(...)` returns `nil`
value = datavalue_hash.dig('datavalue', 'value') || {}
return value if type == 'value'
value[type]
end
##
# Get the Wikibase 'instance of’ QID if there’s ‘P16’ ‘instance_of’ claim, if present.
# Otherwise, return 'nil’.
#
# Example:
#
# JSON structure:
#
# "claims":
# {
# "P16":
# [
# {
# "mainsnak":
# {
# "snaktype": "value",
# "property": "P16",
# "datavalue":
# {
# "value":
# {
# "entity-type": "item",
# "numeric-id": 17,
# "id": "Q17"
# },
# "type": "wikibase-entityid"
# },
# "datatype": "wikibase-item"
# },
# "type": "statement",
# "id": "Q18$37029DB4-8D1C-4F47-BCBB-26F0C41F1046",
# "rank": "normal"
# }
# ],
# // ... etc. ...
# },
#
# instance_of = get_instance_of(claims_array) # => ‘Q17'
def get_first_instance_of(claims)
return unless claims.instance_of?(Hash)
return unless claims[ID_INSTANCE_OF]
return if claims[ID_INSTANCE_OF].empty?
# each claim property is an array, get the first one
claim = claims[ID_INSTANCE_OF].first
#claim.dig('mainsnak', 'datavalue', 'value', 'numeric-id')
p get_value_by_type(claim, 'numeric-id')
end
def has_wikidata_id(claims)
return unless claims.instance_of?(Hash)
return unless claims[ID_WIKIDATA_QID]
return if claims[ID_WIKIDATA_QID].empty?
return true
end
def get_first_wikidata_id(claims)
return unless has_wikidata_id(claims)
# each claim property is an array, get the first one
claim = claims[ID_WIKIDATA_QID].first
#claim.dig('mainsnak', 'datavalue', 'value')
get_value_by_type(claim, 'value')
end
This function helps avoid Ruby run-time crashes when encountering empty, nil, or wrong values
def has_external_uri(claims)
return unless claims.instance_of?(Hash)
return unless claims[ID_EXTERNAL_URI]
return if claims[ID_EXTERNAL_URI].empty?
return true
end
def get_first_external_uri(claims)
return unless has_external_uri(claims)
# each claim property is an array, get the first one
claim = claims[ID_EXTERNAL_URI].first
#claim.dig('mainsnak', 'datavalue', 'value')
get_value_by_type(claim, 'value')
end
dir = File.dirname __FILE__
importJSONfile = File.expand_path 'export-dev-0302.json', dir
sampleOutput = true
data = JSON.load_file importJSONfile
item_LABELS = {}
item_URIS = {}
## Loop through every item from the Wikibase JSON export to populate item_LABELS and item_URIS
data.each do |item|
## item.keys = ["type", "id", "labels", "descriptions", "aliases", "claims", "sitelinks", "lastrevid"]
## Retrieve the item ID (value)
item_ID = item["id"]
## Retrieve the item claims (deep array)
item_CLAIMS = item["claims"]
## Retrieve the ID_INSTANCE_OF (deep dig into claims via get_first_instance_of method)
item_INSTANCE_OF = get_first_instance_of item_CLAIMS
## Unlikely, but if there are no claims, skip to the next item
next if item_CLAIMS.empty?
next if item_INSTANCE_OF.nil?
# Wikibase items with an ID_INSTANCE_OF = Q4-Q17 contain "lookup values" that we want to use when constructing the Solr item
if item_INSTANCE_OF.between?(4,17) then
# Construct reference arrays for filling in Q-entity values in the main loop
# Labels are the text string values associated with every item, which we often use in the Solr item values
item_LABELS[item_ID] = item["labels"]["en"]["value"]
# URI's are the Linked Data entity URLs, generally terms from Linked Data Authority's such as VIAF
has_external_uri(item_CLAIMS) ? item_URIS[item_ID] = get_first_external_uri(item_CLAIMS): nil
# Wikidata ID properties are not being stored with the full URL, so we have to append the a base URL to the stored value
has_wikidata_id(item_CLAIMS) ? item_URIS[item_ID] = "https://www.wikidata.org/wiki/" + get_first_wikidata_id(item_CLAIMS): nil
end
end
## Loop through every item from the Wikibase JSON export to generate Solr items
data.each do |item|
## item.keys = ["type", "id", "labels", "descriptions", "aliases", "claims", "sitelinks", "lastrevid"]
## Retrieve the item ID (value)
item_ID = item["id"]
## Retrieve the item claims (deep array)
item_CLAIMS = item["claims"]
## Retrieve the ID_INSTANCE_OF (deep dig into claims via get_first_instance_of method)
item_INSTANCE_OF = get_first_instance_of item_CLAIMS
if sampleOutput then
puts "---"
puts "Wikibase item ID: #{item_ID}"
puts "Item instance: #{item_INSTANCE_OF}"
end
## Unlikely, but if there are no claims, skip to the next item
next if item_CLAIMS.empty?
next if item_INSTANCE_OF.nil?
# Wikibase items with an ID_INSTANCE_OF = Q1-Q3 contain the manuscript data that we want to use when constructing the Solr item
next unless item_INSTANCE_OF.between?(1,3)
# Wikibase item claims array contains an arbitrary list of property ID's (P1-P47)
item_CLAIMS.each_key do |propertyID|
# Each property ID has an array with zero, one, or many values
item_PROPERTY_ARRAY = item_CLAIMS.dig propertyID
# Skip ahead if the array has zero elements/data in it
next if item_PROPERTY_ARRAY.nil?
# Loop through each instance of a property
item_PROPERTY_ARRAY.each do |propertyInstance|
# Retrieve the actual text string value (or Q-entity reference) that we use for the Solr item
item_PROPERTY_VALUE = propertyInstance&.dig "mainsnak", "datavalue", "value"
# When the retrieved value is a hash/array, that means we have to dig one level further to retrieve the value we want
if item_PROPERTY_VALUE.is_a?(Hash)
# INSERT BUSINESS LOGIC FOR SPECIAL CASES WHERE THE PROPERTY ID VALUE IS NOT A TEXT STRING
end
if sampleOutput then puts "#{propertyID} = #{item_PROPERTY_VALUE}" end
# Each property ID may be further described by an array of qualifiers, which are properties
item_PROPERTY_QUALIFIERS = propertyInstance.dig "qualifiers"
# Skip ahead if there are no qualifiers
next if item_PROPERTY_QUALIFIERS.nil?
# Loop through each qualifier ID in the qualifier array
item_PROPERTY_QUALIFIERS.each do |qualifier,qualifierArray| # qualifier => qualifierArray
# Each qualifier may have multiple instances of data within it, e.g. multiple authors
qualifierArray.each do |qualifierInstance|
# Retrieve the actual text string value (or Q-entity reference) that we use for the Solr item
item_PROPERTY_QUALIFIER_VALUE = qualifierInstance&.dig "datavalue", "value"
item_PROPERTY_QUALIFIER_VALUE_ID = item_PROPERTY_QUALIFIER_VALUE["id"]
# check if the value_id is nil, otherwise it will cause an error
unless item_PROPERTY_QUALIFIER_VALUE_ID.nil?
item_PROPERTY_QUALIFIER_VALUE_LABEL = item_LABELS[item_PROPERTY_QUALIFIER_VALUE_ID]
item_PROPERTY_QUALIFIER_VALUE_URI = item_URIS[item_PROPERTY_QUALIFIER_VALUE_ID]
if sampleOutput then puts "^-- #{qualifier} = #{item_PROPERTY_QUALIFIER_VALUE_ID} > #{item_PROPERTY_QUALIFIER_VALUE_LABEL} < #{item_PROPERTY_QUALIFIER_VALUE_URI}" end
end
# INSERT BUSINESS LOGIC
# USE PROPERTY CONSTANTS INSTEAD OF P-VALUES
# STANDARD CASE AND THEN EXCEPTIONS
end
end
# INSERT BUSINESS LOGIC WHEN NO QUALIFIERS
# USE PROPERT CONSTANTS INSTEAD OF P-VALUES
# STANDARD CASE AND THEN EXCEPTIONS
end
end
end
The sample output shows the internal Wikibase item ID, the ID_INSTANCE_OF, all instances of properties in the claims, all values (without any additional business logic to translate hashes), and all sub-properties (qualifiers) of a property, including the internally referenced Wikibase item ID and its associated URL.
If every property or qualifier value was a string, as opposed to either a string or a hash, and if the DS2.0 requirements did not include combination, translation, modification, and/or generation of new fields based on the provided data, then the script could stop here and would be relatively straightforward.
Wikibase item ID: Q942
Item instance: 2
P16 = {"entity-type"=>"item", "numeric-id"=>2, "id"=>"Q2"}
P38 = {"time"=>"+2023-03-02T00:00:00Z", "timezone"=>0, "before"=>0, "after"=>0, "precision"=>11, "calendarmodel"=>"http://www.wikidata.org/entity/Q1985727"}
P5 = Conception Abbey and Seminary
^-- P4 = Q821 > Conception Abbey and Seminary < https://www.wikidata.org/wiki/Q30257935
P6 = {"entity-type"=>"item", "numeric-id"=>4, "id"=>"Q4"}
P8 = CA 37
---
Wikibase item ID: Q943
Item instance: 1
P1 = DS135
P16 = {"entity-type"=>"item", "numeric-id"=>1, "id"=>"Q1"}
P2 = {"entity-type"=>"item", "numeric-id"=>942, "id"=>"Q942"}
---
Wikibase item ID: Q944
Item instance: 3
P10 = Bible, O.T.
^-- P11 = Q809 > Bible <
^-- P11 = Q810 > Old Testament <
P14 = Brought from the library of Engelberg Abbey in Switzerland to Conception Abbey sometime before 1900.
^-- P15 = Q38 > Former owner < http://vocab.getty.edu/aat/300203630
^-- P17 = Q822 > Engelberg Abbey < https://www.wikidata.org/wiki/Q667880
P16 = {"entity-type"=>"item", "numeric-id"=>3, "id"=>"Q3"}
P21 = Latin
^-- P22 = Q118 > Latin < https://www.wikidata.org/wiki/Q397
P23 = s. XII, 1100-1199
^-- P24 = Q94 > twelfth century (dates CE) < http://vocab.getty.edu/aat/300404504
P26 = {"entity-type"=>"item", "numeric-id"=>15, "id"=>"Q15"}
P27 = Germany
^-- P28 = Q111 > Germany < http://vocab.getty.edu/tgn/7000084
P29 = Script, One fragment, recto & verso: Gothic book hand.
P3 = {"entity-type"=>"item", "numeric-id"=>943, "id"=>"Q943"}
P32 = One fragment, recto & verso: Latin.
P32 = One fragment, recto & verso: Recto, I Kings 3:17; verso, 1 Kings 17:5-6.
P33 = We thank Michael W. Heil for his work in making this description available.
P34 = {"time"=>"+2023-03-02T00:00:00Z", "timezone"=>0, "before"=>0, "after"=>0, "precision"=>11, "calendarmodel"=>"http://www.wikidata.org/entity/Q1985727"}
P35 = {"time"=>"+2023-03-02T00:00:00Z", "timezone"=>0, "before"=>0, "after"=>0, "precision"=>11, "calendarmodel"=>"http://www.wikidata.org/entity/Q1985727"}
P41 = https://iiif.archivelab.org/iiif/images_CA37_15/manifest.json
`