Skip to content

Commit

Permalink
Update README to reflect Transformer updates
Browse files Browse the repository at this point in the history
  • Loading branch information
mdholloway committed Jan 23, 2025
1 parent 694094d commit 62b0878
Showing 1 changed file with 4 additions and 4 deletions.
8 changes: 4 additions & 4 deletions doc/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@

For a general description of the Wikibase data model, see [Wikibase/DataModel](https://www.mediawiki.org/wiki/Wikibase/DataModel) on mediawiki.org.

The Digital Scriptorium Wikibase data export is a JSON-formatted array of Wikibase entities. The bulk of the entities in the export consist of triplets that together form a meta-record consisting of one each of the DS Catalog core model types: manuscipts, holdings, and records. The export also contains entities representing property definitions and authoritative references to common topics.
The Digital Scriptorium Wikibase data export is a JSON-formatted array of Wikibase entities. The bulk of the entities in the export consist of the DS Catalog core model types: manuscipts, holdings, and records. The export also contains entities representing property definitions and authoritative references to common topics.

The [ExportRepresenter](../lib/digital_scriptorium/export_representer.rb) class can be used to deserialize an export in its entirety. The resulting [Export](../lib/digital_scriptorium/export.rb) object is essentially an array of Item and Property objects. Entities in the export are modeled using domain-specific classes provided by the [wikibase_representable](https://rubygems.org/gems/wikibase_representable) gem, such as Items, Properties, Statements (also known as Claims), and Snaks, which represent the primary claim of any statement as well as any qualifiers. Convenience methods are also provided to facilitate extracting data values.

The conversion script [wikibase_to_solr_new.rb](../wikibase_to_solr_new.rb) proceeds by deserializing the export and converting the resulting array of Wikibase objects to a hash keyed by entity ID. It then iterates over the elements of the hash. When it finds a record item based on the value of its instance-of (P16) claim, it retrieves the linked manuscript item, as well as the holding item linked in turn to the manuscript item, from the export hash by entity ID. It then iterates over the claims attached to manuscript, holding, and record in turn, extracting the Solr fields requested based on the property ID that is the subject of the claim and adding them to the Solr record to be produced for the meta-record. Claims for most properties are transformed to Solr fields using a generic algorithm implemented in [ClaimTransformer](../lib/digital_scriptorium/claim_transformer.rb). Name and date claims require some special handling, and are handled in dedicated claim transformer classes ([NameClaimTransformer](../lib/digital_scriptorium/name_claim_transformer.rb) and [DateClaimTransformer](../lib/digital_scriptorium/date_claim_transformer.rb) respectively). After all claims from the manuscript, holding, and record have been processed, the resulting Solr record is written to the output file.
The conversion script [wikibase_to_solr.rb](https://github.com/mdholloway/hxs-blacklight/blob/main/lib/wikibase_to_solr.rb) proceeds by deserializing the export and converting the resulting array of Wikibase objects to a hash keyed by entity ID. It then iterates over the elements of the hash. When it finds a record item based on the value of its instance-of (P16) claim, it retrieves the linked manuscript item from the export hash by entity ID. From the manuscript, in turn, it retrieves the ID of the item containing current holding information, and retrieves that too from the export hash. With the manuscript, current holding, and record items obtained, it iterates over each, extracting the Solr fields requested based on the property ID that is the subject of the claim and adding them to the Solr record to be produced. After all claims from the manuscript, holding, and record have been processed, the resulting Solr record is written to the output file. The script is written so as not to rely on the structure of the export file beyond that it will be a JSON array consisting of all entities in the DS 2.0 Wikibase, with record items linked to manuscript items and manuscript items linked to holding items by P3 (described manuscript) and P2 (holding) claims respectively.

The specific Solr fields produced for each claim are controlled by the configuration file [property_config.yml](../property_config.yml). This file also defines the prefix (representing the property name) to be attached to each field for a given property, and whether a claim based on the property might have a related authority qualifier.
Solr field extraction logic is encapsulated in the Transformer classes. The [BaseClaimTransformer](../lib/digital_scriptorium/transformers/base_claim_transformer.rb) class sets out the basic contract, which consists of three methods: `display_values`, `search_values`, and `facet_values`. These methods return the collections of values to be included in the `_display`, `_search`, and `_facet` fields for the claim in the Solr object. For a title (P10) claim, for example, they would return the values to be used in the `title_display`, `title_search`, and `title_facet` fields. The remaining Transformer classes build on BaseClaimTransformer in various ways. For some claim types, the Transformer simply extracts the recorded value and returns it in one or more of the `_values` methods. For other claim types, it is expected that a claim will be qualified with a representation of the recorded value in its original script, or with references to a standard title or value from an authority file. This logic is contained in the [QualifiedClaimTransformer](../lib/digital_scriptorium/transformers/qualified_claim_transformer.rb) class. For these claim types, the standard title or value from authority file is returned in the `facet_values` collection. For claim types where the recorded value should be provided as a facet value in the absence of a qualifier, the [QualifiedClaimTransformerWithFacetFallback](../lib/digital_scriptorium/transformers/qualified_claim_transformer_with_facet_fallback.rb) class is provided. Finally, the [LinkClaimTransformer](../lib/digital_scriptorium/transformers/link_claim_transformer.rb) class handles a couple of claim types for which the value to be extracted is a URL. The [Transformers](../lib/digital_scriptorium/transformers.rb) class contains the mapping of claim property IDs to Transformer classes, as well as the prefixes to be used in the Solr fields based on the property name, and provides factory methods used by the conversion script to obtain Transformers as it iterates over claims.

The script was written so as not to rely on the structure of the export file beyond that it will be a JSON array consisting of all entities in the DS 2.0 Wikibase, with record items linked to manuscript items and manuscript items linked to holding items by P3 (described manuscript) and P2 (holding) claims respectively.
The `_search` and `_facet` fields contain values pulled directly from the source Wikibase data. The `_display` field values contain serialized JSON objects that are used to support linked data bars beneath recorded values when viewing item details in the Catalog. These objects contain a `recorded_value` property with the recorded value and an optional `original_script` property with the value in original script where present. Additionally, where the recorded value is qualified by one or more qualifiers, the object will contain a `linked_terms` array to support sets of one or more linked data bars beneath the recorded value. The objects in this array will contain a `label` property that can be passed to a faceted search, as well as an optional `source_url` property that contains a link to an external vocabulary item.

0 comments on commit 62b0878

Please sign in to comment.