diff --git a/doc/overview.md b/doc/overview.md index eb83ab9..5df8919 100644 --- a/doc/overview.md +++ b/doc/overview.md @@ -6,8 +6,8 @@ The Digital Scriptorium Wikibase data export is a JSON-formatted array of Wikiba The [ExportRepresenter](../lib/digital_scriptorium/export_representer.rb) class can be used to deserialize an export in its entirety. The resulting [Export](../lib/digital_scriptorium/export.rb) object is essentially an array of Item and Property objects. Entities in the export are modeled using domain-specific classes provided by the [wikibase_representable](https://rubygems.org/gems/wikibase_representable) gem, such as Items, Properties, Statements (also known as Claims), and Snaks, which represent the primary claim of any statement as well as any qualifiers. Convenience methods are also provided to facilitate extracting data values. -The conversion script [wikibase_to_solr_new.rb](../wikibase_to_solr_new.rb) proceeds by deserializing the export and converting the resulting array of Wikibase objects to a hash keyed by entity ID. It then iterates over elements of the hash. When it finds a record item (as identified by its "instance of" (P16) claim) it retrieves the linked manuscript item, as well as the holding item linked in turn to the manuscript item, from the export hash by entity ID. It then iterates over the claims attached to each of the three, extracting the Solr fields requested based on the property ID that is the subject of the claim. Claims for most properties are transformed to Solr fields using a generic algorithm implemented in [ClaimTransformer](../lib/digital_scriptorium/claim_transformer.rb). Name and date claims require some special handling, and are handled in dedicated claim transformer classes ([NameClaimTransformer](../lib/digital_scriptorium/name_claim_transformer.rb) and [DateClaimTransformer](../lib/digital_scriptorium/date_claim_transformer.rb) respectively). After all claims from the manuscript, holding, and record have been processed, the Solr record is written to the output file. +The conversion script [wikibase_to_solr_new.rb](../wikibase_to_solr_new.rb) proceeds by deserializing the export and converting the resulting array of Wikibase objects to a hash keyed by entity ID. It then iterates over elements of the hash. When it finds a record item (as identified by its instance-of (P16) claim) it retrieves the linked manuscript item, as well as the holding item linked in turn to the manuscript item, from the export hash by entity ID. It then iterates over the claims attached to each of the three, extracting the Solr fields requested based on the property ID that is the subject of the claim. Claims for most properties are transformed to Solr fields using a generic algorithm implemented in [ClaimTransformer](../lib/digital_scriptorium/claim_transformer.rb). Name and date claims require some special handling, and are handled in dedicated claim transformer classes ([NameClaimTransformer](../lib/digital_scriptorium/name_claim_transformer.rb) and [DateClaimTransformer](../lib/digital_scriptorium/date_claim_transformer.rb) respectively). After all claims from the manuscript, holding, and record have been processed, the Solr record is written to the output file. The specific Solr fields produced for each claim are controlled by the configuration file [property_config.yml](../property_config.yml). This file also defines the prefix (representing the property name) to be attached to each field for a given property, and whether a claim based on the property might have a related authority qualifier. -The script was written so as not to rely on the structure of the Wikibase export file beyond that it will be a JSON array of Wikibase entities, with manuscripts, holdings, and records linked by P16 claims. That being said, with certain guarantees about the structure of the export file, we could avoid loading it into memory and deserializing it all at once, which would greatly improve efficiency. +The script was written so as not to rely on the structure of the Wikibase export file beyond that it will be a JSON array of Wikibase entities, with manuscripts, holdings, and records linked by P2 (holding) and P3 (described manuscript) claims. That being said, with certain guarantees about the structure of the export file, we could avoid loading it into memory and deserializing it all at once, which would greatly improve efficiency.