diff --git a/.ruby-version b/.ruby-version index 860487ca1..ef538c281 100644 --- a/.ruby-version +++ b/.ruby-version @@ -1 +1 @@ -2.7.1 +3.1.2 diff --git a/CHANGELOG.md b/CHANGELOG.md index ec95a464f..1087dfee2 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -49,15 +49,49 @@ Versioning](https://semver.org/spec/v2.0.0.html). - minor test for Datura::Helpers.date_standardize - documentation for web scraping - documentation for CsvToEs (transforming CSV files and posting to elasticsearch) +- documentation for adding new ingest formats to Datura +- byebug gem for debugging - instructions for installing Javascript Runtime files for Saxon +- API schema can either be the original 1.0 or the newly updated 2.0 (which includes new fields including nested fields); 1.0 will be run by default unless 2.0 is specified. Add the following to `public.yml` or `private.yml` in the data repo: +``` +api_version: '2.0' +``` +See new schema (2.0) documentation [here](https://github.com/CDRH/datura/blob/main/docs/schema_v2.md) +- schema validation with API version 2.0: invalidly constructed documents will not post +- authentication with Elasticesarch 8.5 +- field overrides for new fields in the new API schema +- functionality to transform EAD files and post them to elasticsearch +- functionality to transform PDF files (including text and metadata) and post them to elasticsearch +- limiting `text` field to a specific limit: `text_limit` in `public.yml` or `private.yml` +- configuration options related to Elasticsearch, including `es_schema_override` and `es_schema_path` to change the location of the Elasticsearch schema +- more detailed errors including a stack trace ### Changed +- update ruby to 3.1.2 - date_standardize now relies on strftime instead of manual zero padding for month, day - minor corrections to documentation - XPath: "text" is now ingested as an array and will be displayed delimitted by spaces +- "text" field now includes "notes" XPath +- refactored posting script (`Datura.run`) +- refactored command line methods into elasticsearch library +- refactored and moved date_standardize and date_display helper methods +- Nokogiri methods `get_text` and `get_list` on TEI now return nil rather than empty strings or arrays if there are no matches. fields have been changed to check for these nil values ### Migration - check to make sure "text" xpath is doing desired behavior +- use Elasticsearch 8.5 or higher and add authentication if security is enabled. Add the following to `public.yml` or `private.yml` in the data repo: +``` + es_user: username + es_password: ******** +``` +- upgrade data repos to Ruby 3.1.2 +- +- add api version to config as described above +- make sure fields are consistent with the api schema, many have been renamed or changed in format +- add nil checks with get_text and get_list methods as needed +- add EadToES overrides if ingesting EAD files +- add `byebug` and `pdf-reader` to Gemfile in repos based on Datura +- if overriding the `read_csv` method in `lib/datura/file_type.rb`, the hash must be prefixed with ** (`**{}`). ## [v0.2.0-beta](https://github.com/CDRH/datura/compare/v0.1.6...v0.2.0-beta) - 2020-08-17 - Altering field and xpath behavior, adds get_elements @@ -68,6 +102,8 @@ Versioning](https://semver.org/spec/v2.0.0.html). - Tests and fixtures for all supported formats except CustomToEs - `get_elements` returns nodeset given xpath arguments - `spatial` nested fields `spatial.type` and `spatial.title` +- Versioning system to support multiple elasticsearch schemas +- Validator to check against the elasticsearch copy ### Changed - Arguments for `get_text`, `get_list`, and `get_xpaths` @@ -76,12 +112,14 @@ Versioning](https://semver.org/spec/v2.0.0.html). - Documentation updated - Changed Install instructions to include RVM and gemset naming conventions - API field `coverage_spatial` is now just `spatial` +- refactored executables into modules and classes ### Migration - Change `coverage_spatial` nested field to `spatial` - `get_text`, `get_list`, and `get_xpaths` require changing arguments to keyword (like `xml` and `keep_tags`) - Recommend checking xpaths and behavior of fields after updating to this version, as some defaults have changed - Possible to refactor previous FileCsv overrides to use new CsvToEs abilities, but not necessary +- Config files should specify `api_version` as 1.0 or 2.0 ## [v0.1.6](https://github.com/CDRH/datura/compare/v0.1.5...v0.1.6) - 2020-04-24 - Improvements to CSV, WEBS transformers and adds Custom transformer diff --git a/Gemfile.lock b/Gemfile.lock index 77684b7e4..887e9f17b 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -1,40 +1,57 @@ PATH remote: . specs: - datura (0.2.0.pre.beta) + datura (0.2.0) + byebug (~> 11.0) colorize (~> 0.8.1) - nokogiri (~> 1.8) - rest-client (~> 2.0.2) + nokogiri (~> 1.10) + pdf-reader (~> 2.12) + rest-client (~> 2.1) GEM remote: https://rubygems.org/ specs: + Ascii85 (1.1.1) + afm (0.2.2) + bigdecimal (3.1.8) + byebug (11.1.3) colorize (0.8.1) - domain_name (0.5.20190701) - unf (>= 0.0.5, < 1.0.0) - http-cookie (1.0.5) + domain_name (0.6.20240107) + hashery (2.1.2) + http-accept (1.7.0) + http-cookie (1.0.7) domain_name (~> 0.5) - mime-types (3.4.1) + mime-types (3.5.2) mime-types-data (~> 3.2015) - mime-types-data (3.2022.0105) - mini_portile2 (2.8.0) - minitest (5.15.0) + mime-types-data (3.2024.0903) + mini_portile2 (2.8.7) + minitest (5.16.3) netrc (0.11.0) - nokogiri (1.13.6) - mini_portile2 (~> 2.8.0) + nokogiri (1.16.7) + mini_portile2 (~> 2.8.2) racc (~> 1.4) - racc (1.6.0) + nokogiri (1.16.7-x86_64-darwin) + racc (~> 1.4) + pdf-reader (2.12.0) + Ascii85 (~> 1.0) + afm (~> 0.2.1) + hashery (~> 2.0) + ruby-rc4 + ttfunk + racc (1.8.1) rake (13.0.6) - rest-client (2.0.2) + rest-client (2.1.0) + http-accept (>= 1.7.0, < 2.0) http-cookie (>= 1.0.2, < 2.0) mime-types (>= 1.16, < 4.0) netrc (~> 0.8) - unf (0.1.4) - unf_ext - unf_ext (0.0.8.1) + ruby-rc4 (0.1.5) + ttfunk (1.8.0) + bigdecimal (~> 3.1) PLATFORMS ruby + x86_64-darwin-20 DEPENDENCIES bundler (>= 1.16.0, < 3.0) @@ -43,4 +60,4 @@ DEPENDENCIES rake (~> 13.0) BUNDLED WITH - 2.1.4 + 2.2.33 diff --git a/README.md b/README.md index 5997622ee..55baa376f 100644 --- a/README.md +++ b/README.md @@ -7,7 +7,7 @@ Looking for information about how to post documents? Check out the ## Install / Set Up Data Repo -Check that Ruby is installed, preferably 2.7.x or up. If you are using RVM, see the RVM section below. +Check that Ruby is installed, preferably 3.1.2 or up. If you are using RVM, see the RVM section below. If your project already has a Gemfile, add the `gem "datura"` line. If not, create a new directory and add a file named `Gemfile` (no extension). diff --git a/bin/admin_es_create_index b/bin/admin_es_create_index index e27997e18..94bee8ea9 100755 --- a/bin/admin_es_create_index +++ b/bin/admin_es_create_index @@ -2,18 +2,10 @@ require "datura" -params = Datura::Parser.es_create_delete_index -options = Datura::Options.new(params).all - -put_url = File.join(options["es_path"], "#{options["es_index"]}?pretty=true") -get_url = File.join(options["es_path"], "_cat", "indices?v&pretty=true") - begin - # TODO if we want to add any default settings to the new index, - # we can do that with the payload and then use rest-client again instead of exec - # however, rest-client appears to require a payload and won't allow simple "PUT" with none - puts "Creating new ES index: #{put_url}" - exec("curl -XPUT #{put_url}") + es = Datura::Elasticsearch::Index.new + es.create + es.set_schema rescue => e - puts "Error: #{e.inspect}" + puts e end diff --git a/bin/admin_es_delete_index b/bin/admin_es_delete_index index 76299afd4..8de5fbb06 100755 --- a/bin/admin_es_delete_index +++ b/bin/admin_es_delete_index @@ -1,15 +1,10 @@ #!/usr/bin/env ruby require "datura" -require "rest-client" - -params = Datura::Parser.es_create_delete_index -options = Datura::Options.new(params).all - -url = File.join(options["es_path"], "#{options["es_index"]}?pretty=true") begin - puts JSON.parse(RestClient.delete(url)) + es = Datura::Elasticsearch::Index.new + es.delete rescue => e - puts "Error with request, check that index exists before deleting: #{e}" + puts e end diff --git a/bin/es_alias_add b/bin/es_alias_add index e9c3f74d3..6be2c564a 100755 --- a/bin/es_alias_add +++ b/bin/es_alias_add @@ -2,29 +2,8 @@ require "datura" -require "json" -require "rest-client" - -params = Datura::Parser.es_alias_add -options = Datura::Options.new(params).all - -ali = options["alias"] -idx = options["index"] -url = File.join(options["es_path"], "_aliases") - -data = { - actions: [ - { remove: { alias: ali, index: "_all" } }, - { add: { alias: ali, index: idx } } - ] -} - begin - res = RestClient.post(url, data.to_json, { content_type: :json }) - puts "Results of setting alias #{ali} to index #{idx}" - puts res - list = JSON.parse(RestClient.get(url)) - puts "\nAll aliases: #{JSON.pretty_generate(list)}" + Datura::Elasticsearch::Alias.add rescue => e - puts "Error: #{e.response}" + puts e end diff --git a/bin/es_alias_delete b/bin/es_alias_delete index d12a574d2..6c6f2ade4 100755 --- a/bin/es_alias_delete +++ b/bin/es_alias_delete @@ -2,12 +2,8 @@ require "datura" -require "json" -require "rest-client" - -params = Datura::Parser.es_alias_delete -options = Datura::Options.new(params).all -url = File.join(options["es_path"], options["index"], "_alias", options["alias"]) - -res = JSON.parse(RestClient.delete(url)) -puts JSON.pretty_generate(res) +begin + Datura::Elasticsearch::Alias.delete +rescue => e + puts e +end diff --git a/bin/es_alias_list b/bin/es_alias_list index d37691626..23ad183e8 100755 --- a/bin/es_alias_list +++ b/bin/es_alias_list @@ -2,11 +2,4 @@ require "datura" -require "json" -require "rest-client" - -options = Datura::Options.new({}).all -url = File.join(options["es_path"], "_aliases") - -res = JSON.parse(RestClient.get(url)) -puts JSON.pretty_generate(res) +Datura::Elasticsearch::Alias.list diff --git a/bin/es_clear_index b/bin/es_clear_index index 2890f6230..c8534eba5 100755 --- a/bin/es_clear_index +++ b/bin/es_clear_index @@ -2,89 +2,8 @@ require "datura" -require "json" -require "rest-client" - -def confirm_basic(options, url) - # verify that the user is really sure about the index they're about to wipe - puts "Are you sure that you want to remove entries from" - puts " #{options["collection"]}'s #{options['environment']} environment?" - puts "url: #{url}" - puts "y/N" - answer = STDIN.gets.chomp - # boolean - return !!(answer =~ /[yY]/) -end - -def main - - # run the parameters through the option parser - params = Datura::Parser.clear_index_params - options = Datura::Options.new(params).all - if options["collection"] == "all" - clear_all(options) - else - clear_index(options) - end -end - -def build_data(options) - if options["regex"] - field = options["field"] || "identifier" - return { - "query" => { - "bool" => { - "must" => [ - { "regexp" => { field => options["regex"] } }, - { "term" => { "collection" => options["collection"] } } - ] - } - } - } - else - return { - "query" => { "term" => { "collection" => options["collection"] } } - } - end -end - -def clear_all(options) - puts "Please verify that you want to clear EVERY ENTRY from the ENTIRE INDEX\n\n" - puts "== FIELD / REGEX FILTERS NOT AVAILABLE FOR THIS OPTION, YOU'LL WIPE EVERYTHING ==\n\n" - puts "Seriously, you probably do not want to do this" - puts "Are you running this on something other than your local machine? RETHINK IT." - puts "Type: 'Yes I'm sure'" - confirm = STDIN.gets.chomp - if confirm == "Yes I'm sure" - url = "#{options["es_path"]}/#{options["es_index"]}/_doc/_delete_by_query?pretty=true" - post url, { "query" => { "match_all" => {} } } - else - puts "You typed '#{confirm}'. This is incorrect, exiting program" - exit - end -end - -def clear_index(options) - url = "#{options["es_path"]}/#{options["es_index"]}/_doc/_delete_by_query?pretty=true" - confirmation = confirm_basic(options, url) - - if confirmation - data = build_data(options) - post(url, data) - else - puts "come back anytime!" - exit - end +begin + Datura::Elasticsearch::Index.clear +rescue => e + puts e end - -def post(url, data={}) - begin - puts "clearing from #{url}: #{data.to_json}" - res = RestClient.post(url, data.to_json, {:content_type => :json}) - puts res.body - rescue => e - puts "error posting to ES: #{e.response}" - end -end - -main diff --git a/bin/es_get_schema b/bin/es_get_schema index 14e41b847..24f173c46 100755 --- a/bin/es_get_schema +++ b/bin/es_get_schema @@ -2,19 +2,9 @@ require "datura" -require "json" -require "rest-client" -require "yaml" - -params = Datura::Parser.es_set_schema_params -options = Datura::Options.new(params).all - begin - url = File.join(options["es_path"], options["es_index"], "_mapping", "_doc?pretty=true") - res = RestClient.get(url) - puts res.body - puts "environment: #{options["environment"]}" - puts "url: #{url}" + es = Datura::Elasticsearch::Index.new + puts JSON.pretty_generate(es.get_schema) rescue => e - puts "Error: #{e.response}" + puts e end diff --git a/bin/es_set_schema b/bin/es_set_schema index f40050016..6c461478d 100755 --- a/bin/es_set_schema +++ b/bin/es_set_schema @@ -2,22 +2,9 @@ require "datura" -require "json" -require "rest-client" -require "yaml" - -params = Datura::Parser.es_set_schema_params -options = Datura::Options.new(params).all -path = File.join(options["datura_dir"], options["es_schema_path"]) -schema = YAML.load_file(path) - begin - idx = options["es_index"] - - url = File.join(options["es_path"], options["es_index"], "_mapping", "_doc?pretty=true") - puts "environment: #{options["environment"]}" - puts "Setting schema: #{url}" - RestClient.put(url, schema.to_json, { :content_type => :json }) + es = Datura::Elasticsearch::Index.new + es.set_schema rescue => e - puts "Error: #{e.response}" + puts e end diff --git a/bin/setup b/bin/setup index 1a25e109c..45258adf1 100755 --- a/bin/setup +++ b/bin/setup @@ -1,6 +1,7 @@ #!/usr/bin/env ruby require "colorize" +require 'fileutils' coll = Dir.getwd diff --git a/bin/solr_clear_index b/bin/solr_clear_index index 248a709b1..4b0cc078b 100755 --- a/bin/solr_clear_index +++ b/bin/solr_clear_index @@ -2,7 +2,7 @@ require "datura" -params = Datura::Parser.clear_index_params +params = Datura::Parser.clear_index options = Datura::Options.new(params).all url = File.join(options["solr_path"], options["solr_core"], "update") diff --git a/datura.gemspec b/datura.gemspec index 763c20063..cfff7a1bd 100644 --- a/datura.gemspec +++ b/datura.gemspec @@ -54,10 +54,12 @@ Gem::Specification.new do |spec| ] spec.require_paths = ["lib"] - spec.required_ruby_version = "~> 2.5" + spec.required_ruby_version = "~> 3.1" spec.add_runtime_dependency "colorize", "~> 0.8.1" - spec.add_runtime_dependency "nokogiri", "~> 1.8" - spec.add_runtime_dependency "rest-client", "~> 2.0.2" + spec.add_runtime_dependency "nokogiri", "~> 1.10" + spec.add_runtime_dependency "rest-client", "~> 2.1" + spec.add_runtime_dependency "pdf-reader", "~> 2.12" + spec.add_runtime_dependency "byebug", "~> 11.0" spec.add_development_dependency "bundler", ">= 1.16.0", "< 3.0" spec.add_development_dependency "minitest", "~> 5.0" spec.add_development_dependency "rake", "~> 13.0" diff --git a/docs/1_setup/config.md b/docs/1_setup/config.md index fe4e7b29f..e58ccd5b5 100644 --- a/docs/1_setup/config.md +++ b/docs/1_setup/config.md @@ -9,7 +9,10 @@ default: collection: es_index es_path + es_user + es_password ``` +(The options es_user and es_password are needed if you are using a secured Elasticsearch index.) If there are any settings which must be different based on the local environment (your computer vs the server), place these in `config/private.yml`. @@ -118,6 +121,8 @@ Some stuff commonly in `private.yml`: - `threads: 5` (5 recommended for PC, 50 for powerful servers) - `es_path: http://localhost:9200` - `es_index: some_index` +- `es_user: elastic` (if you want to use security on your local elasticsearch instance) +- `es_password: ******` - `solr_path: http://localhost:8983/solr` - `solr_core: collection_name` diff --git a/docs/1_setup/prepare_index.md b/docs/1_setup/prepare_index.md index 944f9a719..fa79e7013 100644 --- a/docs/1_setup/prepare_index.md +++ b/docs/1_setup/prepare_index.md @@ -13,7 +13,7 @@ You will need to make sure that somewhere, the following are being set in your p ### Step 2: Prepare Elasticsearch Index -Make sure elasticsearch is installed and running in the location you wish to push to. If there is already an index you will be using, take note of its name and skip this step. If you want to add an index, run this command with a specified environment: +Make sure elasticsearch is installed and running in the location you wish to push to. If there is already an index you will be using, take note of its name and skip this step. (Note that each index must be dedicated to data on one version of the API schema) If you want to add an index, run this command with a specified environment: ``` admin_es_create_index -e development diff --git a/docs/4_developers/installation.md b/docs/4_developers/installation.md index 37eb521c2..1c486b6db 100644 --- a/docs/4_developers/installation.md +++ b/docs/4_developers/installation.md @@ -6,11 +6,12 @@ TODO ### Elasticsearch -TODO +Download Elasticsearch 8 [here](https://www.elastic.co/downloads/elasticsearch). ### Apache Permissions -Assuming that you place this collection in your web tree, you will need to take care to protect any sensitive information you place it in that you do not want to be accessible through the browser (copywritten material, drafts, passwords, etc). To protect the configuration files that contain information about your solr instance, you should put these lines in your main apache configuration file. If you have an older version of Apache, you may need to specify `Order deny,allow` and `Deny from all` instead of using `Require all denied`. +Assuming that you place this collection in your web tree, you will need to take care to protect any sensitive information you place it in that you do not want to be accessible through the browser (copywritten material, drafts, passwords, etc). To protect the configuration files that contain information about your solr instance, you should put these lines in your main apache configuration file. If you have an older version of Apache, you may need to specify `Order deny,allow` and `Deny from all` instead of using `Require all denied`. + ``` # # Prevent private.yml files that might be in the webtree from being viewable @@ -19,4 +20,5 @@ Assuming that you place this collection in your web tree, you will need to take Require all denied ``` + Otherwise, you should not need to do anything with apache assuming that you already had it set up with a webtree. diff --git a/docs/4_developers/new_formats.md b/docs/4_developers/new_formats.md new file mode 100644 index 000000000..9ebdb6640 --- /dev/null +++ b/docs/4_developers/new_formats.md @@ -0,0 +1,94 @@ +# Adding new formats to datura + +## Configuring datura + +In `lib/datura/data_manager.rb`, `self.format_to_class` contains a hash with the format as key and a file with format-specific methods as value. Add your desired format to this hash, with the corresponding file FileFornat. +``` + def self.format_to_class + { + "csv" => FileCsv, + "ead" => FileEad, + "html" => FileHtml, + "tei" => FileTei, + "vra" => FileVra, + } +end +``` +Modify `lib/datura/parser_options/post.rb` to accept parameters for the new format: +``` +options["format"] = nil + opts.on( '-f', '--format [input]', 'Restrict to one format (csv, ead, html, tei, vra, webs)') do |input| + if %w[csv ead html tei vra webs].include?(input) + options["format"] = input + else + puts "Format #{input} is not recognized.".red + puts "Allowed formats are csv, ead, html, tei, vra, and webs (web-scraped html)" + exit + end + end +``` +In the `config/public.yml` file you need to add a link to the xsl scripts for the specific format (you do not necessarily need to create a working script until you need to transform files), and also create a file and add it to the scripts folder: +``` + html_html_xsl: scripts/.xslt-datura/html_to_html/html_to_html.xsl + tei_html_xsl: scripts/.xslt-datura/tei_to_html/tei_to_html.xsl + vra_html_xsl: scripts/.xslt-datura/vra_to_html/vra_to_html.xsl + ead_html_xsl: scripts/.xslt-datura/ead_to_html/ead_to_html.xsl +``` + +## Datura overrides and new files +You will need to create a `file_format.rb` (i.e. `file_ead.rb`) file in `lib/datura/file_types`. Copy from a similar file type (file_tei.rb is a good model for XML-based formats) and make any necessary changes for the file format. Make sure to change the class name to reflect the new file format. In particular, in the case of an XML-based format, the `subdoc_xpaths` should be modified to get the correct XPath for the files you want to transform: +``` +def subdoc_xpaths + # match subdocs against classes + return { + "/ead" => EadToEs, + } + end +``` + +In the `/lib/datura/to_es` folder you also need to make a format_to_es.rb file, i.e. `ead_to_es.rb` and also a folder with fields.rb, request.rb, and (for XML-based formats) xpaths.rb overrides +Be sure to require all the necessary files at the top (and create them in the proper folder). +``` +require_relative "xml_to_es.rb" +require_relative "ead_to_es/fields.rb" +require_relative "ead_to_es/request.rb" +require_relative "ead_to_es/xpaths.rb" + + + +class EadToEs < XmlToEs +end +``` +The new files you have added must to be required in `lib/datura/requirer.rb`. This should happen automatically, but if not add the following to make sure they get picked up: +``` +require_relative "to_es/ead_to_es.rb" +``` +All code in these files should be within the same class. If the format is based on XML, it should inherit from XmlToEs. +``` +class EadToEs < XmlToEs +end +``` + +## Xpaths +Add all the xpaths for your desired fields in xpaths.rb, in the hash inside the xpaths.rb file. it may be helpful to use an existing template like `tei_to_es/xpaths.rb`. There is no need to add all of them, you can comment out the fields you do not need. + +## Overrides with specific fields +All fields must be defined within fields.rb. Even if you do not intend to index them, Datura requires that you at least have an empty method defining each field. (An empty field will be nil and not be displayed in Orchid). +``` +def category +end +``` +Make appropriate changes to your fields as desired. + +## Dealing with subsections of XML files +If you want to index subsections, the best way to do this is to define an xpaths selector for the desired sections in the `subdoc_xpaths` method as described above. +``` +def subdoc_xpaths + # match subdocs against classes + return { + "/ead" => EadToEs, + "//*[@level='item']" => EadToEsItems, + } + end +``` +Then add all the necessary overrides in the `to_es` folder like above. Depending on what you need to override, you may combine them into one file, or have separate files. In any case, they should inherit from the main file, i.e. `class TeiToEsPersonography < TeiToEs`. diff --git a/docs/README.md b/docs/README.md index 7e8487915..75618aa9d 100644 --- a/docs/README.md +++ b/docs/README.md @@ -39,6 +39,7 @@ The files are parsed and formatted into documents appropriate for Solr, IIIF, El - [Ruby / Gems](4_developers/ruby_gems.md) - Class organization - [Tests](4_developers/test.md) + - [Add new formats to Datura](4_developers/new_formats.md) - More - [Troubleshooting](troubleshooting.md) - [XSLT to Ruby reference](xslt_to_ruby_reference.md) diff --git a/docs/schema_v2.md b/docs/schema_v2.md new file mode 100644 index 000000000..e5985a32a --- /dev/null +++ b/docs/schema_v2.md @@ -0,0 +1,152 @@ +## CDRH Schema, version 2 + +| NEW FIELD NAME | likely facet field? | Metadata Equivalent | ORIGINAL FIELD NAME | DESCRIPTION | FIELD TYPE | | EXAMPLE | +| ----------------------------------------------------------------------------------------- | ------------------- | --------------------------- | ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------- | ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Resourse identification, website display | +| identifier | | | identifier | Unique identifier of the resource. | keyword | | oscys.case.0001.001 | +| collection | y | | collection | User friendly and URL valid name of project. Typically consists of directory under specified web domain. | keyword | | oscys, quillsandfeathers | +| collection\_desc | | | collection\_desc | Full CDRH name of the project. (e.g. “The William F. Cody Archive”) | keyword | | O Say Can You See: Early Washington, D.C., Law & Family | +| uri | | | uri | Full URI of resource. (Actual site not API site) | keyword | | [http://earlywashingtondc.org/doc/oscys.case.0001.001](http://earlywashingtondc.org/doc/oscys.case.0001.001) | +| uri\_data | | | uri\_data | Full URL to XML of data, when available | keyword | | [http://earlywashingtondc.org/files/oscys/tei/oscys.case.0001.001.xml](http://earlywashingtondc.org/files/oscys/tei/oscys.case.0001.001.xml) | +| uri\_html | | | uri\_html | Full URL to HTML snippit of data | keyword | | [http://earlywashingtondc.org/files/oscys/html-generated/oscys.case.0001.001.txt](http://earlywashingtondc.org/files/oscys/html-generated/oscys.case.0001.001.txt) | +| data\_type | | | data\_type | Format the data was originally stored in at CDRH. | keyword | | tei
| +| fig\_location | | | fig\_location | URI to location of figure. | keyword | | [http://earlywashingtondc.org/figures/](http://earlywashingtondc.org/figures/) | +| cover\_image | | | image\_id | Unambiguous reference to the image when the image id does not match the file id. | keyword | | oscys.case.0001.001.001.jpg | +| title | | dcterms:title | title | Name given to the resource. | keyword, copied into text | text? | The Once and Future King | +| title\_sort | | | title\_sort | Name given to the resource lowercased with articles removed | keyword | | once and future king | +| alternative | | dcterms:alternative | alternative | Alternative name for the resource. | keyword, copied into text | text? | Petition for Habeas Corpus | +| date\_updated | m | | NEW | | date | | | +| category | y | | category | Category on web page where resource occurs. Category fields are meant to be hierarchical and exclusive, for other types of organization look to subjects, keywords, etc

Each site will have a controlled vocabulary of its own | keyword | | works | +| category2 | y | | subcategory | | keyword | | works | novels | +| category3 | y | | NEW | 3rd level category | keyword | | works | novels | historical fiction | +| category4 | y | | NEW | 4th level category | keyword | | works | novels | historical fiction | civil war | +| category5 | y | | NEW | 5th level category | keyword | | etc | +| notes | | | NEW | | keyword | | | +| Metadata: Digital Item | +| contributor | | dcterms:contributor | contributor | CONTAINER FIELD
"Entity responsible for making contributions
to the resource." | | | | +| contributor.name | | | [contributor.name](http://contributor.name) | Entity responsible for making contributions
to the resource. | keyword | | \[Allison, Dee Ann\]
\[Walter, Katherine\] | +| [contributor.id](http://contributor.id/) | | | [contributor.id](http://contributor.id) | ID of the contributor | keyword | | \[https://orcid.org/0000-0002-4671-061X\]
(leave blank for no id) | +| contributor.role | | | contributor.role | | keyword | | \[researcher\]
\[Principal Investigator\]
\[encoder\] | +| Metadata: Original Item | +| creator | | dcterms:creator | creator | CONTAINER FIELD
An entity primarily responsible for making the resource.
Examples of a Creator include a person, an organization, or a service. | | | Use person field with role instead | +| creator.name | y | | [creator.name](http://creator.name) | Creator field name | keyword | copied into text | Use person field with role instead | +| creator.id | y | | [creator.id](http://creator.id) | Creator field ID (if available) | keyword | | Use person field with role instead | +| citation | | | | | | | | +| citation.role | | | NEW | | keyword | | | +| [citation.id](http://citation.id/) | | bibo:identifier | NEW | an identifier of the original item | keyword | | | +| citation.title | | dcterms:title | NEW | Used to describe the title of a bibliographic resource | keyword | text? | | +| citation.publisher | y | bibo:producer | publisher | Entity responsible for making the resource available. | keyword | | University of Nebraska Press, Lincoln & London, 1992 | +| citation.date | | dcterms:date | NEW | Date the resource was orginally created. | date | | 1900-01-01 | +| citation.issue | | bibo:issue | NEW | An issue number | keyword | | | +| citation.page\_start | | bibo:pageStart | NEW | Starting page number within a continuous page range. | keyword (some pages are roman numerals) | | 4 | +| citation.page\_end | | bibo:pageEnd | NEW | Ending page number within a continuous page range. | keyword | | 5 (if applicable) | +| citation.section | | bibo:section | NEW | A section number | keyword | | | +| citation.volume | | bibo:volume | NEW | A volume number | keyword | | | +| citation.place | | juso:name | NEW | This property indicates the name of the spatial thing. | keyword | | | +| citation.title\_a | | tei title level a | NEW | typically an article | keyword | text? | | +| citation.title\_m | | tei title level m | NEW | typically a monograph | keyword | text? | | +| citation.title\_j | y | tei title level j | NEW | typically a journal name | keyword | text? | | +| date | y | dcterms:date | | the date that will be used to sort and run date queries on item | date | | | +| date\_display | | | date\_display | Date in whatever display format is used on the site | keyword | text? | January, 1900 | +| date\_not\_before | | | date\_not\_before | Inclusive beginning date of resource. | date | | 1900-01-01 | +| date\_not\_after | | | date\_not\_after | Inclusive ending date of resource. | date | | 1900-01-31 | +| format | y | dcterms:format | format | File format, physical medium, or dimensions of the resource. | keyword | copied into text? | Film: 16mm Safety Film | +| medium | y | dcterms:medium | medium | Material or physical carrier of the resource. | keyword | copied into text? | Film | +| extent | | dcterms:extent | extent | Size or duration of the resource. | keyword | | 4:03 | +| language | y | dcterms:language | language | Primary / original language of the resource | keyword | | en | +| rights\_holder | y | dcterms:rightsHolder | rights\_holder | A person or organization owning or managing rights over the resource. | keyword | copied into text? | Huntington Library | +| rights | | dcterms:rights | rights | Information about the rights held in and over the resource. | keyword | copied into text? | All Rights Reserved. Contact Rights Holder for Permissions Information.
or
Covered by a CC-By License https://creativecommons.org/licenses/by/2.0/ | +| rights\_uri | | | rights\_uri | URI to rights holder information. | keyword | | [http://www.huntington.org/](http://www.huntington.org/) | +| container\_box | | ead container type = box | NEW | box an item is kept in, as in an archive | keyword | | | +| container\_folder | | ead container type = folder | NEW | folder an item is kept in, as in an archive | keyword | | | +| Metadata: Interpretive | +| subjects | y | dcterms:subject | subjects | Topic of the content of the resource. | keyword | copied into text? | \[Horror in art\]
\[Poisonous spiders--Venom\] | +| abstract | | dcterms:abstract | abstract | Abstract of the resource. | keyword? (for display or searching?)
| text? | The poem is not one of DGR's great sonnets, and it pales before the majestic painting it was written to accompany. Nevertheless, it is quite an interesting and important text. | +| description | | dcterms:description | description | Short description of the resource. | text | text? | A Poem by Dante Gabriel Rossetti | +| type | y | dcterms:type | type | Nature or genre of the resource. | keyword | copied into text? | Video | +| topics | y | | topics | Topics of content of resource. | keyword | copied into text? | | +| keywords | y | | keywords | Keywords used for resource. | keyword | copied into text? | | +| keywords2 | y | | NEW | Another set of keywords, used in sites to create another way to browse | keyword | copied into text? | decade | +| keywords3 | y | | NEW | Another set of keywords, used in sites to create another way to browse | keyword | copied into text? | | +| keywords4 | y | | NEW | Another set of keywords, used in sites to create another way to browse | keyword | copied into text? | | +| Relation to other items | +| relation | | dcterms:relation | relation | A related resource that is substantially the same as the described resource, but in another format. | keyword | | oscys.case.0001.001-B | +| source | | dcterms:source | source | A related resource from which the described resource is derived | keyword | | oscys.case.0001.001-A | +| has\_part | | dcterms:hasPart | NEW | parts of the resource, for example items pasted into a scrapbook | | | | +| has\_part.role | | | | | | | | +| has\_part.id | | | | | keyword | | cdrh.0001 | +| has\_part.title | | | | | keyword | | Resource title | +| has\_part.order | | | | | whole number | | 1 | +| is\_part\_of | | dcterms:isPartOf | NEW | the containing resource, for example the scrapbook the individual items are in | | | | +| is\_part\_of.role | | | | | | | | +| is\_part\_of.id | | | | | keyword | | cdrh.0001 | +| is\_part\_of.title | | | | | keyword | | Resource title | +| is\_part\_of.order | | | | | whole number | | 1 | +| previous\_item | | | NEW | previous item in a series. role can be used to create multiple nexts - for instance, previous letter in a mailing sequence, pervious letter by date | | | | +| previous\_item.role | | | | | | | | +| [previous\_item.id](http://previous_item.id/) | | | | | keyword | | cdrh.0001 | +| previous\_item.title | | | | | keyword | | Resource title | +| previous\_item.order | | | | | whole number | | 1 | +| next\_item | | | NEW | next item in a series. role can be used to create multiple nexts - for instance, next letter in a mailing sequence, next letter by date | | | | +| next\_item.role | | | | | | | | +| [next\_item.id](http://next_item.id/) | | | | | keyword | | cdrh.0001 | +| next\_item.title | | | | | keyword | | Resource title | +| next\_item.order | | | | | whole number | | 1 | +| Additional Data types | +| spatial | | | spatial | CONTAINER FIELD | | | | +| spatial.role | | | | | keyword | | | +| spatial.name | y | juso:name | spatial.title | Title / display name of location | keyword | copied into text? | Display name for this location, typically built from other fields, but potentially not. | +| spatial.description | | | spatial.description | Description | text | text? | | +| spatial.type | y | | spatial.type | | keyword | copied into text? | "origin" or "destination" used to distinguish multiple spatial records for one item (for example, for an item of correspondence) | +| spatial.short\_name | y | juso:short\_name | spatial.place\_name | Specific name of location in question, such as the army camp name, business, event title, etc | keyword, copied into text | | Camp Hollowell, Kimball Recital Hall, The Coffeehouse, Lancaster County Fairgrounds | +| spatial.coordinates | y | juso:geometry | spatial.coordinates | | geopoint | | \[-96.6669600, 40.8000000\] | +| spatial.id | | | [spatial.id](http://coverage.spatial.id/) | | keyword | | ????
| +| spatial.city | y | juso:city | spatial.city | | keyword | copied into text? | | +| spatial.township | | juso:Township | NEW | | | copied into text? | | +| spatial.county | | juso:county | spatial.county | | keyword | copied into text? | | +| spatial.country | y | juso:country | spatial.country | | keyword | copied into text? | | +| spatial.region | y | juso:within | NEW? | | keyword | copied into text? | | +| spatial.state | | juso:state | spatial.state | | keyword | copied into text? | | +| spatial.street | | juso:street | spatial.street | | keyword | | | +| spatial.postal\_code | | juso:postal\_code | spatial.postal\_code | | keyword | | | +| spatial.note | | | | | | | | +| deprecate and replace with place with role of "placename" and only place\_name filled out | | | places | Place names mentioned in the resource. | keyword | | | +| person | | foaf:Person | person | any people other than contributors associated with resource | | | | +| person.name | y | foaf:name | [person.name](http://person.name) | Name as we wish it to appear | keyword | copied into text? | \[Cody, William F.\] | +| [person.id](http://person.id/) | y | | [person.id](http://person.id) | Optional, if exists, may be from VIAF or similar. | keyword | | \[http://viaf.org/viaf/100252467\] | +| person.role | y | | person.role | Role of person. Common examples are recipient and sender, less common examples are attorney and defendant | keyword | copied into text? | \[sender\]
\[recipient\]
\[creator\]
\[editor\] | +| person.note | | | NEW | | keyword | | | +| person.order | | | NEW | | keyword | | | +| person.birth\_date | | foaf:birthday | NEW | | date | | \[1899-03-04\] | +| person.death\_date | | | NEW | | date | | | +| person.age\_category | y | | NEW | used when resources are categorizing the age of the participant at the time of the event. For instance, a minor in a court case | keyword | | \[minor\]
\[adult\] | +| person.name\_last | | foaf:lastName | NEW | | keyword | | | +| person.name\_given | | foaf:givenName | NEW | | keyword | | | +| person.name\_alternate | | | NEW | | keyword | | | +| person.name\_previous | | | | | | | | +| person.race | y | | NEW | | keyword | | | +| person.sex | y | | NEW | | keyword | | | +| person.gender | y | foaf:gender | NEW | | keyword | | | +| person.nationality | y | | NEW | | keyword | | | +| person.trait1 | y | | NEW | | keyword | | | +| person.trait2 | y | | NEW | | keyword | | | +| event | | | | | | | | +| event.type | y | | NEW | | keyword | | | +| event.agent | y | event:agent | NEW | Relates an event to an active agent (a person, a computer, ... :-) ) | keyword | | | +| event.factor | | event:factor | NEW | Relates an event to a passive factor (a tool, an instrument, an abstract cause...) | keyword | | points of law cited in a case | +| event.product | y | event:product | NEW | | keyword | | case outcome | +| event.date\_begin | | event\_date\_begin | NEW | | date | | | +| event.date\_end | | event\_date\_end | NEW | | date | | | +| event.trait1 | | | NEW | | keyword | | can be used for case keywords, i.e. civil, criminal | +| event.trait2 | | | NEW | | keyword | | | +| event.notes | | | NEW | | keyword | | | +| RDF | | | | The RDF field can be used to record any other data that needs to be associated with the record, for instance relationships | | | | +| rdf.type | | | NEW | | keyword | | \[relationship\] | +| rdf.subject | y | ref:subject | NEW | | keyword | | \[Smith, John\] | +| rdf.predicate | y | rdf:predicate | NEW | | keyword | | \[is married to\] | +| rdf.object | y | rdf:object | NEW | | keyword | | \[Smith, Mary\] | +| rdf.source | | | NEW | | keyword | | item.0001 | +| rdf.note | | | NEW | | keyword | | | +| Text search | +| annotations\_text | | | annotations\_text | Place for annotations text, so we can search annotations separately from the main text | text | | | +| text | | | text | Combined text of all the above fields for key word searching. | text
| | | diff --git a/lib/config/api_schema.yml b/lib/config/api_schema.yml deleted file mode 100644 index 26f1e6ccb..000000000 --- a/lib/config/api_schema.yml +++ /dev/null @@ -1,194 +0,0 @@ -properties: - identifier: - type: keyword - identifier: - type: keyword - collection: - type: keyword - collection_desc: - type: keyword - uri: - type: keyword - uri_data: - type: keyword - uri_html: - type: keyword - data_type: - type: keyword - image_location: - type: keyword - image_id: - type: keyword - # TODO copy to text? - title: - type: keyword - title_sort: - type: keyword - # TODO copy to text? - alternative: - type: keyword - creator_sort: - type: keyword - creator: - type: nested - properties: - name: - # TODO copy into text? - type: keyword - id: - type: keyword - subjects: - type: keyword - # TODO not sure yet if for display or search - abstract: - type: keyword - # TODO copy to text? - description: - type: keyword - publisher: - type: keyword - contributor: - type: nested - properties: - name: - type: keyword - id: - type: keyword - role: - type: keyword - date: - type: date - format: "yyyy-MM-dd||epoch_millis" - # ignore_malformed: true - date_display: - type: keyword - date_not_before: - type: date - format: "yyyy-MM-dd||epoch_millis" - # ignore_malformed: true - date_not_after: - type: date - format: "yyyy-MM-dd||epoch_millis" - # ignore_malformed: true - type: - type: keyword - format: - type: keyword - medium: - type: keyword - extent: - type: keyword - language: - type: keyword - languages: - type: keyword - relation: - type: keyword - source: - type: keyword - recipient: - type: nested - properties: - name: - type: keyword - id: - type: keyword - role: - type: keyword - rights_holder: - type: keyword - rights: - type: keyword - rights_uri: - type: keyword - spatial: - type: nested - properties: - id: - type: keyword - # display title for entire location - title: - type: keyword - type: - type: keyword - # specific name of building, park, mountain, etc - place_name: - type: keyword - coordinates: - type: geo_point - city: - type: keyword - county: - type: keyword - country: - type: keyword - region: - type: keyword - state: - type: keyword - street: - type: keyword - postal_code: - type: keyword - person: - type: nested - properties: - name: - # TODO copy into text? - type: keyword - id: - type: keyword - role: - type: keyword - annotations_text: - type: text - analyzer: english - category: - type: keyword - subcategory: - type: keyword - topics: - type: keyword - keywords: - type: keyword - people: - type: keyword - places: - type: keyword - works: - type: keyword - text: - type: text - analyzer: english -dynamic_templates: - - date_fields: - match: "*_d" - mapping: - type: date - format: "yyyy-MM-dd||epoch_millis" - - integer_fields: - match: "*_i" - mapping: - type: integer - - keyword_fields: - match: "*_k" - mapping: - type: keyword - - text_fields: - match: "*_t" - mapping: - type: text - analyzer: english - # language fields are always text fields - # but specifying _t_ for clarity - # _t_en functionally the same as _t - - text_english: - match: "*_t_en" - mapping: - type: text - analyzer: english - - text_spanish: - match: "*_t_es" - mapping: - type: text - analyzer: spanish diff --git a/lib/config/es_api_schemas/1.0.yml b/lib/config/es_api_schemas/1.0.yml new file mode 100644 index 000000000..da276c956 --- /dev/null +++ b/lib/config/es_api_schemas/1.0.yml @@ -0,0 +1,209 @@ +# compatible with Apium v1.0 +mappings: + properties: + identifier: + type: keyword + identifier: + type: keyword + collection: + type: keyword + collection_desc: + type: keyword + uri: + type: keyword + uri_data: + type: keyword + uri_html: + type: keyword + data_type: + type: keyword + image_location: + type: keyword + image_id: + type: keyword + # TODO copy to text? + title: + type: keyword + title_sort: + type: keyword + # TODO copy to text? + alternative: + type: keyword + creator_sort: + type: keyword + creator: + type: nested + properties: + name: + # TODO copy into text? + type: keyword + id: + type: keyword + subjects: + type: keyword + # TODO not sure yet if for display or search + abstract: + type: keyword + # TODO copy to text? + description: + type: keyword + publisher: + type: keyword + contributor: + type: nested + properties: + name: + type: keyword + id: + type: keyword + role: + type: keyword + date: + type: date + format: "yyyy-MM-dd||epoch_millis" + # ignore_malformed: true + date_display: + type: keyword + date_not_before: + type: date + format: "yyyy-MM-dd||epoch_millis" + # ignore_malformed: true + date_not_after: + type: date + format: "yyyy-MM-dd||epoch_millis" + # ignore_malformed: true + type: + type: keyword + format: + type: keyword + medium: + type: keyword + extent: + type: keyword + language: + type: keyword + languages: + type: keyword + relation: + type: keyword + source: + type: keyword + rdf: + type: nested + properties: + type: + type: keyword + subject: + type: keyword + predicate: + type: keyword + object: + type: keyword + source: + type: keyword + note: + type: keyword + recipient: + type: nested + properties: + name: + type: keyword + id: + type: keyword + role: + type: keyword + rights_holder: + type: keyword + rights: + type: keyword + rights_uri: + type: keyword + spatial: + type: nested + properties: + # display title for entire location + title: + type: keyword + place_name: + # TODO copy into text? + type: keyword + coordinates: + type: geo_point + id: + type: keyword + city: + type: keyword + county: + type: keyword + country: + type: keyword + region: + type: keyword + state: + type: keyword + street: + type: keyword + postal_code: + type: keyword + person: + type: nested + properties: + name: + # TODO copy into text? + type: keyword + id: + type: keyword + role: + type: keyword + annotations_text: + type: text + analyzer: english + category: + type: keyword + subcategory: + type: keyword + topics: + type: keyword + keywords: + type: keyword + people: + type: keyword + places: + type: keyword + works: + type: keyword + text: + type: text + analyzer: english + dynamic_templates: + - date_fields: + match: "*_d" + mapping: + type: date + format: "yyyy-MM-dd||epoch_millis" + - integer_fields: + match: "*_i" + mapping: + type: integer + - keyword_fields: + match: "*_k" + mapping: + type: keyword + - text_fields: + match: "*_t" + mapping: + type: text + analyzer: english + # language fields are always text fields + # but specifying _t_ for clarity + # _t_en functionally the same as _t + - text_english: + match: "*_t_en" + mapping: + type: text + analyzer: english + - text_spanish: + match: "*_t_es" + mapping: + type: text + analyzer: spanish diff --git a/lib/config/es_api_schemas/2.0.yml b/lib/config/es_api_schemas/2.0.yml new file mode 100644 index 000000000..303b22ec6 --- /dev/null +++ b/lib/config/es_api_schemas/2.0.yml @@ -0,0 +1,573 @@ +# compatible with Apium v2.0 +settings: + settings: + analysis: + char_filter: + escapes: + type: mapping + mappings: + - " => " + - " => " + - " => " + - " => " + - " => " + - " => " + - "- => " + - "& => " + - ": => " + - "; => " + - ", => " + - ". => " + - "$ => " + - "@ => " + - "~ => " + - "\" => " + - "' => " + - "[ => " + - "] => " + normalizer: + keyword_normalized: + type: custom + char_filter: + - escapes + filter: + - asciifolding + - lowercase +mappings: + properties: + identifier: + type: keyword + normalizer: keyword_normalized + collection: + type: keyword + normalizer: keyword_normalized + collection_desc: + type: keyword + normalizer: keyword_normalized + uri: + type: keyword + normalizer: keyword_normalized + uri_data: + type: keyword + normalizer: keyword_normalized + uri_html: + type: keyword + normalizer: keyword_normalized + data_type: + type: keyword + normalizer: keyword_normalized + fig_location: + type: keyword + normalizer: keyword_normalized + cover_image: + type: keyword + normalizer: keyword_normalized + title: + # TODO copy to text + type: keyword + normalizer: keyword_normalized + title_sort: + type: keyword + normalizer: keyword_normalized + alternative: + # TODO copy to text + type: keyword + normalizer: keyword_normalized + date_updated: + type: date + format: "yyyy-MM-dd||epoch_millis" + category: + type: keyword + normalizer: keyword_normalized + category2: + type: keyword + normalizer: keyword_normalized + category3: + type: keyword + normalizer: keyword_normalized + category4: + type: keyword + normalizer: keyword_normalized + category5: + type: keyword + normalizer: keyword_normalized + notes: + type: keyword + normalizer: keyword_normalized + contributor: + type: nested + properties: + name: + type: keyword + normalizer: keyword_normalized + id: + type: keyword + normalizer: keyword_normalized + role: + type: keyword + normalizer: keyword_normalized + creator: + type: nested + properties: + name: + type: keyword + normalizer: keyword_normalized + id: + type: keyword + normalizer: keyword_normalized + citation: + type: nested + properties: + role: + type: keyword + normalizer: keyword_normalized + id: + type: keyword + normalizer: keyword_normalized + date: + type: date + format: "yyyy-MM-dd||epoch_millis" + title: + type: keyword + normalizer: keyword_normalized + publisher: + type: keyword + normalizer: keyword_normalized + issue: + type: keyword + normalizer: keyword_normalized + page_start: + type: keyword + normalizer: keyword_normalized + page_end: + type: keyword + normalizer: keyword_normalized + section: + type: keyword + normalizer: keyword_normalized + volume: + type: keyword + normalizer: keyword_normalized + place: + type: keyword + normalizer: keyword_normalized + title_a: + type: keyword + normalizer: keyword_normalized + title_m: + type: keyword + normalizer: keyword_normalized + title_j: + type: keyword + normalizer: keyword_normalized + date: + type: date + format: "yyyy-MM-dd||epoch_millis" + # ignore_malformed: true + date_display: + type: keyword + normalizer: keyword_normalized + date_not_before: + type: date + format: "yyyy-MM-dd||epoch_millis" + # ignore_malformed: true + date_not_after: + type: date + format: "yyyy-MM-dd||epoch_millis" + # ignore_malformed: true + format: + type: keyword + normalizer: keyword_normalized + medium: + type: keyword + normalizer: keyword_normalized + extent: + type: keyword + normalizer: keyword_normalized + language: + type: keyword + normalizer: keyword_normalized + rights_holder: + type: keyword + normalizer: keyword_normalized + rights: + type: keyword + normalizer: keyword_normalized + rights_uri: + type: keyword + normalizer: keyword_normalized + container_box: + type: keyword + normalizer: keyword_normalized + container_folder: + type: keyword + normalizer: keyword_normalized + subjects: + type: keyword + normalizer: keyword_normalized + # TODO not sure yet if for display or search + abstract: + type: keyword + normalizer: keyword_normalized + # TODO copy to text? + description: + type: text + analyzer: english + type: + type: keyword + normalizer: keyword_normalized + topics: + type: keyword + normalizer: keyword_normalized + keywords: + type: keyword + normalizer: keyword_normalized + keywords2: + type: keyword + normalizer: keyword_normalized + keywords3: + type: keyword + normalizer: keyword_normalized + keywords4: + type: keyword + normalizer: keyword_normalized + keywords5: + type: keyword + normalizer: keyword_normalized + relation: + type: keyword + normalizer: keyword_normalized + has_source: + type: nested + properties: + role: + type: keyword + normalizer: keyword_normalized + id: + type: keyword + normalizer: keyword_normalized + title: + type: keyword + normalizer: keyword_normalized + order: + type: integer + has_relation: + type: nested + properties: + role: + type: keyword + normalizer: keyword_normalized + id: + type: keyword + normalizer: keyword_normalized + title: + type: keyword + normalizer: keyword_normalized + order: + type: integer + has_part: + type: nested + properties: + role: + type: keyword + normalizer: keyword_normalized + id: + type: keyword + normalizer: keyword_normalized + title: + type: keyword + normalizer: keyword_normalized + order: + type: integer + is_part_of: + type: nested + properties: + role: + type: keyword + normalizer: keyword_normalized + id: + type: keyword + normalizer: keyword_normalized + title: + type: keyword + normalizer: keyword_normalized + order: + type: integer + previous_item: + type: nested + properties: + role: + type: keyword + normalizer: keyword_normalized + id: + type: keyword + normalizer: keyword_normalized + title: + type: keyword + normalizer: keyword_normalized + order: + type: integer + next_item: + type: nested + properties: + role: + type: keyword + normalizer: keyword_normalized + id: + type: keyword + normalizer: keyword_normalized + title: + type: keyword + normalizer: keyword_normalized + order: + type: integer + # recipient: #DELETED + # type: nested + # properties: + # name: + # type: keyword + # normalizer: keyword_normalized + # id: + # type: keyword + # normalizer: keyword_normalized + # role: + # type: keyword + # normalizer: keyword_normalized + spatial: + type: nested + properties: + role: + type: keyword + normalizer: keyword_normalized + name: + # display title for entire location + type: keyword + normalizer: keyword_normalized + description: + type: text + analyzer: english + type: + type: keyword + normalizer: keyword_normalized + short_name: + # TODO copy into text? + type: keyword + normalizer: keyword_normalized + coordinates: + type: geo_point + id: + type: keyword + normalizer: keyword_normalized + city: + type: keyword + normalizer: keyword_normalized + township: + type: keyword + normalizer: keyword_normalized + county: + type: keyword + normalizer: keyword_normalized + country: + type: keyword + normalizer: keyword_normalized + region: + type: keyword + normalizer: keyword_normalized + state: + type: keyword + normalizer: keyword_normalized + street: + type: keyword + normalizer: keyword_normalized + postal_code: + type: keyword + normalizer: keyword_normalized + note: + type: keyword + normalizer: keyword_normalized + trait1: + type: keyword + normalizer: keyword_normalized + trait2: + type: keyword + normalizer: keyword_normalized + trait3: + type: keyword + normalizer: keyword_normalized + trait4: + type: keyword + normalizer: keyword_normalized + trait5: + type: keyword + normalizer: keyword_normalized + places: #DEPRECATED + type: keyword + normalizer: keyword_normalized + person: + type: nested + properties: + name: + type: keyword + normalizer: keyword_normalized + id: + type: keyword + normalizer: keyword_normalized + role: + type: keyword + normalizer: keyword_normalized + note: + type: keyword + normalizer: keyword_normalized + order: + type: integer + birth_date: + type: date + format: "yyyy-MM-dd||epoch_millis" + death_date: + type: date + format: "yyyy-MM-dd||epoch_millis" + age_category: + type: keyword + normalizer: keyword_normalized + name_last: + type: keyword + normalizer: keyword_normalized + name_given: + type: keyword + normalizer: keyword_normalized + name_alternate: + type: keyword + normalizer: keyword_normalized + name_previous: + type: keyword + normalizer: keyword_normalized + race: + type: keyword + normalizer: keyword_normalized + sex: + type: keyword + normalizer: keyword_normalized + gender: + type: keyword + normalizer: keyword_normalized + nationality: + type: keyword + normalizer: keyword_normalized + trait1: + type: keyword + normalizer: keyword_normalized + trait2: + type: keyword + normalizer: keyword_normalized + trait3: + type: keyword + normalizer: keyword_normalized + trait4: + type: keyword + normalizer: keyword_normalized + trait5: + type: keyword + normalizer: keyword_normalized + event: + type: nested + properties: + type: + type: keyword + normalizer: keyword_normalized + agent: + type: keyword + normalizer: keyword_normalized + factor: + type: keyword + normalizer: keyword_normalized + product: + type: keyword + normalizer: keyword_normalized + date_begin: + type: date + format: "yyyy-MM-dd||epoch_millis" + date_end: + type: date + format: "yyyy-MM-dd||epoch_millis" + trait1: + type: keyword + normalizer: keyword_normalized + trait2: + type: keyword + normalizer: keyword_normalized + trait3: + type: keyword + normalizer: keyword_normalized + trait4: + type: keyword + normalizer: keyword_normalized + trait5: + type: keyword + normalizer: keyword_normalized + notes: + type: keyword + normalizer: keyword_normalized + rdf: + type: nested + properties: + type: + type: keyword + normalizer: keyword_normalized + subject: + type: keyword + normalizer: keyword_normalized + predicate: + type: keyword + normalizer: keyword_normalized + object: + type: keyword + normalizer: keyword_normalized + source: + type: keyword + normalizer: keyword_normalized + note: + type: keyword + normalizer: keyword_normalized + annotations_text: + type: text + analyzer: english + text: + type: text + analyzer: english + dynamic_templates: + - date_fields: + match: "*_d" + mapping: + type: date + format: "yyyy-MM-dd||epoch_millis" + - integer_fields: + match: "*_i" + mapping: + type: integer + - keyword_fields: + match: "*_k" + mapping: + type: keyword + normalizer: keyword_normalized + - nested_fields: + match: "*_n" + mapping: + type: nested + - text_fields: + match: "*_t" + mapping: + type: text + analyzer: english + # language fields are always text fields + # but specifying _t_ for clarity + # _t_en functionally the same as _t + - text_english: + match: "*_t_en" + mapping: + type: text + analyzer: english + - text_spanish: + match: "*_t_es" + mapping: + type: text + analyzer: spanish diff --git a/lib/config/f17.dtd b/lib/config/f17.dtd deleted file mode 100644 index e69de29bb..000000000 diff --git a/lib/config/public.yml b/lib/config/public.yml index 3fff24731..7fefc6c85 100644 --- a/lib/config/public.yml +++ b/lib/config/public.yml @@ -9,37 +9,46 @@ # the collection specific configuration files: # (config/public.yml and config/private.yml) - ################### # Defaults # ################### default: - # SCRIPT POWER # recommend this be increased in private.yml # on more powerful systems to improve runtime threads: 5 # LOGGING - log_old_number: 1 # number of log files before beginning to erase - log_size: 32768000 # size of log file in bytes - log_level: Logger::INFO # available levels: UNKNOWN, FATAL, ERROR, WARN, INFO, DEBUG + log_old_number: 1 # number of log files before beginning to erase + log_size: 32768000 # size of log file in bytes + log_level: Logger::INFO # available levels: UNKNOWN, FATAL, ERROR, WARN, INFO, DEBUG - # SCHEMA LOCATION - # misleadingly, this is not currently overrideable per collection - # TODO make overrideable in es_set_schema and post - # or perhaps remove it from this config since it is not collection-specific - # in any sense of the word, except if working with an entirely separate ES index - es_schema_path: lib/config/api_schema.yml + # ELASTICSEARCH SCHEMA CONFIGURATION + # if es_schema_override is false, datura is base directory + # if es_schema_override is true, then host data repo is the base directory + # it is NOT recommended to set es_schema_override to true! + # if you need something outside of your data repo's directory, consider + # overridding the es_schema_path method in options.rb + es_schema_override: false + # path from base directory to schemas + es_schema_path: lib/config/es_api_schemas + # current version of the API (powered by Elasticsearch) + # this setting determines which of the schemas will be used + api_version: "1.0" + # NOTE: es_schema option is set later as combination of above + # es_schema_override, es_schema_path, and api_version + # ES currently has a limited character size for keyword fields of 1000000 + # exceeding this limit (generally in text field) will cause errors when searching + text_limit: 900000 # RESOURCE LOCATIONS - data_base: https://cdrhmedia.unl.edu # xml, csv, html snippets, etc - media_base: https://cdrhmedia.unl.edu # images, audio, video - es_index: override_to_set_index # elasticsearch index name - es_path: http://localhost:9200 # elasticsearch path (recommend override) - solr_core: override_to_set_core # solr core name - solr_path: http://localhost:8983/solr # solr path (recommend override) + data_base: https://cdrhmedia.unl.edu # xml, csv, html snippets, etc + media_base: https://cdrhmedia.unl.edu # images, audio, video + es_index: override_to_set_index # elasticsearch index name + es_path: http://localhost:9200 # elasticsearch path (recommend override) + solr_core: override_to_set_core # solr core name + solr_path: http://localhost:8983/solr # solr path (recommend override) # OUTPUT LOCATION # default is [environment]/output/[file_type] @@ -59,6 +68,7 @@ default: html_html_xsl: scripts/.xslt-datura/html_to_html/html_to_html.xsl tei_html_xsl: scripts/.xslt-datura/tei_to_html/tei_to_html.xsl vra_html_xsl: scripts/.xslt-datura/vra_to_html/vra_to_html.xsl + ead_html_xsl: scripts/.xslt-datura/ead_to_html/ead_to_html.xsl # XSLT PARAMETERS # NOTE! If you are altering ANY of the variables you must @@ -83,7 +93,6 @@ default: development: data_base: https://cdrhdev1.unl.edu/media - ################## # Production # ################## diff --git a/lib/datura.rb b/lib/datura.rb index 5f46a4d55..3ff9278b2 100644 --- a/lib/datura.rb +++ b/lib/datura.rb @@ -1,5 +1,6 @@ require "datura/version" require "datura/data_manager" +require "datura/elasticsearch" module Datura diff --git a/lib/datura/data_manager.rb b/lib/datura/data_manager.rb index 9ae304a43..24f4898e9 100644 --- a/lib/datura/data_manager.rb +++ b/lib/datura/data_manager.rb @@ -1,7 +1,7 @@ require "colorize" require "logger" require "yaml" - +require "byebug" require_relative "./requirer.rb" class Datura::DataManager @@ -20,10 +20,12 @@ class Datura::DataManager def self.format_to_class classes = { "csv" => FileCsv, + "ead" => FileEad, "html" => FileHtml, "tei" => FileTei, "vra" => FileVra, - "webs" => FileWebs + "webs" => FileWebs, + "pdf" => FilePdf } classes.default = FileCustom classes @@ -45,7 +47,6 @@ def initialize prepare_xslt load_collection_classes set_up_logger - # set up posting URLs @es_url = File.join(options["es_path"], options["es_index"]) @solr_url = File.join(options["solr_path"], options["solr_core"], "update") @@ -58,6 +59,7 @@ def load_collection_classes # any of the default ones (for example: TeiToEs) path = File.join(@options["collection_dir"], "scripts", "overrides", "*.rb") Dir[path].each do |f| + puts "requiring" + f require f end end @@ -71,15 +73,14 @@ def print_options def run @time = [Time.now] # log starting information for user + check_options + set_up_services + msg = options_msg @log.info(msg) puts msg - - check_options - set_schema pre_file_preparation @files = prepare_files - pre_batch_processing batch_process_files post_batch_processing @@ -110,7 +111,7 @@ def allowed_files(all_files) # TODO should this move to Options class? def assert_option(opt) - if !@options.has_key?(opt) + if !@options.key?(opt) puts "Option #{opt} was not found! Check config files and add #{opt} to continue".red raise "Missing configuration options" end @@ -132,13 +133,18 @@ def batch_process_files def check_options # verify that everything's all good before moving to per-file level processing - if should_transform?("es") + if should_post?("es") assert_option("es_path") assert_option("es_index") + # options used to obtain the mappings + assert_option("es_schema_override") + assert_option("es_schema_path") + assert_option("api_version") + assert_option("collection") end - if should_transform?("solr") + if should_post?("solr") assert_option("solr_core") assert_option("solr_path") end @@ -189,8 +195,8 @@ def options_msg msg << "Running script with following options:\n" msg << "collection: #{@options['collection']}\n" msg << "Environment: #{@options['environment']}\n" - msg << "Posting to: #{@es_url}\n\n" if should_transform?("es") - msg << "Posting to: #{@solr_url}\n\n" if should_transform?("solr") + msg << "Posting to: #{@es.index_url}\n\n" if should_post?("es") + msg << "Posting to: #{@solr_url}\n\n" if should_post?("solr") msg << "Format: #{@options['format']}\n" if @options["format"] msg << "Regex: #{@options['regex']}\n" if @options["regex"] msg << "Allowed Files: #{@options['allowed_files']}\n" if @options["allowed_files"] @@ -262,24 +268,6 @@ def prepare_xslt end end - def set_schema - # if ES is requested and not transform only, then set the schema - # to make sure that any new fields are stored with the correct fieldtype - if should_transform?("es") && !@options["transform_only"] - schema = YAML.load_file(File.join(@options["datura_dir"], @options["es_schema_path"])) - path, idx = ["es_path", "es_index"].map { |i| @options[i] } - url = "#{path}/#{idx}/_mapping/_doc?pretty=true" - begin - RestClient.put(url, schema.to_json, { content_type: :json }) - msg = "Successfully set elasticsearch schema for index #{idx} _doc" - @log.info(msg) - puts msg.green - rescue => e - raise("Something went wrong setting the elasticsearch schema for index #{idx} _doc:\n#{e.to_s}".red) - end - end - end - def set_up_logger # make directory if one does not already exist log_dir = File.join(@options["collection_dir"], "logs") @@ -293,13 +281,28 @@ def set_up_logger ) end + def set_up_services + if should_post?("es") + # set up elasticsearch instance + @es = Datura::Elasticsearch::Index.new(@options, schema_mapping: true) + end + + if should_post?("solr") + # set up posting URLs + @solr_url = File.join(options["solr_path"], options["solr_core"], "update") + end + end + + def should_post?(type) + should_transform?(type) && !@options["transform_only"] + end + def should_transform?(type) # adjust default transformation type in params parser @options["transform_types"].include?(type) end def transform_and_post(file) - # elasticsearch if should_transform?("es") if @options["transform_only"] @@ -311,7 +314,7 @@ def transform_and_post(file) error_with_transform_and_post("#{e}", @error_es) end else - res_es = file.post_es(@es_url) + res_es = file.post_es(@es) if res_es && res_es.has_key?("error") error_with_transform_and_post(res_es["error"], @error_es) end diff --git a/lib/datura/elasticsearch.rb b/lib/datura/elasticsearch.rb new file mode 100644 index 000000000..ffc4c710e --- /dev/null +++ b/lib/datura/elasticsearch.rb @@ -0,0 +1,18 @@ +require_relative './helpers.rb' +require_relative './options.rb' + +module Datura::Elasticsearch + + # clear data from the index (leaves index schema intact) + module Data + end + + # manage the aliases used to refer to specific indexes + module Alias + end + + # manage the creation / deletion / schema configuration of indexes + class Index + end + +end diff --git a/lib/datura/elasticsearch/alias.rb b/lib/datura/elasticsearch/alias.rb new file mode 100644 index 000000000..4d6a3a118 --- /dev/null +++ b/lib/datura/elasticsearch/alias.rb @@ -0,0 +1,55 @@ +require "json" +require "rest-client" + +require_relative "./../elasticsearch.rb" + +module Datura::Elasticsearch::Alias + + def self.add + params = Datura::Parser.es_alias_add + options = Datura::Options.new(params).all + + ali = options["alias"] + idx = options["index"] + + base_url = File.join(options["es_path"], "_aliases") + + data = { + actions: [ + { remove: { alias: ali, index: "_all" } }, + { add: { alias: ali, index: idx } } + ] + } + RestClient.post(base_url, data.to_json, @auth_header.merge({ content_type: :json })) { |res, req, result| + if result.code == "200" + puts res + puts "Successfully added alias #{ali}. Current alias list:" + puts list + else + raise "#{result.code} error managing aliases: #{res}" + end + } + end + + def self.delete + params = Datura::Parser.es_alias_add + options = Datura::Options.new(params).all + + ali = options["alias"] + idx = options["index"] + + url = File.join(options["es_path"], idx, "_alias", ali) + + res = JSON.parse(RestClient.delete(url, @auth_header)) + puts JSON.pretty_generate(res) + list + end + + def self.list + options = Datura::Options.new({}).all + + res = RestClient.get(File.join(options["es_path"], "_aliases"), ) + JSON.pretty_generate(JSON.parse(res)) + end + +end diff --git a/lib/datura/elasticsearch/index.rb b/lib/datura/elasticsearch/index.rb new file mode 100644 index 000000000..09828d82f --- /dev/null +++ b/lib/datura/elasticsearch/index.rb @@ -0,0 +1,260 @@ +require "json" +require "rest-client" +require "yaml" +require "base64" + +require_relative "./../elasticsearch.rb" + +class Datura::Elasticsearch::Index + + attr_reader :schema_mapping + attr_reader :index_url + + # if options are passed in, then commandline arguments + # do not need to be parsed + def initialize(options = nil, schema_mapping: false) + if !options + params = Datura::Parser.es_index + @options = Datura::Options.new(params).all + else + @options = options + end + + @index_url = File.join(@options["es_path"], @options["es_index"]) + @pretty_url = "#{@index_url}?pretty=true" + @mapping_url = File.join(@index_url, "_mapping?pretty=true") + + # yaml settings (if exist) and mappings + @requested_schema = YAML.load_file(@options["es_schema"]) + @auth_header = Datura::Helpers.construct_auth_header(@options) + # if requested, grab the mapping currently associated with this index + # otherwise wait until after the requested schema is loaded + get_schema_mapping if schema_mapping + end + + def create + json = @requested_schema["settings"].to_json + puts "Creating ES index for API version #{@options["api_version"]}: #{@pretty_url}" + if json && json != "null" + RestClient.put(@pretty_url, json, @auth_header.merge({ content_type: :json })) { |res, req, result| + if result.code == "200" + puts res + else + raise "#{result.code} error creating Elasticsearch index: #{res}" + end + } + else + RestClient.put(@pretty_url, nil, @auth_header) { |res, req, result| + if result.code == "200" + puts res + else + raise "#{result.code} error creating Elasticsearch index: #{res}" + end + } + end + end + + def delete + puts "Deleting #{@options["es_index"]} via url #{@pretty_url}" + + RestClient.delete(@pretty_url, @auth_header) { |res, req, result| + if result.code != "200" + raise "#{result.code} error deleting Elasticsearch index: #{res}" + end + } + end + + def get_schema + RestClient.get(@mapping_url, @auth_header) { |res, req, result| + if result.code == "200" + JSON.parse(res) + else + raise "#{result.code} error getting Elasticsearch schema: #{res}" + end + } + end + + def get_schema_mapping + # if mapping has not already been set, get the schema and manipulate + if !defined?(@schema_mapping) + @schema_mapping = { + "dynamic" => nil, # /regex|regex/ + "fields" => [], # [ fields ] + "nested" => {} # { field: [ nested_fields ] } + } + + schema = get_schema[@options["es_index"]] + doc = schema["mappings"] + doc["properties"].each do |field, value| + @schema_mapping["fields"] << field + if value["type"] == "nested" + @schema_mapping["nested"][field] = value["properties"].keys + end + end + + regex_pieces = [] + if doc["dynamic_templates"] + doc["dynamic_templates"].each do |template| + mapping = template.map { |k,v| v["match"] }.first + # dynamic fields are listed like *_k and will need + # to be converted to ^.*_k$, then combined into a mega-regex + es_match = mapping.sub("*", ".*") + regex_pieces << es_match + end + end + if !regex_pieces.empty? + regex_joined = regex_pieces.join("|") + @schema_mapping["dynamic"] = /^(?:#{regex_joined})$/ + end + end + @schema_mapping + end + + def set_schema + json = @requested_schema["mappings"].to_json + + puts "Setting schema: #{@mapping_url}" + RestClient.put(@mapping_url, json, @auth_header.merge({ content_type: :json })) { |res, req, result| + if result.code == "200" + puts res + else + raise "#{result.code} error setting Elasticsearch schema: #{res}" + end + } + end + + # doc: ruby hash corresponding with Elasticsearch document JSON + def valid_document?(doc) + get_schema_mapping + # NOTE: validation only checking the names of fields + # against the schema, NOT the contents of fields + # Elasticsearch itself checks that you are sending date + # formats to date fields, etc + + doc.all? do |field, value| + if valid_field?(field) + # great, the field is valid, now check if it is a parent + Array(value).each do |nested| + if nested.class == Hash + if nested.keys.all? { |k| valid_field?(k, field) } + next + else + # if one of the nested hashes fails, it is invalid + puts "Nested field '#{field}' is invalid" + return false + end + end + end + # all nested fields passed, so it is valid + true + else + puts "Field '#{field}' is invalid" + false + end + end + end + + # if a field, including those inside nested fields, + # matches a top level field mapping or a dynamic field, + # they are good to go + # further, if this is a nested field, they may check + # to see if the specific nesting mapping validates them + def valid_field?(field, parent=nil) + @schema_mapping["fields"].include?(field) || + field.match(@schema_mapping["dynamic"]) || + valid_nested_field?(field, parent) + end + + def valid_nested_field?(field, parent) + parent_mapping = @schema_mapping["nested"][parent] + parent_mapping.include?(field) if parent_mapping + end + + def self.clear + # run the parameters through the option parser + params = Datura::Parser.clear_index + options = Datura::Options.new(params).all + if options["collection"] == "all" + self.clear_all(options) + else + self.clear_index(options) + end + end + + private + + def self.build_clear_data(options) + if options["regex"] + field = options["field"] || "identifier" + { + "query" => { + "bool" => { + "must" => [ + { "regexp" => { field => options["regex"] } }, + { "term" => { "collection" => options["collection"] } } + ] + } + } + } + else + { + "query" => { "term" => { "collection" => options["collection"] } } + } + end + end + + def self.clear_all(options) + puts "Please verify that you want to clear EVERY ENTRY from the ENTIRE INDEX\n\n" + puts "== FIELD / REGEX FILTERS NOT AVAILABLE FOR THIS OPTION, YOU'LL WIPE EVERYTHING ==\n\n" + puts "Running this on something other than your computer's localhost? DON'T." + puts "Type: 'Yes I'm sure'" + confirm = STDIN.gets.chomp + if confirm == "Yes I'm sure" + url = File.join(options["es_path"], options["es_index"], "_delete_by_query?pretty=true") + auth_header = Datura::Helpers.construct_auth_header(options) + json = { "query" => { "match_all" => {} } } + RestClient.post(url, json.to_json, auth_header.merge({ content_type: :json })) { |res, req, result| + if result.code == "200" + puts res + else + raise "#{result.code} error when clearing entire index: #{res}" + end + } + else + puts "You typed '#{confirm}'. This is incorrect, exiting program" + exit + end + end + + def self.clear_index(options) + url = File.join(options["es_path"], options["es_index"], "_delete_by_query?pretty=true") + confirmation = self.confirm_clear(options, url) + + if confirmation + data = self.build_clear_data(options) + auth_header = Datura::Helpers.construct_auth_header(options) + RestClient.post(url, data.to_json, auth_header.merge({content_type: :json })) { |res, req, result| + if result.code == "200" || result.code == "201" + puts res + else + raise "#{result.code} error when clearing index: #{res}" + end + } + else + puts "come back anytime!" + exit + end + end + + def self.confirm_clear(options, url) + # verify that the user is really sure about the index they're about to wipe + puts "Are you sure that you want to remove entries from" + puts " #{options["collection"]}'s #{options['environment']} environment?" + puts "url: #{url}" + puts "y/N" + answer = STDIN.gets.chomp + # boolean + !!(answer =~ /[yY]/) + end + +end diff --git a/lib/datura/file_type.rb b/lib/datura/file_type.rb index 236369a30..e17114837 100644 --- a/lib/datura/file_type.rb +++ b/lib/datura/file_type.rb @@ -23,13 +23,14 @@ def initialize(location, options) @file_location = location @options = options add_xsl_params_options - # set output directories output = File.join(@options["collection_dir"], "output", @options["environment"]) + @out_es = File.join(output, "es") @out_html = File.join(output, "html") @out_iiif = File.join(output, "iiif") @out_solr = File.join(output, "solr") + @auth_header = Datura::Helpers.construct_auth_header(options) Datura::Helpers.make_dirs(@out_es, @out_html, @out_iiif, @out_solr) # script locations set in child classes end @@ -49,30 +50,37 @@ def parse_markup_lang_file CommonXml.create_xml_object(self.file_location) end - def post_es(url=nil) - url = url || "#{@options["es_path"]}/#{@options["es_index"]}" + # expecting an instance of Datura::Elasticsearch::Index + def post_es(es) + error = nil begin transformed = transform_es rescue => e - return { "error" => "Error transforming ES for #{self.filename(false)}: #{e}" } + return { "error" => "Error transforming ES for #{self.filename(false)}: #{e.full_message}" } end if transformed && transformed.length > 0 transformed.each do |doc| id = doc["identifier"] - puts "posting #{id}" - puts "PATH: #{url}/_doc/#{id}" if options["verbose"] - # NOTE: If you need to do partial updates rather than replacement of doc - # you will need to add _update at the end of this URL - begin - RestClient.put("#{url}/_doc/#{id}", doc.to_json, {:content_type => :json } ) - rescue => e - return { "error" => "Error transforming or posting to ES for #{self.filename(false)}: #{e.response}" } + # before a document is posted, we need to make sure that the fields validate against the schema + if es.valid_document?(doc) + + puts "posting #{id}" + puts "PATH: #{es.index_url}/_doc/#{id}" if options["verbose"] + # NOTE: If you need to do partial updates rather than replacement of doc + # you will need to add _update at the end of this URL + begin + RestClient.put("#{es.index_url}/_doc/#{id}", doc.to_json, @auth_header.merge({:content_type => :json }) ) + rescue => e + error = "Error transforming or posting to ES for #{self.filename(false)}: #{e}" + end + else + error = "Document #{id} did not validate against the elasticsearch schema" end end else - return { "error" => "No file was transformed" } + error = "No file was transformed" end - return { "docs" => transformed } + error ? { "error" => error } : { "docs" => transformed} end def post_solr(url=nil) @@ -119,7 +127,7 @@ def transform_es # check if any xpaths hit before continuing results = file_xml.xpath(*subdoc_xpaths.keys) if results.length == 0 - raise "No possible xpaths found fo file #{self.filename}, check if XML is valid or customize 'subdoc_xpaths' method" + raise "No possible xpaths found for file #{self.filename}, check if XML is valid or customize 'subdoc_xpaths' method" end subdoc_xpaths.each do |xpath, classname| subdocs = file_xml.xpath(xpath) @@ -135,6 +143,8 @@ def transform_es return es_req rescue => e puts "something went wrong transforming #{self.filename}" + puts e + puts e.backtrace raise e end end diff --git a/lib/datura/file_types/file_csv.rb b/lib/datura/file_types/file_csv.rb index 65655a940..92a1cbff6 100644 --- a/lib/datura/file_types/file_csv.rb +++ b/lib/datura/file_types/file_csv.rb @@ -13,7 +13,6 @@ def build_html_from_csv # Note: if overriding this function, it's recommended to use # a more specific identifier for each row of the CSV # but since this is a generic version, simply using the current iteration number - id = index # using XML instead of HTML for simplicity's sake builder = Nokogiri::XML::Builder.new do |xml| xml.div(class: "main_content") { @@ -34,7 +33,7 @@ def present?(item) # override to change encoding def read_csv(file_location, encoding="utf-8") - CSV.read(file_location, { + CSV.read(file_location, **{ encoding: encoding, headers: true, return_headers: true diff --git a/lib/datura/file_types/file_ead.rb b/lib/datura/file_types/file_ead.rb new file mode 100644 index 000000000..a809ab9b4 --- /dev/null +++ b/lib/datura/file_types/file_ead.rb @@ -0,0 +1,46 @@ +require_relative "../helpers.rb" +require_relative "../file_type.rb" +require_relative "../solr_poster.rb" +require "rest-client" + +class FileEad < FileType + # TODO we could include the tei_to_es and other modules directly here + # as a mixin, though then we'll need to namespace them or perish + attr_reader :es_req + + + def initialize(file_location, options) + super(file_location, options) + @script_html = File.join(options["collection_dir"], options["ead_html_xsl"]) + # There needs to be an xsl file to transform into html + # I don't think we need solr at this point) + # @script_solr = File.join(options["collection_dir"], options["tei_solr_xsl"]) + end + + def subdoc_xpaths + # match subdocs against classes + return { + "/ead" => EadToEs, + "//*[@level='item']" => EadToEsItems, + } + end + + # if there should not be any html transformation taking place + # then leave this method empty but uncommented to override default behavior + + # if you would like to use the default transformation behavior + # then comment or remove both of the following methods! + + # def transform_es + # end + + # def transform_html + # end + + def transform_iiif + raise "EAD to IIIF is not yet generalized, please override on a per project basis" + end + + # def transform_solr + # end +end diff --git a/lib/datura/file_types/file_pdf.rb b/lib/datura/file_types/file_pdf.rb new file mode 100644 index 000000000..426fe12fb --- /dev/null +++ b/lib/datura/file_types/file_pdf.rb @@ -0,0 +1,96 @@ +require "pdf-reader" +require_relative "../file_type.rb" + +class FilePdf < FileType + def initialize(file_location, options) + super(file_location, options) + #convert to pdf reading + @pdf = read_pdf(file_location) + end + + def build_html_from_pdf + # #can this be converted? not sure + # @csv.each_with_index do |row, index| + # next if row.header_row? + # # Note: if overriding this function, it's recommended to use + # # a more specific identifier for each row of the CSV + # # but since this is a generic version, simply using the current iteration number + # # using XML instead of HTML for simplicity's sake + # builder = Nokogiri::XML::Builder.new do |xml| + # xml.div(class: "main_content") { + # xml.ul { + # @csv.headers.each do |header| + # xml.li("#{header}: #{row[header]}") + # end + # } + # } + # end + # write_html_to_file(builder, index) + # end + end + + def present?(item) + !item.nil? && !item.empty? + end + + # override to change encoding + def read_pdf(file_location) + #convert to pdf + PDF::Reader.new(file_location) + end + + # override as necessary per project + def pdf_to_es(pdf) + PdfToEs.new(pdf, options, self.filename(false)).json + end + + + def transform_es + puts "transforming #{self.filename}" + es_doc = [] + es_doc << pdf_to_es(@pdf) + if @options["output"] + filepath = "#{@out_es}/#{self.filename(false)}.json" + File.open(filepath, "w") { |f| f.write(pretty_json(es_doc)) } + end + es_doc + end + + def transform_iiif + raise "PDF to IIIF is not yet generalized, please override on a per project basis" + end + + def transform_html + puts "transforming #{self.filename} to HTML subdocuments (not implemented yet)" + # build_html_from_pdf + # # transform_html method is expected to send back a hash + # # but already wrote to filesystem so just sending back empty + # {} + end + + # I am not sure that this is going to be the best way to set this up + # but until we have more examples of CSVs that need to be ingested + # it will have to do! (transmississippi only collection so far) + def transform_solr + puts "transforming #{self.filename}" + solr_doc = Nokogiri::XML("") + doc = Nokogiri::XML::Node.new("doc", solr_doc) + # row_to_solr should return an XML::Node object with children + doc = pdf_to_solr(doc, pdf) + solr_doc.at_css("add").add_child(doc) + + # Uncomment to debug + # puts solr_doc.root.to_xml + if @options["output"] + filepath = "#{@out_solr}/#{self.filename(false)}.xml" + File.open(filepath, "w") { |f| f.write(solr_doc.root.to_xml) } + end + { "doc" => solr_doc.root.to_xml } + end + + def write_html_to_file(builder, index) + filepath = "#{@out_html}/#{index}.html" + puts "writing to #{filepath}" if @options["verbose"] + File.open(filepath, "w") { |f| f.write(builder.to_xml) } + end +end diff --git a/lib/datura/helpers.rb b/lib/datura/helpers.rb index efa0001ff..6e64557e2 100644 --- a/lib/datura/helpers.rb +++ b/lib/datura/helpers.rb @@ -45,6 +45,38 @@ def self.date_standardize(date, before=true) # params: directory (string) # returns: returns array of all files found ([] if none), # returns nil if no directory by that name exists + def self.date_display(date, nd_text="N.D.") + date_hyphen = self.date_standardize(date) + if date_hyphen + y, m, d = date_hyphen.split("-").map { |s| s.to_i } + date_obj = Date.new(y, m, d) + date_obj.strftime("%B %-d, %Y") + else + nd_text + end + end + + # date_standardize + # automatically defaults to setting incomplete dates to the earliest + # date (2016-07 becomes 2016-07-01) but pass in "false" in order + # to set it to the latest available date + def self.date_standardize(date, before=true) + if date + y, m, d = date.split(/-|\//) + if y && y.length == 4 + # use -1 to indicate that this will be the last possible + m_default = before ? "01" : "-1" + d_default = before ? "01" : "-1" + m = m_default if !m + d = d_default if !d + if Date.valid_date?(y.to_i, m.to_i, d.to_i) + date = Date.new(y.to_i, m.to_i, d.to_i) + date.strftime("%Y-%m-%d") + end + end + end + end + def self.get_directory_files(directory, verbose_flag=false) exists = File.directory?(directory) if exists @@ -139,4 +171,10 @@ def self.should_update?(file, since_date=nil) end end + def self.construct_auth_header(options) + username = options["es_user"] + password = options["es_password"] + { "Authorization" => "Basic #{Base64::encode64("#{username}:#{password}")}" } + end + end diff --git a/lib/datura/options.rb b/lib/datura/options.rb index 36d4e47e2..c478ced42 100644 --- a/lib/datura/options.rb +++ b/lib/datura/options.rb @@ -22,6 +22,24 @@ def initialize(params) # include the collection and datura gem directories in the options @all["collection_dir"] = collection_dir @all["datura_dir"] = datura_dir + + other_configuration + end + + def es_schema_path + internal_path = File.join(@all["es_schema_path"], "#{@all["api_version"]}.yml") + if @all["es_schema_override"] + File.join(@all["collection_dir"], internal_path) + else + File.join(@all["datura_dir"], internal_path) + end + end + + # after all options have been flattened, create customization by + # combining the set options, etc + def other_configuration + # put together the elasticsearch schema path + @all["es_schema"] = es_schema_path end def print_message(variable, name) diff --git a/lib/datura/parser_options/clear_index.rb b/lib/datura/parser_options/clear_index.rb index 75176dbd0..e7e0f9a4b 100644 --- a/lib/datura/parser_options/clear_index.rb +++ b/lib/datura/parser_options/clear_index.rb @@ -1,5 +1,5 @@ module Datura::Parser - def self.clear_index_params + def self.clear_index @usage = "Usage: (es|solr)_clear_index -[options]..." options = {} # will hold all the options passed in by user diff --git a/lib/datura/parser_options/es_alias_add.rb b/lib/datura/parser_options/es_alias.rb similarity index 92% rename from lib/datura/parser_options/es_alias_add.rb rename to lib/datura/parser_options/es_alias.rb index 03d88e2c5..bb36b420f 100644 --- a/lib/datura/parser_options/es_alias_add.rb +++ b/lib/datura/parser_options/es_alias.rb @@ -1,6 +1,6 @@ module Datura::Parser - def self.es_alias_add - @usage = "Usage: es_alias_add -a alias -i index -e environment" + def self.es_alias + @usage = "Usage: (command) -a alias -i index -e environment" options = {} optparse = OptionParser.new do |opts| diff --git a/lib/datura/parser_options/es_alias_delete.rb b/lib/datura/parser_options/es_alias_delete.rb deleted file mode 100644 index ea38038b8..000000000 --- a/lib/datura/parser_options/es_alias_delete.rb +++ /dev/null @@ -1,50 +0,0 @@ -module Datura::Parser - def self.es_alias_delete - @usage = "Usage: es_alias_delete -a alias -i index -e environment" - options = {} - - optparse = OptionParser.new do |opts| - opts.banner = @usage - - opts.on( '-h', '--help', 'How does this work?') do - puts opts - exit - end - - options["alias"] = nil - opts.on( '-a', '--alias [input]', 'Alias (cdrhapi-v1)') do |input| - if input && input.length > 0 - options["alias"] = input - else - puts "Must specify an alias with -a flag" - exit - end - end - - options["environment"] = "development" - opts.on( '-e', '--environment [input]', 'Environment (development, production)') do |input| - if input && input.length > 0 - options["environment"] = input - end - end - - options["index"] = nil - opts.on( '-i', '--index [input]', 'Index (cdrhapi-v1.1)') do |input| - if input && input.length > 0 - options["index"] = input - else - puts "Must specify an index with -i flag" - exit - end - end - - end - - optparse.parse! - if options["alias"].nil? || options["index"].nil? - puts "must specify alias and index with -a and -i, respectively" - exit - end - options - end -end diff --git a/lib/datura/parser_options/es_create_delete_index.rb b/lib/datura/parser_options/es_index.rb similarity index 86% rename from lib/datura/parser_options/es_create_delete_index.rb rename to lib/datura/parser_options/es_index.rb index 22ed0d1cc..716835548 100644 --- a/lib/datura/parser_options/es_create_delete_index.rb +++ b/lib/datura/parser_options/es_index.rb @@ -1,6 +1,6 @@ module Datura::Parser - def self.es_create_delete_index - @usage = "Usage: admin_es_(create|delete)_index -e environment" + def self.es_index + @usage = "Usage: (command) -e environment" options = {} # will hold all the options passed in by user optparse = OptionParser.new do |opts| diff --git a/lib/datura/parser_options/es_set_schema.rb b/lib/datura/parser_options/es_set_schema.rb deleted file mode 100644 index 4f3d2388a..000000000 --- a/lib/datura/parser_options/es_set_schema.rb +++ /dev/null @@ -1,26 +0,0 @@ -module Datura::Parser - def self.es_set_schema_params - @usage = "Usage: es_set_schema -e environment" - options = {} - - optparse = OptionParser.new do |opts| - opts.banner = @usage - - opts.on( '-h', '--help', 'How does this work?') do - puts opts - exit - end - - options["environment"] = "development" - opts.on( '-e', '--environment [input]', 'Environment (development, production)') do |input| - if input && input.length > 0 - options["environment"] = input - end - end - - end - - optparse.parse! - options - end -end diff --git a/lib/datura/parser_options/post.rb b/lib/datura/parser_options/post.rb index daa9b7408..b6e48880f 100644 --- a/lib/datura/parser_options/post.rb +++ b/lib/datura/parser_options/post.rb @@ -22,12 +22,12 @@ def self.post_params # default to no restricted format options["format"] = nil - opts.on( '-f', '--format [input]', 'Supported formats (csv, html, tei, vra, webs)') do |input| + opts.on( '-f', '--format [input]', 'Supported formats (csv, html, ead, pdf, tei, vra, webs)') do |input| if %w[authority annotations].include?(input) puts "'authority' and 'annotations' are invalid formats".red puts "Please select a supported format or rename your custom format" exit - elsif !%w[csv html tei vra webs].include?(input) + elsif !%w[csv ead html pdf tei vra webs].include?(input) puts "Caution: Requested custom format #{input}.".red puts "See FileCustom class for implementation instructions" end diff --git a/lib/datura/requirer.rb b/lib/datura/requirer.rb index 75c7bb247..b3048c758 100644 --- a/lib/datura/requirer.rb +++ b/lib/datura/requirer.rb @@ -12,4 +12,6 @@ Dir["#{current_dir}/to_es/**/*.rb"].each { |f| require f } # file types -Dir["#{current_dir}/file_types/*.rb"].each { |f| require f } +Dir["#{current_dir}/file_types/*.rb"].each {|f| require f } +# elasticsearch files +Dir["#{current_dir}/elasticsearch/*.rb"].each {|f| require f } diff --git a/lib/datura/to_es/csv_to_es/fields.rb b/lib/datura/to_es/csv_to_es/fields.rb index 0f4a2be61..616470280 100644 --- a/lib/datura/to_es/csv_to_es/fields.rb +++ b/lib/datura/to_es/csv_to_es/fields.rb @@ -5,6 +5,8 @@ class CsvToEs ########## # FIELDS # ########## + # beginning with fields from API 1.0, including those that are unchanged in 2.0 + def id @id end @@ -37,6 +39,12 @@ def collection def collection_desc @options["collection_desc"] || @options["collection"] end + + def container_box + end + + def container_folder + end # nested field def contributor @@ -179,7 +187,7 @@ def text text_all += text_additional text_all = text_all.compact - Datura::Helpers.normalize_space(text_all.join(" ")) + Datura::Helpers.normalize_space(text_all.join(" "))[0..@options["text_limit"]] end # override and add by collection as needed @@ -241,4 +249,74 @@ def works end end + # new/moved fields for API 2.0 + + def cover_image + @row["image_id"] + end + + def date_updated + end + + def fig_location + end + + def category2 + @row["subcategory"] + end + + def category3 + end + + def category4 + end + + def category5 + end + + def notes + end + + def citation + end + + def abstract + end + + def keywords2 + end + + def keywords3 + end + + def keywords4 + end + + def keywords5 + end + + def has_part + end + + def is_part_of + end + + def previous_item + end + + def next_item + end + + def event + end + + def rdf + end + + def has_source + end + + def has_relation + end + end diff --git a/lib/datura/to_es/custom_to_es/fields.rb b/lib/datura/to_es/custom_to_es/fields.rb index 0818f94e7..7a430d3be 100644 --- a/lib/datura/to_es/custom_to_es/fields.rb +++ b/lib/datura/to_es/custom_to_es/fields.rb @@ -85,6 +85,10 @@ def places def publisher end + # nested field + def rdf + end + # nested field def recipient end diff --git a/lib/datura/to_es/ead_to_es.rb b/lib/datura/to_es/ead_to_es.rb new file mode 100644 index 000000000..318053df7 --- /dev/null +++ b/lib/datura/to_es/ead_to_es.rb @@ -0,0 +1,33 @@ +require_relative "xml_to_es.rb" +require_relative "ead_to_es/fields.rb" +require_relative "ead_to_es/request.rb" +require_relative "ead_to_es/xpaths.rb" + +########################################################### +# NOTE: DO NOT EDIT EAD_TO_ES FILES IN SCRIPTS DIRECTORY # +########################################################### + +# (unless you are a CDRH dev and then you may do so very cautiously) +# this file provides defaults for ALL of the collections included +# in the API and changing it could alter dozens of sites unexpectedly! +# PLEASE RUN LOADS OF TESTS AFTER A CHANGE BEFORE PUSHING TO PRODUCTION + +# HOW DO I CHANGE XPATHS? +# You may add or modify xpaths in each collection's ead_to_es.rb file +# located in the collections//scripts directory + +# HOW DO I CHANGE FIELD CONTENT? +# You may need to alter an xpath, but otherwise you may also +# copy paste the field defined in ead_to_es/fields.rb and change +# it as needed. If you are dealing with something particularly complex +# you may need to consult with a CDRH dev for help + +# HOW DO I CUSTOMIZE THE FIELDS BEING SENT TO ELASTICSEARCH? +# You will need to look in the ead_to_es/request.rb file, which has +# collections of fields being sent to elasticsearch +# you can override individual chunks of fields in your collection + +class EadToEs < XmlToEs + # Override XmlToEs methods that need to be customized for EAD here + # rather than in one of the files in ead_to_es/ +end diff --git a/lib/datura/to_es/ead_to_es/fields.rb b/lib/datura/to_es/ead_to_es/fields.rb new file mode 100644 index 000000000..95b89976e --- /dev/null +++ b/lib/datura/to_es/ead_to_es/fields.rb @@ -0,0 +1,369 @@ +class EadToEs < XmlToEs + # Note to add custom fields, use "assemble_collection_specific" from request.rb + # and be sure to either use the _d, _i, _k, or _t to use the correct field type + + ########## + # FIELDS # + ########## + + def id + get_text(@xpaths["identifer"]) + end + + # def id_dc + # # TODO use api path from config or something? + # "https://cdrhapi.unl.edu/doc/#{@id}" + # end + + def abstract + get_text(@xpaths["abstract"]) + end + + def alternative + end + + def annotations_text + # TODO what should default behavior be? + end + + def category + end + + # note this does not sort the creators + def creator + creators = get_list(@xpaths["creator"]) + if creators + return creators.map { |creator| { "name" => Datura::Helpers.normalize_space(creator) } } + end + end + + # returns ; delineated string of alphabetized creators + def creator_sort + return get_text(@xpaths["creators"]) + end + + def collection + @options["collection"] + end + + def collection_desc + @options["collection_desc"] || @options["collection"] + end + + def container_box + end + + def container_folder + end + + def contributor + # contribs = [] + # @xpaths["contributors"].each do |xpath| + # eles = @xml.xpath(xpath) + # eles.each do |ele| + # contribs << { + # "id" => ele["id"], + # "name" => CommonXml.normalize_space(ele.text), + # "role" => CommonXml.normalize_space(ele["role"]) + # } + # end + # end + # contribs.uniq + end + + def data_type + "ead" + end + + def date(before=true) + datestr = get_text(@xpaths["date"]) + if datestr + return Datura::Helpers.date_standardize(datestr, before) + end + end + + def date_display + get_text(@xpaths["date_display"]) + end + + def date_not_after + date(false) + end + + def date_not_before + date(true) + end + + def date_updated + end + + def description + get_text(@xpaths["description"]) + end + + def extent + get_text(@xpaths["extent"]) + end + + def format + get_list(@xpaths["format"]) + end + + def image_id + # Note: don't pull full path because will be pulled by IIIF + # How to deal with this? + images = get_list(@xpaths["image_id"]) + images[0] if images + end + + def keywords + get_list(@xpaths["keywords"]) + end + + def language + get_text(@xpaths["language"]) + end + + def languages + get_list(@xpaths["languages"]) + end + + def medium + # Default behavior is the same as "format" method + format + end + + def person + # TODO will need some examples of how this will work + # and put in the xpaths above, also for attributes, etc + # should contain name, id, and role + # eles = @xml.xpath(@xpaths["person"]) + # people = eles.map do |p| + # { + # "id" => "", + # "name" => CommonXml.normalize_space(p.text), + # "role" => CommonXml.normalize_space(p["role"]) + # } + # end + # return people + end + + def people + # @json["person"].map { |p| CommonXml.normalize_space(p["name"]) } + end + + def places + return get_list(@xpaths["places"]) + end + + def publisher + get_text(@xpaths["publisher"]) + end + + def recipient + # eles = @xml.xpath(@xpaths["recipient"]) + # people = eles.map do |p| + # { + # "id" => "", + # "name" => CommonXml.normalize_space(p.text), + # "role" => "recipient" + # } + # end + # return people + end + + def relation + end + + def rights + # Note: override by collection as needed + get_text(@xpaths["rights"]) + end + + def rights_holder + get_text(@xpaths["rights_holder"]) + end + + def rights_uri + # by default collections have no uri associated with them + # copy this method into collection specific tei_to_es.rb + # to return specific string or xpath as required + end + + def source + get_text(@xpaths["source"]) + end + + def spatial + end + + def subjects + get_list(@xpaths["subjects"]) + end + + def subcategory + # subcategory = get_text(@xpaths["subcategory"]) + # subcategory.length > 0 ? subcategory : "none" + end + + def text + # handling separate fields in array + # means no worrying about handling spacing between words + text = [] + @xpaths.keys.each do |xpath| + body = get_text(@xpaths[xpath]) + if body + text << body + end + end + text + # TODO: do we need to preserve tags like in text? if so, turn get_text to true + # text << CommonXml.convert_tags_in_string(body) + # text += text_additional + # return CommonXml.normalize_space(text.join(" ")) + end + + # def text_additional + # # Note: Override this per collection if you need additional + # # searchable fields or information for collections + # # just make sure you return an array at the end! + + # text = [] + # text << title + # end + + def title + get_list(@xpaths["title"]) + end + + def title_sort + Datura::Helpers.normalize_name(title) + end + + def type + get_text(@xpaths["type"]) + end + + def topics + get_list(@xpaths["topic"]) + end + + def uri + # override per collection + # should point at the live website view of resource + end + + def uri_data + base = @options["data_base"] + subpath = "data/#{@options["collection"]}/source/tei" + return "#{base}/#{subpath}/#{@id}.xml" + end + + def uri_html + base = @options["data_base"] + subpath = "data/#{@options["collection"]}/output/#{@options["environment"]}/html" + return "#{base}/#{subpath}/#{@id}.html" + end + + def works + # TODO need to create a list of items, maybe an array of ids + end + + # new/moved fields for API 2.0 + + def cover_image + if get_list(@xpaths["image_id"]) + get_list(@xpaths["image_id"]).first + end + end + + def date_updated + get_list(@xpaths["date_updated"]) + end + + def fig_location + get_list(@xpaths["fig_location"]) + end + + def category2 + get_list(@xpaths["subcategory"]) + end + + def category3 + get_text(@xpaths["category3"]) + end + + def category4 + get_text(@xpaths["category4"]) + end + + def category5 + get_text(@xpaths["category5"]) + end + + def notes + get_text(@xpaths["notes"]) + end + + def citation + # nested + end + + def container_box + end + + def container_folder + end + + def abstract + get_text(@xpaths["abstract"]) + end + + def keywords2 + get_text(@xpaths["keywords2"]) + end + + def keywords3 + get_text(@xpaths["keywords3"]) + end + + def keywords4 + get_text(@xpaths["keywords4"]) + end + + def keywords5 + get_text(@xpaths["keywords5"]) + end + + def has_part + # nested + end + + def is_part_of + # nested + end + + def previous_item + # nested + end + + def next_item + # nested + end + + def event + # nested + end + + def rdf + # nested + end + + def has_source + # nested + end + + def has_relation + # nested + end +end diff --git a/lib/datura/to_es/ead_to_es/request.rb b/lib/datura/to_es/ead_to_es/request.rb new file mode 100644 index 000000000..e0e0c9899 --- /dev/null +++ b/lib/datura/to_es/ead_to_es/request.rb @@ -0,0 +1,7 @@ +class EadToEs < XmlToEs + + # please refer to generic xml to es request file, request.rb + # and override methods specific to TEI transformation here + # project specific overrides should go in the COLLECTION's overrides! + +end diff --git a/lib/datura/to_es/ead_to_es/xpaths.rb b/lib/datura/to_es/ead_to_es/xpaths.rb new file mode 100644 index 000000000..989d5122c --- /dev/null +++ b/lib/datura/to_es/ead_to_es/xpaths.rb @@ -0,0 +1,26 @@ +class EadToEs < XmlToEs + # These are the default xpaths that are used for collections + # if you require a different xpath, please override the xpath in + # the specific collection's TeiToEs file or create a new method + # in that file which returns a different value + def xpaths_list + { + "abstract" => "/ead/archdesc/did/abstract", + "creator" => ["/ead/archdesc/did/origination/persname", "/ead/eadheader/filedesc/titlestmt/creator"], + "date" => "/ead/eadheader/filedesc/publicationstmt/date", + "description" => "/ead/archdesc/scopecontent/p", + "formats" => "/ead/archdesc/did/physdesc/genreform", + "identifier" => "/ead/archdesc/did/unitid", + "language" => "/ead/eadheader/profiledesc/langusage/language", + "publisher" => "/ead/eadheader/filedesc/publicationstmt/publisher", + "repository_contact" => "/ead/archdesc/did/repository/address/*", + "rights" => "/ead/archdesc/descgrp/accessrestrict/p", + "rights_holder" => "/ead/archdesc/did/repository/corpname", + "source" => "/ead/archdesc/descgrp/prefercite/p", + "subjects" => "/ead/archdesc/controlaccess/*[not(name()='head')]", + "title" => "/ead/archdesc/did/unittitle", + "text" => "/ead/eadheader/filedesc/titlestmt/*", + "items" => "//*[@level='item']/did/unitid" + }.merge(override_xpaths) + end + end diff --git a/lib/datura/to_es/ead_to_es_items.rb b/lib/datura/to_es/ead_to_es_items.rb new file mode 100644 index 000000000..c4778cf3d --- /dev/null +++ b/lib/datura/to_es/ead_to_es_items.rb @@ -0,0 +1,33 @@ +require_relative "ead_to_es.rb" +require_relative "ead_to_es_items/fields.rb" +require_relative "ead_to_es_items/request.rb" +require_relative "ead_to_es_items/xpaths.rb" + +########################################################### +# NOTE: DO NOT EDIT EAD_TO_ES FILES IN SCRIPTS DIRECTORY # +########################################################### + +# (unless you are a CDRH dev and then you may do so very cautiously) +# this file provides defaults for ALL of the collections included +# in the API and changing it could alter dozens of sites unexpectedly! +# PLEASE RUN LOADS OF TESTS AFTER A CHANGE BEFORE PUSHING TO PRODUCTION + +# HOW DO I CHANGE XPATHS? +# You may add or modify xpaths in each collection's ead_to_es.rb file +# located in the collections//scripts directory + +# HOW DO I CHANGE FIELD CONTENT? +# You may need to alter an xpath, but otherwise you may also +# copy paste the field defined in ead_to_es/fields.rb and change +# it as needed. If you are dealing with something particularly complex +# you may need to consult with a CDRH dev for help + +# HOW DO I CUSTOMIZE THE FIELDS BEING SENT TO ELASTICSEARCH? +# You will need to look in the ead_to_es/request.rb file, which has +# collections of fields being sent to elasticsearch +# you can override individual chunks of fields in your collection + +class EadToEsItems < EadToEs + # Override XmlToEs methods that need to be customized for EAD here + # rather than in one of the files in ead_to_es/ +end diff --git a/lib/datura/to_es/ead_to_es_items/fields.rb b/lib/datura/to_es/ead_to_es_items/fields.rb new file mode 100644 index 000000000..7764e083b --- /dev/null +++ b/lib/datura/to_es/ead_to_es_items/fields.rb @@ -0,0 +1,271 @@ +class EadToEsItems < EadToEs + # Note to add custom fields, use "assemble_collection_specific" from request.rb + # and be sure to either use the _d, _i, _k, or _t to use the correct field type + + ########## + # FIELDS # + ########## + + def id + get_text(@xpaths["identifer"]) + end + + # def id_dc + # # TODO use api path from config or something? + # "https://cdrhapi.unl.edu/doc/#{@id}" + # end + + def alternative + end + + def annotations_text + # TODO what should default behavior be? + end + + def category + end + + # note this does not sort the creators + def creator + creators = get_list(@xpaths["creators"]) + if creators + return creators.map { |creator| { "name" => CommonXml.normalize_space(creator) } } + end + end + + # returns ; delineated string of alphabetized creators + def creator_sort + return get_text(@xpaths["creators"]) + end + + def collection + "#{@options["collection"]}_items" + end + + def collection_desc + # @options["collection_desc"] || @options["collection"] + end + + def contributor + # contribs = [] + # @xpaths["contributors"].each do |xpath| + # eles = @xml.xpath(xpath) + # eles.each do |ele| + # contribs << { + # "id" => ele["id"], + # "name" => CommonXml.normalize_space(ele.text), + # "role" => CommonXml.normalize_space(ele["role"]) + # } + # end + # end + # contribs.uniq + end + + def data_type + "ead_item" + end + + def date(before=true) + datestr = get_text(@xpaths["date"]) + if datestr + return Datura::Helpers.date_standardize(datestr, before) + end + end + + def date_display + if get_text(@xpaths["date_display"]) == "" + return get_text(@xpaths["date"]) + else + return get_text(@xpaths["date_display"]) + end + + end + + def date_not_after + date(false) + end + + def date_not_before + date(true) + end + + def description + get_text(@xpaths["description"]) + end + + def extent + get_text(@xpaths["extent"]) + end + + def format + get_text(@xpaths["format"]) + end + + def get_id + # doc = id + doc = get_text(@xpaths["identifier"]) + if !doc + title = get_text(@xpaths["file"]) + if title + return "#{@filename}_#{title}" + end + end + return "#{@filename}_#{doc}" + end + + def image_id + # # Note: don't pull full path because will be pulled by IIIF + # images = get_list(@xpaths["image_id"]) + # images[0] if images + end + + def keywords + get_list(@xpaths["keywords"]) + end + + def language + get_text(@xpaths["language"]) + end + + def languages + get_list(@xpaths["languages"]) + end + + def medium + # Default behavior is the same as "format" method + format + end + + def person + # TODO will need some examples of how this will work + # and put in the xpaths above, also for attributes, etc + # should contain name, id, and role + # eles = @xml.xpath(@xpaths["person"]) + # people = eles.map do |p| + # { + # "id" => "", + # "name" => CommonXml.normalize_space(p.text), + # "role" => CommonXml.normalize_space(p["role"]) + # } + # end + # return people + end + + def people + # @json["person"].map { |p| CommonXml.normalize_space(p["name"]) } + end + + def places + return get_list(@xpaths["places"]) + end + + def publisher + get_text(@xpaths["publisher"]) + end + + def recipient + # eles = @xml.xpath(@xpaths["recipient"]) + # people = eles.map do |p| + # { + # "id" => "", + # "name" => CommonXml.normalize_space(p.text), + # "role" => "recipient" + # } + # end + # return people + end + + def relation + end + + def rights + # Note: override by collection as needed + get_text(@xpaths["rights"]) + end + + def rights_holder + get_text(@xpaths["rights_holder"]) + end + + def rights_uri + # by default collections have no uri associated with them + # copy this method into collection specific tei_to_es.rb + # to return specific string or xpath as required + end + + def source + get_text(@xpaths["source"]) + end + + def spatial + end + + def subjects + # TODO default behavior? + end + + def subcategory + subcategory = get_text(@xpaths["subcategory"]) + subcategory.length > 0 ? subcategory : "none" + end + + def text + # handling separate fields in array + # means no worrying about handling spacing between words + text = [] + body = get_text(@xpaths["text"]) + if body + text << body + end + # TODO: do we need to preserve tags like in text? if so, turn get_text to true + # text << CommonXml.convert_tags_in_string(body) + text += text_additional + return Datura::Helpers.normalize_space(text.join(" "))[0..@options["text_limit"]] + end + + def text_additional + # Note: Override this per collection if you need additional + # searchable fields or information for collections + # just make sure you return an array at the end! + + text = [] + text << title + end + + def title + get_text(@xpaths["title"])[0] + end + + def title_sort + Datura::Helpers.normalize_name(title) + end + + def topics + get_list(@xpaths["topic"]) + end + + def type + get_text(@xpaths["type"]) + end + + def uri + # override per collection + # should point at the live website view of resource + end + + def uri_data + base = @options["data_base"] + subpath = "data/#{@options["collection"]}/source/tei" + return "#{base}/#{subpath}/#{@id}.xml" + end + + def uri_html + base = @options["data_base"] + subpath = "data/#{@options["collection"]}/output/#{@options["environment"]}/html" + return "#{base}/#{subpath}/#{@id}.html" + end + + def works + # TODO figure out how this behavior should look + end +end diff --git a/lib/datura/to_es/ead_to_es_items/request.rb b/lib/datura/to_es/ead_to_es_items/request.rb new file mode 100644 index 000000000..45e5f28c5 --- /dev/null +++ b/lib/datura/to_es/ead_to_es_items/request.rb @@ -0,0 +1,7 @@ +class EadToEsItems < EadToEs + + # please refer to generic xml to es request file, request.rb + # and override methods specific to TEI transformation here + # project specific overrides should go in the COLLECTION's overrides! + +end diff --git a/lib/datura/to_es/ead_to_es_items/xpaths.rb b/lib/datura/to_es/ead_to_es_items/xpaths.rb new file mode 100644 index 000000000..23b941e29 --- /dev/null +++ b/lib/datura/to_es/ead_to_es_items/xpaths.rb @@ -0,0 +1,21 @@ +class EadToEsItems < EadToEs + # These are the default xpaths that are used for collections + # if you require a different xpath, please override the xpath in + # the specific collection's TeiToEs file or create a new method + # in that file which returns a different value + def xpaths_list + { + "abstract" => "did/abstract", + "date" => "did/unitdate", + "date_display" => "did/unitdate", + "description" => "scopecontent/p", + "extent" => "did/physdesc/extent", + "format" => "did/physdesc/physfacet", + "image_url" => "did/dao/@href", + "identifier" => "did/unitid", + "repository_id" => "did/unitid[@type='repository']", + "title" => "did/unittitle/title", + "type" => "did/physdesc/genreform", + }.merge(override_xpaths) + end +end diff --git a/lib/datura/to_es/es_request.rb b/lib/datura/to_es/es_request.rb index 6aeeef056..ee120bbf7 100644 --- a/lib/datura/to_es/es_request.rb +++ b/lib/datura/to_es/es_request.rb @@ -19,28 +19,51 @@ def assemble_json # below not alphabetical to reflect their position # in the cdrh api schema - - assemble_identifiers - assemble_categories - assemble_locations - assemble_descriptions - assemble_other_metadata - assemble_dates - assemble_publishing - assemble_people - assemble_spatial - assemble_references - assemble_text + if @options["api_version"] == "1.0" + assemble_json_1 + elsif @options["api_version"] == "2.0" + assemble_json_2 + end assemble_collection_specific - + assemble_text @json end + def assemble_json_1 + #fields for API v 1.0 + assemble_identifiers_1 + assemble_categories_1 + assemble_locations_1 + assemble_descriptions_1 + assemble_other_metadata_1 + assemble_dates_1 + assemble_publishing_1 + assemble_people_1 + assemble_spatial_1 + assemble_references_1 + assemble_rdf_1 + end + + def assemble_json_2 + #field for API v 2.0 + assemble_identifiers_2 + assemble_metadata_digital_2 + assemble_metadata_original_2 + assemble_metadata_interpretive_2 + assemble_relations_2 + assemble_additional_2 + end + ############## # components # ############## + def assemble_collection_specific + # add your own per collection + # with format + # @json["fieldname"] = field_contents + end - def assemble_categories + def assemble_categories_1 @json["category"] = category @json["subcategory"] = subcategory @json["data_type"] = data_type @@ -49,20 +72,14 @@ def assemble_categories @json["subjects"] = subjects end - def assemble_collection_specific - # add your own per collection - # with format - # @json["fieldname"] = field_contents - end - - def assemble_dates + def assemble_dates_1 @json["date"] = date @json["date_not_after"] = date_not_after @json["date_not_before"] = date_not_before @json["date_display"] = date_display end - def assemble_descriptions + def assemble_descriptions_1 @json["alternative"] = alternative @json["description"] = description @json["title"] = title @@ -70,18 +87,18 @@ def assemble_descriptions @json["topics"] = topics end - def assemble_identifiers + def assemble_identifiers_1 @json["identifier"] = @id end - def assemble_locations + def assemble_locations_1 @json["uri"] = uri @json["uri_data"] = uri_data @json["uri_html"] = uri_html @json["image_id"] = image_id end - def assemble_other_metadata + def assemble_other_metadata_1 @json["format"] = format @json["language"] = language @json["languages"] = languages @@ -91,7 +108,7 @@ def assemble_other_metadata @json["medium"] = medium end - def assemble_people + def assemble_people_1 # container fields @json["person"] = person @json["contributor"] = contributor @@ -99,7 +116,7 @@ def assemble_people @json["recipient"] = recipient end - def assemble_publishing + def assemble_publishing_1 @json["publisher"] = publisher @json["rights"] = rights @json["rights_uri"] = rights_uri @@ -107,20 +124,97 @@ def assemble_publishing @json["source"] = source end - def assemble_references + def assemble_references_1 @json["keywords"] = keywords @json["places"] = places @json["works"] = works end - def assemble_spatial + def assemble_spatial_1 + @json["spatial"] = spatial + end + + def assemble_rdf_1 + @json["rdf"] = rdf + end + + def assemble_identifiers_2 + @json["identifier"] = @id # does this still work? + @json["collection"] = collection + @json["collection_desc"] = collection_desc + @json["uri"] = uri + @json["uri_data"] = uri_data + @json["uri_html"] = uri_html + @json["data_type"] = data_type + @json["fig_location"] = fig_location + @json["cover_image"] = cover_image + @json["title"] = title + @json["title_sort"] = title_sort + @json["alternative"] = alternative + @json["date_updated"] = date_updated + @json["category"] = category + @json["category2"] = category2 + @json["category3"] = category3 + @json["category4"] = category4 + @json["category5"] = category5 + @json["notes"] = notes + end + + def assemble_metadata_digital_2 + @json["contributor"] = contributor + end + + def assemble_metadata_original_2 + @json["creator"] = creator + @json["citation"] = citation + @json["date"] = date + @json["date_display"] = date_display + @json["date_not_before"] = date_not_before + @json["date_not_after"] = date_not_after + @json["format"] = format + @json["medium"] = medium + @json["extent"] = extent + @json["language"] = language + @json["rights_holder"] = rights_holder + @json["rights"] = rights + @json["rights_uri"] = rights_uri + @json["container_box"] = container_box + @json["container_folder"] = container_folder + end + + def assemble_metadata_interpretive_2 + @json["subjects"] = subjects + @json["abstract"] = abstract + @json["description"] = description + @json["type"] = type + @json["topics"] = topics + @json["keywords"] = keywords + @json["keywords2"] = keywords2 + @json["keywords3"] = keywords3 + @json["keywords4"] = keywords4 + @json["keywords5"] = keywords5 + end + + def assemble_relations_2 + @json["has_relation"] = has_relation + @json["has_source"] = has_source + @json["has_part"] = has_part + @json["is_part_of"] = is_part_of + @json["previous_item"] = previous_item + @json["next_item"] = next_item + end + + def assemble_additional_2 @json["spatial"] = spatial + @json["places"] = places + @json["person"] = person + @json["event"] = event + @json["rdf"] = rdf end def assemble_text @json["annotations_text"] = annotations_text @json["text"] = text - # @json["abstract"] end end diff --git a/lib/datura/to_es/html_to_es/fields.rb b/lib/datura/to_es/html_to_es/fields.rb index c95d452e4..babe8adfc 100644 --- a/lib/datura/to_es/html_to_es/fields.rb +++ b/lib/datura/to_es/html_to_es/fields.rb @@ -152,9 +152,11 @@ def text # means no worrying about handling spacing between words text = [] body = get_text(@xpaths["text"]) - text << body + if body + text << body + end text += text_additional - Datura::Helpers.normalize_space(text.join(" ")) + Datura::Helpers.normalize_space(text.join(" "))[0..@options["text_limit"]] end def text_additional @@ -218,4 +220,100 @@ def works get_list(@xpaths["works"]) end + # new/moved fields for API 2.0 + + def cover_image + get_list(@xpaths["image_id"]).first + end + + def date_updated + get_list(@xpaths["date_updated"]) + end + + def fig_location + get_list(@xpaths["fig_location"]) + end + + def category2 + get_list(@xpaths["subcategory"]) + end + + def category3 + get_text(@xpaths["category3"]) + end + + def category4 + get_text(@xpaths["category4"]) + end + + def category5 + get_text(@xpaths["category5"]) + end + + def notes + get_text(@xpaths["notes"]) + end + + def citation + # nested + end + + def container_box + end + + def container_folder + end + + def abstract + get_text(@xpaths["abstract"]) + end + + def keywords2 + get_text(@xpaths["keywords2"]) + end + + def keywords3 + get_text(@xpaths["keywords3"]) + end + + def keywords4 + get_text(@xpaths["keywords4"]) + end + + def keywords5 + get_text(@xpaths["keywords5"]) + end + + def has_part + # nested + end + + def is_part_of + # nested + end + + def previous_item + # nested + end + + def next_item + # nested + end + + def event + # nested + end + + def rdf + # nested + end + + def has_source + # nested + end + + def has_relation + # nested + end + end diff --git a/lib/datura/to_es/pdf_to_es.rb b/lib/datura/to_es/pdf_to_es.rb new file mode 100644 index 000000000..80e86b2c4 --- /dev/null +++ b/lib/datura/to_es/pdf_to_es.rb @@ -0,0 +1,53 @@ +require_relative "../helpers.rb" +require_relative "pdf_to_es/fields.rb" +require_relative "pdf_to_es/request.rb" + +######################################### +# NOTE: DO NOT EDIT THIS FILE!!!!!!!!! # +######################################### +# (unless you are a CDRH dev and then you may do so very cautiously) +# this file provides defaults for ALL of the collections included +# in the API and changing it could alter dozens of sites unexpectedly! +# PLEASE RUN LOADS OF TESTS AFTER A CHANGE BEFORE PUSHING TO PRODUCTION + +# WHAT IS THIS FILE? +# This file sets up default behavior for transforming PDF +# documents to Elasticsearch JSON documents + +class PdfToEs + + attr_reader :json, :pdf + # variables + # id, row, pdf + + def initialize(pdf, options={}, filename=nil) + @pdf = pdf + @options = options + @filename = filename + @id = get_id + + create_json + end + + # getter for @json response object + def create_json + @json = {} + # if anything needs to be done before processing + # do it here (ex: reading in annotations into memory) + preprocessing + assemble_json + postprocessing + end + + def get_id + @filename.delete_suffix(".pdf") + end + + def preprocessing + # copy this in your pdf_to_es collection file to customize + end + + def postprocessing + # copy this in your pdf_to_es collection file to customize + end +end diff --git a/lib/datura/to_es/pdf_to_es/fields.rb b/lib/datura/to_es/pdf_to_es/fields.rb new file mode 100644 index 000000000..ef0848938 --- /dev/null +++ b/lib/datura/to_es/pdf_to_es/fields.rb @@ -0,0 +1,313 @@ +class PdfToEs + # Note to add custom fields, use "assemble_collection_specific" from request.rb + # and be sure to either use the _d, _i, _k, or _t to use the correct field type + + ########## + # FIELDS # + ########## + # beginning with fields from API 1.0, including those that are unchanged in 2.0 + + def id + get_id + end + + def alternative + # @row["alternative"] + end + + def annotations_text + # @row["annotations_text"] + end + + def category + # @row["category"] + end + + # nested field + def creator + # if @row["creator"] + # @row["creator"].split("; ").map do |p| + # { "name" => p } + # end + # end + end + + def collection + @options["collection"] + end + + def collection_desc + @options["collection_desc"] || @options["collection"] + end + + def container_box + end + + def container_folder + end + + # nested field + def contributor + # if @row["contributor"] + # @row["contributor"].split("; ").map do |p| + # { "name" => p } + # end + # end + end + + def data_type + "pdf" + end + + def date(before=true) + # Datura::Helpers.date_standardize(@row["date"], before) + end + + def date_display + # Datura::Helpers.date_display(date) + end + + def date_not_after + # if @row["date_not_after"] && !@row["date_not_after"].empty? + # Datura::Helpers.date_standardize(@row["date_not_after"], false) + # else + # date(false) + # end + end + + def date_not_before + # if @row["date_not_before"] && !@row["date_not_before"].empty? + # Datura::Helpers.date_standardize(@row["date_not_before"], true) + # else + # date(true) + # end + end + + def description + # @row["description"] + end + + def extent + # @row["extent"] + end + + def format + # @row["format"] + end + + def image_id + # @row["image_id"] + end + + def keywords + # if @row["keywords"] + # @row["keywords"].split("; ") + # end + end + + def language + # @row["language"] + end + + def languages + # if @row["languages"] + # @row["languages"].split("; ") + # end + end + + def medium + # @row["medium"] + end + + # nested field + def person + # if @row["person"] + # @row["person"].split("; ").map do |p| + # { "name" => p } + # end + # end + end + + def places + # if @row["places"] + # @row["places"].split("; ") + # end + end + + def publisher + # @row["publisher"] + end + + # nested field + def recipient + # if @row["recipient"] + # @row["recipient"].split("; ").map do |p| + # { "name" => p } + # end + # end + end + + def relation + # @row["relation"] + end + + def rights + # @row["rights"] + end + + def rights_holder + # @row["rights_holder"] + end + + def rights_uri + # @row["rights_uri"] + end + + def source + # @row["source"] + end + + # nested field + def spatial + end + + def subjects + + end + + def subcategory + @row["subcategory"] + end + + # text is generally going to be pulled from + def text + text_all = [] + @pdf.pages.each do |page| + text_all << page.text + end + text_all += text_additional + text_all = text_all.compact + Datura::Helpers.normalize_space(text_all.join(" "))[0..@options["text_limit"]] + end + + # override and add by collection as needed + def text_additional + [ title ] + end + + def title + @filename.delete_suffix(".pdf") + end + + def title_sort + Datura::Helpers.normalize_name(title) if title + end + + def topics + # if @row["topics"] + # @row["topics"].split("; ") + # end + end + + def type + # @row["type"] + end + + def uri + File.join( + @options["site_url"], + "item", + @id + ) + end + + def uri_data + File.join( + @options["data_base"], + "data", + @options["collection"], + "source/pdf", + "#{@filename}.pdf" + ) + end + + def works + # if @row["works"] + # @row["works"].split("; ") + # end + end + + # new/moved fields for API 2.0 + + def cover_image + # @row["image_id"] + end + + def date_updated + end + + def fig_location + end + + def category2 + # @row["subcategory"] + end + + def category3 + end + + def category4 + end + + def category5 + end + + def notes + end + + def citation + end + + def abstract + end + + def keywords2 + end + + def keywords3 + end + + def keywords4 + end + + def keywords5 + end + + def has_part + end + + def is_part_of + end + + def previous_item + end + + def next_item + end + + def event + end + + def rdf + end + + def has_relation + end + + def has_source + end + + def uri_html + end + +end diff --git a/lib/datura/to_es/pdf_to_es/request.rb b/lib/datura/to_es/pdf_to_es/request.rb new file mode 100644 index 000000000..dd20f8a62 --- /dev/null +++ b/lib/datura/to_es/pdf_to_es/request.rb @@ -0,0 +1,8 @@ +class PdfToEs + include EsRequest + + # please refer to generic es_request.rb file + # and override the JSON being sent to elasticsearch here, if needed + # project specific overrides should go in the COLLECTION's overrides! + +end diff --git a/lib/datura/to_es/tei_to_es/fields.rb b/lib/datura/to_es/tei_to_es/fields.rb index 861d42d70..960fd1929 100644 --- a/lib/datura/to_es/tei_to_es/fields.rb +++ b/lib/datura/to_es/tei_to_es/fields.rb @@ -21,7 +21,9 @@ def category # nested field def creator creators = get_list(@xpaths["creator"]) - creators.map { |c| { "name" => Datura::Helpers.normalize_space(c) } } + if creators + creators.map { |c| { "name" => Datura::Helpers.normalize_space(c) } } + end end def collection @@ -49,8 +51,14 @@ def data_type end def date(before=true) - datestr = get_list(@xpaths["date"]).first - Datura::Helpers.date_standardize(datestr, before) + if get_list(@xpaths["date"]) + datestr = get_list(@xpaths["date"]).first + else + datestr = nil + end + if datestr && !datestr.empty? + Datura::Helpers.date_standardize(datestr, false) + end end def date_display @@ -84,12 +92,16 @@ def extent end def format - get_list(@xpaths["format"]).first + if get_list(@xpaths["format"]) + get_list(@xpaths["format"]).first + end end def image_id # Note: don't pull full path because will be pulled by IIIF - get_list(@xpaths["image_id"]).first + if get_list(@xpaths["image_id"]) + get_list(@xpaths["image_id"]).first + end end def keywords @@ -98,7 +110,9 @@ def keywords def language # uses the first language discovered in the document - get_list(@xpaths["language"]).first + if get_list(@xpaths["language"]) + get_list(@xpaths["language"]).first + end end def languages @@ -129,6 +143,10 @@ def publisher get_text(@xpaths["publisher"]) end + # nested field + def rdf + end + # nested field def recipient eles = @xml.xpath(@xpaths["recipient"]) @@ -187,11 +205,13 @@ def text # means no worrying about handling spacing between words text_all = [] body = get_text(@xpaths["text"], keep_tags: false, delimiter: '') - text_all << body + if body + text_all << body + end # TODO: do we need to preserve tags like in text? if so, turn get_text to true # text_all << CommonXml.convert_tags_in_string(body) text_all += text_additional - Datura::Helpers.normalize_space(text_all.join(" ")) + Datura::Helpers.normalize_space(text_all.join(" "))[0..@options["text_limit"]] end def text_additional @@ -209,7 +229,9 @@ def title end def title_sort - Datura::Helpers.normalize_name(title) + if title + Datura::Helpers.normalize_name(title) + end end def topics @@ -251,6 +273,104 @@ def works get_list(@xpaths["works"]) end + # new/moved fields for API 2.0 + + def cover_image + if get_list(@xpaths["image_id"]) + get_list(@xpaths["image_id"]).first + end + end + + def date_updated + get_list(@xpaths["date_updated"]) + end + + def fig_location + get_list(@xpaths["fig_location"]) + end + + def category2 + get_list(@xpaths["subcategory"]) + end + + def category3 + get_text(@xpaths["category3"]) + end + + def category4 + get_text(@xpaths["category4"]) + end + + def category5 + get_text(@xpaths["category5"]) + end + + def notes + get_text(@xpaths["notes"]) + end + + def citation + # nested + end + + def container_box + end + + def container_folder + end + + def abstract + get_text(@xpaths["abstract"]) + end + + def keywords2 + get_text(@xpaths["keywords2"]) + end + + def keywords3 + get_text(@xpaths["keywords3"]) + end + + def keywords4 + get_text(@xpaths["keywords4"]) + end + + def keywords5 + get_text(@xpaths["keywords5"]) + end + + def has_part + # nested + end + + def is_part_of + # nested + end + + def previous_item + # nested + end + + def next_item + # nested + end + + def event + # nested + end + + def rdf + # nested + end + + def has_source + # nested + end + + def has_relation + # nested + end + protected # default behavior is simply to comma delineate fields @@ -267,5 +387,6 @@ def source_to_s(f) .reject! { |value| value.nil? || value.strip.empty? } .join(", ") end + end diff --git a/lib/datura/to_es/tei_to_es/tei_to_es_personography.rb b/lib/datura/to_es/tei_to_es/tei_to_es_personography.rb index 7e4ff79be..4d3d43de2 100644 --- a/lib/datura/to_es/tei_to_es/tei_to_es_personography.rb +++ b/lib/datura/to_es/tei_to_es/tei_to_es_personography.rb @@ -16,7 +16,9 @@ def category def creator creators = get_list(@xpaths["creators"], false, @parent_xml) - creators.map { |c| { "name" => c } } + if creators + creators.map { |c| { "name" => c } } + end end def creators diff --git a/lib/datura/to_es/tei_to_es/xpaths.rb b/lib/datura/to_es/tei_to_es/xpaths.rb index e9b9140b6..8f4f2605e 100644 --- a/lib/datura/to_es/tei_to_es/xpaths.rb +++ b/lib/datura/to_es/tei_to_es/xpaths.rb @@ -72,6 +72,8 @@ def xpaths_list # "medium" => "", + "notes" => "//note[@type='project']", + # NOTE: if you would like to associate a role you may need the parent element # such as correspAction[@type='deliveredTo'], etc "person" => [ @@ -125,7 +127,7 @@ def xpaths_list # NOTE this xpath will often catch notes, back, etc which a project may wish to # exclude if they are using the annotations_text field for editorial comments - "text" => "//text//text()", + "text" => ["//text//text()", "//note[@type='project']"], "title" => "/TEI/teiHeader/fileDesc/titleStmt/title[1]", diff --git a/lib/datura/to_es/vra_to_es/fields.rb b/lib/datura/to_es/vra_to_es/fields.rb index 75e374e80..bdd3c4fd3 100644 --- a/lib/datura/to_es/vra_to_es/fields.rb +++ b/lib/datura/to_es/vra_to_es/fields.rb @@ -20,8 +20,10 @@ def category # nested field def creator - creators = get_list(@xpaths["creators"]) - creators.map { |c| { "name" => Datura::Helpers.normalize_space(c) } } + creators = get_list(@xpaths["creator"]) + if creators + creators.map { |c| { "name" => Datura::Helpers.normalize_space(c) } } + end end def collection @@ -50,7 +52,9 @@ def data_type def date(before=true) datestr = get_list(@xpaths["date"]).first - Datura::Helpers.date_standardize(datestr, before) + if datestr + Datura::Helpers.date_standardize(datestr, before) + end end def date_display @@ -114,16 +118,16 @@ def person # subject element if get_text("@type", xml: p) == "personalName" { - id: nil, - name: get_text(".", xml: p), - role: nil + "id" => nil, + "name" => get_text(".", xml: p), + "role" => nil } # agent element else { - id: nil, - name: get_text("name", xml: p), - role: get_text("role", xml: p) + "id" => nil, + "name" => get_text("name", xml: p), + "role" => get_text("role", xml: p) } end end @@ -138,6 +142,10 @@ def publisher get_text(@xpaths["publisher"]) end + # nested field + def rdf + end + # nested field def recipient eles = get_elements(@xpaths["recipient"]) @@ -185,11 +193,13 @@ def text # handling separate fields in array # means no worrying about handling spacing between words text_all = [] - text_all << get_text(@xpaths["text"]) + if get_text(@xpaths["text"]) + text_all << get_text(@xpaths["text"]) + end # TODO: do we need to preserve tags like in text? if so, turn get_text to true # text_all << CommonXml.convert_tags_in_string(body) text_all += text_additional - Datura::Helpers.normalize_space(text_all.join(" ")) + Datura::Helpers.normalize_space(text_all.join(" "))[0..@options["text_limit"]] end def text_additional @@ -251,4 +261,100 @@ def uri_html def works get_list(@xpaths["works"]) end + + # new/moved fields for API 2.0 + + def cover_image + get_list(@xpaths["image_id"]).first + end + + def date_updated + get_list(@xpaths["date_updated"]) + end + + def fig_location + get_list(@xpaths["fig_location"]) + end + + def category2 + get_list(@xpaths["subcategory"]) + end + + def category3 + get_text(@xpaths["category3"]) + end + + def category4 + get_text(@xpaths["category4"]) + end + + def category5 + get_text(@xpaths["category5"]) + end + + def notes + get_text(@xpaths["notes"]) + end + + def citation + # nested + end + + def container_box + end + + def container_folder + end + + def abstract + get_text(@xpaths["abstract"]) + end + + def keywords2 + get_text(@xpaths["keywords2"]) + end + + def keywords3 + get_text(@xpaths["keywords3"]) + end + + def keywords4 + get_text(@xpaths["keywords4"]) + end + + def keywords5 + get_text(@xpaths["keywords5"]) + end + + def has_part + # nested + end + + def is_part_of + # nested + end + + def previous_item + # nested + end + + def next_item + # nested + end + + def event + # nested + end + + def rdf + # nested + end + + def has_source + # nested + end + + def has_relation + # nested + end end diff --git a/lib/datura/to_es/vra_to_es/vra_to_es_personography.rb b/lib/datura/to_es/vra_to_es/vra_to_es_personography.rb index 8d8f904b1..cab0c5591 100644 --- a/lib/datura/to_es/vra_to_es/vra_to_es_personography.rb +++ b/lib/datura/to_es/vra_to_es/vra_to_es_personography.rb @@ -13,7 +13,9 @@ def category def creator creators = get_list(@xpaths["creators"], xml: @parent_xml) - creators.map { |c| { "name" => c } } + if creators + creators.map { |c| { "name" => c } } + end end def creator_sort diff --git a/lib/datura/to_es/webs_to_es/fields.rb b/lib/datura/to_es/webs_to_es/fields.rb index 815b78aa8..8721e4462 100644 --- a/lib/datura/to_es/webs_to_es/fields.rb +++ b/lib/datura/to_es/webs_to_es/fields.rb @@ -39,7 +39,11 @@ def data_type end def date(before=true) - datestr = get_list(@xpaths["date"]).first + if get_list(@xpaths["date"]) + datestr = get_list(@xpaths["date"]).first + else + datestr = nil + end if datestr Datura::Helpers.date_standardize(datestr, true) end @@ -80,7 +84,9 @@ def format end def image_id - get_list(@xpaths["image_id"]).first + if get_list(@xpaths["image_id"]) + get_list(@xpaths["image_id"]).first + end end def keywords @@ -111,6 +117,10 @@ def publisher get_text(@xpaths["publisher"]) end + # nested field + def rdf + end + # nested field def recipient end @@ -152,9 +162,11 @@ def text # means no worrying about handling spacing between words text = [] body = get_text(@xpaths["text"]) - text << body + if body + text << body + end text += text_additional - Datura::Helpers.normalize_space(text.join(" ")) + Datura::Helpers.normalize_space(text.join(" "))[0..@options["text_limit"]] end def text_additional @@ -210,4 +222,102 @@ def uri_html def works get_list(@xpaths["works"]) end + + # new/moved fields for API 2.0 + + def cover_image + if get_list(@xpaths["image_id"]) + get_list(@xpaths["image_id"]).first + end + end + + def date_updated + get_list(@xpaths["date_updated"]) + end + + def fig_location + get_list(@xpaths["fig_location"]) + end + + def category2 + get_list(@xpaths["subcategory"]) + end + + def category3 + get_text(@xpaths["category3"]) + end + + def category4 + get_text(@xpaths["category4"]) + end + + def category5 + get_text(@xpaths["category5"]) + end + + def container_box + end + + def container_folder + end + + def notes + get_text(@xpaths["notes"]) + end + + def citation + # nested + end + + def abstract + get_text(@xpaths["abstract"]) + end + + def keywords2 + get_text(@xpaths["keywords2"]) + end + + def keywords3 + get_text(@xpaths["keywords3"]) + end + + def keywords4 + get_text(@xpaths["keywords4"]) + end + + def keywords5 + get_text(@xpaths["keywords5"]) + end + + def has_part + # nested + end + + def is_part_of + # nested + end + + def previous_item + # nested + end + + def next_item + # nested + end + + def event + # nested + end + + def rdf + # nested + end + + def has_source + # nested + end + + def has_relation + # nested + end end diff --git a/lib/datura/to_es/xml_to_es.rb b/lib/datura/to_es/xml_to_es.rb index 3cbbb1e46..9c0072af7 100644 --- a/lib/datura/to_es/xml_to_es.rb +++ b/lib/datura/to_es/xml_to_es.rb @@ -34,9 +34,8 @@ def initialize(xml, options={}, parent_xml=nil, filename=nil) @options = options @parent_xml = parent_xml @filename = filename - @id = get_id @xpaths = xpaths_list - + @id = get_id create_json end @@ -93,6 +92,9 @@ def get_elements(*xpaths, xml: nil) def get_list(xpaths, keep_tags: false, xml: nil, sort: false) xpath_array = Array(xpaths) list = get_xpaths(xpath_array, keep_tags: keep_tags, xml: xml) + if !list || list.empty? + return nil + end sort ? list.sort : list end @@ -104,6 +106,9 @@ def get_list(xpaths, keep_tags: false, xml: nil, sort: false) def get_text(xpaths, keep_tags: false, xml: nil, delimiter: ";", sort: false) # ensure all xpaths are an array before beginning list = get_list(xpaths, keep_tags: keep_tags, xml: xml, sort: sort) + if !list || list.empty? + return nil + end list.join("#{delimiter} ") end diff --git a/lib/xslt/ead_to_html/ead_to_html.xsl b/lib/xslt/ead_to_html/ead_to_html.xsl new file mode 100644 index 000000000..d0aa8f923 --- /dev/null +++ b/lib/xslt/ead_to_html/ead_to_html.xsl @@ -0,0 +1,60 @@ + + + + + + + + + + + + + + + + + + + + + + + + +production + + + + + + + + + + + + + + + + + + diff --git a/test/common_xml_test.rb b/test/common_xml_test.rb index 765f85c97..6d5c22224 100644 --- a/test/common_xml_test.rb +++ b/test/common_xml_test.rb @@ -1,4 +1,4 @@ -require "test_helper" +require_relative "test_helper" require "nokogiri" class CommonXmlTest < Minitest::Test diff --git a/test/datura_test.rb b/test/datura_test.rb index 92db7ee99..f0e256d1d 100644 --- a/test/datura_test.rb +++ b/test/datura_test.rb @@ -1,4 +1,4 @@ -require "test_helper" +require_relative "test_helper" class DaturaTest < Minitest::Test def test_that_it_has_a_version_number diff --git a/test/es_index_test.rb b/test/es_index_test.rb new file mode 100644 index 000000000..5cee19c27 --- /dev/null +++ b/test/es_index_test.rb @@ -0,0 +1,130 @@ +require_relative "test_helper" + +class Datura::ElasticsearchIndexTest < Minitest::Test + + @@options = { + "api_version" => "2.0", + "es_index" => "fake_index", + "es_path" => "fake_path", + "es_schema" => File.join( + File.expand_path(File.dirname(__FILE__)), + "../lib/config/es_api_schemas/2.0.yml" + ) + } + + # stub in get_schema so that we can test get_schema_mapping without + # worrying about integration with actual index + + class Datura::Elasticsearch::Index + def get_schema + raw = File.read( + File.join( + File.expand_path(File.dirname(__FILE__)), + "fixtures/es_mapping_2.0.json" + ) + ) + JSON.parse(raw) + end + end + + def test_initialize + # test that options populate if you pass existing ones in + es = Datura::Elasticsearch::Index.new(@@options) + path = File.join(@@options["es_path"], @@options["es_index"]) + assert_equal path, es.index_url + + # test that schema mapping occurs, although it will be with the stubbed + # in version of get_schema above, rather than index integration + es = Datura::Elasticsearch::Index.new(@@options, schema_mapping: true) + assert es.schema_mapping + end + + def test_get_schema_mapping + # let's just see what happens + es = Datura::Elasticsearch::Index.new(@@options) + es.get_schema_mapping + assert es.schema_mapping["fields"] + assert_equal 46, es.schema_mapping["fields"].length + assert_equal( + /^.*_d$|^.*_i$|^.*_k$|^.*_n$|^.*_t$|^.*_t_en$|^.*_t_es$/, + es.schema_mapping["dynamic"] + ) + end + + def test_valid_document? + es = Datura::Elasticsearch::Index.new(@@options) + + # basic fields + assert es.valid_document?({ "identifier" => "a" }) + assert es.valid_document?({ + "collection" => "a", + "date_not_before" => "2012-01-01", + "text" => "a", + }) + + # nested fields with child fields not matching top level field + assert es.valid_document?({ + "creator" => [ + { + "id" => "a", + "name" => "a" + } + ] + }) + + # nested fields with child fields matching top level / dynamic + assert es.valid_document?({ + "creator" => [ + { + "subcategory" => "a", + "data_type" => "a", + "keyword_k" => "a" + } + ] + }) + + # dynamic fields, each type + assert es.valid_document?({ "new_field_d" => "2012-01-1" }) + assert es.valid_document?({ "new_field_i" => "1" }) + assert es.valid_document?({ "new_field_k" => "a" }) + assert es.valid_document?({ "new_field_t" => "a" }) + assert es.valid_document?({ "new_field_t_en" => "a" }) + assert es.valid_document?({ "new_field_t_es" => "a" }) + + # test failures of basic and dynamic fields + refute es.valid_document?({ "bad_field" => "a" }) + refute es.valid_document?({ "dynamic_t_bad" => "a" }) + + # test failure of nested field with all bad subfields + refute es.valid_document?({ + "creator" => [ + { + "bad_field" => "a", + "another_one" => "a" + } + ] + }) + + # test feailure of nested field with mixture of good / bad + refute es.valid_document?({ + "creator" => [ + { + "id" => "a", + "keyword_k" => "a" + }, + { + "id" => "a", + "bad_field" => "a" + } + ] + }) + + # test that bad fields hidden with good still fail + refute es.valid_document?({ + "collection" => "a", + "keyword_k" => "a", + "bad_field" => "a" + }) + end + +end diff --git a/test/fixtures/es_mapping_2.0.json b/test/fixtures/es_mapping_2.0.json new file mode 100644 index 000000000..f82189503 --- /dev/null +++ b/test/fixtures/es_mapping_2.0.json @@ -0,0 +1,345 @@ +{ + "fake_index" : { + "mappings" : { + "_doc" : { + "dynamic_templates" : [ + { + "date_fields" : { + "match" : "*_d", + "mapping" : { + "format" : "yyyy-MM-dd||epoch_millis", + "type" : "date" + } + } + }, + { + "integer_fields" : { + "match" : "*_i", + "mapping" : { + "type" : "integer" + } + } + }, + { + "keyword_fields" : { + "match" : "*_k", + "mapping" : { + "normalizer" : "keyword_normalized", + "type" : "keyword" + } + } + }, + { + "nested_fields" : { + "match" : "*_n", + "mapping" : { + "type" : "nested" + } + } + }, + { + "text_fields" : { + "match" : "*_t", + "mapping" : { + "analyzer" : "english", + "type" : "text" + } + } + }, + { + "text_english" : { + "match" : "*_t_en", + "mapping" : { + "analyzer" : "english", + "type" : "text" + } + } + }, + { + "text_spanish" : { + "match" : "*_t_es", + "mapping" : { + "analyzer" : "spanish", + "type" : "text" + } + } + } + ], + "properties" : { + "abstract" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "alternative" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "annotations_text" : { + "type" : "text", + "analyzer" : "english" + }, + "category" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "collection" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "collection_desc" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "contributor" : { + "type" : "nested", + "properties" : { + "id" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "name" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "role" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + } + } + }, + "coverage-spatial" : { + "type" : "nested", + "properties" : { + "city" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "coordinates" : { + "type" : "geo_point" + }, + "country" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "county" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "id" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "place_name" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "postal_code" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "region" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "state" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "street" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + } + } + }, + "creator" : { + "type" : "nested", + "properties" : { + "id" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "name" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + } + } + }, + "creator_sort" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "data_type" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "date" : { + "type" : "date", + "format" : "yyyy-MM-dd||epoch_millis" + }, + "date_display" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "date_not_after" : { + "type" : "date", + "format" : "yyyy-MM-dd||epoch_millis" + }, + "date_not_before" : { + "type" : "date", + "format" : "yyyy-MM-dd||epoch_millis" + }, + "description" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "extent" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "format" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "identifier" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "image_id" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "image_location" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "keywords" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "language" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "languages" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "medium" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "people" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "person" : { + "type" : "nested", + "properties" : { + "id" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "name" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "role" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + } + } + }, + "places" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "publisher" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "recipient" : { + "type" : "nested", + "properties" : { + "id" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "name" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "role" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + } + } + }, + "relation" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "rights" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "rights_holder" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "rights_uri" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "source" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "subcategory" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "subjects" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "text" : { + "type" : "text", + "analyzer" : "english" + }, + "title" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "title_sort" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "topics" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "type" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "uri" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "uri_data" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "uri_html" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + }, + "works" : { + "type" : "keyword", + "normalizer" : "keyword_normalized" + } + } + } + } + } +} diff --git a/test/helpers_test.rb b/test/helpers_test.rb index fcc4f9c7d..88d740ecf 100644 --- a/test/helpers_test.rb +++ b/test/helpers_test.rb @@ -1,4 +1,4 @@ -require "test_helper" +require_relative "test_helper" require "nokogiri" class Datura::HelpersTest < Minitest::Test diff --git a/test/options_test.rb b/test/options_test.rb index cd088785b..1bf33b60f 100644 --- a/test/options_test.rb +++ b/test/options_test.rb @@ -1,4 +1,4 @@ -require "test_helper" +require_relative "test_helper" # override the Options class method so that we # can test without real config files @@ -6,7 +6,9 @@ class Datura::Options def read_all_configs fake1, fake2 @general_config_pub = { "default" => { - "a" => "general default public" + "a" => "general default public", + "es_schema_path" => "lib/config", + "api_version" => "2.0" } } @collection_config_pub = { diff --git a/test/tei_to_es_test.rb b/test/tei_to_es_test.rb index abf22a898..19d59c9d1 100644 --- a/test/tei_to_es_test.rb +++ b/test/tei_to_es_test.rb @@ -1,4 +1,5 @@ -require "test_helper" +require_relative "test_helper" +require "nokogiri" class TeiToEsTest < Minitest::Test def setup