Skip to content

Commit

Permalink
Merge pull request #231 from CDRH/new_fields
Browse files Browse the repository at this point in the history
API v2 support with new fields
  • Loading branch information
techgique authored Sep 26, 2024
2 parents 605da3c + 93ba036 commit e1498c7
Show file tree
Hide file tree
Showing 74 changed files with 4,145 additions and 614 deletions.
2 changes: 1 addition & 1 deletion .ruby-version
Original file line number Diff line number Diff line change
@@ -1 +1 @@
2.7.1
3.1.2
38 changes: 38 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,15 +49,49 @@ Versioning](https://semver.org/spec/v2.0.0.html).
- minor test for Datura::Helpers.date_standardize
- documentation for web scraping
- documentation for CsvToEs (transforming CSV files and posting to elasticsearch)
- documentation for adding new ingest formats to Datura
- byebug gem for debugging
- instructions for installing Javascript Runtime files for Saxon
- API schema can either be the original 1.0 or the newly updated 2.0 (which includes new fields including nested fields); 1.0 will be run by default unless 2.0 is specified. Add the following to `public.yml` or `private.yml` in the data repo:
```
api_version: '2.0'
```
See new schema (2.0) documentation [here](https://github.com/CDRH/datura/blob/main/docs/schema_v2.md)
- schema validation with API version 2.0: invalidly constructed documents will not post
- authentication with Elasticesarch 8.5
- field overrides for new fields in the new API schema
- functionality to transform EAD files and post them to elasticsearch
- functionality to transform PDF files (including text and metadata) and post them to elasticsearch
- limiting `text` field to a specific limit: `text_limit` in `public.yml` or `private.yml`
- configuration options related to Elasticsearch, including `es_schema_override` and `es_schema_path` to change the location of the Elasticsearch schema
- more detailed errors including a stack trace

### Changed
- update ruby to 3.1.2
- date_standardize now relies on strftime instead of manual zero padding for month, day
- minor corrections to documentation
- XPath: "text" is now ingested as an array and will be displayed delimitted by spaces
- "text" field now includes "notes" XPath
- refactored posting script (`Datura.run`)
- refactored command line methods into elasticsearch library
- refactored and moved date_standardize and date_display helper methods
- Nokogiri methods `get_text` and `get_list` on TEI now return nil rather than empty strings or arrays if there are no matches. fields have been changed to check for these nil values

### Migration
- check to make sure "text" xpath is doing desired behavior
- use Elasticsearch 8.5 or higher and add authentication if security is enabled. Add the following to `public.yml` or `private.yml` in the data repo:
```
es_user: username
es_password: ********
```
- upgrade data repos to Ruby 3.1.2
-
- add api version to config as described above
- make sure fields are consistent with the api schema, many have been renamed or changed in format
- add nil checks with get_text and get_list methods as needed
- add EadToES overrides if ingesting EAD files
- add `byebug` and `pdf-reader` to Gemfile in repos based on Datura
- if overriding the `read_csv` method in `lib/datura/file_type.rb`, the hash must be prefixed with ** (`**{}`).

## [v0.2.0-beta](https://github.com/CDRH/datura/compare/v0.1.6...v0.2.0-beta) - 2020-08-17 - Altering field and xpath behavior, adds get_elements

Expand All @@ -68,6 +102,8 @@ Versioning](https://semver.org/spec/v2.0.0.html).
- Tests and fixtures for all supported formats except CustomToEs
- `get_elements` returns nodeset given xpath arguments
- `spatial` nested fields `spatial.type` and `spatial.title`
- Versioning system to support multiple elasticsearch schemas
- Validator to check against the elasticsearch copy

### Changed
- Arguments for `get_text`, `get_list`, and `get_xpaths`
Expand All @@ -76,12 +112,14 @@ Versioning](https://semver.org/spec/v2.0.0.html).
- Documentation updated
- Changed Install instructions to include RVM and gemset naming conventions
- API field `coverage_spatial` is now just `spatial`
- refactored executables into modules and classes

### Migration
- Change `coverage_spatial` nested field to `spatial`
- `get_text`, `get_list`, and `get_xpaths` require changing arguments to keyword (like `xml` and `keep_tags`)
- Recommend checking xpaths and behavior of fields after updating to this version, as some defaults have changed
- Possible to refactor previous FileCsv overrides to use new CsvToEs abilities, but not necessary
- Config files should specify `api_version` as 1.0 or 2.0

## [v0.1.6](https://github.com/CDRH/datura/compare/v0.1.5...v0.1.6) - 2020-04-24 - Improvements to CSV, WEBS transformers and adds Custom transformer

Expand Down
53 changes: 35 additions & 18 deletions Gemfile.lock
Original file line number Diff line number Diff line change
@@ -1,40 +1,57 @@
PATH
remote: .
specs:
datura (0.2.0.pre.beta)
datura (0.2.0)
byebug (~> 11.0)
colorize (~> 0.8.1)
nokogiri (~> 1.8)
rest-client (~> 2.0.2)
nokogiri (~> 1.10)
pdf-reader (~> 2.12)
rest-client (~> 2.1)

GEM
remote: https://rubygems.org/
specs:
Ascii85 (1.1.1)
afm (0.2.2)
bigdecimal (3.1.8)
byebug (11.1.3)
colorize (0.8.1)
domain_name (0.5.20190701)
unf (>= 0.0.5, < 1.0.0)
http-cookie (1.0.5)
domain_name (0.6.20240107)
hashery (2.1.2)
http-accept (1.7.0)
http-cookie (1.0.7)
domain_name (~> 0.5)
mime-types (3.4.1)
mime-types (3.5.2)
mime-types-data (~> 3.2015)
mime-types-data (3.2022.0105)
mini_portile2 (2.8.0)
minitest (5.15.0)
mime-types-data (3.2024.0903)
mini_portile2 (2.8.7)
minitest (5.16.3)
netrc (0.11.0)
nokogiri (1.13.6)
mini_portile2 (~> 2.8.0)
nokogiri (1.16.7)
mini_portile2 (~> 2.8.2)
racc (~> 1.4)
racc (1.6.0)
nokogiri (1.16.7-x86_64-darwin)
racc (~> 1.4)
pdf-reader (2.12.0)
Ascii85 (~> 1.0)
afm (~> 0.2.1)
hashery (~> 2.0)
ruby-rc4
ttfunk
racc (1.8.1)
rake (13.0.6)
rest-client (2.0.2)
rest-client (2.1.0)
http-accept (>= 1.7.0, < 2.0)
http-cookie (>= 1.0.2, < 2.0)
mime-types (>= 1.16, < 4.0)
netrc (~> 0.8)
unf (0.1.4)
unf_ext
unf_ext (0.0.8.1)
ruby-rc4 (0.1.5)
ttfunk (1.8.0)
bigdecimal (~> 3.1)

PLATFORMS
ruby
x86_64-darwin-20

DEPENDENCIES
bundler (>= 1.16.0, < 3.0)
Expand All @@ -43,4 +60,4 @@ DEPENDENCIES
rake (~> 13.0)

BUNDLED WITH
2.1.4
2.2.33
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Looking for information about how to post documents? Check out the

## Install / Set Up Data Repo

Check that Ruby is installed, preferably 2.7.x or up. If you are using RVM, see the RVM section below.
Check that Ruby is installed, preferably 3.1.2 or up. If you are using RVM, see the RVM section below.

If your project already has a Gemfile, add the `gem "datura"` line. If not, create a new directory and add a file named `Gemfile` (no extension).

Expand Down
16 changes: 4 additions & 12 deletions bin/admin_es_create_index
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,10 @@

require "datura"

params = Datura::Parser.es_create_delete_index
options = Datura::Options.new(params).all

put_url = File.join(options["es_path"], "#{options["es_index"]}?pretty=true")
get_url = File.join(options["es_path"], "_cat", "indices?v&pretty=true")

begin
# TODO if we want to add any default settings to the new index,
# we can do that with the payload and then use rest-client again instead of exec
# however, rest-client appears to require a payload and won't allow simple "PUT" with none
puts "Creating new ES index: #{put_url}"
exec("curl -XPUT #{put_url}")
es = Datura::Elasticsearch::Index.new
es.create
es.set_schema
rescue => e
puts "Error: #{e.inspect}"
puts e
end
11 changes: 3 additions & 8 deletions bin/admin_es_delete_index
Original file line number Diff line number Diff line change
@@ -1,15 +1,10 @@
#!/usr/bin/env ruby

require "datura"
require "rest-client"

params = Datura::Parser.es_create_delete_index
options = Datura::Options.new(params).all

url = File.join(options["es_path"], "#{options["es_index"]}?pretty=true")

begin
puts JSON.parse(RestClient.delete(url))
es = Datura::Elasticsearch::Index.new
es.delete
rescue => e
puts "Error with request, check that index exists before deleting: #{e}"
puts e
end
25 changes: 2 additions & 23 deletions bin/es_alias_add
Original file line number Diff line number Diff line change
Expand Up @@ -2,29 +2,8 @@

require "datura"

require "json"
require "rest-client"

params = Datura::Parser.es_alias_add
options = Datura::Options.new(params).all

ali = options["alias"]
idx = options["index"]
url = File.join(options["es_path"], "_aliases")

data = {
actions: [
{ remove: { alias: ali, index: "_all" } },
{ add: { alias: ali, index: idx } }
]
}

begin
res = RestClient.post(url, data.to_json, { content_type: :json })
puts "Results of setting alias #{ali} to index #{idx}"
puts res
list = JSON.parse(RestClient.get(url))
puts "\nAll aliases: #{JSON.pretty_generate(list)}"
Datura::Elasticsearch::Alias.add
rescue => e
puts "Error: #{e.response}"
puts e
end
14 changes: 5 additions & 9 deletions bin/es_alias_delete
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,8 @@

require "datura"

require "json"
require "rest-client"

params = Datura::Parser.es_alias_delete
options = Datura::Options.new(params).all
url = File.join(options["es_path"], options["index"], "_alias", options["alias"])

res = JSON.parse(RestClient.delete(url))
puts JSON.pretty_generate(res)
begin
Datura::Elasticsearch::Alias.delete
rescue => e
puts e
end
9 changes: 1 addition & 8 deletions bin/es_alias_list
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,4 @@

require "datura"

require "json"
require "rest-client"

options = Datura::Options.new({}).all
url = File.join(options["es_path"], "_aliases")

res = JSON.parse(RestClient.get(url))
puts JSON.pretty_generate(res)
Datura::Elasticsearch::Alias.list
89 changes: 4 additions & 85 deletions bin/es_clear_index
Original file line number Diff line number Diff line change
Expand Up @@ -2,89 +2,8 @@

require "datura"

require "json"
require "rest-client"

def confirm_basic(options, url)
# verify that the user is really sure about the index they're about to wipe
puts "Are you sure that you want to remove entries from"
puts " #{options["collection"]}'s #{options['environment']} environment?"
puts "url: #{url}"
puts "y/N"
answer = STDIN.gets.chomp
# boolean
return !!(answer =~ /[yY]/)
end

def main

# run the parameters through the option parser
params = Datura::Parser.clear_index_params
options = Datura::Options.new(params).all
if options["collection"] == "all"
clear_all(options)
else
clear_index(options)
end
end

def build_data(options)
if options["regex"]
field = options["field"] || "identifier"
return {
"query" => {
"bool" => {
"must" => [
{ "regexp" => { field => options["regex"] } },
{ "term" => { "collection" => options["collection"] } }
]
}
}
}
else
return {
"query" => { "term" => { "collection" => options["collection"] } }
}
end
end

def clear_all(options)
puts "Please verify that you want to clear EVERY ENTRY from the ENTIRE INDEX\n\n"
puts "== FIELD / REGEX FILTERS NOT AVAILABLE FOR THIS OPTION, YOU'LL WIPE EVERYTHING ==\n\n"
puts "Seriously, you probably do not want to do this"
puts "Are you running this on something other than your local machine? RETHINK IT."
puts "Type: 'Yes I'm sure'"
confirm = STDIN.gets.chomp
if confirm == "Yes I'm sure"
url = "#{options["es_path"]}/#{options["es_index"]}/_doc/_delete_by_query?pretty=true"
post url, { "query" => { "match_all" => {} } }
else
puts "You typed '#{confirm}'. This is incorrect, exiting program"
exit
end
end

def clear_index(options)
url = "#{options["es_path"]}/#{options["es_index"]}/_doc/_delete_by_query?pretty=true"
confirmation = confirm_basic(options, url)

if confirmation
data = build_data(options)
post(url, data)
else
puts "come back anytime!"
exit
end
begin
Datura::Elasticsearch::Index.clear
rescue => e
puts e
end

def post(url, data={})
begin
puts "clearing from #{url}: #{data.to_json}"
res = RestClient.post(url, data.to_json, {:content_type => :json})
puts res.body
rescue => e
puts "error posting to ES: #{e.response}"
end
end

main
16 changes: 3 additions & 13 deletions bin/es_get_schema
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,9 @@

require "datura"

require "json"
require "rest-client"
require "yaml"

params = Datura::Parser.es_set_schema_params
options = Datura::Options.new(params).all

begin
url = File.join(options["es_path"], options["es_index"], "_mapping", "_doc?pretty=true")
res = RestClient.get(url)
puts res.body
puts "environment: #{options["environment"]}"
puts "url: #{url}"
es = Datura::Elasticsearch::Index.new
puts JSON.pretty_generate(es.get_schema)
rescue => e
puts "Error: #{e.response}"
puts e
end
Loading

0 comments on commit e1498c7

Please sign in to comment.