Skip to content

Commit

Permalink
Merge pull request #226 from CDRH/whitman-fixes
Browse files Browse the repository at this point in the history
Whitman fixes
  • Loading branch information
techgique authored Aug 28, 2024
2 parents b680e70 + 40393d8 commit 67eedaa
Show file tree
Hide file tree
Showing 13 changed files with 78 additions and 45 deletions.
16 changes: 12 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,37 +34,45 @@ Versioning](https://semver.org/spec/v2.0.0.html).
- documentation for adding new ingest formats to Datura
- byebug gem for debugging
- instructions for installing Javascript Runtime files for Saxon
- API schema can either be 1.0 or 2.0 (which includes nested fields); 1.0 will be run by default unless 2.0 is specified. Add the following to `public.yml` or `private.yml` in the data repo:
- API schema can either be the original 1.0 or the newly updated 2.0 (which includes new fields including nested fields); 1.0 will be run by default unless 2.0 is specified. Add the following to `public.yml` or `private.yml` in the data repo:
```
api_version: '2.0'
```
See new schema (2.0) documentation [here](https://github.com/CDRH/datura/docs/schema_v2.md)
- schema validation with API version 2.0, invalidly constructed documents will not post
- schema validation with API version 2.0: invalidly constructed documents will not post
- authentication with Elasticesarch 8.5; add the following to `public.yml` or `private.yml` in the data repo:
```
es_user: username
es_password: ********
```
- field overrides for new fields in the new API schema
- functionality to transform EAD files and post them to elasticsearch
- functionality to transform PDF files (including text and metadata) and post them to elasticsearch
- limiting `text` field to a specific limit: `text_limit` in `public.yml` or `private.yml`
- configuration options related to Elasticsearch, including `es_schema_override` and `es_schema_path` to change the location of the Elasticsearch schema
- more detailed errors including a stack trace

### Changed
- update ruby to 3.1.2
- date_standardize now relies on strftime instead of manual zero padding for month, day
- minor corrections to documentation
- XPath: "text" is now ingested as an array and will be displayed delimitted by spaces
- "text" field now includes "notes" XPath
- refactored posting script (`Datura.run`)
- refactored command line methods into elasticsearch library
- refactored and moved date_standardize and date_display helper methods
- Nokogiri methods `get_text` and `get_list` on TEI now return nil rather than empty strings or arrays if there are no matches
- Nokogiri methods `get_text` and `get_list` on TEI now return nil rather than empty strings or arrays if there are no matches. fields have been changed to check for these nil values

### Migration
- check to make sure "text" xpath is doing desired behavior
- use Elasticsearch 8.5 or higher and add authentication as described above if security is enabled. See [dev docs instructions](https://github.com/CDRH/cdrh_dev_docs/blob/update_elasticsearch_documentation/publishing/2_basic_requirements.md#downloading-elasticsearch).
- upgrade data repos to Ruby 3.1.2
-
- add api version to config as described above
- make sure fields are consistent with the api schema, many have been renamed or changed in format
- add nil checks with get_text and get_list methods
- add nil checks with get_text and get_list methods as needed
- add EadToES overrides if ingesting EAD files
- add `byebug` and `pdf-reader` to Gemfile in repos based on Datura
- if overriding the `read_csv` method in `lib/datura/file_type.rb`, the hash must be prefixed with ** (`**{}`).

## [v0.2.0-beta](https://github.com/CDRH/datura/compare/v0.1.6...v0.2.0-beta) - 2020-08-17 - Altering field and xpath behavior, adds get_elements
Expand Down
24 changes: 12 additions & 12 deletions lib/config/public.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,22 +9,20 @@
# the collection specific configuration files:
# (config/public.yml and config/private.yml)


###################
# Defaults #
###################

default:

# SCRIPT POWER
# recommend this be increased in private.yml
# on more powerful systems to improve runtime
threads: 5

# LOGGING
log_old_number: 1 # number of log files before beginning to erase
log_size: 32768000 # size of log file in bytes
log_level: Logger::INFO # available levels: UNKNOWN, FATAL, ERROR, WARN, INFO, DEBUG
log_old_number: 1 # number of log files before beginning to erase
log_size: 32768000 # size of log file in bytes
log_level: Logger::INFO # available levels: UNKNOWN, FATAL, ERROR, WARN, INFO, DEBUG

# ELASTICSEARCH SCHEMA CONFIGURATION
# if es_schema_override is false, datura is base directory
Expand All @@ -40,14 +38,17 @@ default:
api_version: "1.0"
# NOTE: es_schema option is set later as combination of above
# es_schema_override, es_schema_path, and api_version
# ES currently has a limited character size for keyword fields of 1000000
# exceeding this limit (generally in text field) will cause errors when searching
text_limit: 900000

# RESOURCE LOCATIONS
data_base: https://cdrhmedia.unl.edu # xml, csv, html snippets, etc
media_base: https://cdrhmedia.unl.edu # images, audio, video
es_index: override_to_set_index # elasticsearch index name
es_path: http://localhost:9200 # elasticsearch path (recommend override)
solr_core: override_to_set_core # solr core name
solr_path: http://localhost:8983/solr # solr path (recommend override)
data_base: https://cdrhmedia.unl.edu # xml, csv, html snippets, etc
media_base: https://cdrhmedia.unl.edu # images, audio, video
es_index: override_to_set_index # elasticsearch index name
es_path: http://localhost:9200 # elasticsearch path (recommend override)
solr_core: override_to_set_core # solr core name
solr_path: http://localhost:8983/solr # solr path (recommend override)

# OUTPUT LOCATION
# default is [environment]/output/[file_type]
Expand Down Expand Up @@ -92,7 +93,6 @@ default:

development:
data_base: https://cdrhdev1.unl.edu/media

##################
# Production #
##################
Expand Down
4 changes: 2 additions & 2 deletions lib/datura/parser_options/post.rb
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,12 @@ def self.post_params

# default to no restricted format
options["format"] = nil
opts.on( '-f', '--format [input]', 'Supported formats (csv, html, pdf, tei, vra, webs)') do |input|
opts.on( '-f', '--format [input]', 'Supported formats (csv, html, ead, pdf, tei, vra, webs)') do |input|
if %w[authority annotations].include?(input)
puts "'authority' and 'annotations' are invalid formats".red
puts "Please select a supported format or rename your custom format"
exit
elsif !%w[csv html pdf tei vra webs].include?(input)
elsif !%w[csv ead html pdf tei vra webs].include?(input)
puts "Caution: Requested custom format #{input}.".red
puts "See FileCustom class for implementation instructions"
end
Expand Down
2 changes: 1 addition & 1 deletion lib/datura/to_es/csv_to_es/fields.rb
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@ def text

text_all += text_additional
text_all = text_all.compact
Datura::Helpers.normalize_space(text_all.join(" "))
Datura::Helpers.normalize_space(text_all.join(" "))[0..@options["text_limit"]]
end

# override and add by collection as needed
Expand Down
8 changes: 6 additions & 2 deletions lib/datura/to_es/ead_to_es/fields.rb
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,9 @@ def data_type

def date(before=true)
datestr = get_text(@xpaths["date"])
return Datura::Helpers.date_standardize(datestr, before)
if datestr
return Datura::Helpers.date_standardize(datestr, before)
end
end

def date_display
Expand Down Expand Up @@ -210,7 +212,9 @@ def text
text = []
@xpaths.keys.each do |xpath|
body = get_text(@xpaths[xpath])
text << body
if body
text << body
end
end
text
# TODO: do we need to preserve tags like <i> in text? if so, turn get_text to true
Expand Down
14 changes: 9 additions & 5 deletions lib/datura/to_es/ead_to_es_items/fields.rb
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ def creator_sort
end

def collection
"whitman-finding_aid_manuscripts"
"#{@options["collection"]}_items"
end

def collection_desc
Expand Down Expand Up @@ -104,9 +104,11 @@ def format
def get_id
# doc = id
doc = get_text(@xpaths["identifier"])
if doc == ""
if !doc
title = get_text(@xpaths["file"])
return "#{@filename}_#{title}"
if title
return "#{@filename}_#{title}"
end
end
return "#{@filename}_#{doc}"
end
Expand Down Expand Up @@ -212,11 +214,13 @@ def text
# means no worrying about handling spacing between words
text = []
body = get_text(@xpaths["text"])
text << body
if body
text << body
end
# TODO: do we need to preserve tags like <i> in text? if so, turn get_text to true
# text << CommonXml.convert_tags_in_string(body)
text += text_additional
return Datura::Helpers.normalize_space(text.join(" "))
return Datura::Helpers.normalize_space(text.join(" "))[0..@options["text_limit"]]
end

def text_additional
Expand Down
8 changes: 5 additions & 3 deletions lib/datura/to_es/html_to_es/fields.rb
Original file line number Diff line number Diff line change
Expand Up @@ -152,9 +152,11 @@ def text
# means no worrying about handling spacing between words
text = []
body = get_text(@xpaths["text"])
text << body
if body
text << body
end
text += text_additional
Datura::Helpers.normalize_space(text.join(" "))
Datura::Helpers.normalize_space(text.join(" "))[0..@options["text_limit"]]
end

def text_additional
Expand Down Expand Up @@ -278,7 +280,7 @@ def keywords4
get_text(@xpaths["keywords4"])
end

def keywords4
def keywords5
get_text(@xpaths["keywords5"])
end

Expand Down
5 changes: 4 additions & 1 deletion lib/datura/to_es/pdf_to_es/fields.rb
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@ def text
end
text_all += text_additional
text_all = text_all.compact
Datura::Helpers.normalize_space(text_all.join(" "))[0..999999]
Datura::Helpers.normalize_space(text_all.join(" "))[0..@options["text_limit"]]
end

# override and add by collection as needed
Expand Down Expand Up @@ -319,4 +319,7 @@ def has_relation
def has_source
end

def uri_html
end

end
6 changes: 4 additions & 2 deletions lib/datura/to_es/tei_to_es/fields.rb
Original file line number Diff line number Diff line change
Expand Up @@ -205,11 +205,13 @@ def text
# means no worrying about handling spacing between words
text_all = []
body = get_text(@xpaths["text"], keep_tags: false, delimiter: '')
text_all << body
if body
text_all << body
end
# TODO: do we need to preserve tags like <i> in text? if so, turn get_text to true
# text_all << CommonXml.convert_tags_in_string(body)
text_all += text_additional
Datura::Helpers.normalize_space(text_all.join(" "))
Datura::Helpers.normalize_space(text_all.join(" "))[0..@options["text_limit"]]
end

def text_additional
Expand Down
4 changes: 3 additions & 1 deletion lib/datura/to_es/tei_to_es/tei_to_es_personography.rb
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,9 @@ def category

def creator
creators = get_list(@xpaths["creators"], false, @parent_xml)
creators.map { |c| { "name" => c } }
if creators
creators.map { |c| { "name" => c } }
end
end

def creators
Expand Down
22 changes: 13 additions & 9 deletions lib/datura/to_es/vra_to_es/fields.rb
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,9 @@ def data_type

def date(before=true)
datestr = get_list(@xpaths["date"]).first
Datura::Helpers.date_standardize(datestr, before)
if datestr
Datura::Helpers.date_standardize(datestr, before)
end
end

def date_display
Expand Down Expand Up @@ -116,16 +118,16 @@ def person
# subject element
if get_text("@type", xml: p) == "personalName"
{
id: nil,
name: get_text(".", xml: p),
role: nil
"id" => nil,
"name" => get_text(".", xml: p),
"role" => nil
}
# agent element
else
{
id: nil,
name: get_text("name", xml: p),
role: get_text("role", xml: p)
"id" => nil,
"name" => get_text("name", xml: p),
"role" => get_text("role", xml: p)
}
end
end
Expand Down Expand Up @@ -191,11 +193,13 @@ def text
# handling separate fields in array
# means no worrying about handling spacing between words
text_all = []
text_all << get_text(@xpaths["text"])
if get_text(@xpaths["text"])
text_all << get_text(@xpaths["text"])
end
# TODO: do we need to preserve tags like <i> in text? if so, turn get_text to true
# text_all << CommonXml.convert_tags_in_string(body)
text_all += text_additional
Datura::Helpers.normalize_space(text_all.join(" "))
Datura::Helpers.normalize_space(text_all.join(" "))[0..@options["text_limit"]]
end

def text_additional
Expand Down
4 changes: 3 additions & 1 deletion lib/datura/to_es/vra_to_es/vra_to_es_personography.rb
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,9 @@ def category

def creator
creators = get_list(@xpaths["creators"], xml: @parent_xml)
creators.map { |c| { "name" => c } }
if creators
creators.map { |c| { "name" => c } }
end
end

def creator_sort
Expand Down
6 changes: 4 additions & 2 deletions lib/datura/to_es/webs_to_es/fields.rb
Original file line number Diff line number Diff line change
Expand Up @@ -162,9 +162,11 @@ def text
# means no worrying about handling spacing between words
text = []
body = get_text(@xpaths["text"])
text << body
if body
text << body
end
text += text_additional
Datura::Helpers.normalize_space(text.join(" "))
Datura::Helpers.normalize_space(text.join(" "))[0..@options["text_limit"]]
end

def text_additional
Expand Down

0 comments on commit 67eedaa

Please sign in to comment.