Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested aggregations #128

Open
wants to merge 36 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
561aeeb
upgrade Puma
wkdewey May 19, 2022
deb704a
add specific version of puma to avoid security warnings
wkdewey May 19, 2022
065857a
update .ruby-version
wkdewey May 25, 2022
38b02b7
update to later version of puma
wkdewey May 25, 2022
e763bb1
another round of gem updates
wkdewey May 25, 2022
2769319
add facet for matching nested facet
wkdewey May 24, 2022
3b33d4a
add filter for matching nested facet
wkdewey May 24, 2022
fa5c245
change split character, add missing comma
wkdewey May 24, 2022
21310ef
parse the array for matching nested fields
wkdewey May 26, 2022
4656dc6
change how compound facet name is parsed
wkdewey May 26, 2022
2b3af8e
use facet name as agg name
wkdewey May 26, 2022
100ac90
change query to filter
wkdewey May 26, 2022
3dd6608
fix nested filter aggregation so it doesn't cause 400 error
wkdewey May 27, 2022
fafb3a6
check for deeper nesting of buckets
wkdewey May 31, 2022
2ade65b
Change separator
wkdewey Jun 1, 2022
3c223e4
Fix parsing and query for filter matching
wkdewey Jun 1, 2022
ef22307
rewrite filtered aggregation to be either nested or not
wkdewey Jun 2, 2022
16a490d
filtering on a single item can either be nested or not
wkdewey Jun 2, 2022
b038be0
update config for server
wkdewey Sep 26, 2022
6eaa38b
revise query to match both the facet and the filter
wkdewey Oct 19, 2022
88a8f80
use reverse nested agg for correct item count
wkdewey Oct 20, 2022
dd4bacd
used doc_count from reverse nested if it exists
wkdewey Oct 20, 2022
bd739e0
change key for new elasticsearch version
wkdewey Oct 21, 2022
f0c3124
change order query to avoid deprecated '_term'
wkdewey Oct 24, 2022
7441a96
gitignore master key
wkdewey Oct 26, 2022
255f9db
add basic auth to elasticsearch requests
wkdewey Oct 26, 2022
b8ab6b0
raise number of results per facet
wkdewey Oct 28, 2022
2c68524
use facet_limit instead of facet_num to match Orchid
wkdewey Oct 31, 2022
303598f
revert, will set in Orchid
wkdewey Oct 31, 2022
2eed5b2
change facet_num to facet_limit
wkdewey Nov 1, 2022
a7c4a49
update nested facets documentation
wkdewey Nov 10, 2022
0512771
add links to more detailed documentation
wkdewey Nov 10, 2022
e87e40f
clarify
wkdewey Nov 10, 2022
68f14de
use reverse nested on simple nested aggregations
wkdewey May 25, 2023
b1632cf
fix elasticsearch errors
wkdewey May 26, 2023
b210234
titleize bucket values because ES automatically lowercases them
wkdewey Jul 21, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -48,3 +48,5 @@ bower.json
.byebug_history

.DS_Store

/config/master.key
2 changes: 1 addition & 1 deletion .ruby-gemset
Original file line number Diff line number Diff line change
@@ -1 +1 @@
api
api-v2
2 changes: 1 addition & 1 deletion .ruby-version
Original file line number Diff line number Diff line change
@@ -1 +1 @@
ruby-2.6.8
ruby-2.7.6
17 changes: 16 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,8 @@ Markdown Spec](https://github.github.com/gfm/).
- "api_version" added to all response "res" objects

### Changed
- upgraded to Rails 6
- upgraded to Rails 6.1.7 and Ruby 3
- changes reflect new api schemas in Datura, which make heavy use of nested fields
- Added support for aggregating buckets by normalized keyword and returning
the "top_hits" first document result for a non-normalized display
- Changes response format of `facets` key
Expand All @@ -56,6 +57,20 @@ Markdown Spec](https://github.github.com/gfm/).
Not only is the response format itself different, but there may be fewer
facets returned since normalized values which match are combined

### Migration
- in the config files of your Datura repos, (`private.yml` or `public.yml`, set the api to `"api_version": "2.0"` to take advantage of new bucket aggregation functionality (or `"api_version": "1.0"` for legacy repos that have not been updated for the new schema). Please note that a running API index can only use one ES index at a time, and each ES index is restricted to one version of the schema. See new schema (2.0) documentation [here](https://github.com/CDRH/datura/docs/schema_v2.md).
- Use Elasticsearch 8.5 or later. See [dev docs instructions](https://github.com/CDRH/cdrh_dev_docs/blob/update_elasticsearch_documentation/publishing/2_basic_requirements.md#downloading-elasticsearch).
- If you are using ES with security enabled, you must configure credentials with Rails in the API repo. See https://guides.rubyonrails.org/v6.1/security.html. Configure the VSCode editor. Run `EDITOR="code --wait" rails credentials:edit` and add
```
elasticsearch:
user: username
password: *****
```
to the secrets file and then close the window to save. Do not commit `config/master.key` (it should be in `gitignore`)
- Orchid apps that connect to the API should use `facet_limit` instead of `facet_num` in options.
- Add nested facets as described above, if desired.


## [v1.0.4](https://github.com/CDRH/api/compare/v1.0....v1.0.4) - Updates & license

### Changed
Expand Down
2 changes: 1 addition & 1 deletion Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ gem 'rails', '~> 6.0.2'
# Use sqlite3 as the database for Active Record
gem 'sqlite3'
# Use Puma as the app server
gem 'puma', '~> 3.7'
gem 'puma', '>= 5.6'
# Build JSON APIs with ease. Read more: https://github.com/rails/jbuilder
# gem 'jbuilder', '~> 2.5'
# Use Redis adapter to run Action Cable in production
Expand Down
7 changes: 4 additions & 3 deletions Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ GEM
globalid (1.0.0)
activesupport (>= 5.0)
http-accept (1.7.0)
http-cookie (1.0.4)
http-cookie (1.0.5)
domain_name (~> 0.5)
i18n (1.10.0)
concurrent-ruby (~> 1.0)
Expand All @@ -96,7 +96,8 @@ GEM
nokogiri (1.13.6)
mini_portile2 (~> 2.8.0)
racc (~> 1.4)
puma (3.12.6)
puma (5.6.4)
nio4r (~> 2.0)
racc (1.6.0)
rack (2.2.3)
rack-test (1.1.0)
Expand Down Expand Up @@ -168,7 +169,7 @@ DEPENDENCIES
bootsnap
byebug
listen (>= 3.0.5, < 3.2)
puma (~> 3.7)
puma (>= 5.6)
rails (~> 6.0.2)
rest-client (>= 2.1.0.rc1, < 2.2)
spring
Expand Down
3 changes: 2 additions & 1 deletion app/controllers/application_controller.rb
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
class ApplicationController < ActionController::API

def post_search(json, error_method=method(:display_error))
res = RestClient.post("#{ES_URI}/_search", json.to_json, { "content-type" => "json" })
auth_hash = { "Authorization" => "Basic #{Base64::encode64("#{ES_USER}:#{ES_PASSWORD}")}" }
res = RestClient.post("#{ES_URI}/_search", json.to_json, auth_hash.merge({ "content-type" => "json" }))
raise
return JSON.parse(res.body)
rescue => e
Expand Down
137 changes: 120 additions & 17 deletions app/services/search_item_req.rb
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,8 @@ def build_request

# add bool to request body
req["query"]["bool"] = bool
# uncomment below line to log ES query for debugging
# puts req.to_json()
return req
end

Expand All @@ -72,19 +74,17 @@ def facets
dir = "desc"
if @params["facet_sort"].present?
sort_type, sort_dir = @params["facet_sort"].split(@@filter_separator)
type = "_term" if sort_type == "term"
type = "term" if sort_type == "term"
dir = sort_dir if sort_dir == "asc"
end

# FACET_SETTINGS["start"]
size = SETTINGS["num"]
size = @params["facet_num"].blank? ? SETTINGS["num"] : @params["facet_num"]
size = @params["facet_limit"].blank? ? SETTINGS["num"] : @params["facet_limit"]

aggs = {}
Array.wrap(@params["facet"]).each do |f|
# histograms use a different ordering terminology than normal aggs
f_type = type == "_term" ? "_key" : "_count"

f_type = (type == "term") ? "_key" : "_count"
if f.include?("date") || f[/_d$/]
# NOTE: if nested fields will ever have dates we will
# need to refactor this to be available to both
Expand All @@ -98,13 +98,76 @@ def facets
aggs[f] = {
"date_histogram" => {
"field" => field,
"interval" => interval,
"calendar_interval" => interval,
"format" => formatted,
"min_doc_count" => 1,
"order" => { f_type => dir },
}
}
# if nested, has extra syntax
#nested facet, matching on another nested facet

elsif f.include?("[")
# will be an array including the original, and an alternate aggregation name


options = JSON.parse(f)
original = options[0]
agg_name = options[1]
facet = original.split("[")[0]
# may or may not be nested
nested = facet.include?(".")
if nested
path = facet.split(".").first
end
condition = original[/(?<=\[).+?(?=\])/]
subject = condition.split("#").first
predicate = condition.split("#").last
aggregation = {
# common to nested and non-nested
"filter" => {
"term" => {
subject => predicate
}
},
"aggs" => {
agg_name => {
"terms" => {
"field" => facet,
"order" => {f_type => dir},
"size" => size
},
"aggs" => {
"field_to_item" => {
"reverse_nested" => {},
"aggs" => {
"top_matches" => {
"top_hits" => {
"_source" => {
"includes" => [ agg_name ]
},
"size" => 1
}
}
}
}
}
}
}
}
#interpolate above hash into nested query
if nested
aggs[agg_name] = {
"nested" => {
"path" => path
},
"aggs" => {
agg_name => aggregation
}
}
else
#otherwise it is the whole query
aggs[agg_name] = aggregation
end
elsif f.include?(".")
path = f.split(".").first
aggs[f] = {
Expand All @@ -115,16 +178,21 @@ def facets
f => {
"terms" => {
"field" => f,
"order" => { type => dir },
"order" => {f_type => dir},
"size" => size
},
"aggs" => {
"top_matches" => {
"top_hits" => {
"_source" => {
"includes" => [ f ]
},
"size" => 1
"field_to_item" => {
"reverse_nested" => {},
"aggs" => {
"top_matches" => {
"top_hits" => {
"_source" => {
"includes" => [ f ]
},
"size" => 1
}
}
}
}
}
Expand All @@ -135,7 +203,7 @@ def facets
aggs[f] = {
"terms" => {
"field" => f,
"order" => { type => dir },
"order" => { f_type => dir },
"size" => size
},
"aggs" => {
Expand All @@ -161,8 +229,43 @@ def filters
# (type 2 will only be used for dates)
filters = fields.map {|f| f.split(@@filter_separator, 3) }
filters.each do |filter|
# NESTED FIELD FILTER
if filter[0].include?(".")
# filter aggregation with nesting
if filter[0].include?("[")
original = filter[0]
facet = original.split("[")[0]
nested = facet.include?(".")
if nested
path = facet.split(".").first
end
condition = original[/(?<=\[).+?(?=\])/]
subject = condition.split("#").first
predicate = condition.split("#").last
term_match = {
# "person.name" => "oliver wendell holmes"
# Remove CR's added by hidden input field values with returns
facet => filter[1].gsub(/\r/, "")
}
term_filter = {
subject => predicate
}
if nested
query = {
"nested" => {
"path" => path,
"query" => {
"bool" => {
"must" => [
{ "match" => term_filter },
{ "match" => term_match }
]
}
}
}
}
end
filter_list << query
#ordinary nested facet
elsif filter[0].include?(".")
path = filter[0].split(".").first
# this is a nested field and must be treated differently
nested = {
Expand Down
30 changes: 21 additions & 9 deletions app/services/search_item_res.rb
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@ def build_response
# strip out only the fields for the item response
items = combine_highlights
facets = reformat_facets

{
"code" => 200,
"count" => count,
Expand Down Expand Up @@ -50,9 +49,9 @@ def find_source_from_top_hits(top_hits, field, key)
if hit.class == Array
# I don't love this, because we will have to match exactly the logic
# that got us the key to get this to work
match_index = hit
.map { |s| remove_nonword_chars(s) }
.index(remove_nonword_chars(key))
match_index = hit
.map { |s| remove_nonword_chars(s) }
.index(remove_nonword_chars(key))
# if nothing matches the original key, return the entire source hit
# should return a string, regardless
return match_index ? hit[match_index] : hit.join(" ")
Expand All @@ -65,8 +64,8 @@ def find_source_from_top_hits(top_hits, field, key)
def format_bucket_value(facets, field, bucket)
# dates return in wonktastic ways, so grab key_as_string instead of gibberish number
# but otherwise just grab the key if key_as_string unavailable
key = bucket.key?("key_as_string") ? bucket["key_as_string"] : bucket["key"]
val = bucket["doc_count"]
key = bucket.key?("key_as_string") ? bucket["key_as_string"].titleize : bucket["key"].titleize
val = bucket.key?("field_to_item") ? bucket["field_to_item"]["doc_count"] : bucket["doc_count"]
source = key
# top_matches is a top_hits aggregation which returns a list of terms
# which were used for the facet.
Expand All @@ -79,7 +78,7 @@ def format_bucket_value(facets, field, bucket)
end
facets[field][key] = {
"num" => val,
"source" => source
"source" => source.to_s
}
end

Expand All @@ -89,8 +88,7 @@ def reformat_facets
facets = {}
raw_facets.each do |field, info|
facets[field] = {}
# nested fields do not have buckets at this level of response structure
buckets = info.key?("buckets") ? info["buckets"] : info.dig(field, "buckets")
buckets = get_buckets(info, field)
if buckets
buckets.each { |b| format_bucket_value(facets, field, b) }
else
Expand All @@ -110,4 +108,18 @@ def remove_nonword_chars(term)
transliterated.gsub(/<\/?(?:em|strong|u)>|\W/, "").downcase
end

def get_buckets(info, field)
buckets = nil
# ordinary facet
if info.key?("buckets")
buckets = info["buckets"]
# nested facet
elsif info.dig(field, "buckets")
buckets = info.dig(field, "buckets")
# filtered facet
else
buckets = info.dig(field, field, "buckets")
end
buckets
end
end
3 changes: 2 additions & 1 deletion app/services/search_service.rb
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ def initialize(url, params={}, user_req)
end

def post(url_ending, json)
res = RestClient.post("#{@url}/#{url_ending}", json.to_json, { "content-type" => "json" } )
auth_hash = { "Authorization" => "Basic #{Base64::encode64("#{Rails.application.credentials.elasticsearch[:user]}:#{Rails.application.credentials.elasticsearch[:password]}")}" }
res = RestClient.post("#{@url}/#{url_ending}", json.to_json, auth_hash.merge({ "content-type" => "json" } ))
JSON.parse(res.body)
rescue => e
e
Expand Down
1 change: 1 addition & 0 deletions config/environments/development.rb
Original file line number Diff line number Diff line change
Expand Up @@ -61,4 +61,5 @@
# CDRH CONFIGURATION

config.hosts << "cdrhdev1.unl.edu"
config.hosts << "whitman-dev.unl.edu"
end
6 changes: 6 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,12 @@ __Nested fields__
facet[]=creator.name
facet[]=creator.name&facet[]=creator.role
```
you can also match on another nested field with the new API schema
`facet[]=nested_field.keyword_field1[nested_field.keyword_field2#value]`
```
facet[]=person.name[person.role#judge]
```
the above will select all names of persons, where the role of that person is "judge".

__Date ranges__ (currently supports days or years)

Expand Down
Loading