Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic/cv2 5050 text vectorization via presto #2030

Merged
merged 53 commits into from
Oct 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
62b3cfd
Cv2 5082 article indexing to presto (#1994)
DGaffney Aug 27, 2024
14ce78d
Cv2 5080 request model to presto (#2015)
DGaffney Sep 4, 2024
543a117
Cv2 5086 smooch nlu to presto 2 (#2019)
DGaffney Sep 11, 2024
c361771
Cv2 5085 move get items with similar text to presto (#2023)
DGaffney Sep 11, 2024
95e0690
Cv2 5084 Update Reindexing to use async presto endpoint (#2031)
DGaffney Sep 11, 2024
5f07379
CV2-5081 switch text to presto-based querying (#2034)
DGaffney Sep 16, 2024
e0ca299
CV2-5050 add explicit callback for text
DGaffney Sep 16, 2024
ff2c526
update stub
DGaffney Sep 17, 2024
aec9c30
Merge remote-tracking branch 'origin' into epic/cv2-5050-text-vectori…
DGaffney Sep 17, 2024
de21f11
more tweaking during testing
DGaffney Sep 17, 2024
8df7969
Merge branch 'develop' into epic/cv2-5050-text-vectorization-via-presto
caiosba Sep 23, 2024
8ab446c
Merge branch 'develop' into epic/cv2-5050-text-vectorization-via-presto
caiosba Sep 25, 2024
ff64e67
Bot events for test endpoint
caiosba Sep 25, 2024
9fd5dd8
move async query to sync
DGaffney Oct 9, 2024
82bf2ad
CV2-5324: create a method for create relationship and use it everywhe…
melsawy Sep 27, 2024
0dd02c0
A couple of improvements for shared feeds (#2056)
caiosba Sep 28, 2024
55b52ce
CV2-5371: fix sentry issue (#2058)
melsawy Sep 30, 2024
a704dda
Update setuptools module and pin to known good version. (#2059)
sonoransun Sep 30, 2024
7ac655d
5120 – Dont create duplicate tags and clean up `#` (#2054)
vasconsaurus Sep 30, 2024
a244a38
CV2-5005: Sentry issue related to ES (#2057)
melsawy Oct 1, 2024
c82bef2
CV2-5371: fix sentry error (#2062)
melsawy Oct 1, 2024
9bf9d4f
Fix setuptools version pin for check-api builds (#2063)
sonoransun Oct 1, 2024
d74e7cc
CV2-5391: use save instead of save! to avoid raising error (#2061)
melsawy Oct 1, 2024
d98cff3
Reset item status to default one when claim/fact-check is detached. (…
caiosba Oct 2, 2024
0352e2c
CV2-5418: fix sentry issue (#2066)
melsawy Oct 3, 2024
35331bb
Add ukrainian translation (#2065)
amoedoamorim Oct 3, 2024
c7a3ed3
CV2-5420: include cached fields that require ES or PG updates (#2067)
melsawy Oct 3, 2024
d0b6163
:create_project_media_tags should be able to ignore tag already added…
vasconsaurus Oct 7, 2024
4203f1f
CV2-5392: export cluster description (#2070)
melsawy Oct 7, 2024
e2154d9
Setting initial value for `last_request_date` for feed clusters. (#2072)
caiosba Oct 8, 2024
c5adeb3
set the annotator to be the “Smooch Bot” (#2069)
danielevalverde Oct 8, 2024
75fb1f8
CV2-5419: rescue ActiveRecord::RecordNotUnique for relationship save …
melsawy Oct 9, 2024
bd734ae
CV2-5451: set confirmed before creation based on relationship type (#…
melsawy Oct 9, 2024
3f1226c
Make sure that an item can't be related to itself. (#2075)
caiosba Oct 10, 2024
1d02098
Report bad relationship structure only if relationship is not nil. (#…
caiosba Oct 10, 2024
473a224
Revert rejecting suggestion if relationship creation fails. (#2076)
caiosba Oct 10, 2024
87693cb
Fixing Sentry error
caiosba Oct 11, 2024
ebcb0e4
CV2-5434: skip process the TeamTask background job if there is a more…
melsawy Oct 11, 2024
a5cbc59
Setting retry interval for GenericWorker. (#2080)
caiosba Oct 11, 2024
fd22585
Adding a log line for outgoing Smooch requests. (#2081)
caiosba Oct 11, 2024
a94c7e8
Do not send report if search results were already received. (#2082)
caiosba Oct 12, 2024
d9e8168
CV2-5348: refactor ES cached field calling and remove retry_on_confli…
melsawy Oct 13, 2024
6e11a99
CV2-5190: Create a Link and Claim from tipline message that contain b…
melsawy Oct 14, 2024
64ef072
Request/5424 add tagalog translations (#2085)
amoedoamorim Oct 14, 2024
a062296
Request/5424 add tagalog translations (#2086)
amoedoamorim Oct 14, 2024
f5a5326
update fixtures on broken tests
DGaffney Oct 15, 2024
67a0fea
Merge branch 'develop' into epic/cv2-5050-text-vectorization-via-presto
DGaffney Oct 15, 2024
bbfb756
add webmock
DGaffney Oct 15, 2024
f813418
review and resolve missing line errors
DGaffney Oct 16, 2024
570459e
Merge remote-tracking branch 'origin' into epic/cv2-5050-text-vectori…
DGaffney Oct 22, 2024
f2a78bc
resolve changes from Sawy
DGaffney Oct 22, 2024
2a45d08
gut source and id from any tests and responses
DGaffney Oct 24, 2024
b8a59e8
more tweaking to resolve broken tests
DGaffney Oct 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 6 additions & 12 deletions app/lib/smooch_nlu.rb
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,6 @@ def enabled?
end

def update_keywords(language, keywords, keyword, operation, doc_id, context)
alegre_operation = nil
alegre_params = nil
common_alegre_params = {
doc_id: doc_id,
context: {
Expand All @@ -44,15 +42,11 @@ def update_keywords(language, keywords, keyword, operation, doc_id, context)
}
if operation == 'add' && !keywords.include?(keyword)
keywords << keyword
alegre_operation = 'post'
alegre_params = common_alegre_params.merge({ text: keyword, models: ALEGRE_MODELS_AND_THRESHOLDS.keys })
Bot::Alegre.index_sync_with_params(common_alegre_params.merge({ text: keyword, models: ALEGRE_MODELS_AND_THRESHOLDS.keys }), "text")
elsif operation == 'remove'
keywords -= [keyword]
alegre_operation = 'delete'
alegre_params = common_alegre_params.merge({ quiet: true })
Bot::Alegre.request_delete_from_raw(common_alegre_params.merge({ quiet: true }), "text")
end
# FIXME: Add error handling and better logging
Bot::Alegre.request(alegre_operation, '/text/similarity/', alegre_params) if alegre_operation && alegre_params
keywords
end

Expand Down Expand Up @@ -91,19 +85,19 @@ def self.alegre_matches_from_message(message, language, context, alegre_result_k
language: language,
}.merge(context)
}
response = Bot::Alegre.request('post', '/text/similarity/search/', params)
response = Bot::Alegre.query_sync_with_params(params, "text")

# One approach would be to take the option that has the most matches
# Unfortunately this approach is influenced by the number of keywords per option
# So, we are not using this approach right now
# Get the `alegre_result_key` of all results returned
# option_counts = response['result'].to_a.map{|o| o.dig('_source', 'context', alegre_result_key)}
# option_counts = response['result'].to_a.map{|o| o.dig('context', alegre_result_key)}
# Count how many of each alegre_result_key we have and sort (high to low)
# ranked_options = option_counts.group_by(&:itself).transform_values(&:count).sort_by{|_k,v| v}.reverse()

# Second approach is to sort the results from best to worst
sorted_options = response['result'].to_a.sort_by{ |result| result['_score'] }.reverse
ranked_options = sorted_options.map{ |o| { 'key' => o.dig('_source', 'context', alegre_result_key), 'score' => o['_score'] } }
sorted_options = response['result'].to_a.sort_by{ |result| result['score'] }.reverse
ranked_options = sorted_options.map{ |o| { 'key' => o.dig('context', alegre_result_key), 'score' => o['score'] } }
matches = ranked_options

# In all cases log for analysis
Expand Down
14 changes: 4 additions & 10 deletions app/models/bot/alegre.rb
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ def similar_items_ids_and_scores(team_ids, thresholds = {})
ALL_TEXT_SIMILARITY_FIELDS.each do |field|
text = self.send(field)
next if text.blank?
threads << Thread.new { ids_and_scores.merge!(Bot::Alegre.get_similar_texts(team_ids, text, Bot::Alegre::ALL_TEXT_SIMILARITY_FIELDS, thresholds[:text]).to_h) }
threads << Thread.new { ids_and_scores.merge!(Bot::Alegre.get_items_from_similar_text(team_ids, text, Bot::Alegre::ALL_TEXT_SIMILARITY_FIELDS, thresholds[:text]).to_h) }
end
threads.map(&:join)
end
Expand Down Expand Up @@ -155,10 +155,8 @@ def self.run(body)
if ['audio', 'image', 'video'].include?(self.get_pm_type(pm))
self.relate_project_media_async(pm)
else
Bot::Alegre.send_to_media_similarity_index(pm)
Bot::Alegre.send_field_to_similarity_index(pm, 'original_title')
Bot::Alegre.send_field_to_similarity_index(pm, 'original_description')
Bot::Alegre.relate_project_media_to_similar_items(pm)
self.relate_project_media_async(pm, 'original_title')
self.relate_project_media_async(pm, 'original_description')
end
self.get_extracted_text(pm)
self.get_flags(pm)
Expand Down Expand Up @@ -206,7 +204,7 @@ def self.get_items_from_similar_text(team_id, text, fields = nil, threshold = ni
threshold ||= self.get_threshold_for_query('text', nil, true)
models ||= [self.matching_model_to_use(team_ids)].flatten
Hash[self.get_similar_items_from_api(
'/text/similarity/search/',
'text',
self.similar_texts_from_api_conditions(text, models, fuzzy, team_ids, fields, threshold),
threshold
).collect{|k,v| [k, v.merge(model: v[:model]||Bot::Alegre.default_matching_model)]}]
Expand Down Expand Up @@ -716,8 +714,4 @@ def self.is_text_too_short?(pm, length_threshold)
is_short
end

class <<self
alias_method :get_similar_texts, :get_items_from_similar_text
end

end
16 changes: 8 additions & 8 deletions app/models/concerns/alegre_similarity.rb
Original file line number Diff line number Diff line change
Expand Up @@ -125,18 +125,18 @@ def send_to_text_similarity_index_package(pm, field, text, doc_id)
doc_id: doc_id,
text: text,
models: models,
context: self.get_context(pm, field)
context: self.get_context(pm, field),
requires_callback: true
}
params[:language] = language if !language.nil?
params
end

def send_to_text_similarity_index(pm, field, text, doc_id)
if !text.blank? && Bot::Alegre::BAD_TITLE_REGEX !~ text
self.request(
'post',
'/text/similarity/',
self.send_to_text_similarity_index_package(pm, field, text, doc_id)
self.query_sync_with_params(
self.send_to_text_similarity_index_package(pm, field, text, doc_id),
"text"
)
end
end
Expand Down Expand Up @@ -207,10 +207,10 @@ def get_merged_similar_items(pm, threshold, fields, value, team_ids = [pm&.team_
es_matches
end

def get_similar_items_from_api(path, conditions, _threshold = {})
Rails.logger.error("[Alegre Bot] Sending request to alegre : #{path} , #{conditions.to_json}")
def get_similar_items_from_api(type, conditions, _threshold = {})
Rails.logger.error("[Alegre Bot] Sending request to alegre : #{type} , #{conditions.to_json}")
response = {}
result = self.request('post', path, conditions)&.dig('result')
result = self.query_sync_with_params(conditions, type)&.dig('result')
project_medias = result.collect{ |r| self.extract_project_medias_from_context(r) } if !result.nil? && result.is_a?(Array)
project_medias.each do |request_response|
request_response.each do |pmid, score_with_context|
Expand Down
101 changes: 72 additions & 29 deletions app/models/concerns/alegre_v2.rb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
require 'active_support/concern'
class AlegreTimeoutError < StandardError; end
class TemporaryProjectMedia
attr_accessor :team_id, :id, :url, :type
attr_accessor :team_id, :id, :url, :text, :type, :field
def media
media_type_map = {
"claim" => "Claim",
Expand Down Expand Up @@ -36,6 +36,10 @@ def is_video?
def is_audio?
self.type == "audio"
end

def is_uploaded_media?
self.is_image? || self.is_audio? || self.is_video?
end
end

module AlegreV2
Expand All @@ -55,11 +59,18 @@ def sync_path_for_type(type)
end

def async_path(project_media)
"/similarity/async/#{get_type(project_media)}"
self.async_path_for_type(get_type(project_media))
end

def async_path_for_type(type)
"/similarity/async/#{type}"
end

def delete_path(project_media)
type = get_type(project_media)
self.delete_path_for_type(get_type(project_media))
end

def delete_path_for_type(type)
"/#{type}/similarity/"
end

Expand Down Expand Up @@ -122,6 +133,10 @@ def request(method, path, params, retries=3)
end
end

def request_delete_from_raw(params, type)
request("delete", delete_path_for_type(type), params)
end

def request_delete(data, project_media)
request("delete", delete_path(project_media), data)
end
Expand All @@ -148,28 +163,32 @@ def get_type(project_media)
type
end

def content_hash_for_value(value)
value.nil? ? nil : Digest::MD5.hexdigest(value)
end

def content_hash(project_media, field)
if Bot::Alegre::ALL_TEXT_SIMILARITY_FIELDS.include?(field)
Digest::MD5.hexdigest(project_media.send(field))
content_hash_for_value(project_media.send(field))
elsif project_media.is_link?
return content_hash_for_value(project_media.media.url)
elsif project_media.is_a?(TemporaryProjectMedia)
return Rails.cache.read("url_sha:#{project_media.url}")
elsif project_media.is_uploaded_media?
return project_media.media.file.filename.split(".").first
else
if project_media.is_link?
return Digest::MD5.hexdigest(project_media.media.url)
elsif project_media.is_a?(TemporaryProjectMedia)
return Rails.cache.read("url_sha:#{project_media.url}")
elsif !project_media.is_text?
return project_media.media.file.filename.split(".").first
else
return Digest::MD5.hexdigest(project_media.send(field).to_s)
end
return content_hash_for_value(project_media.send(field).to_s)
end
end

def generic_package(project_media, field)
{
content_hash: content_hash(project_media, field),
content_hash_value = content_hash(project_media, field)
params = {
doc_id: item_doc_id(project_media, field),
context: get_context(project_media, field)
}
params[:content_hash] = content_hash_value if !content_hash_value.nil?
params
end

def delete_package(project_media, field, params={}, quiet=false)
Expand Down Expand Up @@ -267,6 +286,22 @@ def store_package_text(project_media, field, params)
generic_package_text(project_media, field, params)
end

def index_async_with_params(params, type, suppress_search_response=true)
request("post", async_path_for_type(type), params.merge(suppress_search_response: suppress_search_response))
end

def index_sync_with_params(params, type)
query_sync_with_params(params, type)
end

def query_sync_with_params(params, type)
request("post", sync_path_for_type(type), params)
end

def query_async_with_params(params, type)
request("post", async_path_for_type(type), params)
end

def get_sync(project_media, field=nil, params={})
request_sync(
store_package(project_media, field, params),
Expand All @@ -286,6 +321,10 @@ def delete(project_media, field=nil, params={})
delete_package(project_media, field, params),
project_media
)
rescue StandardError => e
error = Bot::Alegre::Error.new(e)
Rails.logger.error("[Alegre Bot] Exception on Delete for ProjectMedia ##{project_media.id}: #{error.class} - #{error.message}")
CheckSentry.notify(error, bot: "alegre", project_media: project_media, params: params, field: field)
end

def get_per_model_threshold(project_media, threshold)
Expand All @@ -298,7 +337,7 @@ def get_per_model_threshold(project_media, threshold)
end

def isolate_relevant_context(project_media, result)
result["context"].select{|x| ([x["team_id"]].flatten & [project_media.team_id].flatten).count > 0 && !x["temporary_media"]}.first
(result["contexts"]||result["context"]).select{|x| ([x["team_id"]].flatten & [project_media.team_id].flatten).count > 0 && !x["temporary_media"]}.first
end

def get_target_field(project_media, field)
Expand Down Expand Up @@ -485,25 +524,27 @@ def wait_for_results(project_media, args)
end

def get_items_with_similar_media_v2(args={})
text = args[:text]
field = args[:field]
media_url = args[:media_url]
project_media = args[:project_media]
threshold = args[:threshold]
team_ids = args[:team_ids]
type = args[:type]
if ['audio', 'image', 'video'].include?(type)
if project_media.nil?
project_media = TemporaryProjectMedia.new
project_media.url = media_url
project_media.id = Digest::MD5.hexdigest(project_media.url).to_i(16)
project_media.team_id = team_ids
project_media.type = type
end
get_similar_items_v2_async(project_media, nil, threshold)
wait_for_results(project_media, args)
response = get_similar_items_v2_callback(project_media, nil)
delete(project_media, nil) if project_media.is_a?(TemporaryProjectMedia)
return response
if project_media.nil?
project_media = TemporaryProjectMedia.new
project_media.text = text
project_media.field = field
project_media.url = media_url
project_media.id = Digest::MD5.hexdigest(project_media.url).to_i(16)
project_media.team_id = team_ids
project_media.type = type
end
get_similar_items_v2_async(project_media, nil, threshold)
wait_for_results(project_media, args)
response = get_similar_items_v2_callback(project_media, nil)
delete(project_media, nil) if project_media.is_a?(TemporaryProjectMedia)
return response
end

def process_alegre_callback(params)
Expand All @@ -512,9 +553,11 @@ def process_alegre_callback(params)
should_relate = true
if project_media.nil?
project_media = TemporaryProjectMedia.new
project_media.text = params.dig('data', 'item', 'raw', 'text')
project_media.url = params.dig('data', 'item', 'raw', 'url')
project_media.id = params.dig('data', 'item', 'raw', 'context', 'project_media_id')
project_media.team_id = params.dig('data', 'item', 'raw', 'context', 'team_id')
project_media.field = params.dig('data', 'item', 'raw', 'context', 'field')
project_media.type = params['model_type']
should_relate = false
end
Expand Down
10 changes: 5 additions & 5 deletions app/models/concerns/project_media_getters.rb
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,6 @@ def is_link?
self.media.type == "Link"
end

def is_uploaded_image?
self.media.type == "UploadedImage"
end

def is_blank?
self.media.type == "Blank"
end
Expand All @@ -28,7 +24,11 @@ def is_audio?
end

def is_image?
self.is_uploaded_image?
self.media.type == "UploadedImage"
end

def is_uploaded_media?
self.is_image? || self.is_audio? || self.is_video?
end

def is_text?
Expand Down
16 changes: 9 additions & 7 deletions app/models/explainer.rb
Original file line number Diff line number Diff line change
Expand Up @@ -73,24 +73,26 @@ def self.update_paragraphs_in_alegre(id, previous_paragraphs_count, timestamp)

# Index title
params = {
content_hash: Bot::Alegre.content_hash_for_value(explainer.title),
doc_id: Digest::MD5.hexdigest(['explainer', explainer.id, 'title'].join(':')),
context: base_context.merge({ field: 'title' }),
text: explainer.title,
models: ALEGRE_MODELS_AND_THRESHOLDS.keys,
context: base_context.merge({ field: 'title' })
}
Bot::Alegre.request('post', '/text/similarity/', params)
Bot::Alegre.index_async_with_params(params, "text")

# Index paragraphs
count = 0
explainer.description.to_s.gsub(/\r\n?/, "\n").split(/\n+/).reject{ |paragraph| paragraph.strip.blank? }.each do |paragraph|
count += 1
params = {
content_hash: Bot::Alegre.content_hash_for_value(paragraph.strip),
doc_id: Digest::MD5.hexdigest(['explainer', explainer.id, 'paragraph', count].join(':')),
context: base_context.merge({ paragraph: count }),
text: paragraph.strip,
models: ALEGRE_MODELS_AND_THRESHOLDS.keys,
context: base_context.merge({ paragraph: count })
}
Bot::Alegre.request('post', '/text/similarity/', params)
Bot::Alegre.index_async_with_params(params, "text")
end

# Remove paragraphs that don't exist anymore (we delete after updating in order to avoid race conditions)
Expand All @@ -101,7 +103,7 @@ def self.update_paragraphs_in_alegre(id, previous_paragraphs_count, timestamp)
quiet: true,
context: base_context.merge({ paragraph: count })
}
Bot::Alegre.request('delete', '/text/similarity/', params)
Bot::Alegre.request_delete_from_raw(params, "text")
end
end

Expand All @@ -116,9 +118,9 @@ def self.search_by_similarity(text, language, team_id)
language: language
}
}
response = Bot::Alegre.request('post', '/text/similarity/search/', params)
response = Bot::Alegre.query_sync_with_params(params, "text")
results = response['result'].to_a.sort_by{ |result| result['_score'] }
explainer_ids = results.collect{ |result| result.dig('_source', 'context', 'explainer_id').to_i }.uniq.first(3)
explainer_ids = results.collect{ |result| result.dig('context', 'explainer_id').to_i }.uniq.first(3)
explainer_ids.empty? ? Explainer.none : Explainer.where(team_id: team_id, id: explainer_ids)
end

Expand Down
Loading
Loading