-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implements continue for queries in api.rb + Adds deep_merge gem to ge… #33
Implements continue for queries in api.rb + Adds deep_merge gem to ge… #33
Conversation
…mfiles + Updates test expectations
Gemfile
Outdated
@@ -14,3 +14,5 @@ gem 'rubocop', '~> 1.21' | |||
gem 'mediawiki_api' | |||
|
|||
gem 'json' | |||
|
|||
gem 'deep_merge' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... In the Dashboard codebase, we use the deep_merge functions (Hash#deep_merge!
specifically) that are built into Ruby on Rails via ActiveSupport: https://apidock.com/rails/ActiveSupport/DeepMergeable/deep_merge%21
I think I would prefer adding that as a dependency rather than the separate deep_merge gem, so that this change won't expand the Dashboard's dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To do this, I added require 'active_support/core_ext/hash/deep_merge'
But when i ran it, using the existing data.deep_merge!(response.data)
, it processed only 17 instead of 25, which is the same behaviour as just using .merge
, this time not because of missing revisions but rather, the ActiveSupport deep_merge function doesn't handle the response.data properly for some reason, it probably overwrites the affected revisions because of duplicate keys or something.
In the documentation link you sent, it did say that 'A block can be provided to merge non-DeepMergeable values' so using the response json structure for reference, I added the block below and it works, and processes all 25 revisions.
data.deep_merge!(response.data) do |key, old_val, new_val|
if key == 'revisions' && old_val.is_a?(Array) && new_val.is_a?(Array)
old_val + new_val
else
new_val
end
end
Comparison: (The ones denoted in red, are the ones not processed without the block
So should I use this block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So should I use this block?
Maybe. I'd like to get an explanation of why exactly that version of deep_merge! behaves differently first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding + checking docs for the 2 versions: active_support deep merge vs deep_merge! gem
-
For the gem, the docs say:
"Deep Merge core documentation. deep_merge! method permits merging of arbitrary child elements. The two top level elements must be hashes. These hashes can contain unlimited (to stack limit) levels of child elements. These child elements to not have to be of the same types. Where child elements are of the same type, deep_merge will attempt to merge them together. "
So as long as the top level elements are hashes , it will recursively merge any nested child elements within the hashes, regardless of how deeply nested they are. so it just merges all the hashes in the revisions array. -
The Active Support 'deep_merge!' method is optimized for merging hashes and not necessarily for handling complex nested arrays in a way that preserves all elements. If the data structure includes arrays (like revisions), the merging logic might inadvertently drop or skip some array elements.
It could be that Active Support’s deep_merge! is treating the affected revisions as conflicts that it cannot reconcile.
Or the deep_merge! method is encountering the revisions array at a level where it expects to merge hashes, so it chooses to skip those revisions instead of merging them.
But as to why those specific revisions are skipped /overwritten using the ActiveSupport method, unfortunately I cannot tell, I saved the incomplete / complete responses to analyze it, but the hashes for the missing revisions ids are just not present in the revisions array, there isn't a broken structure or anything.
example response structure:
"95116959": {
"pageid": 95116959,
"ns": 0,
"title": "Q96417649",
"revisions": [
{
"revid": 2266123060,
"parentid": 2265005040,
"slots": {
"main": {
"contentmodel": "wikibase-item",
"contentformat": "application/json",
"*": "{//insert content here}"
}
},
"comment": "/* wbsetdescription-set:1|jv */ gim vidéo, #quickstatements; #temporary_batch_1730083092317"
},
{
"revid": 2266123123,
"parentid": 2266123060,
"slots": {
"main": {
"contentmodel": "wikibase-item",
"contentformat": "application/json",
"*": "{//insert content here}"
}
},
"comment": "/* wbsetdescription-add:1|su */ kaulinan video, #quickstatements; #temporary_batch_1730083092317"
}, ....
lib/wikidata/diff/api.rb
Outdated
tries -= 1 | ||
sleep 1 if too_many_requests?(e) | ||
retry unless tries.zero? | ||
log_error(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this method come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That function comes from Dashboard code, and shouldn't be used in this gem.
I think raising the error and letting the Dashboard code handle it is fine here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ragesoss Please is simplifying it to this fine:
def self.mediawiki_request(client, action, query)
client.send(action, query)
rescue StandardError => e
raise e
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That case, you don't even need the rescue. However, I think the retry behavior is worth keeping. The idea would just be to raise the error instead of logging it, if the retries are exhausted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh okay😅. Then this is fine? :
tries ||= 3
client.send(action, query)
rescue StandardError => e
tries -= 1
sleep 1 if too_many_requests?(e)
retry unless tries.zero?
raise(e)
end
Okay. Since the structure of the expected response is known, I think it
might be better to just write our own code to merge the results rather than
rely on either deep_merge implementation.
…On Tue, Nov 12, 2024, 2:39 PM -- ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In Gemfile
<#33 (comment)>
:
> @@ -14,3 +14,5 @@ gem 'rubocop', '~> 1.21'
gem 'mediawiki_api'
gem 'json'
+
+gem 'deep_merge'
From my understanding + checking docs for the 2 versions: active_support
deep merge
<https://www.rubydoc.info/github/rails/rails/ActiveSupport/DeepMergeable#deep_merge!-instance_method>
vs deep_merge! gem
<https://www.rubydoc.info/gems/deep_merge/DeepMerge#deep_merge!-class_method>
-
For the gem, the docs say:
"Deep Merge core documentation. deep_merge! method permits merging of
arbitrary child elements. The two top level elements must be hashes. These
hashes can contain unlimited (to stack limit) levels of child elements.
These child elements to not have to be of the same types. Where child
elements are of the same type, deep_merge will attempt to merge them
together. "
So as long as the top level elements are hashes , it will recursively
merge any nested child elements within the hashes, regardless of how deeply
nested they are. so it just merges all the hashes in the revisions array.
-
The Active Support 'deep_merge!' method is optimized for merging
hashes and not necessarily for handling complex nested arrays in a way that
preserves all elements. If the data structure includes arrays (like
revisions), the merging logic might inadvertently drop or skip some array
elements.
It could be that Active Support’s deep_merge! is treating the affected
revisions as conflicts that it cannot reconcile.
Or the deep_merge! method is encountering the revisions array at a
level where it expects to merge hashes, so it chooses to skip those
revisions instead of merging them.
But as to why those specific revisions are skipped /overwritten using the
ActiveSupport method, unfortunately I cannot tell, I saved the incomplete /
complete responses to analyze it, but the hashes for the missing revisions
ids are just not present in the revisions array, there isn't a broken
structure or anything.
example response structure:
"95116959": {
"pageid": 95116959,
"ns": 0,
"title": "Q96417649",
"revisions": [
{
"revid": 2266123060,
"parentid": 2265005040,
"slots": {
"main": {
"contentmodel": "wikibase-item",
"contentformat": "application/json",
"*": "{//insert content here}"
}
},
"comment": "/* wbsetdescription-set:1|jv */ gim vidéo, #quickstatements; #temporary_batch_1730083092317"
},
{
"revid": 2266123123,
"parentid": 2266123060,
"slots": {
"main": {
"contentmodel": "wikibase-item",
"contentformat": "application/json",
"*": "{//insert content here}"
}
},
"comment": "/* wbsetdescription-add:1|su */ kaulinan video, #quickstatements; #temporary_batch_1730083092317"
}, ....
—
Reply to this email directly, view it on GitHub
<#33 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGPEYY6JWVFA2B4TQ2K7KD2AJ7RNAVCNFSM6AAAAABRSA2TXKVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDIMZQHA3DKOJSGM>
.
You are receiving this because you were mentioned.Message ID:
<WikiEducationFoundation/wikidata-diff-analyzer/pull/33/review/2430865923@
github.com>
|
This works:
It uses the inbuilt Ruby .merge method however. |
That's quite verbose. Is there any simplification you can do? Otherwise, perhaps you can extract the part that is specifically for merging? |
Note: I've tested them both and they process all revisions I simplified the merge logic here:
Alternatively, simplying the whole method:
Alternatively, the merge logic could be moved to a helper method, something like:
|
Extracting the merge logic is definitely a good idea, in my opinion. |
|
This seems reasonable. |
…ge + Removes deep_merge dependencies + Changes request method to raise error instead of logging it
Done!, Since we've found a better fix, should I close the PRs I opened in the dashboard and here for adding the guard statement? |
Yes please. |
Note the failing test here. |
I've determined the issue.
the revisions here have 0 as their parent ids, When .analyze is called, it calls the
so in the case of the spec, the 2 revision ids have parent ids that are 0, so according to the logic above, the id is skipped and not added to the array, resulting in an empty array: which causes the error, because the code I wrote doesn't handle cases where the ids (although parent ids) are nil, which makes the page nil too. So simply adding a guard statement in the merge method works and does not affect the normal processing:
|
spec/wikidata/diff/api_spec.rb
Outdated
it 'returns a truncation warning if query exceeds response size limit' do | ||
result = Api.fetch_revision_data(revision_ids) | ||
|
||
if result['warnings'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like having an if/else in this test. It think this test should assume that the API will behave the way it currently does, and write the test against that with a query that we know results in the warning. If necessary, you can write a test against a single direct query rather than the .fetch_revision_data method handles but doesn't return the warning. But the test should also be explicit about handling the warning and returning the correct results in such a case, if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I changed this testcase so it makes a single direct query and checks for the truncation warning, then the second testcase makes the same query through the get_revision_contents method and expects the warning to be handled by returning the correct results i.e the result size is the same as the number of revision ids passed in. Does that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that sounds good.
spec/wikidata/diff/api_spec.rb
Outdated
require 'rspec' | ||
|
||
RSpec.describe Api do | ||
describe '.get_revision_contents' do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this pair of tests is basically doing what we want now, but they should be refactored to be more idiomatic.
The describe
block here indicates a particular method that is supposed to be tested, but the first test within it doesn't use that method. It should be in a different describe
block.
spec/wikidata/diff/api_spec.rb
Outdated
] | ||
end | ||
|
||
it 'returns a truncation warning since query exceeds response size limit' do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test isn't actually running any of the gem code, it's replicating the client setup and just testing against the mediawiki_client gem. It should probably be rewritten to use .mediawiki_request
method, which at least will involve a little bit of gem code and might break if the implementation is changed in significant way (and could be extended with more tests if we discover and add fixes for additional bugs).
spec/wikidata/diff/api_spec.rb
Outdated
API_URL = 'https://www.wikidata.org/w/api.php' | ||
client = MediawikiApi::Client.new(API_URL) | ||
|
||
query_parameters = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is essentially duplicating .fetch_revision_data
(except for the last line), which is a signal that it might be good to refactor the API file so that taking the revisions and turning it into query params is an isolated function.
A good test is one where it's very hard to break the corresponding code without also causing the test to fail, but here we could make any any sort of changes to the API code (such as reducing the rvprops to only ids, which would probably make the warning disappear without changing the number of results) without breaking this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted!, it is my first time writing tests, hence my many oversights. Sorry about that 😅 but I am learning a whole lot, so I am thankful for that.
These are the changes I made. Here response
is assigned to fetch_all_revisions
instead, which calls the get_query_parameters
method:
def self.get_revision_contents(revision_ids)
revision_ids = revision_ids.uniq if revision_ids
response = fetch_all_revisions(revision_ids)
# more code
end
def self.get_query_parameters(revision_ids)
{
prop: 'revisions',
revids: revision_ids&.join('|'),
rvslots: 'main',
rvprop: 'content|ids|comment',
format: 'json'
}
end
def self.fetch_all_revisions(revision_ids)
query = get_query_parameters(revision_ids)
# more code
end
Then the test becomes:
describe '.mediawiki_request' do
it 'returns a truncation warning since query exceeds response size limit' do
client = Api.api_client
query = Api.get_query_parameters(revision_ids)
response = Api.mediawiki_request(client, 'query', query)
expect(response['warnings']).not_to be_nil
expect(response['warnings']['result']['*']).to include(
"This result was truncated because it would otherwise be larger than the limit of 12,582,912 bytes."
)
end
end
describe '.get_revision_contents' do
it 'returns the correct result and handles the warning' do
result = Api.get_revision_contents(revision_ids)
expect { result }.not_to raise_error
expect(result).to be_a(Hash)
expect(result.size).to eq(25)
end
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like an improvement!
What this PR does
This PR solves the issue #31 by Improving the response handling logic by implementing continue for queries that exceed the size limit.
The original method
fetch_revision_data
directly calledclient.action
to send a request to the Mediawiki API, retrieving revision data in one go. This approach didn't account for cases where the API response could be paginated, meaning not all revision data might be returned in a single response.Refactored Logic Walkthrough:
continue
parameter indicating there is more data to fetch.response.data
is merged into data.deep_merge!
.response.data
for all the revisions, structured and ready for parsing.Notes:
mediawiki_request
method introduces a retry mechanism with a maximum of three attempts, including a short sleep if the error is due to rate limiting (HTTP 429 status code).too_many_requests?
method checks for rate-limiting errors, and log_error is used to handle unexpected errors gracefully, instead of allowing exceptions to terminate the process.api_client
method creates the MediawikiApi::Client instance, making the code cleaner and potentially more reusable.Comparison
Before implementing 'continue':
The 2nd test passes because it returns a truncation warning ( I tested removing the else statement to make sure it was not just passing because the result was not nil)
After implementing 'continue':