Implements continue for queries in api.rb + Adds deep_merge gem to ge… #33

empty-codes · 2024-11-11T15:01:52Z

What this PR does

This PR solves the issue #31 by Improving the response handling logic by implementing continue for queries that exceed the size limit.

The original method fetch_revision_data directly called client.action to send a request to the Mediawiki API, retrieving revision data in one go. This approach didn't account for cases where the API response could be paginated, meaning not all revision data might be returned in a single response.

Refactored Logic Walkthrough:

First API Call:

The initial API request returns a subset of data, along with a continue parameter indicating there is more data to fetch.
response.data is merged into data.

Subsequent API Calls:

The loop continues, using the continue parameter to fetch additional data.
Each new response's data is merged into the data hash using deep_merge!.

Final Result:

The data hash contains all the response.data for all the revisions, structured and ready for parsing.

Notes:

I used the implementation of continue in this file as reference lib/wiki_api.rb
Added multiple helper methods as seen in the WikiEdu implementation:
- The mediawiki_request method introduces a retry mechanism with a maximum of three attempts, including a short sleep if the error is due to rate limiting (HTTP 429 status code).
- The too_many_requests? method checks for rate-limiting errors, and log_error is used to handle unexpected errors gracefully, instead of allowing exceptions to terminate the process.
- A new api_client method creates the MediawikiApi::Client instance, making the code cleaner and potentially more reusable.
To use data.deep_merge!(response.data I had to install the deep_merge gem, so I modified the Gemfile. I tried simply using .merge but then it processed 17 instead of 25 like with .deep_merge.
I rewrote the test, and added the size expectations before testing the api.rb changes
The revision and page ids were only printed for the sake of debugging, and are not part of the actual code

Comparison

Before implementing 'continue':

The 2nd test passes because it returns a truncation warning ( I tested removing the else statement to make sure it was not just passing because the result was not nil)

After implementing 'continue':

…mfiles + Updates test expectations

ragesoss · 2024-11-12T18:35:03Z

Gemfile

@@ -14,3 +14,5 @@ gem 'rubocop', '~> 1.21'
 gem 'mediawiki_api'

 gem 'json'
+
+gem 'deep_merge'


Hmm... In the Dashboard codebase, we use the deep_merge functions (Hash#deep_merge! specifically) that are built into Ruby on Rails via ActiveSupport: https://apidock.com/rails/ActiveSupport/DeepMergeable/deep_merge%21

I think I would prefer adding that as a dependency rather than the separate deep_merge gem, so that this change won't expand the Dashboard's dependencies.

To do this, I added require 'active_support/core_ext/hash/deep_merge'

But when i ran it, using the existing data.deep_merge!(response.data), it processed only 17 instead of 25, which is the same behaviour as just using .merge, this time not because of missing revisions but rather, the ActiveSupport deep_merge function doesn't handle the response.data properly for some reason, it probably overwrites the affected revisions because of duplicate keys or something.

In the documentation link you sent, it did say that 'A block can be provided to merge non-DeepMergeable values' so using the response json structure for reference, I added the block below and it works, and processes all 25 revisions.

data.deep_merge!(response.data) do |key, old_val, new_val| if key == 'revisions' && old_val.is_a?(Array) && new_val.is_a?(Array) old_val + new_val else new_val end end

Comparison: (The ones denoted in red, are the ones not processed without the block

So should I use this block?

So should I use this block?

Maybe. I'd like to get an explanation of why exactly that version of deep_merge! behaves differently first.

From my understanding + checking docs for the 2 versions: active_support deep merge vs deep_merge! gem

For the gem, the docs say:
"Deep Merge core documentation. deep_merge! method permits merging of arbitrary child elements. The two top level elements must be hashes. These hashes can contain unlimited (to stack limit) levels of child elements. These child elements to not have to be of the same types. Where child elements are of the same type, deep_merge will attempt to merge them together. "
So as long as the top level elements are hashes , it will recursively merge any nested child elements within the hashes, regardless of how deeply nested they are. so it just merges all the hashes in the revisions array.

The Active Support 'deep_merge!' method is optimized for merging hashes and not necessarily for handling complex nested arrays in a way that preserves all elements. If the data structure includes arrays (like revisions), the merging logic might inadvertently drop or skip some array elements.
It could be that Active Support’s deep_merge! is treating the affected revisions as conflicts that it cannot reconcile.
Or the deep_merge! method is encountering the revisions array at a level where it expects to merge hashes, so it chooses to skip those revisions instead of merging them.

But as to why those specific revisions are skipped /overwritten using the ActiveSupport method, unfortunately I cannot tell, I saved the incomplete / complete responses to analyze it, but the hashes for the missing revisions ids are just not present in the revisions array, there isn't a broken structure or anything.

example response structure:

"95116959": { "pageid": 95116959, "ns": 0, "title": "Q96417649", "revisions": [ { "revid": 2266123060, "parentid": 2265005040, "slots": { "main": { "contentmodel": "wikibase-item", "contentformat": "application/json", "*": "{//insert content here}" } }, "comment": "/* wbsetdescription-set:1|jv */ gim vidéo, #quickstatements; #temporary_batch_1730083092317" }, { "revid": 2266123123, "parentid": 2266123060, "slots": { "main": { "contentmodel": "wikibase-item", "contentformat": "application/json", "*": "{//insert content here}" } }, "comment": "/* wbsetdescription-add:1|su */ kaulinan video, #quickstatements; #temporary_batch_1730083092317" }, ....

ragesoss · 2024-11-12T18:36:29Z

lib/wikidata/diff/api.rb

+    tries -= 1
+    sleep 1 if too_many_requests?(e)
+    retry unless tries.zero?
+    log_error(e)


Where does this method come from?

I saw it here: https://github.com/WikiEducationFoundation/WikiEduDashboard/blob/d5ec317a43ea1f2bd43c34b71a2851717cc40b7a/lib/wiki_api.rb#L100

That function comes from Dashboard code, and shouldn't be used in this gem.

I think raising the error and letting the Dashboard code handle it is fine here.

@ragesoss Please is simplifying it to this fine:

def self.mediawiki_request(client, action, query) client.send(action, query) rescue StandardError => e raise e end

That case, you don't even need the rescue. However, I think the retry behavior is worth keeping. The idea would just be to raise the error instead of logging it, if the retries are exhausted.

Oh okay😅. Then this is fine? :

tries ||= 3 client.send(action, query) rescue StandardError => e tries -= 1 sleep 1 if too_many_requests?(e) retry unless tries.zero? raise(e) end

ragesoss · 2024-11-12T22:42:42Z

Okay. Since the structure of the expected response is known, I think it might be better to just write our own code to merge the results rather than rely on either deep_merge implementation.

…

On Tue, Nov 12, 2024, 2:39 PM -- ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In Gemfile <#33 (comment)> : > @@ -14,3 +14,5 @@ gem 'rubocop', '~> 1.21' gem 'mediawiki_api' gem 'json' + +gem 'deep_merge' From my understanding + checking docs for the 2 versions: active_support deep merge <https://www.rubydoc.info/github/rails/rails/ActiveSupport/DeepMergeable#deep_merge!-instance_method> vs deep_merge! gem <https://www.rubydoc.info/gems/deep_merge/DeepMerge#deep_merge!-class_method> - For the gem, the docs say: "Deep Merge core documentation. deep_merge! method permits merging of arbitrary child elements. The two top level elements must be hashes. These hashes can contain unlimited (to stack limit) levels of child elements. These child elements to not have to be of the same types. Where child elements are of the same type, deep_merge will attempt to merge them together. " So as long as the top level elements are hashes , it will recursively merge any nested child elements within the hashes, regardless of how deeply nested they are. so it just merges all the hashes in the revisions array. - The Active Support 'deep_merge!' method is optimized for merging hashes and not necessarily for handling complex nested arrays in a way that preserves all elements. If the data structure includes arrays (like revisions), the merging logic might inadvertently drop or skip some array elements. It could be that Active Support’s deep_merge! is treating the affected revisions as conflicts that it cannot reconcile. Or the deep_merge! method is encountering the revisions array at a level where it expects to merge hashes, so it chooses to skip those revisions instead of merging them. But as to why those specific revisions are skipped /overwritten using the ActiveSupport method, unfortunately I cannot tell, I saved the incomplete / complete responses to analyze it, but the hashes for the missing revisions ids are just not present in the revisions array, there isn't a broken structure or anything. example response structure: "95116959": { "pageid": 95116959, "ns": 0, "title": "Q96417649", "revisions": [ { "revid": 2266123060, "parentid": 2265005040, "slots": { "main": { "contentmodel": "wikibase-item", "contentformat": "application/json", "*": "{//insert content here}" } }, "comment": "/* wbsetdescription-set:1|jv */ gim vidéo, #quickstatements; #temporary_batch_1730083092317" }, { "revid": 2266123123, "parentid": 2266123060, "slots": { "main": { "contentmodel": "wikibase-item", "contentformat": "application/json", "*": "{//insert content here}" } }, "comment": "/* wbsetdescription-add:1|su */ kaulinan video, #quickstatements; #temporary_batch_1730083092317" }, .... — Reply to this email directly, view it on GitHub <#33 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGPEYY6JWVFA2B4TQ2K7KD2AJ7RNAVCNFSM6AAAAABRSA2TXKVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDIMZQHA3DKOJSGM> . You are receiving this because you were mentioned.Message ID: <WikiEducationFoundation/wikidata-diff-analyzer/pull/33/review/2430865923@ github.com>

empty-codes · 2024-11-13T00:05:59Z

Okay. Since the structure of the expected response is known, I think it might be better to just write our own code to merge

This works:

def self.fetch_all_revisions(query)
    client = api_client
    data = {}
    continue_param = nil

    loop do
      query.merge!(continue_param) if continue_param
      response = mediawiki_request(client, 'query', query)
      break unless response

      response.data['pages'].each do |pageid, page_data|
        if data['pages'] && data['pages'][pageid]
          # If the page already exists, merge revisions and other properties
          existing_page_data = data['pages'][pageid]

          # Merge the properties at the top level
          existing_page_data.merge!(page_data) do |key, old_val, new_val| 
            if key == 'revisions' && old_val.is_a?(Array) && new_val.is_a?(Array)
              old_val + new_val  # Append new revisions to the existing array
            else
              new_val  # Overwrite for other keys
            end
          end
        else
          # If the page does not exist, add it to the data
          data['pages'] ||= {}
          data['pages'][pageid] = page_data
        end
      end

      continue_param = response['continue']
      break if continue_param.nil?
    end
    
    data
  end

It uses the inbuilt Ruby .merge method however.
If this is okay, can I go ahead to make a commit with all the changes?

ragesoss · 2024-11-13T00:16:49Z

That's quite verbose. Is there any simplification you can do? Otherwise, perhaps you can extract the part that is specifically for merging?

empty-codes · 2024-11-13T01:36:16Z

That's quite verbose. Is there any simplification you can do? Otherwise, perhaps you can extract the part that is specifically for merging?

Note: I've tested them both and they process all revisions

I simplified the merge logic here:

 def self.fetch_all_revisions(query)
    client = api_client
    data = {}
    continue_param = nil

    loop do
      query.merge!(continue_param) if continue_param
      response = mediawiki_request(client, 'query', query)
      break unless response

      response.data['pages'].each do |pageid, page_data|
        data['pages'] ||= {}
  
        if data['pages'][pageid]
          # Merge existing page data with the new page data
          data['pages'][pageid].merge!(page_data) do |key, old_val, new_val|
            key == 'revisions' && old_val.is_a?(Array) && new_val.is_a?(Array) ? old_val + new_val : new_val
          end
        else
          # Add new page data if it doesn't already exist
          data['pages'][pageid] = page_data
        end
      end

      continue_param = response['continue']
      break if continue_param.nil?
    end

    data
  end

Alternatively, simplying the whole method:

  def self.fetch_all_revisions(query)
    client = api_client
    data = { 'pages' => {} }
    continue_param = nil
  
    loop do
      query.merge!(continue_param) if continue_param
      response = mediawiki_request(client, 'query', query)
      break unless response
  
      response.data['pages'].each do |pageid, page_data|
        if data['pages'][pageid]
          # Merge new page data, combining arrays for 'revisions'
          data['pages'][pageid].merge!(page_data) do |key, old_val, new_val|
            key == 'revisions' ? Array(old_val) + Array(new_val) : new_val
          end
        else
          data['pages'][pageid] = page_data
        end
      end
  
      continue_param = response['continue']
      break unless continue_param
    end
  
    data
  end

Alternatively, the merge logic could be moved to a helper method, something like:

def self.fetch_all_revisions(query)
  client = api_client
  data = {}
  continue_param = nil

  loop do
    query.merge!(continue_param) if continue_param
    response = mediawiki_request(client, 'query', query)
    break unless response

    //insert merge helper method call

    continue_param = response['continue']
    break if continue_param.nil?
  end

  data
end

ragesoss · 2024-11-13T02:00:28Z

Extracting the merge logic is definitely a good idea, in my opinion.

empty-codes · 2024-11-13T02:21:59Z

Extracting the merge logic is definitely a good idea, in my opinion.
@ragesoss

  def self.fetch_all_revisions(query)
    client = api_client
    data = {}
    continue_param = nil

    loop do
      query.merge!(continue_param) if continue_param
      response = mediawiki_request(client, 'query', query)
      break unless response

      merge_page_data(data, response.data['pages'])

      continue_param = response['continue']
      break unless continue_param
    end
  
    data
  end


def self.merge_page_data(data, pages)
  pages.each do |pageid, page_data|
    if data['pages'] && data['pages'][pageid]
      existing_page_data = data['pages'][pageid]

      existing_page_data.merge!(page_data) do |key, old_val, new_val|
        key == 'revisions' && old_val.is_a?(Array) && new_val.is_a?(Array) ? old_val + new_val : new_val
      end
    else
      data['pages'] ||= {}
      data['pages'][pageid] = page_data
    end
  end
end

ragesoss · 2024-11-14T19:13:48Z

This seems reasonable.

…ge + Removes deep_merge dependencies + Changes request method to raise error instead of logging it

empty-codes · 2024-11-14T19:34:46Z

This seems reasonable.

Done!, Since we've found a better fix, should I close the PRs I opened in the dashboard and here for adding the guard statement?

ragesoss · 2024-11-14T19:38:19Z

Yes please.

ragesoss · 2024-11-14T21:10:53Z

Note the failing test here.

empty-codes · 2024-11-14T21:50:52Z

Note the failing test here.

I've determined the issue.
In this part of analyzer_spec:

# testing with the first revisions (with parent id being zero) and create item
    it 'returns the correct result for a revision array' do
      revision_ids = [1_016_232_717, 1_596_231_784]

      analyzed_revisions = WikidataDiffAnalyzer.analyze(revision_ids)

the revisions here have 0 as their parent ids, When .analyze is called, it calls the handle_large_batches in LargeBatchesAnalyzer like so result = LargeBatchesAnalyzer.handle_large_batches(revision_ids, 50) who then calls, the get_revision_data method twice to get the parsed_contents then the parents_contents_batch which uses the parent ids of the parsed_contents:

revision_ids.each_slice(batch_size) do |batch|
      parent_ids = []
      parsed_contents = Api.get_revision_contents(batch) # first call
      next unless parsed_contents

      # I have to check if any of the revision ids in the parsed content has parentid == 0
      parsed_contents.each do |revid, data|
        if (data[:parentid]).zero?
          first_revisions << revid
          puts "Revision ID with parentid 0: #{revid}"  # Print the revision ID
        else
          parent_ids << data[:parentid]
        end
      end
      puts "Array of parent IDs: #{parent_ids.inspect}"
      revision_contents.merge!(parsed_contents)
      parent_contents_batch = Api.get_revision_contents(parent_ids) #second call
      parent_contents.merge!(parent_contents_batch) if parent_contents_batch
    end

so in the case of the spec, the 2 revision ids have parent ids that are 0, so according to the logic above, the id is skipped and not added to the array, resulting in an empty array:

which causes the error, because the code I wrote doesn't handle cases where the ids (although parent ids) are nil, which makes the page nil too.
The spec worked fine with the deep_merge because the logic there would just skip / overwrite if the response.data['pages'] was nil.

So simply adding a guard statement in the merge method works and does not affect the normal processing:

def self.merge_page_data(data, pages)
    return unless pages # exits early if pages is nil or false
    pages.each do |pageid, page_data|
      if data['pages'] && data['pages'][pageid]
        existing_page_data = data['pages'][pageid]

ragesoss · 2024-11-14T22:26:03Z

spec/wikidata/diff/api_spec.rb

+      it 'returns a truncation warning if query exceeds response size limit' do
+        result = Api.fetch_revision_data(revision_ids)
+
+        if result['warnings']


I don't like having an if/else in this test. It think this test should assume that the API will behave the way it currently does, and write the test against that with a query that we know results in the warning. If necessary, you can write a test against a single direct query rather than the .fetch_revision_data method handles but doesn't return the warning. But the test should also be explicit about handling the warning and returning the correct results in such a case, if possible.

So I changed this testcase so it makes a single direct query and checks for the truncation warning, then the second testcase makes the same query through the get_revision_contents method and expects the warning to be handled by returning the correct results i.e the result size is the same as the number of revision ids passed in. Does that work?

that sounds good.

ragesoss · 2024-11-15T19:14:00Z

spec/wikidata/diff/api_spec.rb

+require 'rspec'
+
+RSpec.describe Api do
+    describe '.get_revision_contents' do


I think this pair of tests is basically doing what we want now, but they should be refactored to be more idiomatic.

The describe block here indicates a particular method that is supposed to be tested, but the first test within it doesn't use that method. It should be in a different describe block.

ragesoss · 2024-11-15T19:17:43Z

spec/wikidata/diff/api_spec.rb

+        ]
+      end
+
+      it 'returns a truncation warning since query exceeds response size limit' do


This test isn't actually running any of the gem code, it's replicating the client setup and just testing against the mediawiki_client gem. It should probably be rewritten to use .mediawiki_request method, which at least will involve a little bit of gem code and might break if the implementation is changed in significant way (and could be extended with more tests if we discover and add fixes for additional bugs).

ragesoss · 2024-11-15T19:24:28Z

spec/wikidata/diff/api_spec.rb

+        API_URL = 'https://www.wikidata.org/w/api.php'
+        client = MediawikiApi::Client.new(API_URL)
+
+        query_parameters = {


This is essentially duplicating .fetch_revision_data (except for the last line), which is a signal that it might be good to refactor the API file so that taking the revisions and turning it into query params is an isolated function.

A good test is one where it's very hard to break the corresponding code without also causing the test to fail, but here we could make any any sort of changes to the API code (such as reducing the rvprops to only ids, which would probably make the warning disappear without changing the number of results) without breaking this test.

Noted!, it is my first time writing tests, hence my many oversights. Sorry about that 😅 but I am learning a whole lot, so I am thankful for that.

These are the changes I made. Here response is assigned to fetch_all_revisions instead, which calls the get_query_parameters method:

def self.get_revision_contents(revision_ids) revision_ids = revision_ids.uniq if revision_ids response = fetch_all_revisions(revision_ids) # more code end def self.get_query_parameters(revision_ids) { prop: 'revisions', revids: revision_ids&.join('|'), rvslots: 'main', rvprop: 'content|ids|comment', format: 'json' } end def self.fetch_all_revisions(revision_ids) query = get_query_parameters(revision_ids) # more code end

Then the test becomes:

describe '.mediawiki_request' do it 'returns a truncation warning since query exceeds response size limit' do client = Api.api_client query = Api.get_query_parameters(revision_ids) response = Api.mediawiki_request(client, 'query', query) expect(response['warnings']).not_to be_nil expect(response['warnings']['result']['*']).to include( "This result was truncated because it would otherwise be larger than the limit of 12,582,912 bytes." ) end end describe '.get_revision_contents' do it 'returns the correct result and handles the warning' do result = Api.get_revision_contents(revision_ids) expect { result }.not_to raise_error expect(result).to be_a(Hash) expect(result.size).to eq(25) end end

This looks like an improvement!

Implements continue for queries in api.rb + Adds deep_merge gem to ge…

69a2b23

…mfiles + Updates test expectations

empty-codes mentioned this pull request Nov 11, 2024

Adds a guard statement to prevent NoMethodErrors #32

Closed

ragesoss reviewed Nov 12, 2024

View reviewed changes

Creates custom method for merging page data instead of using deep_mer…

ad42908

…ge + Removes deep_merge dependencies + Changes request method to raise error instead of logging it

Adds guard statement to exit merge method early if pages is nil or false

ffbeb70

ragesoss reviewed Nov 14, 2024

View reviewed changes

Updates api_spec tests

b7fec4c

ragesoss reviewed Nov 15, 2024

View reviewed changes

Improves test structure + Refactors api.rb

67d1554

ragesoss merged commit b41c355 into WikiEducationFoundation:main Nov 15, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implements continue for queries in api.rb + Adds deep_merge gem to ge… #33

Implements continue for queries in api.rb + Adds deep_merge gem to ge… #33

empty-codes commented Nov 11, 2024 •

edited

Loading

ragesoss Nov 12, 2024

empty-codes Nov 12, 2024 •

edited

Loading

ragesoss Nov 12, 2024

empty-codes Nov 12, 2024

ragesoss Nov 12, 2024

empty-codes Nov 12, 2024

ragesoss Nov 12, 2024

empty-codes Nov 12, 2024 •

edited

Loading

ragesoss Nov 12, 2024

empty-codes Nov 12, 2024

ragesoss commented Nov 12, 2024 via email

empty-codes commented Nov 13, 2024

ragesoss commented Nov 13, 2024

empty-codes commented Nov 13, 2024 •

edited

Loading

ragesoss commented Nov 13, 2024

empty-codes commented Nov 13, 2024 •

edited

Loading

ragesoss commented Nov 14, 2024

empty-codes commented Nov 14, 2024

ragesoss commented Nov 14, 2024

ragesoss commented Nov 14, 2024

empty-codes commented Nov 14, 2024

ragesoss Nov 14, 2024

empty-codes Nov 15, 2024

ragesoss Nov 15, 2024

ragesoss Nov 15, 2024

ragesoss Nov 15, 2024

ragesoss Nov 15, 2024

empty-codes Nov 15, 2024 •

edited

Loading

ragesoss Nov 15, 2024

Implements continue for queries in api.rb + Adds deep_merge gem to ge… #33

Implements continue for queries in api.rb + Adds deep_merge gem to ge… #33

Conversation

empty-codes commented Nov 11, 2024 • edited Loading

What this PR does

Refactored Logic Walkthrough:

Comparison

Before implementing 'continue':

After implementing 'continue':

Choose a reason for hiding this comment

empty-codes Nov 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

empty-codes Nov 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ragesoss commented Nov 12, 2024 via email

empty-codes commented Nov 13, 2024

ragesoss commented Nov 13, 2024

empty-codes commented Nov 13, 2024 • edited Loading

ragesoss commented Nov 13, 2024

empty-codes commented Nov 13, 2024 • edited Loading

ragesoss commented Nov 14, 2024

empty-codes commented Nov 14, 2024

ragesoss commented Nov 14, 2024

ragesoss commented Nov 14, 2024

empty-codes commented Nov 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

empty-codes Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

empty-codes commented Nov 11, 2024 •

edited

Loading

empty-codes Nov 12, 2024 •

edited

Loading

empty-codes Nov 12, 2024 •

edited

Loading

empty-codes commented Nov 13, 2024 •

edited

Loading

empty-codes commented Nov 13, 2024 •

edited

Loading

empty-codes Nov 15, 2024 •

edited

Loading