Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements continue for queries in api.rb + Adds deep_merge gem to ge… #33

Conversation

empty-codes
Copy link
Contributor

@empty-codes empty-codes commented Nov 11, 2024

What this PR does

This PR solves the issue #31 by Improving the response handling logic by implementing continue for queries that exceed the size limit.

The original method fetch_revision_data directly called client.action to send a request to the Mediawiki API, retrieving revision data in one go. This approach didn't account for cases where the API response could be paginated, meaning not all revision data might be returned in a single response.

Refactored Logic Walkthrough:

  1. First API Call:
  • The initial API request returns a subset of data, along with a continue parameter indicating there is more data to fetch.
  • response.data is merged into data.
  1. Subsequent API Calls:
  • The loop continues, using the continue parameter to fetch additional data.
  • Each new response's data is merged into the data hash using deep_merge!.
  1. Final Result:
  • The data hash contains all the response.data for all the revisions, structured and ready for parsing.

Notes:

  • I used the implementation of continue in this file as reference lib/wiki_api.rb
  • Added multiple helper methods as seen in the WikiEdu implementation:
    • The mediawiki_request method introduces a retry mechanism with a maximum of three attempts, including a short sleep if the error is due to rate limiting (HTTP 429 status code).
    • The too_many_requests? method checks for rate-limiting errors, and log_error is used to handle unexpected errors gracefully, instead of allowing exceptions to terminate the process.
    • A new api_client method creates the MediawikiApi::Client instance, making the code cleaner and potentially more reusable.
  • To use data.deep_merge!(response.data I had to install the deep_merge gem, so I modified the Gemfile. I tried simply using .merge but then it processed 17 instead of 25 like with .deep_merge.
  • I rewrote the test, and added the size expectations before testing the api.rb changes
  • The revision and page ids were only printed for the sake of debugging, and are not part of the actual code

Comparison

Before implementing 'continue':

finalb

The 2nd test passes because it returns a truncation warning ( I tested removing the else statement to make sure it was not just passing because the result was not nil)

After implementing 'continue':

finala

Gemfile Outdated
@@ -14,3 +14,5 @@ gem 'rubocop', '~> 1.21'
gem 'mediawiki_api'

gem 'json'

gem 'deep_merge'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... In the Dashboard codebase, we use the deep_merge functions (Hash#deep_merge! specifically) that are built into Ruby on Rails via ActiveSupport: https://apidock.com/rails/ActiveSupport/DeepMergeable/deep_merge%21

I think I would prefer adding that as a dependency rather than the separate deep_merge gem, so that this change won't expand the Dashboard's dependencies.

Copy link
Contributor Author

@empty-codes empty-codes Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To do this, I added require 'active_support/core_ext/hash/deep_merge'

But when i ran it, using the existing data.deep_merge!(response.data), it processed only 17 instead of 25, which is the same behaviour as just using .merge, this time not because of missing revisions but rather, the ActiveSupport deep_merge function doesn't handle the response.data properly for some reason, it probably overwrites the affected revisions because of duplicate keys or something.

In the documentation link you sent, it did say that 'A block can be provided to merge non-DeepMergeable values' so using the response json structure for reference, I added the block below and it works, and processes all 25 revisions.

data.deep_merge!(response.data) do |key, old_val, new_val|
      if key == 'revisions' && old_val.is_a?(Array) && new_val.is_a?(Array)
        old_val + new_val  
      else
        new_val  
      end
    end

Comparison: (The ones denoted in red, are the ones not processed without the block
image

So should I use this block?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should I use this block?

Maybe. I'd like to get an explanation of why exactly that version of deep_merge! behaves differently first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding + checking docs for the 2 versions: active_support deep merge vs deep_merge! gem

  • For the gem, the docs say:
    "Deep Merge core documentation. deep_merge! method permits merging of arbitrary child elements. The two top level elements must be hashes. These hashes can contain unlimited (to stack limit) levels of child elements. These child elements to not have to be of the same types. Where child elements are of the same type, deep_merge will attempt to merge them together. "
    So as long as the top level elements are hashes , it will recursively merge any nested child elements within the hashes, regardless of how deeply nested they are. so it just merges all the hashes in the revisions array.

  • The Active Support 'deep_merge!' method is optimized for merging hashes and not necessarily for handling complex nested arrays in a way that preserves all elements. If the data structure includes arrays (like revisions), the merging logic might inadvertently drop or skip some array elements.
    It could be that Active Support’s deep_merge! is treating the affected revisions as conflicts that it cannot reconcile.
    Or the deep_merge! method is encountering the revisions array at a level where it expects to merge hashes, so it chooses to skip those revisions instead of merging them.

But as to why those specific revisions are skipped /overwritten using the ActiveSupport method, unfortunately I cannot tell, I saved the incomplete / complete responses to analyze it, but the hashes for the missing revisions ids are just not present in the revisions array, there isn't a broken structure or anything.

example response structure:

"95116959": {
      "pageid": 95116959,
      "ns": 0,
      "title": "Q96417649",
      "revisions": [
        {
          "revid": 2266123060,
          "parentid": 2265005040,
          "slots": {
            "main": {
              "contentmodel": "wikibase-item",
              "contentformat": "application/json",
              "*": "{//insert content here}"
            }
          },
          "comment": "/* wbsetdescription-set:1|jv */ gim vidéo, #quickstatements; #temporary_batch_1730083092317"
        },
        {
          "revid": 2266123123,
          "parentid": 2266123060,
          "slots": {
            "main": {
              "contentmodel": "wikibase-item",
              "contentformat": "application/json",
              "*": "{//insert content here}"
            }
          },
          "comment": "/* wbsetdescription-add:1|su */ kaulinan video, #quickstatements; #temporary_batch_1730083092317"
        }, ....

tries -= 1
sleep 1 if too_many_requests?(e)
retry unless tries.zero?
log_error(e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this method come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That function comes from Dashboard code, and shouldn't be used in this gem.

I think raising the error and letting the Dashboard code handle it is fine here.

Copy link
Contributor Author

@empty-codes empty-codes Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ragesoss Please is simplifying it to this fine:

def self.mediawiki_request(client, action, query)
  client.send(action, query)
rescue StandardError => e
  raise e 
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That case, you don't even need the rescue. However, I think the retry behavior is worth keeping. The idea would just be to raise the error instead of logging it, if the retries are exhausted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh okay😅. Then this is fine? :

    tries ||= 3
    client.send(action, query)
  rescue StandardError => e
    tries -= 1
    sleep 1 if too_many_requests?(e)
    retry unless tries.zero?
    raise(e)
  end

@ragesoss
Copy link
Member

ragesoss commented Nov 12, 2024 via email

@empty-codes
Copy link
Contributor Author

Okay. Since the structure of the expected response is known, I think it might be better to just write our own code to merge

This works:

def self.fetch_all_revisions(query)
    client = api_client
    data = {}
    continue_param = nil

    loop do
      query.merge!(continue_param) if continue_param
      response = mediawiki_request(client, 'query', query)
      break unless response

      response.data['pages'].each do |pageid, page_data|
        if data['pages'] && data['pages'][pageid]
          # If the page already exists, merge revisions and other properties
          existing_page_data = data['pages'][pageid]

          # Merge the properties at the top level
          existing_page_data.merge!(page_data) do |key, old_val, new_val| 
            if key == 'revisions' && old_val.is_a?(Array) && new_val.is_a?(Array)
              old_val + new_val  # Append new revisions to the existing array
            else
              new_val  # Overwrite for other keys
            end
          end
        else
          # If the page does not exist, add it to the data
          data['pages'] ||= {}
          data['pages'][pageid] = page_data
        end
      end

      continue_param = response['continue']
      break if continue_param.nil?
    end
    
    data
  end

It uses the inbuilt Ruby .merge method however.
If this is okay, can I go ahead to make a commit with all the changes?

@ragesoss
Copy link
Member

That's quite verbose. Is there any simplification you can do? Otherwise, perhaps you can extract the part that is specifically for merging?

@empty-codes
Copy link
Contributor Author

empty-codes commented Nov 13, 2024

That's quite verbose. Is there any simplification you can do? Otherwise, perhaps you can extract the part that is specifically for merging?

Note: I've tested them both and they process all revisions

I simplified the merge logic here:

 def self.fetch_all_revisions(query)
    client = api_client
    data = {}
    continue_param = nil

    loop do
      query.merge!(continue_param) if continue_param
      response = mediawiki_request(client, 'query', query)
      break unless response

      response.data['pages'].each do |pageid, page_data|
        data['pages'] ||= {}
  
        if data['pages'][pageid]
          # Merge existing page data with the new page data
          data['pages'][pageid].merge!(page_data) do |key, old_val, new_val|
            key == 'revisions' && old_val.is_a?(Array) && new_val.is_a?(Array) ? old_val + new_val : new_val
          end
        else
          # Add new page data if it doesn't already exist
          data['pages'][pageid] = page_data
        end
      end

      continue_param = response['continue']
      break if continue_param.nil?
    end

    data
  end

Alternatively, simplying the whole method:

  def self.fetch_all_revisions(query)
    client = api_client
    data = { 'pages' => {} }
    continue_param = nil
  
    loop do
      query.merge!(continue_param) if continue_param
      response = mediawiki_request(client, 'query', query)
      break unless response
  
      response.data['pages'].each do |pageid, page_data|
        if data['pages'][pageid]
          # Merge new page data, combining arrays for 'revisions'
          data['pages'][pageid].merge!(page_data) do |key, old_val, new_val|
            key == 'revisions' ? Array(old_val) + Array(new_val) : new_val
          end
        else
          data['pages'][pageid] = page_data
        end
      end
  
      continue_param = response['continue']
      break unless continue_param
    end
  
    data
  end

Alternatively, the merge logic could be moved to a helper method, something like:

def self.fetch_all_revisions(query)
  client = api_client
  data = {}
  continue_param = nil

  loop do
    query.merge!(continue_param) if continue_param
    response = mediawiki_request(client, 'query', query)
    break unless response

    //insert merge helper method call

    continue_param = response['continue']
    break if continue_param.nil?
  end

  data
end

@ragesoss
Copy link
Member

Extracting the merge logic is definitely a good idea, in my opinion.

@empty-codes
Copy link
Contributor Author

empty-codes commented Nov 13, 2024

Extracting the merge logic is definitely a good idea, in my opinion.
@ragesoss

  def self.fetch_all_revisions(query)
    client = api_client
    data = {}
    continue_param = nil

    loop do
      query.merge!(continue_param) if continue_param
      response = mediawiki_request(client, 'query', query)
      break unless response

      merge_page_data(data, response.data['pages'])

      continue_param = response['continue']
      break unless continue_param
    end
  
    data
  end


def self.merge_page_data(data, pages)
  pages.each do |pageid, page_data|
    if data['pages'] && data['pages'][pageid]
      existing_page_data = data['pages'][pageid]

      existing_page_data.merge!(page_data) do |key, old_val, new_val|
        key == 'revisions' && old_val.is_a?(Array) && new_val.is_a?(Array) ? old_val + new_val : new_val
      end
    else
      data['pages'] ||= {}
      data['pages'][pageid] = page_data
    end
  end
end

@ragesoss
Copy link
Member

This seems reasonable.

…ge + Removes deep_merge dependencies + Changes request method to raise error instead of logging it
@empty-codes
Copy link
Contributor Author

This seems reasonable.

Done!, Since we've found a better fix, should I close the PRs I opened in the dashboard and here for adding the guard statement?

@ragesoss
Copy link
Member

Yes please.

@ragesoss
Copy link
Member

Note the failing test here.

@empty-codes
Copy link
Contributor Author

Note the failing test here.

I've determined the issue.
In this part of analyzer_spec:

# testing with the first revisions (with parent id being zero) and create item
    it 'returns the correct result for a revision array' do
      revision_ids = [1_016_232_717, 1_596_231_784]

      analyzed_revisions = WikidataDiffAnalyzer.analyze(revision_ids)

the revisions here have 0 as their parent ids, When .analyze is called, it calls the handle_large_batches in LargeBatchesAnalyzer like so result = LargeBatchesAnalyzer.handle_large_batches(revision_ids, 50) who then calls, the get_revision_data method twice to get the parsed_contents then the parents_contents_batch which uses the parent ids of the parsed_contents:

revision_ids.each_slice(batch_size) do |batch|
      parent_ids = []
      parsed_contents = Api.get_revision_contents(batch) # first call
      next unless parsed_contents

      # I have to check if any of the revision ids in the parsed content has parentid == 0
      parsed_contents.each do |revid, data|
        if (data[:parentid]).zero?
          first_revisions << revid
          puts "Revision ID with parentid 0: #{revid}"  # Print the revision ID
        else
          parent_ids << data[:parentid]
        end
      end
      puts "Array of parent IDs: #{parent_ids.inspect}"
      revision_contents.merge!(parsed_contents)
      parent_contents_batch = Api.get_revision_contents(parent_ids) #second call
      parent_contents.merge!(parent_contents_batch) if parent_contents_batch
    end

so in the case of the spec, the 2 revision ids have parent ids that are 0, so according to the logic above, the id is skipped and not added to the array, resulting in an empty array:

image

which causes the error, because the code I wrote doesn't handle cases where the ids (although parent ids) are nil, which makes the page nil too.
The spec worked fine with the deep_merge because the logic there would just skip / overwrite if the response.data['pages'] was nil.

So simply adding a guard statement in the merge method works and does not affect the normal processing:

def self.merge_page_data(data, pages)
    return unless pages # exits early if pages is nil or false
    pages.each do |pageid, page_data|
      if data['pages'] && data['pages'][pageid]
        existing_page_data = data['pages'][pageid]

it 'returns a truncation warning if query exceeds response size limit' do
result = Api.fetch_revision_data(revision_ids)

if result['warnings']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like having an if/else in this test. It think this test should assume that the API will behave the way it currently does, and write the test against that with a query that we know results in the warning. If necessary, you can write a test against a single direct query rather than the .fetch_revision_data method handles but doesn't return the warning. But the test should also be explicit about handling the warning and returning the correct results in such a case, if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I changed this testcase so it makes a single direct query and checks for the truncation warning, then the second testcase makes the same query through the get_revision_contents method and expects the warning to be handled by returning the correct results i.e the result size is the same as the number of revision ids passed in. Does that work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that sounds good.

require 'rspec'

RSpec.describe Api do
describe '.get_revision_contents' do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this pair of tests is basically doing what we want now, but they should be refactored to be more idiomatic.

The describe block here indicates a particular method that is supposed to be tested, but the first test within it doesn't use that method. It should be in a different describe block.

]
end

it 'returns a truncation warning since query exceeds response size limit' do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test isn't actually running any of the gem code, it's replicating the client setup and just testing against the mediawiki_client gem. It should probably be rewritten to use .mediawiki_request method, which at least will involve a little bit of gem code and might break if the implementation is changed in significant way (and could be extended with more tests if we discover and add fixes for additional bugs).

API_URL = 'https://www.wikidata.org/w/api.php'
client = MediawikiApi::Client.new(API_URL)

query_parameters = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is essentially duplicating .fetch_revision_data (except for the last line), which is a signal that it might be good to refactor the API file so that taking the revisions and turning it into query params is an isolated function.

A good test is one where it's very hard to break the corresponding code without also causing the test to fail, but here we could make any any sort of changes to the API code (such as reducing the rvprops to only ids, which would probably make the warning disappear without changing the number of results) without breaking this test.

Copy link
Contributor Author

@empty-codes empty-codes Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted!, it is my first time writing tests, hence my many oversights. Sorry about that 😅 but I am learning a whole lot, so I am thankful for that.

These are the changes I made. Here response is assigned to fetch_all_revisions instead, which calls the get_query_parameters method:

def self.get_revision_contents(revision_ids)
  revision_ids = revision_ids.uniq if revision_ids
  response = fetch_all_revisions(revision_ids)
  # more code
end

def self.get_query_parameters(revision_ids)
  {
    prop: 'revisions',
    revids: revision_ids&.join('|'),
    rvslots: 'main',
    rvprop: 'content|ids|comment',
    format: 'json'
  }
end

def self.fetch_all_revisions(revision_ids)
  query = get_query_parameters(revision_ids)
  # more code
end

Then the test becomes:

describe '.mediawiki_request' do
  it 'returns a truncation warning since query exceeds response size limit' do
    client = Api.api_client
    query = Api.get_query_parameters(revision_ids)
    
    response = Api.mediawiki_request(client, 'query', query)

    expect(response['warnings']).not_to be_nil
    expect(response['warnings']['result']['*']).to include(
      "This result was truncated because it would otherwise be larger than the limit of 12,582,912 bytes."
    )
  end
end

describe '.get_revision_contents' do
  it 'returns the correct result and handles the warning' do
    result = Api.get_revision_contents(revision_ids)
    
    expect { result }.not_to raise_error
    expect(result).to be_a(Hash)
    expect(result.size).to eq(25)
  end
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like an improvement!

@ragesoss ragesoss merged commit b41c355 into WikiEducationFoundation:main Nov 15, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants