Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GitHub tag matching #103

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

GitHub tag matching #103

wants to merge 10 commits into from

Conversation

16Martin
Copy link
Collaborator

This PR addresses #99 and introduces code intended to replace the current combination of get_github_info() followed by get_matching_tag() which exists only in capycli.bom.findsources.

This approach first tries to match a tag using the original get_matching_tag(). If all the guessing does not yield any results, the algo implicitly falls back to analyzing each tag with get_matching_tag().

@16Martin
Copy link
Collaborator Author

16Martin commented Nov 16, 2024

I cannot make the unittest work.

In test_find_golang_url_github() (https://github.com/sw360/capycli/blob/main/tests/test_find_sources.py#L329-L339) we expect the result of find_golang_url() to be 'https://pkg.go.dev/github.com/opencontainers/runc'. I do not understand this.

find_golang_url() has essentially two ways to set source_url, the variable that it ultimately returns:

  1. if len(split_version) == 3, in this case the source_url would contain /archive/
  2. in any other case source_url is set to the result of get_matching_tags, which is mocked to 'https://github.com/opencontainers/runc/archive/refs/tags/v1.0.1.zip', which also contains /archive/
  3. Every non-empty result of get_matching_tag() ends with .zip

How and why does this test work?

As a user, I would not feel good about getting https://pkg.go.dev/github.com/opencontainers/runc as a source URL, but maybe the variable names are misleading?

@16Martin
Copy link
Collaborator Author

Found it. The order in which the unittest sets up its mocks does not match the mock naming.

@16Martin 16Martin force-pushed the martin/fix-github-tag-matching branch from 5c9acdb to 4730fa3 Compare November 19, 2024 17:15
@16Martin
Copy link
Collaborator Author

Some functionality has been moved to protected methods in order to split the task into smaller, more focused parts. There is only one new public method: FindSources.version_to_github_tag().

The core of this PR are the lines in https://github.com/sw360/capycli/blob/martin/fix-github-tag-matching/capycli/bom/findsources.py#L278-L290:

While the current approach first fetch all tags that belong to a specific project and then passes the full list to get_matching_tag(), this new approach calls get_matching_tag() for each tag for each page of tags. As a result, if the new tag-guessing does not yield any matches,, the algo will match (or not match) the same tag the current approach matches.

The most important additions are lines 290 and following. If get_matching_tag() was unable to match the version to the tag, the logic generates eight possible candidates from the version and then checks if any of these candidates exist in the repo, before moving on to the next tag in the list.

The logic to create the candidates is (to some extend) the inverse of to_semver_string(). The idea is that if any of these candidates exists, get_matching_tag() would yield a positive match. The logic generating the candidates is to some extend the inverse of to_semver_string().

The algo then looks for each candidate in the current result-page and if that local lookup does not yield a match, then the algo queries the GitHub API and specifically asks if a tag with the candidate's name exists. If we can find a match through either of these two lookups, we use that match and stop the search.

With my BOMs, I notice a tremendous speedup. On average the guessing part finds a positive match immediately on the first results page. If it doesn't the API query is successful. Using my BOMs, the algo never fetches the second page of tags from GitHub.

@16Martin 16Martin force-pushed the martin/fix-github-tag-matching branch from 9116a54 to 5dd3cbb Compare November 20, 2024 12:51
@16Martin 16Martin force-pushed the martin/fix-github-tag-matching branch 2 times, most recently from aa75947 to bea838f Compare November 30, 2024 00:18
@16Martin 16Martin marked this pull request as ready for review November 30, 2024 00:21
@16Martin
Copy link
Collaborator Author

The first draft of this contribution to address the scaling issue of the current code described in #99

I went through several iterations and attempts and I always ended up in a way to big change to become a sensible contribution. This draft is now a very reduced implementation of the orignal idea. Essentially it runs get_matching_tag() (unchanged!) per results page instead of once on the complete set of results. That way it is still get_matching_tag() that makes all important decisions and therefore we can have a high confidence that the new code produces the same outcome as the old code.

The guessing tags idea has been replaced with a guessing prefix search using the /git/refs endpoint. The idea is that in most cases this will yield a much smaller result than the original get_github_info() and will therefore consume fewer resources and be quicker. In the worst case it makes even more API calls than the original implementation, but usually it is a lot quicker.

I tested this approach in the context of another project of mine where I actually had both implementations enabled concurrently, so I was able to compate the old approach with the new approach. In my test run, there were 1215 calls to get_matching_source_url. In 1210 cases it did yield the exact same result as the original implementation.

In five cases the new approach did yield results while the old apporach didn't yield any results. Since all matching is still done using the unchenged get_matching_tag() I concluded the reason for these 5 additional results is that in the original implementation these tags are not retrieved from Github. It turned out that only 2 out of the 3 use cases use get_github_into() to retrieve the full set of tags. In the third use case get_matching_tag is only passed the first page of results. I assumed this is a bug.

I structured the contribution into smaller commits with extensive commit messages to assist the reviewing.

@tngraf
Copy link
Collaborator

tngraf commented Nov 30, 2024

Hi @16Martin,
I had a look at your code:

  • you may want to run poetry run flake8 and fix all issues.
  • you may want to run poetry run codespell . and fix all issues.
  • using your code in real live did not work: please run
    poetry run capycli bom findsources -i .\testbom1.json -o .\testbom1s.json -v
    with the attached SBOM ... and make it work.

testbom1.json

* there is generic error handling of GitHub API calls in
  FindSources.check_for_github_error which is actually always helpful.
  -> replicated behaviour of FindSources.check_for_github_error in
     FindSources.github_request
* augmented FindSources.github_request with two additional parameters:
  - return_response, boolean, default: False
    if set to true it will return the entire Response object instead of the
    output of Response.json()
    This helps with GitHub API interaction, where we sometimes need to access
    reponse headers.
  - allow_redirects, boolean, default: True
    this is a mere externalization of the allow_redirects parameter
    requests.requests, i.e. requests.get as it is used in github_request()
    This helps with GitHub API interaction, where we sometimes need to access
    reponse headers of 3xx responses.
* fixed a bug in FindSources.version_regex, which prevented it from
  extracting versions with leading underscores  e.g. FOOBAR_1_2_3.
* aligned whitespace padding for a frequent (in my test scenario) capycli output
* refactored TestBasePytest.capture_stdout and TestBasePytest.capture_stderr to
  dump the capture to stdout if the wrapped method raises an unhandled
  exception.
* refactored the github_request-mock in tests/test_find_sources.py:
  - introduced class MockProject which can mock GitHub API responses for
    individual projects. Most important method is MockProject.mock_github_request
  - moved all the static test data of GitHub projects to global GITHUB_PROJECTS
  - added TestFindSources.setUpClass which initializes cls.github_projects as a list
    of objects of type MockProject; one for each project in GITHUB_PROJECTS.
  - Findsources.mock_github_request_side_effect is now a relatively small lookup
    in self.github_projects.
  This allows individual tests to do specific temporary changes. Beware when
  running multithreaded unittests!
* softened a unittest which would make an exact assumption about the installed
  packages of the virtualenv running the unittests. In its old version, the
  test would always break if developers installed additional software without
  correspondingly modifying this particular test. It would also break every time
  the amount of transitive dependencies changes, which is beyond our influence!
This commit only adds new code and a corresponding unittest. It doesn't
alter the current FindSources behaviour at all.

* Introduced member class FindSources.TagCache
  A simple caching mechanism to ensure get_matching_source_url() sends
  each call to the /git/refs endpoint only once. As such and since it is
  only used by get_matching_source_url(), it is merely an optimization of
  the approach implemented in get_matching_source_url(). It is used
  through FindSources.TagCache.filter_and_cache()
* Introduced FindSources.get_matching_source_url, which is intended to
  replace all current calls (there are 3) to FindSources.get_matching_tag
  in capycli.bom.findsources. Especially it is to replace the call to
  FindSources.get_github_info that preceeds get_matching_tag().
  - the replacement for get_matching_tag is called get_matching_source_url,
    because the name get_matching_tag is misleading:
      get_matching_tag() does never return a tag. It either returns an
      empty string in case of errors or it returns a URL that points to
      source archive.
    get_matching_source_url does indeed return a source URL on success.
  - supporting methods _get_github_repo and _get_link_page for focus
  - _get_github_repo centralizes all current sanity checks on references
    to GitHub projects found throughout FindSources, ensuring consistent
    results. In addition this allows the refactoring of all methods in
    FindSources that currently call get_matching_tag to have a more
    unified approach to retrieving source URLs by tag. This contribution
    does not do this refactoring for clarity and focus.
* replace all calls to get_matching_tag to get_matching_source_url
  (3 occurances)
* rewrote get_github_info() and check_for_github_error() to raise
  a NotImplementedError. This will help consumers of these methods
  notify the change in capycli. In capycli, these methods are not
  used outside of FindSources.
* refactored the unittest test_find_golang_url_github for historical
  reasons: Due to a bug in the unittest it was the first test to
  break when I integrated the new approach. Through this the test
  became my workhorse for integrating the new approach.
@16Martin 16Martin force-pushed the martin/fix-github-tag-matching branch from bea838f to c8451a3 Compare November 30, 2024 22:42
@tngraf
Copy link
Collaborator

tngraf commented Dec 1, 2024

@16Martin
change line 143 to elif "message" in response.json(): should help...

@16Martin
Copy link
Collaborator Author

16Martin commented Dec 1, 2024

@16Martin change line 143 to elif "message" in response.json(): should help...

But it will run into problems if we ever see a list with 'message' as an entry. I put isinstance(res, dict) that covers more cases. The issue with my code is that I never tested it without a github token. I am working on a proper unittest for github_request(). Obviously we are lacking coverage.

@16Martin 16Martin force-pushed the martin/fix-github-tag-matching branch from 530f482 to 00b7592 Compare December 5, 2024 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants