GitHub tag matching #103

16Martin · 2024-11-16T12:48:06Z

This PR addresses #99 and introduces code intended to replace the current combination of get_github_info() followed by get_matching_tag() which exists only in capycli.bom.findsources.

This approach first tries to match a tag using the original get_matching_tag(). If all the guessing does not yield any results, the algo implicitly falls back to analyzing each tag with get_matching_tag().

16Martin · 2024-11-16T13:27:43Z

I cannot make the unittest work.

In test_find_golang_url_github() (https://github.com/sw360/capycli/blob/main/tests/test_find_sources.py#L329-L339) we expect the result of find_golang_url() to be 'https://pkg.go.dev/github.com/opencontainers/runc'. I do not understand this.

find_golang_url() has essentially two ways to set source_url, the variable that it ultimately returns:

if len(split_version) == 3, in this case the source_url would contain /archive/
in any other case source_url is set to the result of get_matching_tags, which is mocked to 'https://github.com/opencontainers/runc/archive/refs/tags/v1.0.1.zip', which also contains /archive/
Every non-empty result of get_matching_tag() ends with .zip

How and why does this test work?

As a user, I would not feel good about getting https://pkg.go.dev/github.com/opencontainers/runc as a source URL, but maybe the variable names are misleading?

16Martin · 2024-11-18T23:29:11Z

Found it. The order in which the unittest sets up its mocks does not match the mock naming.

16Martin · 2024-11-19T17:58:01Z

Some functionality has been moved to protected methods in order to split the task into smaller, more focused parts. There is only one new public method: FindSources.version_to_github_tag().

The core of this PR are the lines in https://github.com/sw360/capycli/blob/martin/fix-github-tag-matching/capycli/bom/findsources.py#L278-L290:

While the current approach first fetch all tags that belong to a specific project and then passes the full list to get_matching_tag(), this new approach calls get_matching_tag() for each tag for each page of tags. As a result, if the new tag-guessing does not yield any matches,, the algo will match (or not match) the same tag the current approach matches.

The most important additions are lines 290 and following. If get_matching_tag() was unable to match the version to the tag, the logic generates eight possible candidates from the version and then checks if any of these candidates exist in the repo, before moving on to the next tag in the list.

The logic to create the candidates is (to some extend) the inverse of to_semver_string(). The idea is that if any of these candidates exists, get_matching_tag() would yield a positive match. The logic generating the candidates is to some extend the inverse of to_semver_string().

The algo then looks for each candidate in the current result-page and if that local lookup does not yield a match, then the algo queries the GitHub API and specifically asks if a tag with the candidate's name exists. If we can find a match through either of these two lookups, we use that match and stop the search.

With my BOMs, I notice a tremendous speedup. On average the guessing part finds a positive match immediately on the first results page. If it doesn't the API query is successful. Using my BOMs, the algo never fetches the second page of tags from GitHub.

16Martin · 2024-11-30T00:41:07Z

The first draft of this contribution to address the scaling issue of the current code described in #99

I went through several iterations and attempts and I always ended up in a way to big change to become a sensible contribution. This draft is now a very reduced implementation of the orignal idea. Essentially it runs get_matching_tag() (unchanged!) per results page instead of once on the complete set of results. That way it is still get_matching_tag() that makes all important decisions and therefore we can have a high confidence that the new code produces the same outcome as the old code.

The guessing tags idea has been replaced with a guessing prefix search using the /git/refs endpoint. The idea is that in most cases this will yield a much smaller result than the original get_github_info() and will therefore consume fewer resources and be quicker. In the worst case it makes even more API calls than the original implementation, but usually it is a lot quicker.

I tested this approach in the context of another project of mine where I actually had both implementations enabled concurrently, so I was able to compate the old approach with the new approach. In my test run, there were 1215 calls to get_matching_source_url. In 1210 cases it did yield the exact same result as the original implementation.

In five cases the new approach did yield results while the old apporach didn't yield any results. Since all matching is still done using the unchenged get_matching_tag() I concluded the reason for these 5 additional results is that in the original implementation these tags are not retrieved from Github. It turned out that only 2 out of the 3 use cases use get_github_into() to retrieve the full set of tags. In the third use case get_matching_tag is only passed the first page of results. I assumed this is a bug.

I structured the contribution into smaller commits with extensive commit messages to assist the reviewing.

tngraf · 2024-11-30T17:20:56Z

Hi @16Martin,
I had a look at your code:

you may want to run poetry run flake8 and fix all issues.
you may want to run poetry run codespell . and fix all issues.
using your code in real live did not work: please run
poetry run capycli bom findsources -i .\testbom1.json -o .\testbom1s.json -v
with the attached SBOM ... and make it work.

testbom1.json

* there is generic error handling of GitHub API calls in FindSources.check_for_github_error which is actually always helpful. -> replicated behaviour of FindSources.check_for_github_error in FindSources.github_request * augmented FindSources.github_request with two additional parameters: - return_response, boolean, default: False if set to true it will return the entire Response object instead of the output of Response.json() This helps with GitHub API interaction, where we sometimes need to access reponse headers. - allow_redirects, boolean, default: True this is a mere externalization of the allow_redirects parameter requests.requests, i.e. requests.get as it is used in github_request() This helps with GitHub API interaction, where we sometimes need to access reponse headers of 3xx responses. * fixed a bug in FindSources.version_regex, which prevented it from extracting versions with leading underscores e.g. FOOBAR_1_2_3. * aligned whitespace padding for a frequent (in my test scenario) capycli output * refactored TestBasePytest.capture_stdout and TestBasePytest.capture_stderr to dump the capture to stdout if the wrapped method raises an unhandled exception. * refactored the github_request-mock in tests/test_find_sources.py: - introduced class MockProject which can mock GitHub API responses for individual projects. Most important method is MockProject.mock_github_request - moved all the static test data of GitHub projects to global GITHUB_PROJECTS - added TestFindSources.setUpClass which initializes cls.github_projects as a list of objects of type MockProject; one for each project in GITHUB_PROJECTS. - Findsources.mock_github_request_side_effect is now a relatively small lookup in self.github_projects. This allows individual tests to do specific temporary changes. Beware when running multithreaded unittests! * softened a unittest which would make an exact assumption about the installed packages of the virtualenv running the unittests. In its old version, the test would always break if developers installed additional software without correspondingly modifying this particular test. It would also break every time the amount of transitive dependencies changes, which is beyond our influence!

This commit only adds new code and a corresponding unittest. It doesn't alter the current FindSources behaviour at all. * Introduced member class FindSources.TagCache A simple caching mechanism to ensure get_matching_source_url() sends each call to the /git/refs endpoint only once. As such and since it is only used by get_matching_source_url(), it is merely an optimization of the approach implemented in get_matching_source_url(). It is used through FindSources.TagCache.filter_and_cache() * Introduced FindSources.get_matching_source_url, which is intended to replace all current calls (there are 3) to FindSources.get_matching_tag in capycli.bom.findsources. Especially it is to replace the call to FindSources.get_github_info that preceeds get_matching_tag(). - the replacement for get_matching_tag is called get_matching_source_url, because the name get_matching_tag is misleading: get_matching_tag() does never return a tag. It either returns an empty string in case of errors or it returns a URL that points to source archive. get_matching_source_url does indeed return a source URL on success. - supporting methods _get_github_repo and _get_link_page for focus - _get_github_repo centralizes all current sanity checks on references to GitHub projects found throughout FindSources, ensuring consistent results. In addition this allows the refactoring of all methods in FindSources that currently call get_matching_tag to have a more unified approach to retrieving source URLs by tag. This contribution does not do this refactoring for clarity and focus.

* replace all calls to get_matching_tag to get_matching_source_url (3 occurances) * rewrote get_github_info() and check_for_github_error() to raise a NotImplementedError. This will help consumers of these methods notify the change in capycli. In capycli, these methods are not used outside of FindSources. * refactored the unittest test_find_golang_url_github for historical reasons: Due to a bug in the unittest it was the first test to break when I integrated the new approach. Through this the test became my workhorse for integrating the new approach.

tngraf · 2024-12-01T09:03:47Z

@16Martin
change line 143 to elif "message" in response.json(): should help...

16Martin · 2024-12-01T10:12:22Z

@16Martin change line 143 to elif "message" in response.json(): should help...

But it will run into problems if we ever see a list with 'message' as an entry. I put isinstance(res, dict) that covers more cases. The issue with my code is that I never tested it without a github token. I am working on a proper unittest for github_request(). Obviously we are lacking coverage.

…w360/capycli into martin/fix-github-tag-matching

…is untouched line

…hon 3.10

16Martin force-pushed the martin/fix-github-tag-matching branch from 5c9acdb to 4730fa3 Compare November 19, 2024 17:15

16Martin force-pushed the martin/fix-github-tag-matching branch from 9116a54 to 5dd3cbb Compare November 20, 2024 12:51

16Martin force-pushed the martin/fix-github-tag-matching branch 2 times, most recently from aa75947 to bea838f Compare November 30, 2024 00:18

16Martin marked this pull request as ready for review November 30, 2024 00:21

16Martin added 5 commits November 30, 2024 23:41

fixup: GitHub checks

9928df2

fixup(findsources): flake8 and codespell

c8451a3

16Martin force-pushed the martin/fix-github-tag-matching branch from bea838f to c8451a3 Compare November 30, 2024 22:42

fixup(findsources): flake8 and codespell

ad78faa

16Martin added 2 commits December 5, 2024 16:21

fixup(FindSources): handle github_request() exceptions w/ unittests

b806257

Merge branch 'martin/fix-github-tag-matching' of https://github.com/s…

00b7592

…w360/capycli into martin/fix-github-tag-matching

16Martin force-pushed the martin/fix-github-tag-matching branch from 530f482 to 00b7592 Compare December 5, 2024 15:44

16Martin added 2 commits December 5, 2024 16:51

fixup(FindSources): dunno. mypy suddenly started complaining about th…

ea93377

…is untouched line

fixup(FindSources): AttributeError().name does not exist prior to Pyt…

a1d2009

…hon 3.10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub tag matching #103

GitHub tag matching #103

16Martin commented Nov 16, 2024

16Martin commented Nov 16, 2024 •

edited

Loading

16Martin commented Nov 18, 2024

16Martin commented Nov 19, 2024

16Martin commented Nov 30, 2024

tngraf commented Nov 30, 2024

tngraf commented Dec 1, 2024

16Martin commented Dec 1, 2024

GitHub tag matching #103

Are you sure you want to change the base?

GitHub tag matching #103

Conversation

16Martin commented Nov 16, 2024

16Martin commented Nov 16, 2024 • edited Loading

16Martin commented Nov 18, 2024

16Martin commented Nov 19, 2024

16Martin commented Nov 30, 2024

tngraf commented Nov 30, 2024

tngraf commented Dec 1, 2024

16Martin commented Dec 1, 2024

16Martin commented Nov 16, 2024 •

edited

Loading