-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GitHub tag matching #103
base: main
Are you sure you want to change the base?
GitHub tag matching #103
Conversation
I cannot make the unittest work. In
How and why does this test work? As a user, I would not feel good about getting |
Found it. The order in which the unittest sets up its mocks does not match the mock naming. |
5c9acdb
to
4730fa3
Compare
Some functionality has been moved to protected methods in order to split the task into smaller, more focused parts. There is only one new public method: The core of this PR are the lines in https://github.com/sw360/capycli/blob/martin/fix-github-tag-matching/capycli/bom/findsources.py#L278-L290: While the current approach first fetch all tags that belong to a specific project and then passes the full list to The most important additions are lines 290 and following. If The logic to create the candidates is (to some extend) the inverse of The algo then looks for each candidate in the current result-page and if that local lookup does not yield a match, then the algo queries the GitHub API and specifically asks if a tag with the candidate's name exists. If we can find a match through either of these two lookups, we use that match and stop the search. With my BOMs, I notice a tremendous speedup. On average the guessing part finds a positive match immediately on the first results page. If it doesn't the API query is successful. Using my BOMs, the algo never fetches the second page of tags from GitHub. |
9116a54
to
5dd3cbb
Compare
aa75947
to
bea838f
Compare
The first draft of this contribution to address the scaling issue of the current code described in #99 I went through several iterations and attempts and I always ended up in a way to big change to become a sensible contribution. This draft is now a very reduced implementation of the orignal idea. Essentially it runs get_matching_tag() (unchanged!) per results page instead of once on the complete set of results. That way it is still get_matching_tag() that makes all important decisions and therefore we can have a high confidence that the new code produces the same outcome as the old code. The guessing tags idea has been replaced with a guessing prefix search using the /git/refs endpoint. The idea is that in most cases this will yield a much smaller result than the original get_github_info() and will therefore consume fewer resources and be quicker. In the worst case it makes even more API calls than the original implementation, but usually it is a lot quicker. I tested this approach in the context of another project of mine where I actually had both implementations enabled concurrently, so I was able to compate the old approach with the new approach. In my test run, there were 1215 calls to get_matching_source_url. In 1210 cases it did yield the exact same result as the original implementation. In five cases the new approach did yield results while the old apporach didn't yield any results. Since all matching is still done using the unchenged get_matching_tag() I concluded the reason for these 5 additional results is that in the original implementation these tags are not retrieved from Github. It turned out that only 2 out of the 3 use cases use get_github_into() to retrieve the full set of tags. In the third use case get_matching_tag is only passed the first page of results. I assumed this is a bug. I structured the contribution into smaller commits with extensive commit messages to assist the reviewing. |
Hi @16Martin,
|
* there is generic error handling of GitHub API calls in FindSources.check_for_github_error which is actually always helpful. -> replicated behaviour of FindSources.check_for_github_error in FindSources.github_request * augmented FindSources.github_request with two additional parameters: - return_response, boolean, default: False if set to true it will return the entire Response object instead of the output of Response.json() This helps with GitHub API interaction, where we sometimes need to access reponse headers. - allow_redirects, boolean, default: True this is a mere externalization of the allow_redirects parameter requests.requests, i.e. requests.get as it is used in github_request() This helps with GitHub API interaction, where we sometimes need to access reponse headers of 3xx responses. * fixed a bug in FindSources.version_regex, which prevented it from extracting versions with leading underscores e.g. FOOBAR_1_2_3. * aligned whitespace padding for a frequent (in my test scenario) capycli output * refactored TestBasePytest.capture_stdout and TestBasePytest.capture_stderr to dump the capture to stdout if the wrapped method raises an unhandled exception. * refactored the github_request-mock in tests/test_find_sources.py: - introduced class MockProject which can mock GitHub API responses for individual projects. Most important method is MockProject.mock_github_request - moved all the static test data of GitHub projects to global GITHUB_PROJECTS - added TestFindSources.setUpClass which initializes cls.github_projects as a list of objects of type MockProject; one for each project in GITHUB_PROJECTS. - Findsources.mock_github_request_side_effect is now a relatively small lookup in self.github_projects. This allows individual tests to do specific temporary changes. Beware when running multithreaded unittests! * softened a unittest which would make an exact assumption about the installed packages of the virtualenv running the unittests. In its old version, the test would always break if developers installed additional software without correspondingly modifying this particular test. It would also break every time the amount of transitive dependencies changes, which is beyond our influence!
This commit only adds new code and a corresponding unittest. It doesn't alter the current FindSources behaviour at all. * Introduced member class FindSources.TagCache A simple caching mechanism to ensure get_matching_source_url() sends each call to the /git/refs endpoint only once. As such and since it is only used by get_matching_source_url(), it is merely an optimization of the approach implemented in get_matching_source_url(). It is used through FindSources.TagCache.filter_and_cache() * Introduced FindSources.get_matching_source_url, which is intended to replace all current calls (there are 3) to FindSources.get_matching_tag in capycli.bom.findsources. Especially it is to replace the call to FindSources.get_github_info that preceeds get_matching_tag(). - the replacement for get_matching_tag is called get_matching_source_url, because the name get_matching_tag is misleading: get_matching_tag() does never return a tag. It either returns an empty string in case of errors or it returns a URL that points to source archive. get_matching_source_url does indeed return a source URL on success. - supporting methods _get_github_repo and _get_link_page for focus - _get_github_repo centralizes all current sanity checks on references to GitHub projects found throughout FindSources, ensuring consistent results. In addition this allows the refactoring of all methods in FindSources that currently call get_matching_tag to have a more unified approach to retrieving source URLs by tag. This contribution does not do this refactoring for clarity and focus.
* replace all calls to get_matching_tag to get_matching_source_url (3 occurances) * rewrote get_github_info() and check_for_github_error() to raise a NotImplementedError. This will help consumers of these methods notify the change in capycli. In capycli, these methods are not used outside of FindSources. * refactored the unittest test_find_golang_url_github for historical reasons: Due to a bug in the unittest it was the first test to break when I integrated the new approach. Through this the test became my workhorse for integrating the new approach.
bea838f
to
c8451a3
Compare
@16Martin |
But it will run into problems if we ever see a list with 'message' as an entry. I put |
…w360/capycli into martin/fix-github-tag-matching
530f482
to
00b7592
Compare
This PR addresses #99 and introduces code intended to replace the current combination of get_github_info() followed by get_matching_tag() which exists only in capycli.bom.findsources.
This approach first tries to match a tag using the original get_matching_tag(). If all the guessing does not yield any results, the algo implicitly falls back to analyzing each tag with get_matching_tag().