Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci-matrix] Distribute unknown packages between shards. #339

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Pawka
Copy link
Contributor

@Pawka Pawka commented Jun 16, 2023

Currently all new packages which are not in timing files are assigned to a single bucket with shortest duration. This happens because duration of "unknown" packages is set to 0 but in reality it might be longer. There are also empty packages with no tests which have 0s duration but there is little overhead to run those tests.

This patch changes how those are distributed between shards to avoid all all of them added to a single bucket. If package run duration is lower than threshold - assign it by using hash function instead of minimal bucket. Hashing function is consistent what means the same package is assigned to the same bucket if package's name and count of buckets is not changed. This helps to reuse go build cache.

Bucket name is retrieved by calculating md5 hash of package's name, then taking part of hash, converting to integer and calculating modulus by buckets count. Only a part of hash is converted to integer to avoid overflow - converting md5 hex to integer might give a result bigger than int64. I've tested distribution of such method and it seems to be pretty balanced. The 1st column shows how many packages were assigned to the bucket.

$ go list ./... | gotestsum tool ci-matrix --timing-files="path_to_files/*.json" --partitions 16 --debug 2>debug.log
$ grep "bucket \d+" -Eo < debug.log  | sort | uniq -c
  76 bucket 0
  64 bucket 1
  70 bucket 10
  78 bucket 11
  75 bucket 12
  78 bucket 13
  55 bucket 14
  60 bucket 15
  59 bucket 2
  71 bucket 3
  59 bucket 4
  67 bucket 5
  78 bucket 6
  65 bucket 7
  67 bucket 8
  74 bucket 9

This approach can be used when sharding strategy is not updated on each test run but is reused for extended period (e.g. week) because new tests will be distributed among shards.

Currently all new packages which are not in timing files are assigned to
a single bucket with shortest duration. This happened because duration
of "unknown" packages is set to 0 but in reality it might be longer.
Also empty packages with no tests have 0s duration but there is little
overhead to run those tests.

To avoid all such packages adding into a single bucket, change how those
are distributed between shards. If package run duration is lower than
threshold - assign it by using hash function instead of minimal bucket.
@Pawka
Copy link
Contributor Author

Pawka commented Jun 16, 2023

P.S. We might want to make this feature opt-in by providing threshold via flag.

@dnephin
Copy link
Member

dnephin commented Jun 17, 2023

Thank you for the PR! This would be a great thing to improve. It has been a little while since I've looked at this code. I believe the only special case it was handling previously was when no packages had any timing data. In that case it was using round robin to distribute the packages.

Generalizing it so that all packages with 0 timing data are round robin distributed seems like a great improvement. A couple questions for you about this approach. What's the reason for using a threshold of 5ms instead of using 0? Is it common for your packages to start out with very little test?

What do you think about an approach like this:

  1. add a TotalZeroTime field to the bucket struct, which will count the number of packages assigned to that bucket that had 0 elapsed time
  2. When placing a package with 0 elapsed time, round robin between all the buckets by looking for the bucket with the smallest number of TotalZeroTime. That way new packages should be evenly distributed between the buckets, starting with the bucket with the least amount of time.
  3. As a special case we could avoid the largest bucket when playing packages with 0 elapsed time. Since the largest bucket is likely going to be the bottleneck, we don't want to accidentally assign a new package to that bucket.

@Pawka
Copy link
Contributor Author

Pawka commented Jun 19, 2023

Thanks for reply!

What's the reason for using a threshold of 5ms instead of using 0?

It is just arbitrary number which I've picked to treat a test as negligible. Even 100 of 5ms tests won't increase duration of a bucket significantly so it might make sense to distribute those among buckets. But I feel we might want to make this value configurable via flag. What do you think?

As a special case we could avoid the largest bucket when playing packages with 0 elapsed time. <...>

100% to this. It is better to avoid adding additional tests to the "longest" bucket. I was thinking is it enough just to exclude one or some arbitrary part of buckets such as 20% but no less than 1?

Few comments on round robin vs consistent assigning since we need to agree which path to choose.

I see two ways how ci-matrix can be used:

  1. Run the command before each CI job depending on recent timing files from previous builds. In this way there is high probability packages would be distributed to different partitions each time. Golang build cache hit rate might not be high if agents are not long-living. This approach is fine with not very huge codebases.
  2. Generated ci-matrix distribution file is reused for extended period. Benefit of this approach is that same tests always will be assigned to the same shards what might help to achieve higher Golang cache hit rate.

While cache hit rate for 0s packages is not important but new packages, which are not in distribution file, are also treated as 0s. But in reality those will run longer. If the 2nd approach is used - it is better to keep consistency and assign specific package to the same bucket so build cache could be reused. To stay consistent round robin approach can not be used because a new package would shift all packages after it by one.

Let me know if you think this is something what you'd like to adopt and I can make changes to this PR.

@Pawka
Copy link
Contributor Author

Pawka commented Jun 21, 2023

After some investigation I've found that the best way is just to ignore packages without tests at all. Running go test for a package with no tests still takes time. In my experiment running go test for ~70 empty packages took ~30-120s (with no cache). This might become significant overhead for huge repositories.

@dnephin what do you think about excluding packages from matrix if Elapsed value is 0?

@dnephin
Copy link
Member

dnephin commented Jun 21, 2023

To stay consistent round robin approach can not be used because a new package would shift all packages after it by one.

One important detail, that may be not be obvious, is that the packages are sorted by elapsed time before attempting to place them into buckets. This means that packages with 0 elapsed time will always be last. I think that makes it safe to round-robin them, because only new packages or packages with no tests could come later. It will never impact existing packages with non-zero elapsed time.

ignore packages without tests at all

How do we know if the package has no tests, or if the package is new and has no previous runtime? I'm not sure if this is safe, because it sounds like it could accidentally skip tests in new packages.

In my experiment running go test for ~70 empty packages took ~30-120s

Do you think it is common for projects to have that many packages without tests? I've seen projects with a few small packages with no tests (maybe 10-20 packages), but generally the time to compile those packages is negligible compared to the time required to run all the other tests.

@Pawka
Copy link
Contributor Author

Pawka commented Jun 21, 2023

Yeah, I agree with you - standard project probably won't have much packages without tests.

On a flip side empty packages can be excluded without any changes to gotestsum - by using go list -json ./... | jq 'select(.TestGoFiles | length > 0) | .ImportPath' -r instead of simple go list ./.... So probably it is not worth to extend gotestsum here. Sorry for jumping back and forward :)

I'll update this PR in near time to support round robin approach.

Comment on lines +249 to +255

func consistentBucket(pkg string, n uint) int {
h := md5.New()
io.WriteString(h, pkg)
hash := fmt.Sprintf("%x", h.Sum(nil))
decimal, _ := strconv.ParseUint(hash[:10], 16, 64)
return int(decimal) % int(n)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use maphash.Hash{} ?

Suggested change
func consistentBucket(pkg string, n uint) int {
h := md5.New()
io.WriteString(h, pkg)
hash := fmt.Sprintf("%x", h.Sum(nil))
decimal, _ := strconv.ParseUint(hash[:10], 16, 64)
return int(decimal) % int(n)
var hashSeed = maphash.MakeSeed()
func consistentBucket(pkg string, n uint) int {
return int(maphash.String(hashSeed, str)) % int(n)

Maphash was designed specifically for this case.

Props:

  1. fastest hash from standart library.
  2. zero allocation.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants