Skip to content

The execution time of micro benchmark is not consistent #868

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
helloguo opened this issue Sep 5, 2018 · 9 comments
Closed

The execution time of micro benchmark is not consistent #868

helloguo opened this issue Sep 5, 2018 · 9 comments

Comments

@helloguo
Copy link

helloguo commented Sep 5, 2018

I have a micro benchmark looks like this.

        [Benchmark]
        public void ScaleUPerf0() => CpuMath.Scale(DEFAULT_SCALE, dst, LEN);

        [Benchmark]
        public void ScaleUPerf1() => CpuMath.Scale(DEFAULT_SCALE, dst, LEN);

        [Benchmark]
        public void ScaleUPerf2() => CpuMath.Scale(DEFAULT_SCALE, dst, LEN);

ScaleUPerf0, ScaleUPerf1 and ScaleUPerf2 are actually testing the exactly same function. I expect the execution time of each benchmark is similar. However, when I run the benchmark, the perf data is quite different.

Method LEN Mean Error StdDev Median
ScaleUPerf0 65537 5.446 us 0.0062 us 0.0058 us 5.448 us
ScaleUPerf1 65537 7.131 us 0.0055 us 0.0046 us 7.132 us
ScaleUPerf2 65537 5.446 us 0.0081 us 0.0072 us 5.447 us

I was wondering what would be the reasons? The whole micro benchmark can be found here https://github.com/helloguo/tmp-code/tree/master/bench

@EgorBo
Copy link
Member

EgorBo commented Sep 5, 2018

@helloguo can it be affected by AVX frequency throttling?

@Tornhoof
Copy link
Contributor

Tornhoof commented Sep 5, 2018

This might be a problem with the inprocess toolchain, try the following config, for creating an netcoreapp3.0 toolchain with fairly recent daily bits.

private static IConfig CreateClrVsCoreConfig()
{
    var config = DefaultConfig.Instance.With(
        Job.Default.With(CustomCoreClrToolchainBuilder.Create().UseCoreClrNuGet("3.0.0-preview1-26905-04")
            .UseCoreFxNuGet("3.0.0-preview1-26904-01").TargetFrameworkMoniker("netcoreapp3.0").ToToolchain()).WithLaunchCount(1));
    return config;
}

Edit: With your repro I saw the same behaviour a few times (not always) with the inprocess toolchain, but never with the custom one.

@helloguo
Copy link
Author

helloguo commented Sep 5, 2018

@EgorBo Thank you for your input. I did not see the frequency changed that much when I profiled the benchmark.

@Tornhoof Thank you for your suggestion. Yes, CustomCoreClrToolchainBuilder makes the results more consistent. I guess I should use CustomCoreClrToolchainBuilder instead of InProcessToolchain. It would be interesting to know why the variance happened with InProcessToolchain. Any idea how to root cause it?

@Tornhoof
Copy link
Contributor

Tornhoof commented Sep 6, 2018

@helloguo As you use netcoreapp3.0, I would probably first check for the tiered jitter, by setting SET COMPlus_TieredCompilation = 0 and running it again, I think I saw an issue over at coreclr that the hw intrinsics behave slightly differently with tiered_compilation, if that does not change it, I have no idea. Maybe @adamsitnik knows a good way too debug that.

@adamsitnik
Copy link
Member

Hi @helloguo

In this case, I would expect it to be an alignment issue. This great blog posts from @Metalnem explain a similar issue.

@helloguo
Copy link
Author

helloguo commented Sep 6, 2018

Thank you for your suggestion. It seems alignment (cache split) makes the difference from that blog. But if we use CustomCoreClrToolchainBuilder, we do not see much variance. In this case, probably we just get lucky that arrays are 32 bytes aligned for most of the time?

The ideal case is that we can define if the array is 32 bytes alignment or not. In this way, I could test the function against both aligned array and non-aligned array. Unfortunately, I'm not aware of any possible way.

Is there a way to reduce the impact of alignment of the tested array? Maybe measure it multiple times and take the median number?

@adamsitnik
Copy link
Member

we just get lucky that arrays are 32 bytes aligned for most of the time?

I guess so.

We had a very long discussion about this in #756

What I think that you could do:

  1. Instead of allocating a n-element array dst = new float[LEN]; allocate a bigger array n-element + alignment array like dst = new float[LEN + 32];
  2. In your benchmark, don't start from float* pd = pd0; but from the first aligned element of the array (sth like float* pd = pd0 + 32 - (pd0 % 32);)

@helloguo
Copy link
Author

helloguo commented Sep 6, 2018

Thank you. I will close this issue if there is no more concerns.

@helloguo helloguo closed this as completed Sep 6, 2018
@adamsitnik
Copy link
Member

@helloguo I am glad I could help! Please let me know if it helps or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants