Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add benchmarks to CI #130

Open
SemyonSinchenko opened this issue Aug 4, 2024 · 10 comments
Open

Add benchmarks to CI #130

SemyonSinchenko opened this issue Aug 4, 2024 · 10 comments

Comments

@SemyonSinchenko
Copy link
Collaborator

It would be very cool to have some performance benchmarks in CI.
Benchmarks, mentioned in README is a nice thing to start from:
https://github.com/mrpowers-io/spark-fast-tests?tab=readme-ov-file#why-is-this-library-fast

@zeotuan
Copy link
Collaborator

zeotuan commented Aug 7, 2024

@SemyonSinchenko
Copy link
Collaborator Author

Looks cool, thank you! But we definitely should have it as a subproject.

@zeotuan
Copy link
Collaborator

zeotuan commented Sep 3, 2024

@SemyonSinchenko I added a Draft with an initial project structure for Benchmarks module as well as an additional benchmark step for CI.
Those we still need additional work for:

  • Test Data generation + Parameterized Benchmark
  • Investigate most appropriate, important test metric for the project

@zeotuan
Copy link
Collaborator

zeotuan commented Sep 18, 2024

[info] ColumnComparerBenchmark.assertColumnEqualityBenchmarks                      avgt   20    5361225.521 ±    20905.951  ns/op
[info] DataFrameComparerBenchmark.assertApproximateDataFrameEqualityWithPrecision  avgt   20   64024117.133 ±   768249.992  ns/op
[info] ColumnComparerBenchmark.assertColumnEqualityBenchmarks                        ss   20   32886862.500 ±  3606312.983  ns/op
[info] DataFrameComparerBenchmark.assertApproximateDataFrameEqualityWithPrecision    ss   20  155991154.500 ± 17649852.089  ns/op

surprisingly, our comparer is performing better than what is being shown in README:

  • assertColumnEquality take 5ms/32ms (warmed up vs single shot) each run for 200 rows on simple data type
  • assertApproximateDataFrameEquality take 64ms/176ms (warmed up vs single shot) each run for 100 rows and 3 columns (though usual test will probably run on DFs that are much wider)

I will temporarily disable average benchmark mode until we can come up with a better plan to run.
Probably we will do a hybrid approach in the future where single shot which is only a few seconds is done on push and the whole suite will run on merge or maybe release flow as @alfonsorr suggested

@SemyonSinchenko
Copy link
Collaborator Author

@zeotuan Can we anyhow save results? For example, can we create another branch and do force-push from CI with results in form of MD-files or even plain text?

@zeotuan
Copy link
Collaborator

zeotuan commented Sep 18, 2024

I can make jmh store result in json (or csv and some other format) and have github action upload the artifact which we can download in same or different workflow for updating the MD file.
Should we use github action for auto commit. Maybe auto raise PR for updating MD is a safer option

@SemyonSinchenko
Copy link
Collaborator Author

I can make jmh store result in json (or csv and some other format) and have github action upload the artifact which we can download in same or different workflow for updating the MD file.
Should we use github action for auto commit. Maybe auto raise PR for updating MD is a safer option

My idea was to make a simple script in scala that parse the result and use them with a templating tool to fill the markdown file. Then, the file will be a part of the existing static website.

@zeotuan
Copy link
Collaborator

zeotuan commented Sep 18, 2024

Sure, I think either way it's a two step process:

  • have a separate Workflow that do all the benchmark and publish the result artifact (this can be access by same or different workflow and even Rest API)
  • download all the result and run our scala tool to parse and aggregate the result and produce them for the templating tool

@SemyonSinchenko
Copy link
Collaborator Author

Sure, I think either way it's a two step process:

  • have a separate Workflow that do all the benchmark and publish the result artifact (this can be access by same or different workflow and even Rest API)
  • download all the result and run our scala tool to parse and aggregate the result and produce them for the templating tool

I just thought that maybe we can do it even easier: we can run such a script in the docs ci, fill the template, and for the docs ci it will be just like an md file...

@zeotuan
Copy link
Collaborator

zeotuan commented Sep 19, 2024

I see, so no need to storing intermediate result. I have no objection

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants