Add benchmarks to CI #130

SemyonSinchenko · 2024-08-04T09:50:20Z

It would be very cool to have some performance benchmarks in CI.
Benchmarks, mentioned in README is a nice thing to start from:
https://github.com/mrpowers-io/spark-fast-tests?tab=readme-ov-file#why-is-this-library-fast

zeotuan · 2024-08-07T07:50:36Z

I did some test benchmarking on this https://github.com/mrpowers-io/spark-fast-tests/pull/122/files#diff-7e79c32936db58af66efbf87c7e0114672f89209aa69722e98abee90faee72e3.
what do you think about using https://github.com/sbt/sbt-jmh

SemyonSinchenko · 2024-08-07T07:53:35Z

Looks cool, thank you! But we definitely should have it as a subproject.

zeotuan · 2024-09-03T22:20:55Z

@SemyonSinchenko I added a Draft with an initial project structure for Benchmarks module as well as an additional benchmark step for CI.
Those we still need additional work for:

Test Data generation + Parameterized Benchmark
Investigate most appropriate, important test metric for the project

zeotuan · 2024-09-18T11:47:34Z

[info] ColumnComparerBenchmark.assertColumnEqualityBenchmarks                      avgt   20    5361225.521 ±    20905.951  ns/op
[info] DataFrameComparerBenchmark.assertApproximateDataFrameEqualityWithPrecision  avgt   20   64024117.133 ±   768249.992  ns/op
[info] ColumnComparerBenchmark.assertColumnEqualityBenchmarks                        ss   20   32886862.500 ±  3606312.983  ns/op
[info] DataFrameComparerBenchmark.assertApproximateDataFrameEqualityWithPrecision    ss   20  155991154.500 ± 17649852.089  ns/op

surprisingly, our comparer is performing better than what is being shown in README:

assertColumnEquality take 5ms/32ms (warmed up vs single shot) each run for 200 rows on simple data type
assertApproximateDataFrameEquality take 64ms/176ms (warmed up vs single shot) each run for 100 rows and 3 columns (though usual test will probably run on DFs that are much wider)

I will temporarily disable average benchmark mode until we can come up with a better plan to run.
Probably we will do a hybrid approach in the future where single shot which is only a few seconds is done on push and the whole suite will run on merge or maybe release flow as @alfonsorr suggested

SemyonSinchenko · 2024-09-18T14:42:53Z

@zeotuan Can we anyhow save results? For example, can we create another branch and do force-push from CI with results in form of MD-files or even plain text?

zeotuan · 2024-09-18T22:07:51Z

I can make jmh store result in json (or csv and some other format) and have github action upload the artifact which we can download in same or different workflow for updating the MD file.
Should we use github action for auto commit. Maybe auto raise PR for updating MD is a safer option

SemyonSinchenko · 2024-09-18T22:26:51Z

I can make jmh store result in json (or csv and some other format) and have github action upload the artifact which we can download in same or different workflow for updating the MD file.
Should we use github action for auto commit. Maybe auto raise PR for updating MD is a safer option

My idea was to make a simple script in scala that parse the result and use them with a templating tool to fill the markdown file. Then, the file will be a part of the existing static website.

zeotuan · 2024-09-18T22:36:46Z

Sure, I think either way it's a two step process:

have a separate Workflow that do all the benchmark and publish the result artifact (this can be access by same or different workflow and even Rest API)
download all the result and run our scala tool to parse and aggregate the result and produce them for the templating tool

SemyonSinchenko · 2024-09-19T06:08:33Z

Sure, I think either way it's a two step process:

have a separate Workflow that do all the benchmark and publish the result artifact (this can be access by same or different workflow and even Rest API)

download all the result and run our scala tool to parse and aggregate the result and produce them for the templating tool

I just thought that maybe we can do it even easier: we can run such a script in the docs ci, fill the template, and for the docs ci it will be just like an md file...

zeotuan · 2024-09-19T08:15:09Z

I see, so no need to storing intermediate result. I have no objection

SemyonSinchenko added enhancement ci performance labels Aug 4, 2024

SemyonSinchenko self-assigned this Aug 4, 2024

zeotuan mentioned this issue Sep 3, 2024

Init benchmarks module with multi-projects structure #147

Merged

zeotuan mentioned this issue Sep 18, 2024

Column Comparer benchmarks #158

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmarks to CI #130

Add benchmarks to CI #130

SemyonSinchenko commented Aug 4, 2024

zeotuan commented Aug 7, 2024

SemyonSinchenko commented Aug 7, 2024

zeotuan commented Sep 3, 2024

zeotuan commented Sep 18, 2024

SemyonSinchenko commented Sep 18, 2024

zeotuan commented Sep 18, 2024

SemyonSinchenko commented Sep 18, 2024

zeotuan commented Sep 18, 2024 •

edited

Loading

SemyonSinchenko commented Sep 19, 2024

zeotuan commented Sep 19, 2024

Add benchmarks to CI #130

Add benchmarks to CI #130

Comments

SemyonSinchenko commented Aug 4, 2024

zeotuan commented Aug 7, 2024

SemyonSinchenko commented Aug 7, 2024

zeotuan commented Sep 3, 2024

zeotuan commented Sep 18, 2024

SemyonSinchenko commented Sep 18, 2024

zeotuan commented Sep 18, 2024

SemyonSinchenko commented Sep 18, 2024

zeotuan commented Sep 18, 2024 • edited Loading

SemyonSinchenko commented Sep 19, 2024

zeotuan commented Sep 19, 2024

zeotuan commented Sep 18, 2024 •

edited

Loading