Skip to content
This repository has been archived by the owner on May 17, 2024. It is now read-only.

benchmark: add suite #125

Merged
merged 1 commit into from
Jun 29, 2022
Merged

benchmark: add suite #125

merged 1 commit into from
Jun 29, 2022

Conversation

sirupsen
Copy link
Contributor

@sirupsen sirupsen commented Jun 28, 2022

Cleanup of resolved merge conflict from #67 based on all the changes in master. Trying to split it up into chunks to make it easier to review (and do an extra pass on it all)

The idea is to dual-purpose a subset of the type test-suite for benchmarking, to ensure it doesn't rust. In particular, to follow-up from #106, the changes are to:

  • In BENCHMARK=1 mode, we do not want to drop tables, to prevent waiting for millions of rows to be inserted again. Not even across revisions if we're testing the performance change between two revisions
  • Create indexes, to have good performance
  • Write benchmark results to stdout and JSONL file(s) for graphing/analysis after
  • Insert in larger batches for performance (follow-up on CSV)
  • When benchmarking, we want to download segments in parallel for fairness

There are plenty of follow-ups to this, likely not all of which I will be able to complete:

  • Enable threading > 1 in the test suite, which likely leads to some connection pooling logic changes I'd rather do parallel
  • Do CSV insertion for the Cloud databases, to make it much faster to reproduce tests for 100M rows (currently takes hours)
  • Add steps on how to reproduce the graph in the README based on benchmark_*.jsonl
  • Add benchmarks with an increasing % of changed/deleted rows

@erezsh

cc @abhinav1912 due to questions in #103

# When benchmarking, we want the ability to download the segments in
# parallel. By default, data-diff will checksum the section first (when
# it's below the threshold) and _then_ download it.
if BENCHMARK:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I don't think it would be crazy to make this the default...

I ran the benchmark with the default bisection threshold, and checksumming / downloading with the default threshold and ~10k rows is extremely competitive. And the current approach is twice as slow when there is a difference.

Copy link
Contributor

@erezsh erezsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running the tests I get

mysql.connector.errors.ProgrammingError: 1059 (42000): Identifier name 'idx_dst_postgresql_double_precision_to_mysql_numeric65_10_50_id_col' is too long

@@ -403,6 +404,15 @@ def _diff_tables(self, table1, table2, level=0, segment_index=None, segment_coun
f"size: {table2.max_key-table1.min_key}"
)

# When benchmarking, we want the ability to download the segments in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the word "parallel" in this context.

Wouldn't it be more correct to say "we want the ability to skip the checksum, and force the download of all segments" ?

Copy link
Contributor Author

@sirupsen sirupsen Jun 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, the parallel/concurrency is redundant. I think you're right to just put checksum-skipping as the prominent thing in this comment

assert BENCHMARK, "Table should've been deleted, or we should be in BENCHMARK mode"
return
elif current_n_rows > 0:
logging.error(f"{conn.name}: Table {table} has {current_n_rows}, expected {N_SAMPLES} or 0. Recreating table")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be a warning?

Copy link
Contributor Author

@sirupsen sirupsen Jun 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default logging error level is ERROR to have less noise for test output (because the warning level has all the precision warnings, etc. that make the test output noisy). Of course, we could set this differently per logger, but I think that just adds more confusion..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something, but it should be simple to set the test logger to warning, and the data-diff loggers to error, no?

Copy link
Contributor Author

@sirupsen sirupsen Jun 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was what I was getting at in my last sentence.. I will just remove it then

@sirupsen
Copy link
Contributor Author

mysql.connector.errors.ProgrammingError: 1059 (42000): Identifier name 'idx_dst_postgresql_double_precision_to_mysql_numeric65_10_50_id_col' is too long

Ugh, not sure how I missed that! Fixed with another .replace

@sirupsen
Copy link
Contributor Author

@erezsh tried to resolve your comments! Let me know if it's good to go now :)

@sirupsen
Copy link
Contributor Author

Good to merge?

@sirupsen sirupsen merged commit 16d6c7c into master Jun 29, 2022
@sirupsen sirupsen deleted the benchmark-suite2 branch June 29, 2022 13:18
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants