-
Notifications
You must be signed in to change notification settings - Fork 271
Conversation
# When benchmarking, we want the ability to download the segments in | ||
# parallel. By default, data-diff will checksum the section first (when | ||
# it's below the threshold) and _then_ download it. | ||
if BENCHMARK: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I don't think it would be crazy to make this the default...
I ran the benchmark with the default bisection threshold, and checksumming / downloading with the default threshold and ~10k rows is extremely competitive. And the current approach is twice as slow when there is a difference.
23614f7
to
c5c2a9d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When running the tests I get
mysql.connector.errors.ProgrammingError: 1059 (42000): Identifier name 'idx_dst_postgresql_double_precision_to_mysql_numeric65_10_50_id_col' is too long
data_diff/diff_tables.py
Outdated
@@ -403,6 +404,15 @@ def _diff_tables(self, table1, table2, level=0, segment_index=None, segment_coun | |||
f"size: {table2.max_key-table1.min_key}" | |||
) | |||
|
|||
# When benchmarking, we want the ability to download the segments in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the word "parallel" in this context.
Wouldn't it be more correct to say "we want the ability to skip the checksum, and force the download of all segments" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, the parallel/concurrency is redundant. I think you're right to just put checksum-skipping as the prominent thing in this comment
tests/test_database_types.py
Outdated
assert BENCHMARK, "Table should've been deleted, or we should be in BENCHMARK mode" | ||
return | ||
elif current_n_rows > 0: | ||
logging.error(f"{conn.name}: Table {table} has {current_n_rows}, expected {N_SAMPLES} or 0. Recreating table") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be a warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default logging error level is ERROR to have less noise for test output (because the warning level has all the precision warnings, etc. that make the test output noisy). Of course, we could set this differently per logger, but I think that just adds more confusion..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something, but it should be simple to set the test logger to warning, and the data-diff loggers to error, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that was what I was getting at in my last sentence.. I will just remove it then
Ugh, not sure how I missed that! Fixed with another |
c5c2a9d
to
ae8599a
Compare
@erezsh tried to resolve your comments! Let me know if it's good to go now :) |
ae8599a
to
90de3df
Compare
Good to merge? |
90de3df
to
36fd574
Compare
Cleanup of resolved merge conflict from #67 based on all the changes in master. Trying to split it up into chunks to make it easier to review (and do an extra pass on it all)
The idea is to dual-purpose a subset of the type test-suite for benchmarking, to ensure it doesn't rust. In particular, to follow-up from #106, the changes are to:
BENCHMARK=1
mode, we do not want to drop tables, to prevent waiting for millions of rows to be inserted again. Not even across revisions if we're testing the performance change between two revisionsThere are plenty of follow-ups to this, likely not all of which I will be able to complete:
benchmark_*.jsonl
@erezsh
cc @abhinav1912 due to questions in #103