fix(cmSketch test) and add better cmSketch test #325

FrankReh · 2022-12-23T12:59:43Z

One Sketch test was intended to check that the seeds had caused each sketch row to be unique but the way the test was iterating, the failure wasn't going to be triggered.

With this fix, the test passes seeming to indicate the seeds are
working as intended on the row creation. But there is a problem.

A new Sketch test is added that goes to the trouble of sorting the counters of each row to ensure that each row isn't just essentially a copy of another row, where the only difference is which position the counters occupy.

And the new Sketch test is performed twice, once with high-entropy hashes, because they come from a PRNG, and once with low-entropy hashes, which in many cases will be normal, because they are all small numbers, good for indexing, but not good for getting a spread when simply applied to a bitwise XOR operation.

These new tests show a problem with the counter increment logic
within the cmSketch.Increment method which was most likely
introduced by commit f823dc4a.

A subsequent commit addresses the problems surfaced. But as the
discussion from issue #108 shows (discussion later moved to
https://discuss.dgraph.io/t/cmsketch-not-benefitting-from-four-rows/8712 ),
other ideas for incrementing the counters were considered
by previous authors as well.

Fix existing cmSketch test and add improved cmSketch test

(Marked as draft because this introduces a test that fails for now. I can commit a fix to the cmSketch increment method to get the new test to pass - if a maintainer agrees there is a problem to be fixed. See #108. I tried a few years ago.)

One Sketch test was intended to check that the seeds had caused each sketch row to be unique but the way the test was iterating, the failure wasn't going to be triggered.

With this fix, the test passes seeming to indicate the seeds are
working as intended on the row creation. But there is a problem with
the actual increment method. A new Sketch test is added that goes
further and sorts the counters of each row to ensure that each row
isn't just essentially a copy of another row, where the only
difference is which position the counters occupy.

And the new Sketch test is performed twice, once with high-entropy hashes, because they come from a PRNG, and once with low-entropy hashes, which in many cases will be normal, because they are all small numbers, good for indexing, but not good for getting a spread when simply applied with a bitwise XOR operation.

These new tests show a problem with the counter increment logic
within the cmSketch.Increment method which was most likely
introduced by commit f823dc4a.

A subsequent commit addresses the problems surfaced. But as the
discussion from issue #108 shows (discussion later moved to
https://discuss.dgraph.io/t/cmsketch-not-benefitting-from-four-rows/8712),
other ideas for incrementing the counters were considered by
previous authors of the original Java code as well.

Problem

Solution

CLAassistant · 2022-12-23T12:59:48Z

All committers have signed the CLA.

CLAassistant · 2022-12-23T12:59:48Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

github-actions · 2024-07-18T14:03:31Z

This PR has been stale for 60 days and will be closed automatically in 7 days. Comment to keep it open.

FrankReh · 2024-07-20T02:43:27Z

It's too bad things like this aren't being fixed. I've seen the logic error copied as part of other language ports.

github-actions · 2024-09-19T02:21:44Z

This PR has been stale for 60 days and will be closed automatically in 7 days. Comment to keep it open.

FrankReh · 2024-09-20T13:22:48Z

Anybody?

FrankReh · 2024-10-16T01:43:45Z

I hope it is clear to anyone reviewing that this test is failing because of the logic bug I pointed out years ago. I had introduced a separate PR to fix that. This PR just highlights the problem more clearly for someone new.

github-actions · 2024-12-15T02:29:37Z

This PR has been stale for 60 days and will be closed automatically in 7 days. Comment to keep it open.

FrankReh · 2024-12-16T20:26:43Z

Would like to keep this open until a maintainer can address it or say they don't care. Thanks!

mangalaman93 · 2024-12-17T06:40:15Z

@FrankReh Thank you for your PR. I understand this maybe frustrating but we appreciate the change and will look at it as soon as we can. This is under our watch and we need to spend some time on it for reviewing it. Let me see if we can prioritize it sooner.

One Sketch test was intended to check that the seeds had caused each sketch row to be unique but the way the test was iterating, the failure wasn't going to be triggered. With this fix, the test passes seeming to indicate the seeds are working as intended on the row creation. But there is a problem. A new Sketch test is added that goes to the trouble of sorting the counters of each row to ensure that each row isn't just essentially a copy of another row, where the only difference is which position the counters occupy. And the new Sketch test is performed twice, once with high-entropy hashes, because they come from a PRNG, and once with low-entropy hashes, which in many cases will be normal, because they are all small numbers, good for indexing, but not good for getting a spread when simply applied to a bitwise XOR operation. These new tests show a problem with the counter increment logic within the cmSketch.Increment method which was most likely introduced by commit f823dc4. A subsequent commit addresses the problems surfaced. But as the discussion from issue hypermodeinc#108 shows (discussion later moved to https://discuss.dgraph.io/t/cmsketch-not-benefitting-from-four-rows/8712 ), other ideas for incrementing the counters were considered by previous authors as well. Fix existing cmSketch test and add improved cmSketch test (Marked as draft because this introduces a test that fails for now. I can commit a fix to the cmSketch increment method to get the new test to pass - if a maintainer agrees there is a problem to be fixed. See hypermodeinc#108. I tried a few years ago.) One Sketch test was intended to check that the seeds had caused each sketch row to be unique but the way the test was iterating, the failure wasn't going to be triggered. With this fix, the test passes seeming to indicate the seeds are working as intended on the row creation. But there is a problem with the actual increment method. A new Sketch test is added that goes further and sorts the counters of each row to ensure that each row isn't just essentially a copy of another row, where the only difference is which position the counters occupy. And the new Sketch test is performed twice, once with high-entropy hashes, because they come from a PRNG, and once with low-entropy hashes, which in many cases will be normal, because they are all small numbers, good for indexing, but not good for getting a spread when simply applied with a bitwise XOR operation. These new tests show a problem with the counter increment logic within the cmSketch.Increment method which was most likely introduced by commit f823dc4. A subsequent commit addresses the problems surfaced. But as the discussion from issue hypermodeinc#108 shows (discussion later moved to https://discuss.dgraph.io/t/cmsketch-not-benefitting-from-four-rows/8712), other ideas for incrementing the counters were considered by previous authors of the original Java code as well.

github-actions bot added the Stale label Jul 18, 2024

github-actions bot removed the Stale label Jul 20, 2024

github-actions bot added the Stale label Sep 19, 2024

github-actions bot removed the Stale label Sep 21, 2024

mangalaman93 force-pushed the frankreh/sketch-effectiveness-tests2 branch from 98bb8ef to 994e81b Compare October 14, 2024 11:07

mangalaman93 self-assigned this Oct 14, 2024

mangalaman93 marked this pull request as ready for review October 14, 2024 11:08

mangalaman93 requested a review from a team as a code owner October 14, 2024 11:08

github-actions bot added the Stale label Dec 15, 2024

github-actions bot removed the Stale label Dec 17, 2024

mangalaman93 force-pushed the frankreh/sketch-effectiveness-tests2 branch from 994e81b to bca7b61 Compare December 17, 2024 06:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cmSketch test) and add better cmSketch test #325

fix(cmSketch test) and add better cmSketch test #325

FrankReh commented Dec 23, 2022

CLAassistant commented Dec 23, 2022 •

edited

Loading

CLAassistant commented Dec 23, 2022

github-actions bot commented Jul 18, 2024

FrankReh commented Jul 20, 2024

github-actions bot commented Sep 19, 2024

FrankReh commented Sep 20, 2024

FrankReh commented Oct 16, 2024

github-actions bot commented Dec 15, 2024

FrankReh commented Dec 16, 2024

mangalaman93 commented Dec 17, 2024

fix(cmSketch test) and add better cmSketch test #325

Are you sure you want to change the base?

fix(cmSketch test) and add better cmSketch test #325

Conversation

FrankReh commented Dec 23, 2022

Problem

Solution

CLAassistant commented Dec 23, 2022 • edited Loading

CLAassistant commented Dec 23, 2022

github-actions bot commented Jul 18, 2024

FrankReh commented Jul 20, 2024

github-actions bot commented Sep 19, 2024

FrankReh commented Sep 20, 2024

FrankReh commented Oct 16, 2024

github-actions bot commented Dec 15, 2024

FrankReh commented Dec 16, 2024

mangalaman93 commented Dec 17, 2024

CLAassistant commented Dec 23, 2022 •

edited

Loading