Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional HLL for distinct value count to StatisticsBuilder #11928

Closed
wants to merge 1 commit into from

Conversation

oerling
Copy link
Contributor

@oerling oerling commented Dec 21, 2024

Adds an optional HLL counter for distinct values to StatisticsBuilder. This is used in Verax sampling to estimate column cardinalities for scalar types.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 21, 2024
Copy link

netlify bot commented Dec 21, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 94eb25f
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/676dd625c603c800086a519f

@facebook-github-bot
Copy link
Contributor

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@xiaoxmeng xiaoxmeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oerling overall looks good % nits. Thanks!

velox/dwio/common/Statistics.h Show resolved Hide resolved
/// 'disableNumDistinct' is true, distinct values will not be counted.
explicit StatisticsBuilder(
const StatisticsBuilderOptions& options,
bool disableNumDistinct = false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we want to disable num distinct but options set it? And I don't see it has been used. Thanks!

}

int64_t cardinality() {
VELOX_CHECK(hll_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VELOX_CHECK_NOTNULL

}
}

int64_t cardinality() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const method?

@@ -93,6 +93,13 @@ void StatisticsBuilder::merge(
// Merge size
mergeCount(size_, other.getSize());
}
if (hll_) {
if (auto* otherBuilder = dynamic_cast<const StatisticsBuilder*>(&other)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add VELOX NOT NULL check here? We should expect all other statistic builder has the same type?

@@ -93,6 +93,13 @@ void StatisticsBuilder::merge(
// Merge size
mergeCount(size_, other.getSize());
}
if (hll_) {
if (auto* otherBuilder = dynamic_cast<const StatisticsBuilder*>(&other)) {
if (otherBuilder->hll_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly shall we check the other build has set hll?

velox/dwio/dwrf/writer/StatisticsBuilder.cpp Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

velox/dwio/dwrf/writer/StatisticsBuilder.h Show resolved Hide resolved
*/
class StatisticsBuilder : public virtual dwio::common::ColumnStatistics {
public:
/// Constructs with 'options'. If 'options' enable count distinct and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the comment here.

@@ -409,7 +444,7 @@ class MapStatisticsBuilder : public StatisticsBuilder,
MapStatisticsBuilder(
const Type& type,
const StatisticsBuilderOptions& options)
: StatisticsBuilder{options},
: StatisticsBuilder{options.withoutNumDistinct()},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need to disable here?

Adds an optional HLL counter for distinct values to
StatisticsBuilder. This is used in Verax sampling to estimate column
cardinalities for scalar types.
@facebook-github-bot
Copy link
Contributor

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@oerling merged this pull request in 0a4c7d5.

athmaja-n pushed a commit to athmaja-n/velox that referenced this pull request Jan 10, 2025
…ookincubator#11928)

Summary:
Adds an optional HLL counter for distinct values to StatisticsBuilder. This is used in Verax sampling to estimate column cardinalities for scalar types.

Pull Request resolved: facebookincubator#11928

Reviewed By: xiaoxmeng

Differential Revision: D67565289

Pulled By: oerling

fbshipit-source-id: e5cbe96059a932377a3d977bca816be6ec52e038
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants