Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Iceberg] Add Histogram Statistic Support #22365

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

ZacBlanco
Copy link
Contributor

@ZacBlanco ZacBlanco commented Mar 29, 2024

Description

This changes adds support for the histogram statistic type to Iceberg. There are 3 related commits

  1. Adds support for collecting a superset of supported statistics in the HMS and Iceberg. This is required to allow histograms to be supported while the HMS is in use.
  2. Adds the histogram implementation for Iceberg.
  3. Adds documentation for histograms statistics.

Motivation and Context

Histograms result in better accuracy for output row counts from filter nodes when calculating statistics. This should result in closer estimates in the CBO, resulting in better query plans.

Impact

  1. ANALYZE for Iceberg and Hive will now store statistics in the table's Puffin files.
  2. Writes new puffin blob type containing the histogram.

Test Plan

  • New tests to ensure histograms are collected for supported types
  • Tests that ensure the histograms result in more accurate plan stat estimates for non-uniform data

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Add support for the histogram statistic type. :pr:`22365`

@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-histogram-storage branch from 58cfe30 to e1a2caa Compare April 17, 2024 13:37
@steveburnett
Copy link
Contributor

Consider this suggested change for the release note entry:

== RELEASE NOTES ==

Iceberg Connector Changes
* Add support for the histogram statistic type. :pr:`22365`

@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-histogram-storage branch 6 times, most recently from bac6b3f to 0e3a87c Compare April 23, 2024 15:26
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the docs! A few nits suggested for clarity and style. Let me know what you think.

presto-docs/src/main/sphinx/admin/properties.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/admin/properties.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/admin/properties.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/optimizer/statistics.rst Outdated Show resolved Hide resolved
@ZacBlanco ZacBlanco marked this pull request as ready for review April 23, 2024 17:13
Copy link

github-actions bot commented Apr 23, 2024

Codenotify: Notifying subscribers in CODENOTIFY files for diff f9f884d...ee5e3fd.

No notifications.

@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-histogram-storage branch 2 times, most recently from 3d551e9 to 65b6fe8 Compare April 26, 2024 03:02
steveburnett
steveburnett previously approved these changes Apr 26, 2024
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull updated branch, new local docs build, everything looks good. Thanks!

Copy link
Member

@hantangwangd hantangwangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a rough glance, and bring some questions for discussion. Will take a detailed look in the next one or two days.

presto-iceberg/pom.xml Outdated Show resolved Hide resolved
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-histogram-storage branch from 24398a5 to 56b79b0 Compare June 20, 2024 17:51
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updated doc! Two nits about formatting, no issues with the content in any of the three doc files.

presto-docs/src/main/sphinx/admin/properties.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/optimizer/statistics.rst Outdated Show resolved Hide resolved
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-histogram-storage branch from 56b79b0 to 492ab5f Compare June 20, 2024 19:31
steveburnett
steveburnett previously approved these changes Jun 20, 2024
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Thanks for the quick fix! Pull updated branch, new local doc build, looks good.

Copy link
Member

@hantangwangd hantangwangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. The changes look good to me, only some nits.

@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-histogram-storage branch from 492ab5f to 3fccf39 Compare July 2, 2024 15:57
steveburnett
steveburnett previously approved these changes Jul 2, 2024
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)
Pull updated branch, new local docs build, everything looks good. Thanks!

steveburnett
steveburnett previously approved these changes Sep 6, 2024
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)
Pull updated branch, new local docs build, everything looks good. Thanks!

The set of ColumnStatisticsMetadata defined by the Hive
and non-Hive connectors are not equivalent. However, it is possible
to collect the superset of the relevant metadata and use it for ANALYZE.
The returned statistics just need to be filtered out to contain only
the relevant column statistics.

This may include duplicate calculations for some statistics. For
example, with distinct values Iceberg puffin files can store the
result of sketch_theta for distinct values, but the code path for
storing the statistic in the HMS requires a direct value from
approx_distinct. Thus, ANALYZE may compute a value twice.
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-histogram-storage branch 2 times, most recently from 56b02fd to d041947 Compare September 23, 2024 17:48
Copy link
Contributor

@aaneja aaneja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a short first pass


The set of statistics available for a particular query depends on the connector
being used and can also vary by table or even by table layout. For example, the
Hive connector does not currently provide statistics on data size.

Table statistics can be displayed via the Presto SQL interface using the
Table statistics can be displayed using a SQL statement with the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can be fetched using the below SQL query sounds better to me

presto-docs/src/main/sphinx/optimizer/statistics.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/sql/show-stats.rst Outdated Show resolved Hide resolved
}
}

public static ToStringHelper toStringHelper(Object self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's attribute this to Guava ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we just use Guava for all of this? We already depend on it. If there is some reason we want to fork this code, then full tests are needed. But it should be much easier just to call Guava

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The guava dependency is not included in presto-common as the presto-common module is a dependency of the presto-spi module which we avoid putting additional dependencies so that when creating connectors or other plugins there it is less likely for downstream users of presto to encounter issues or conflicts

Copy link
Contributor

@elharo elharo Oct 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that's reasonable. However if we're forking this in, we also MUST include the tests as well, and be prepared to support this for the foreseeable future. I'm not sure it's worth the effort. I'd be inclined to just manually build the strings in the toString method like we've been doing since 1995. :-)

}
}

public static ToStringHelper toStringHelper(Object self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we just use Guava for all of this? We already depend on it. If there is some reason we want to fork this code, then full tests are needed. But it should be much easier just to call Guava

@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-histogram-storage branch from d041947 to eae161e Compare October 4, 2024 21:12
}
}

public static ToStringHelper toStringHelper(Object self)
Copy link
Contributor

@elharo elharo Oct 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that's reasonable. However if we're forking this in, we also MUST include the tests as well, and be prepared to support this for the foreseeable future. I'm not sure it's worth the effort. I'd be inclined to just manually build the strings in the toString method like we've been doing since 1995. :-)

Utilizes the sketch_kll function to generate histograms and store them
into the Iceberg table's puffin files for table-level statistic storage.

Histograms are always collected by ANALYZE, but they are not used by the
cost calculator unless enabled via optimizer.use-histograms
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-histogram-storage branch from eae161e to ee5e3fd Compare October 8, 2024 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants