Skip to content

Commit

Permalink
Add 0.289 docs
Browse files Browse the repository at this point in the history
  • Loading branch information
wanglinsong authored and alileclerc committed Sep 19, 2024
1 parent 45eb2ed commit 1f5c572
Show file tree
Hide file tree
Showing 1,814 changed files with 1,840,591 additions and 3 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
18 changes: 18 additions & 0 deletions website/static/docs/0.289/_sources/admin.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
**************
Administration
**************

.. toctree::
:maxdepth: 1

admin/web-interface
admin/tuning
admin/properties
admin/spill
admin/exchange-materialization
admin/cte-materialization
admin/resource-groups
admin/session-property-managers
admin/function-namespace-managers
admin/dist-sort
admin/verifier
138 changes: 138 additions & 0 deletions website/static/docs/0.289/_sources/admin/cte-materialization.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@

===================
CTE Materialization
===================

Common Table Expressions (CTEs) are subqueries that appear in a WITH clause provided by the user.
Their repeated usage in a query can lead to redundant computations, excessive data retrieval, and high resource consumption.

To address this, Presto supports CTE Materialization allowing intermediate CTEs to be reused within the scope of the same query.
Materializing CTEs can improve performance when the same CTE is used multiple times in a query by reducing recomputation of the CTE. However, there is also a cost to writing to and reading from disk, so the optimization may not be beneficial for very simple CTEs
or CTEs that are not used many times in a query.

Materialized CTEs are stored in temporary tables that are bucketed based on random hashing.
To use this feature, the connector used by the query must support the creation of temporary tables. Currently, only the :doc:`/connector/hive` offers this capability.
The QueryStats (com.facebook.presto.spi.eventlistener.QueryStatistics#writtenIntermediateBytes) expose a metric to the event listener to monitor the bytes written to intermediate storage by temporary tables.

How to use CTE Materialization
------------------------------

The following configurations and session properties enable CTE materialization and modify its settings.

``cte-materialization-strategy``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* **Type:** ``string``
* **Allowed values:** ``ALL``, ``NONE``, ``HEURISTIC``, ``HEURISTIC_COMPLEX_QUERIES_ONLY``
* **Default value:** ``NONE``

Specifies the strategy for materializing Common Table Expressions (CTEs) in queries.

``NONE`` - no CTEs will be materialized.

``ALL`` - all CTEs in the query will be materialized.

``HEURISTIC`` - greedily materializes the earliest parent CTE, which is repeated >= ``cte_heuristic_replication_threshold`` times.

``HEURISTIC_COMPLEX_QUERIES_ONLY`` greedily materializes the earliest parent CTE which meets the ``HEURISTIC`` criteria and has a join or aggregate.

Use the ``cte_materialization_strategy`` session property to set on a per-query basis.

``cte-heuristic-replication-threshold``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* **Type:** ``integer``
* **Minimum value:** ``0``
* **Default value:** ``4``

When ``cte-materialization-strategy`` is set to ``HEURISTIC`` or ``HEURISTIC_COMPLEX_QUERIES_ONLY``, then CTEs will be materialized if they appear in a query at least ``cte-heuristic-replication-threshold`` number of times.

Use the ``cte_heuristic_replication_threshold`` session property to set on a per-query basis.

``query.cte-partitioning-provider-catalog``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* **Type:** ``string``
* **Default value:** ``system``

The name of the catalog that provides custom partitioning for CTE materialization.
This setting specifies which catalog should be used for CTE materialization.

Use the ``cte_partitioning_provider_catalog`` session property to set on a per-query basis.

``cte-filter-and-projection-pushdown-enabled``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* **Type:** ``boolean``
* **Default value:** ``true``

Flag to enable or disable the pushdown of common filters and projects into the materialized CTE.

Use the ``cte_filter_and_projection_pushdown_enabled`` session property to set on a per-query basis.

``hive.cte-virtual-bucket-count``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* **Type:** ``integer``
* **Default value:** ``128``

The number of buckets to be used for materializing CTEs in queries.
This setting determines how many buckets are used when materializing the CTEs, potentially affecting the performance of queries involving CTE materialization.
A higher number of buckets might improve parallelism but also increases overhead in terms of memory and network communication.

Recommended value: 4 - 10 times the size of the cluster.

Use the ``hive.cte_virtual_bucket_count`` session property to set on a per-query basis.

``hive.temporary-table-storage-format``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* **Type:** ``string``
* **Allowed values:** ``PAGEFILE``, ``ORC``, ``DWRF``, ``ALPHA``, ``PARQUET``, ``AVRO``, ``RCBINARY``, ``RCTEXT``, ``SEQUENCEFILE``, ``JSON``, ``TEXTFILE``, ``CSV``
* **Default value:** ``ORC``

This setting determines the data format for temporary tables generated by CTE materialization. The recommended value is ``PAGEFILE`` :doc:`/develop/serialized-page`, as it is the most performant,
since it avoids serialization and deserialization during reads and writes, allowing for direct storage of Presto pages.

Use the ``hive.temporary_table_storage_format`` session property to set on a per-query basis.

``hive.temporary-table-compression-codec``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* **Type:** ``string``
* **Allowed values:** ``SNAPPY``, ``NONE``, ``GZIP``, ``LZ4``, ``ZSTD``
* **Default value:** ``SNAPPY``

This property defines the compression codec to be used for temporary tables generated by CTE materialization.

Use the ``hive.temporary_table_compression_codec`` session property to set on a per-query basis.

``hive.bucket-function-type-for-cte-materialization``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* **Type:** ``string``
* **Allowed values:** ``HIVE_COMPATIBLE``, ``PRESTO_NATIVE``
* **Default value:** ``PRESTO_NATIVE``

This setting specifies the Hash function type for CTE materialization.

Use the ``hive.bucket_function_type_for_cte_materialization`` session property to set on a per-query basis.


``query.max-written-intermediate-bytes``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* **Type:** ``DataSize``
* **Default value:** ``2TB``

This setting defines a cap on the amount of data that can be written during CTE Materialization. If a query exceeds this limit, it will fail.

Use the ``query_max_written_intermediate_bytes`` session property to set on a per-query basis.


How to Participate in Development
---------------------------------

List of issues - (https://github.com/prestodb/presto/labels/cte_materialization)


17 changes: 17 additions & 0 deletions website/static/docs/0.289/_sources/admin/dist-sort.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
================
Distributed sort
================

Distributed sort allows to sort data which exceeds ``query.max-memory-per-node``.
Distributed sort is enabled via ``distributed_sort`` session property or
``distributed-sort`` configuration property set in
``etc/config.properties`` of the coordinator. Distributed sort is enabled by
default.

When distributed sort is enabled, sort operator executes in parallel on multiple
nodes in the cluster. Partially sorted data from each Presto worker node is then streamed
to a single worker node for a final merge. This technique allows to utilize memory of multiple
Presto worker nodes for sorting. The primary purpose of distributed sort is to allow for sorting
of data sets which don't normally fit into single node memory. Performance improvement
can be expected, but it won't scale linearly with the number of nodes since the
data needs to be merged by a single node.
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
========================
Exchange Materialization
========================

Presto allows exchange materialization to support memory intensive queries.
This mechanism brings MapReduce-style execution to Presto's MPP architecture runtime,
and can be applied together with :doc:`/admin/spill`.

Introduction
------------

As with other MPP databases, Presto leverages RPC shuffle to achieve efficient and
low-latency query execution for join and aggregation. However, RPC shuffle
also requires all the producers and consumers to be executed concurrently until the
query is finished.

To illustrates this, consider the aggregation query:

.. code-block:: sql
SELECT custkey, SUM(totalprice)
FROM orders
GROUP BY custkey
The following figure demonstrates how this query executes in Presto classic mode:

.. figure:: ../images/rpc_shuffle_execution.png
:align: center

With exchange materialization, the intermediate shuffle data is written to disk (currently,
it is always a temporary Hive bucketed table). This opens the opportunity for flexible scheduling policies
on the aggregation side, as only a subset of aggregation data needs to be held in memory at the
same time -- this execution strategy is called "grouped execution" in Presto.

.. figure:: ../images/materialized_shuffle_execution.png
:align: center

Using Exchange Materialization
------------------------------

Exchange materialization can be enabled on per-query basis by setting the following 3 session properties:
``exchange_materialization_strategy``, ``partitioning_provider_catalog`` and ``hash_partition_count``:

.. code-block:: sql
SET SESSION exchange_materialization_strategy='ALL';
-- Set partitioning_provider_catalog to the Hive connector catalog
SET SESSION partitioning_provider_catalog='hive';
-- We recommend setting hash_partition_count to be at least 5X-10X about the cluster size
-- when exchange materialization is enabled.
SET SESSION hash_partition_count = 4096;
To make it easy for user to use exchange materialization, the admin can leverage :doc:`/admin/session-property-managers`
to set the session properties automatically based on client tags. The example in :doc:`/admin/session-property-managers`
demonstrates how to automatically enable exchange materialization for queries with ``high_mem_etl`` tag.

Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
===========================
Function Namespace Managers
===========================

.. warning::

This is an experimental feature being actively developed. The way
Function Namespace Managers are configured might be changed.

Function namespace managers support storing and retrieving SQL
functions, allowing the Presto engine to perform actions such as
creating, altering, deleting functions.

A function namespace is in the format of ``catalog.schema`` (e.g.
``example.test``). It can be thought of as a schema for storing
functions. However, it is not a full fledged schema as it does not
support storing tables and views, but only functions.

Each Presto function, whether built-in or user-defined, resides in
a function namespace. All built-in functions reside in the
``presto.default`` function namespace. The qualified function name of
a function is the function namespace in which it reside followed by
its function name (e.g. ``example.test.func``). Built-in functions can
be referenced in queries with their function namespaces omitted, while
user-defined functions needs to be referenced by its qualified function
name. A function is uniquely identified by its qualified function name
and parameter type list.

Each function namespace manager binds to a catalog name and manages all
functions within that catalog. Using the catalog name of an existing
connector is discouraged, as the behavior is not defined nor tested,
and will be disallowed in the future.

Currently, those catalog names do not correspond to real catalogs.
They cannot be specified as the catalog in a session, nor do they
support :doc:`/sql/create-schema`, :doc:`/sql/alter-schema`,
:doc:`/sql/drop-schema`, or :doc:`/sql/show-schemas`. Instead,
namespaces can be added using the methods described below.


Configuration
-------------

Presto currently stores all function namespace manager related
information in MySQL.

To instantiate a MySQL-based function namespace manager that manages
catalog ``example``, administrator needs to first have a running MySQL
server. Suppose the MySQL server can be reached at ``localhost:1080``,
add a file ``etc/function-namespace/example.properties`` with the
following contents::

function-namespace-manager.name=mysql
database-url=jdbc:mysql://example.net:3306/database?user=root&password=password
function-namespaces-table-name=example_function_namespaces
functions-table-name=example_sql_functions

When Presto first starts with the above MySQL function namespace
manager configuration, two MySQL tables will be created if they do
not exist.

- ``example_function_namespaces`` stores function namespaces of
the catalog ``example``.
- ``example_sql_functions`` stores SQL-invoked functions of the
catalog ``example``.

Multiple function namespace managers can be instantiated by placing
multiple properties files under ``etc/function-namespace``. They
may be configured to use the same tables. If so, each manager will
only create and interact with entries of the catalog to which it binds.

To create a new function namespace, insert into the
``example_function_namespaces`` table::

INSERT INTO example_function_namespaces (catalog_name, schema_name)
VALUES('example', 'test');


Configuration Reference
-----------------------

``function-namespace-manager.name`` is the type of the function namespace manager to instantiate. Currently, only ``mysql`` is supported.

The following table lists all configuration properties supported by the MySQL function namespace manager.

=========================================== ==================================================================================================
Name Description
=========================================== ==================================================================================================
``database-url`` The URL of the MySQL database used by the MySQL function namespace manager.
``function-namespaces-table-name`` The name of the table that stores all the function namespaces managed by this manager.
``functions-table-name`` The name of the table that stores all the functions managed by this manager.
=========================================== ==================================================================================================

See Also
--------

:doc:`../sql/create-function`, :doc:`../sql/alter-function`, :doc:`../sql/drop-function`, :doc:`../sql/show-functions`
Loading

0 comments on commit 1f5c572

Please sign in to comment.