Add 0.289 docs

prestodb · Sep 19, 2024 · 1f5c572 · 1f5c572
1 parent 45eb2ed
commit 1f5c572
Show file tree

Hide file tree

Showing 1,814 changed files with 1,840,591 additions and 3 deletions.
diff --git a/website/static/docs/0.289/_images/functions_color_bar.png b/website/static/docs/0.289/_images/functions_color_bar.png
diff --git a/website/static/docs/0.289/_images/materialized_shuffle_execution.png b/website/static/docs/0.289/_images/materialized_shuffle_execution.png
diff --git a/website/static/docs/0.289/_images/presto_console.png b/website/static/docs/0.289/_images/presto_console.png
diff --git a/website/static/docs/0.289/_images/query_details_json.png b/website/static/docs/0.289/_images/query_details_json.png
diff --git a/website/static/docs/0.289/_images/rpc_shuffle_execution.png b/website/static/docs/0.289/_images/rpc_shuffle_execution.png
diff --git a/website/static/docs/0.289/_images/serialized-page-array-column.png b/website/static/docs/0.289/_images/serialized-page-array-column.png
diff --git a/website/static/docs/0.289/_images/serialized-page-header.png b/website/static/docs/0.289/_images/serialized-page-header.png
diff --git a/website/static/docs/0.289/_images/serialized-page-int-array.png b/website/static/docs/0.289/_images/serialized-page-int-array.png
diff --git a/website/static/docs/0.289/_images/serialized-page-int-column.png b/website/static/docs/0.289/_images/serialized-page-int-column.png
diff --git a/website/static/docs/0.289/_images/serialized-page-layout.png b/website/static/docs/0.289/_images/serialized-page-layout.png
diff --git a/website/static/docs/0.289/_images/serialized-page-map-column.png b/website/static/docs/0.289/_images/serialized-page-map-column.png
diff --git a/website/static/docs/0.289/_images/serialized-page-nulls.png b/website/static/docs/0.289/_images/serialized-page-nulls.png
diff --git a/website/static/docs/0.289/_images/serialized-page-row-column.png b/website/static/docs/0.289/_images/serialized-page-row-column.png
diff --git a/website/static/docs/0.289/_images/serialized-page-string-column.png b/website/static/docs/0.289/_images/serialized-page-string-column.png
diff --git a/website/static/docs/0.289/_images/worker-protocol-output-buffers.png b/website/static/docs/0.289/_images/worker-protocol-output-buffers.png
diff --git a/website/static/docs/0.289/_images/worker-protocol-results.png b/website/static/docs/0.289/_images/worker-protocol-results.png
diff --git a/website/static/docs/0.289/_sources/admin.rst.txt b/website/static/docs/0.289/_sources/admin.rst.txt
@@ -0,0 +1,18 @@
+**************
+Administration
+**************
+
+.. toctree::
+    :maxdepth: 1
+
+    admin/web-interface
+    admin/tuning
+    admin/properties
+    admin/spill
+    admin/exchange-materialization
+    admin/cte-materialization
+    admin/resource-groups
+    admin/session-property-managers
+    admin/function-namespace-managers
+    admin/dist-sort
+    admin/verifier
diff --git a/website/static/docs/0.289/_sources/admin/cte-materialization.rst.txt b/website/static/docs/0.289/_sources/admin/cte-materialization.rst.txt
@@ -0,0 +1,138 @@
+
+===================
+CTE Materialization
+===================
+
+Common Table Expressions (CTEs) are subqueries that appear in a WITH clause provided by the user.
+Their repeated usage in a query can lead to redundant computations, excessive data retrieval, and high resource consumption.
+
+To address this, Presto supports CTE Materialization allowing intermediate CTEs to be reused within the scope of the same query.
+Materializing CTEs can improve performance when the same CTE is used multiple times in a query by reducing recomputation of the CTE. However, there is also a cost to writing to and reading from disk, so the optimization may not be beneficial for very simple CTEs
+or CTEs that are not used many times in a query.
+
+Materialized CTEs are stored in temporary tables that are bucketed based on random hashing.
+To use this feature, the connector used by the query must support the creation of temporary tables. Currently, only the :doc:`/connector/hive` offers this capability.
+The QueryStats (com.facebook.presto.spi.eventlistener.QueryStatistics#writtenIntermediateBytes) expose a metric to the event listener to monitor the bytes written to intermediate storage by temporary tables.
+
+How to use CTE Materialization
+------------------------------
+
+The following configurations and session properties enable CTE materialization and modify its settings.
+
+``cte-materialization-strategy``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+    * **Type:** ``string``
+    * **Allowed values:** ``ALL``, ``NONE``, ``HEURISTIC``, ``HEURISTIC_COMPLEX_QUERIES_ONLY``
+    * **Default value:** ``NONE``
+
+Specifies the strategy for materializing Common Table Expressions (CTEs) in queries.
+
+``NONE`` - no CTEs will be materialized.
+
+``ALL``  - all CTEs in the query will be materialized.
+
+``HEURISTIC`` - greedily materializes the earliest parent CTE, which is repeated >= ``cte_heuristic_replication_threshold`` times.
+
+``HEURISTIC_COMPLEX_QUERIES_ONLY`` greedily materializes the earliest parent CTE which meets the ``HEURISTIC`` criteria and has a join or aggregate.
+
+Use the ``cte_materialization_strategy`` session property to set on a per-query basis.
+
+``cte-heuristic-replication-threshold``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+    * **Type:** ``integer``
+    * **Minimum value:** ``0``
+    * **Default value:** ``4``
+
+When ``cte-materialization-strategy`` is set to ``HEURISTIC`` or ``HEURISTIC_COMPLEX_QUERIES_ONLY``, then CTEs will be materialized if they appear in a query at least ``cte-heuristic-replication-threshold`` number of times.
+
+Use the ``cte_heuristic_replication_threshold`` session property to set on a per-query basis.
+
+``query.cte-partitioning-provider-catalog``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+    * **Type:** ``string``
+    * **Default value:** ``system``
+
+The name of the catalog that provides custom partitioning for CTE materialization.
+This setting specifies which catalog should be used for CTE materialization.
+
+Use the ``cte_partitioning_provider_catalog`` session property to set on a per-query basis.
+
+``cte-filter-and-projection-pushdown-enabled``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+    * **Type:** ``boolean``
+    * **Default value:** ``true``
+
+Flag to enable or disable the pushdown of common filters and projects into the materialized CTE.
+
+Use the ``cte_filter_and_projection_pushdown_enabled`` session property to set on a per-query basis.
+
+``hive.cte-virtual-bucket-count``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+    * **Type:** ``integer``
+    * **Default value:** ``128``
+
+The number of buckets to be used for materializing CTEs in queries.
+This setting determines how many buckets are used when materializing the CTEs, potentially affecting the performance of queries involving CTE materialization.
+A higher number of buckets might improve parallelism but also increases overhead in terms of memory and network communication.
+
+Recommended value: 4 - 10 times the size of the cluster.
+
+Use the ``hive.cte_virtual_bucket_count`` session property to set on a per-query basis.
+
+``hive.temporary-table-storage-format``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+    * **Type:** ``string``
+    * **Allowed values:** ``PAGEFILE``, ``ORC``, ``DWRF``, ``ALPHA``, ``PARQUET``, ``AVRO``, ``RCBINARY``, ``RCTEXT``, ``SEQUENCEFILE``, ``JSON``, ``TEXTFILE``, ``CSV``
+    * **Default value:** ``ORC``
+
+This setting determines the data format for temporary tables generated by CTE materialization. The recommended value is ``PAGEFILE`` :doc:`/develop/serialized-page`, as it is the most performant,
+since it avoids serialization and deserialization during reads and writes, allowing for direct storage of Presto pages.
+
+Use the ``hive.temporary_table_storage_format`` session property to set on a per-query basis.
+
+``hive.temporary-table-compression-codec``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+    * **Type:** ``string``
+    * **Allowed values:** ``SNAPPY``, ``NONE``, ``GZIP``, ``LZ4``, ``ZSTD``
+    * **Default value:** ``SNAPPY``
+
+This property defines the compression codec to be used for temporary tables generated by CTE materialization.
+
+Use the ``hive.temporary_table_compression_codec`` session property to set on a per-query basis.
+
+``hive.bucket-function-type-for-cte-materialization``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+    * **Type:** ``string``
+    * **Allowed values:** ``HIVE_COMPATIBLE``, ``PRESTO_NATIVE``
+    * **Default value:** ``PRESTO_NATIVE``
+
+This setting specifies the Hash function type for CTE materialization.
+
+Use the ``hive.bucket_function_type_for_cte_materialization`` session property to set on a per-query basis.
+
+
+``query.max-written-intermediate-bytes``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+    * **Type:** ``DataSize``
+    * **Default value:** ``2TB``
+
+This setting defines a cap on the amount of data that can be written during CTE Materialization. If a query exceeds this limit, it will fail.
+
+Use the ``query_max_written_intermediate_bytes`` session property to set on a per-query basis.
+
+
+How to Participate in Development
+---------------------------------
+
+List of issues - (https://github.com/prestodb/presto/labels/cte_materialization)
+
+
diff --git a/website/static/docs/0.289/_sources/admin/dist-sort.rst.txt b/website/static/docs/0.289/_sources/admin/dist-sort.rst.txt
@@ -0,0 +1,17 @@
+================
+Distributed sort
+================
+
+Distributed sort allows to sort data which exceeds ``query.max-memory-per-node``.
+Distributed sort is enabled via ``distributed_sort`` session property or
+``distributed-sort`` configuration property set in
+``etc/config.properties`` of the coordinator. Distributed sort is enabled by
+default.
+
+When distributed sort is enabled, sort operator executes in parallel on multiple
+nodes in the cluster. Partially sorted data from each Presto worker node is then streamed
+to a single worker node for a final merge. This technique allows to utilize memory of multiple
+Presto worker nodes for sorting. The primary purpose of distributed sort is to allow for sorting
+of data sets which don't normally fit into single node memory. Performance improvement
+can be expected, but it won't scale linearly with the number of nodes since the
+data needs to be merged by a single node.
diff --git a/website/static/docs/0.289/_sources/admin/exchange-materialization.rst.txt b/website/static/docs/0.289/_sources/admin/exchange-materialization.rst.txt
@@ -0,0 +1,59 @@
+========================
+Exchange Materialization
+========================
+
+Presto allows exchange materialization to support memory intensive queries.
+This mechanism brings MapReduce-style execution to Presto's MPP architecture runtime,
+and can be applied together with :doc:`/admin/spill`.
+
+Introduction
+------------
+
+As with other MPP databases, Presto leverages RPC shuffle to achieve efficient and
+low-latency query execution for join and aggregation. However, RPC shuffle
+also requires all the producers and consumers to be executed concurrently until the
+query is finished.
+
+To illustrates this, consider the aggregation query:
+
+.. code-block:: sql
+
+    SELECT custkey, SUM(totalprice)
+    FROM orders
+    GROUP BY custkey
+
+
+The following figure demonstrates how this query executes in Presto classic mode:
+
+.. figure:: ../images/rpc_shuffle_execution.png
+   :align: center
+
+With exchange materialization, the intermediate shuffle data is written to disk (currently,
+it is always a temporary Hive bucketed table). This opens the opportunity for flexible scheduling policies
+on the aggregation side, as only a subset of aggregation data needs to be held in memory at the
+same time -- this execution strategy is called "grouped execution" in Presto.
+
+.. figure:: ../images/materialized_shuffle_execution.png
+   :align: center
+
+Using Exchange Materialization
+------------------------------
+
+Exchange materialization can be enabled on per-query basis by setting the following 3 session properties:
+``exchange_materialization_strategy``, ``partitioning_provider_catalog`` and ``hash_partition_count``:
+
+.. code-block:: sql
+
+    SET SESSION exchange_materialization_strategy='ALL';
+
+    -- Set partitioning_provider_catalog to the Hive connector catalog
+    SET SESSION partitioning_provider_catalog='hive';
+
+    -- We recommend setting hash_partition_count to be at least 5X-10X about the cluster size
+    -- when exchange materialization is enabled.
+    SET SESSION hash_partition_count = 4096;
+
+To make it easy for user to use exchange materialization, the admin can leverage :doc:`/admin/session-property-managers`
+to set the session properties automatically based on client tags. The example in :doc:`/admin/session-property-managers`
+demonstrates how to automatically enable exchange materialization for queries with ``high_mem_etl`` tag.
+
diff --git a/website/static/docs/0.289/_sources/admin/function-namespace-managers.rst.txt b/website/static/docs/0.289/_sources/admin/function-namespace-managers.rst.txt
@@ -0,0 +1,97 @@
+===========================
+Function Namespace Managers
+===========================
+
+.. warning::
+
+    This is an experimental feature being actively developed. The way
+    Function Namespace Managers are configured might be changed.
+
+Function namespace managers support storing and retrieving SQL
+functions, allowing the Presto engine to perform actions such as
+creating, altering, deleting functions.
+
+A function namespace is in the format of ``catalog.schema`` (e.g.
+``example.test``). It can be thought of as a schema for storing
+functions. However, it is not a full fledged schema as it does not
+support storing tables and views, but only functions.
+
+Each Presto function, whether built-in or user-defined, resides in
+a function namespace. All built-in functions reside in the
+``presto.default`` function namespace. The qualified function name of
+a function is the function namespace in which it reside followed by
+its function name (e.g. ``example.test.func``). Built-in functions can
+be referenced in queries with their function namespaces omitted, while
+user-defined functions needs to be referenced by its qualified function
+name. A function is uniquely identified by its qualified function name
+and parameter type list.
+
+Each function namespace manager binds to a catalog name and manages all
+functions within that catalog. Using the catalog name of an existing
+connector is discouraged, as the behavior is not defined nor tested,
+and will be disallowed in the future.
+
+Currently, those catalog names do not correspond to real catalogs.
+They cannot be specified as the catalog in a session, nor do they
+support :doc:`/sql/create-schema`, :doc:`/sql/alter-schema`,
+:doc:`/sql/drop-schema`, or :doc:`/sql/show-schemas`. Instead,
+namespaces can be added using the methods described below.
+
+
+Configuration
+-------------
+
+Presto currently stores all function namespace manager related
+information in MySQL.
+
+To instantiate a MySQL-based function namespace manager that manages
+catalog ``example``, administrator needs to first have a running MySQL
+server. Suppose the MySQL server can be reached at ``localhost:1080``,
+add a file ``etc/function-namespace/example.properties`` with the
+following contents::
+
+    function-namespace-manager.name=mysql
+    database-url=jdbc:mysql://example.net:3306/database?user=root&password=password
+    function-namespaces-table-name=example_function_namespaces
+    functions-table-name=example_sql_functions
+
+When Presto first starts with the above MySQL function namespace
+manager configuration, two MySQL tables will be created if they do
+not exist.
+
+- ``example_function_namespaces`` stores function namespaces of
+  the catalog ``example``.
+- ``example_sql_functions`` stores SQL-invoked functions of the
+  catalog ``example``.
+
+Multiple function namespace managers can be instantiated by placing
+multiple properties files under ``etc/function-namespace``. They
+may be configured to use the same tables. If so, each manager will
+only create and interact with entries of the catalog to which it binds.
+
+To create a new function namespace, insert into the
+``example_function_namespaces`` table::
+
+    INSERT INTO example_function_namespaces (catalog_name, schema_name)
+    VALUES('example', 'test');
+
+
+Configuration Reference
+-----------------------
+
+``function-namespace-manager.name`` is the type of the function namespace manager to instantiate. Currently, only ``mysql`` is supported.
+
+The following table lists all configuration properties supported by the MySQL function namespace manager.
+
+=========================================== ==================================================================================================
+Name                                        Description
+=========================================== ==================================================================================================
+``database-url``                            The URL of the MySQL database used by the MySQL function namespace manager.
+``function-namespaces-table-name``          The name of the table that stores all the function namespaces managed by this manager.
+``functions-table-name``                    The name of the table that stores all the functions managed by this manager.
+=========================================== ==================================================================================================
+
+See Also
+--------
+
+:doc:`../sql/create-function`, :doc:`../sql/alter-function`, :doc:`../sql/drop-function`, :doc:`../sql/show-functions`