Skip to content

Latest commit

 

History

History
104 lines (87 loc) · 52 KB

supported-aggregations.md

File metadata and controls

104 lines (87 loc) · 52 KB
description
Aggregate functions return a single result for a group of rows.

Aggregation Functions

Aggregate functions return a single result for a group of rows. The following table shows supported aggregate functions in Pinot.

Function Description Example Default Value When No Record Selected

ARG_MIN

/ARG_MAX

Project a column where the maxima appears in a series of measuring columns. ARG_MAX(measuring1, measuring2, measuring3, projection) Will return no result
AVGVALUEINTEGERSUMTUPLESKETCH See Cardinality Estimation 0
COUNT Returns the count of the records as Long COUNT(*) 0
COVAR_POP Returns the population covariance between of 2 numerical columns as Double COVAR_POP(col1, col2) Double.NEGATIVE_INFINITY
COVAR_SAMP Returns the sample covariance between of 2 numerical columns as Double COVAR_SAMP(col1, col2) Double.NEGATIVE_INFINITY
HISTOGRAM Calculate the histogram of a numeric column as Double[] HISTOGRAM(numberOfGames,0,200,10) 0, 0, ..., 0
MIN Returns the minimum value of a numeric column as Double MIN(playerScore) Double.POSITIVE_INFINITY
MAX Returns the maximum value of a numeric column as Double MAX(playerScore) Double.NEGATIVE_INFINITY
SUM Returns the sum of the values for a numeric column as Double SUM(playerScore) 0
SUMPRECISION Returns the sum of the values for a numeric column with optional precision and scale as BigDecimal SUMPRECISION(salary), SUMPRECISION(salary, precision, scale) 0.0
AVG Returns the average of the values for a numeric column as Double AVG(playerScore) Double.NEGATIVE_INFINITY
MODE Returns the most frequent value of a numeric column as Double. When multiple modes are present it gives the minimum of all the modes. This behavior can be overridden to get the maximum or the average mode.

MODE(playerScore)

MODE(playerScore, 'MIN')

MODE(playerScore, 'MAX')

MODE(playerScore, 'AVG')

Double.NEGATIVE_INFINITY
MINMAXRANGE Returns the max - min value for a numeric column as Double MINMAXRANGE(playerScore) Double.NEGATIVE_INFINITY
PERCENTILE(column, N) Returns the Nth percentile of the values for a numeric column as Double. N is a decimal number between 0 and 100 inclusive. PERCENTILE(playerScore, 50) PERCENTILE(playerScore, 99.9) Double.NEGATIVE_INFINITY
PERCENTILEEST(column, N) Returns the Nth percentile of the values for a numeric column using Quantile Digest as Long

PERCENTILEEST(playerScore, 50)

PERCENTILEEST(playerScore, 99.9)

Long.MIN_VALUE
PERCENTILETDIGEST(column, N) Returns the Nth percentile of the values for a numeric column using T-digest as Double

PERCENTILETDIGEST(playerScore, 50)

PERCENTILETDIGEST(playerScore, 99.9)

Double.NaN
PERCENTILETDIGEST(column, N, CF) Returns the Nth percentile (using compression factor of CF) of the values for a numeric column using T-digest as Double

PERCENTILETDIGEST(playerScore, 50, 1000)

PERCENTILETDIGEST(playerScore, 99.9, 500)

Double.NaN
PERCENTILESMARTTDIGEST Returns the Nth percentile of the values for a numeric column as Double. When there are too many values, automatically switch to approximate percentile using TDigest. The switch threshold (100_000 by default) and compression (100 by default) for the TDigest can be configured via the optional second argument.

PERCENTILESMARTTDIGEST(playerScore, 50)

PERCENTILESMARTTDIGEST(playerScore, 99.9, 'threshold=100;compression=50)

Double.NEGATIVE_INFINITY
DISTINCTCOUNT Returns the count of distinct values of a column as Integer DISTINCTCOUNT(playerName) 0
DISTINCTCOUNTBITMAP Returns the count of distinct values of a column as Integer. This function is accurate for INT column, but approximate for other cases where hash codes are used in distinct counting and there may be hash collisions. DISTINCTCOUNTBITMAP(playerName) 0
DISTINCTCOUNTHLL Returns an approximate distinct count using HyperLogLog as Long. It also takes an optional second argument to configure the log2m for the HyperLogLog. DISTINCTCOUNTHLL(playerName, 12) 0
DISTINCTCOUNTRAWHLL Returns HyperLogLog response serialized as String. The serialized HLL can be converted back into an HLL and then aggregated with other HLLs. A common use case may be to merge HLL responses from different Pinot tables, or to allow aggregation after client-side batching. DISTINCTCOUNTRAWHLL(playerName) 0
DISTINCTCOUNTHLLPLUS Returns an approximate distinct count using HyperLogLogPlus as Long. It also takes an optional second and third arguments to configure the p and sp for the HyperLogLogPlus. DISTINCTCOUNTHLLPLUS(playerName) 0
DISTINCTCOUNTRAWHLLPLUS Returns HyperLogLogPlus response serialized as String. The serialized HLLPlus can be converted back into an HLLPlus and then aggregated with other HLLPluses. A common use case may be to merge HLLPlus responses from different Pinot tables, or to allow aggregation after client-side batching. DISTINCTCOUNTRAWHLLPLUS(playerName) 0
DISTINCTCOUNTSMARTHLL Returns the count of distinct values of a column as Integer. When there are too many distinct values, automatically switch to approximate distinct count using HyperLogLog. The switch threshold (100_000 by default) and log2m (12 by default) for the HyperLogLog can be configured via the optional second argument.

DISTINCTCOUNTSMARTHLL(playerName),

DISTINCTCOUNTSMARTHLL(playerName, 'threshold=100;log2m=8')

0
DISTINCTCOUNTCPCSKETCH See Cardinality Estimation 0
DISTINCTCOUNTRAWCPCSKETCH See Cardinality Estimation 0
DISTINCTCOUNTRAWINTEGERSUMTUPLESKETCH See Cardinality Estimation 0
DISTINCTCOUNTTHETASKETCH See Cardinality Estimation 0
DISTINCTCOUNTRAWTHETASKETCH See Cardinality Estimation 0
DISTINCTCOUNTTUPLESKETCH See Cardinality Estimation 0
DISTINCTCOUNTULL See Cardinality Estimation 0
DISTINCTCOUNTRAWULL See Cardinality Estimation 0
SEGMENTPARTITIONEDDISTINCTCOUNT Returns the count of distinct values of a column as Long when the column is pre-partitioned for each segment, where there is no common value within different segments. This function calculates the exact count of distinct values within the segment, then simply sums up the results from different segments to get the final result. SEGMENTPARTITIONEDDISTINCTCOUNT(playerName) 0
SEGMENTPARTITIONEDDISTINCTCOUNT Returns the count of distinct values of a column as Long when the column is pre-partitioned for each segment, where there is no common value within different segments. This function calculates the exact count of distinct values within the segment, then simply sums up the results from different segments to get the final result. SEGMENTPARTITIONEDDISTINCTCOUNT(playerName) 0
SUMVALUESINTEGERSUMTUPLESKETCH See Cardinality Estimation 0
LASTWITHTIME(dataColumn, timeColumn, 'dataType') Get the last value of dataColumn where the timeColumn is used to define the time of dataColumn and the dataType specifies the type of dataColumn, which can be BOOLEAN, INT, LONG, FLOAT, DOUBLE, STRING

LASTWITHTIME(playerScore, timestampColumn, 'BOOLEAN')

LASTWITHTIME(playerScore, timestampColumn, 'INT')

LASTWITHTIME(playerScore, timestampColumn, 'LONG')

LASTWITHTIME(playerScore, timestampColumn, 'FLOAT')

LASTWITHTIME(playerScore, timestampColumn, 'DOUBLE')

LASTWITHTIME(playerScore, timestampColumn, 'STRING')

INT: Int.MIN_VALUE LONG: Long.MIN_VALUE FLOAT: Float.NaN DOUBLE: Double.NaN STRING: ""
FIRSTWITHTIME(dataColumn, timeColumn, 'dataType') Get the first value of dataColumn where the timeColumn is used to define the time of dataColumn and the dataType specifies the type of dataColumn, which can be BOOLEAN, INT, LONG, FLOAT, DOUBLE, STRING

FIRSTWITHTIME(playerScore, timestampColumn, 'BOOLEAN')

FIRSTWITHTIME(playerScore, timestampColumn, 'INT')

FIRSTWITHTIME(playerScore, timestampColumn, 'LONG')

FIRSTWITHTIME(playerScore, timestampColumn, 'FLOAT')

FIRSTWITHTIME(playerScore, timestampColumn, 'DOUBLE')

FIRSTWITHTIME(playerScore, timestampColumn, 'STRING')

INT: Int.MIN_VALUE LONG: Long.MIN_VALUE FLOAT: Float.NaN DOUBLE: Double.NaN STRING: ""

Deprecated functions:

Function Description Example
FASTHLL FASTHLL stores serialized HyperLogLog in String format, which performs worse than DISTINCTCOUNTHLL, which supports serialized HyperLogLog in BYTES (byte array) format FASTHLL(playerName)

Multi-value column functions

The following aggregation functions can be used for multi-value columns

Function

COUNTMV
Returns the count of a multi-value column as Long

MINMV
Returns the minimum value of a numeric multi-value column as Double

MAXMV
Returns the maximum value of a numeric multi-value column as Double

SUMMV
Returns the sum of the values for a numeric multi-value column as Double

AVGMV
Returns the average of the values for a numeric multi-value column as Double

MINMAXRANGEMV
Returns the max - min value for a numeric multi-value column as Double

PERCENTILEMV(column, N)
Returns the Nth percentile of the values for a numeric multi-value column as Double

PERCENTILEESTMV(column, N)
Returns the Nth percentile using Quantile Digest as Long

PERCENTILETDIGESTMV(column, N)
Returns the Nth percentile using T-digest as Double

PERCENTILETDIGESTMV(column, N, CF)
Returns the Nth percentile (using compression factor CF) using T-digest as Double

DISTINCTCOUNTMV
Returns the count of distinct values for a multi-value column as Integer

DISTINCTCOUNTBITMAPMV
Returns the count of distinct values for a multi-value column as Integer. This function is accurate for INT or dictionary encoded column, but approximate for other cases where hash codes are used in distinct counting and there may be hash collision.

DISTINCTCOUNTHLLMV
Returns an approximate distinct count using HyperLogLog as Long

DISTINCTCOUNTRAWHLLMV
Returns HyperLogLog response serialized as string. The serialized HLL can be converted back into an HLL and then aggregated with other HLLs. A common use case may be to merge HLL responses from different Pinot tables, or to allow aggregation after client-side batching.

DISTINCTCOUNTHLLPLUSMV
Returns an approximate distinct count using HyperLogLogPlus as Long

DISTINCTCOUNTRAWHLLPLUSMV
Returns HyperLogLogPlus response serialized as string. The serialized HLLPlus can be converted back into an HLLPlus and then aggregated with other HLLPluses. A common use case may be to merge HLLPlus responses from different Pinot tables, or to allow aggregation after client-side batching.

FILTER Clause in aggregation

Pinot supports FILTER clause in aggregation queries as follows:

SELECT SUM(COL1) FILTER (WHERE COL2 > 300),
       AVG(COL2) FILTER (WHERE COL2 < 50) 
FROM MyTable WHERE COL3 > 50

In the query above, COL1 is aggregated only for rows where COL2 > 300 and COL3 > 50 . Similarly, COL2 is aggregated where COL2 < 50 and COL3 > 50.

With NULL Value Support enabled, this allows to filter out the null values while performing aggregation as follows:

SELECT SUM(COL1) FILTER (WHERE COL1 IS NOT NULL)
FROM MyTable WHERE COL3 > 50

In the above query, COL1 is aggregated only for the non-null values. Without NULL value support, we would have to filter using the default null value.

Deprecated functions:

Function Description Example
FASTHLLMV (Deprecated) stores serialized HyperLogLog in String format, which performs worse than DISTINCTCOUNTHLL, which supports serialized HyperLogLog in BYTES (byte array) format FASTHLLMV(playerNames)