Skip to content

Commit

Permalink
Add non-determinstic behavior page
Browse files Browse the repository at this point in the history
Fixes #3534
  • Loading branch information
szarnyasg committed Sep 5, 2024
1 parent bd4e9cb commit dae0c5e
Show file tree
Hide file tree
Showing 2 changed files with 92 additions and 0 deletions.
4 changes: 4 additions & 0 deletions _data/menu_docs_dev.json
Original file line number Diff line number Diff line change
Expand Up @@ -1375,6 +1375,10 @@
"page": "Limits",
"url": "limits"
},
{
"page": "Non-Deterministic Behavior",
"url": "non-deterministic_behavior"
},
{
"page": "DuckDB's Footprint",
"slug": "footprint_of_duckdb",
Expand Down
88 changes: 88 additions & 0 deletions docs/operations_manual/non-deterministic_behavior.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
---
layout: docu
title: Non-Deterministic Behavior
---

Several operators in DuckDB exhibit non-deterministic behavior.
Most notably, SQL uses set semantics, which allows results to be returned in a different order.
DuckDB exploits this, particularly when performing multi-threaded execution.
Other factors, such as using different compilers, operating systems, hardware architectures can also cause changes in ordering.
This page documents the cases where non-determinism is an _expected behavior_.
If you would like to make your queries determinisic, see the [“Working Around Non-Determinism” section](#working-around-non-determinism).

## Set Semantics

One of the most common sources of non-determinism is the set semantics used by SQL.
E.g., if you run the following query repeatedly, you may get two different results:

```sql
SELECT *
FROM (
SELECT 'A' AS x
UNION
SELECT 'B' AS x
);
```

Both results `A`, `B` and `B`, `A` are correct.

## Different Results on Different Platforms: `array_distinct`

The `array_distinct` function may return results [in a different order on different platforms](https://github.com/duckdb/duckdb/issues/13746):

```sql
SELECT array_distinct(['A', 'A', 'B', NULL, NULL]) AS arr;
```

For this query, both `[A, B]` and `[B, A]` are valid results.

## Floating-Point Aggregate Operations with Multi-Threading

Floating-point inaccuracies may produce different results when run in a multi-threaded configurations:
For example, [`stddev` and `corr` may produce non-deterministic results](https://github.com/duckdb/duckdb/issues/13763):

```sql
CREATE TABLE tbl AS
SELECT 'ABCDEFG'[floor(random() * 7 + 1)::INT] AS s, 3.7 AS x, i AS y
FROM range(1, 1_000_000) r(i);

SELECT s, stddev(x) AS standard_deviation, corr(x, y) AS correlation FROM tbl
GROUP BY s
ORDER BY s;
```

The expected standard deviations and correlations from this query are 0 for all values of `s`.
However, when executed on multiple threads, the query may return small numbers (`0 <= z < 10e-16`) due to floating-point inaccuracies.

## Working Around Non-Determinism

For the majority of use cases, non-determinism is not causing any issues.
However, there are some cases where deterministic results are desirable.
In these cases, try the following workarounds:

1. Limit the number of threads to prevent non-determinism introduced by multi-threading.

```sql
SET threads = 1;
```

2. Enforce ordering. For example, you can use the [`ORDER BY ALL` clause]({% link docs/sql/query_syntax/orderby.md %}#order-by-all):

```sql
SELECT *
FROM (
SELECT 'A' AS x
UNION
SELECT 'B' AS x
)
ORDER BY ALL;
```

You can also sort lists using [`list_sort`]({% link docs/sql/functions/list.md %}#list_sortlist)

```sql
SELECT list_sort(array_distinct(['A', 'A', 'B', NULL, NULL])) AS i
ORDER BY i;
```

It's also possible to introduce a [deterministic shuffling]({% post_url 2024-08-19-duckdb-tricks-part-1 %}#shuffling-data).

0 comments on commit dae0c5e

Please sign in to comment.