Skip to content

Commit d6e016e

Browse files
jcsherinalamb
andauthored
doc: why nullable of list item is set to true (#11626)
* doc: why nullable of list item is set to true * Adds an external doc to avoid repeating text * rewrite * redirects to external doc * Adds ASF license * Minor: formatting fixes * Minor: copy edits * Retrigger CI * Fixes: name of aggregation in example In `array_agg` the list is nullable, so changed the example to `nth_value` where the list is not nullable to be correct. * Disambiguates list item nullability in copy --------- Co-authored-by: Andrew Lamb <[email protected]>
1 parent 2eaf1ea commit d6e016e

File tree

6 files changed

+83
-4
lines changed

6 files changed

+83
-4
lines changed
+77
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
<!---
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# Why Is List Item Always Nullable?
21+
22+
## Motivation
23+
24+
There were independent proposals to make the `nullable` setting of list
25+
items in accumulator state be configurable. This meant adding additional
26+
fields which captured the `nullable` setting from schema in planning for
27+
the first argument to the aggregation function, and the returned value.
28+
29+
These fields were to be added to `StateFieldArgs`. But then we found out
30+
that aggregate computation does not depend on it, and it can be avoided.
31+
32+
This document exists to make that reasoning explicit.
33+
34+
## Background
35+
36+
The list data type is used in the accumulator state for a few aggregate
37+
functions like:
38+
39+
- `sum`
40+
- `count`
41+
- `array_agg`
42+
- `bit_and`, `bit_or` and `bit_xor`
43+
- `nth_value`
44+
45+
In all of the above cases the data type of the list item is equivalent
46+
to either the first argument of the aggregate function or the returned
47+
value.
48+
49+
For example, in `array_agg` the data type of item is equivalent to the
50+
first argument and the definition looks like this:
51+
52+
```rust
53+
// `args` : `StateFieldArgs`
54+
// `input_type` : data type of the first argument
55+
let mut fields = vec![Field::new_list(
56+
format_state_name(self.name(), "nth_value"),
57+
Field::new("item", args.input_type.clone(), true /* nullable of list item */ ),
58+
false, // nullable of list itself
59+
)];
60+
```
61+
62+
For all the aggregates listed above, the list item is always defined as
63+
nullable.
64+
65+
## Computing Intermediate State
66+
67+
By setting `nullable` (of list item) to be always `true` like this we
68+
ensure that the aggregate computation works even when nulls are
69+
present. The advantage of doing it this way is that it eliminates the
70+
need for additional code and special treatment of nulls in the
71+
accumulator state.
72+
73+
## Nullable Of List Itself
74+
75+
The `nullable` of list itself depends on the aggregate. In the case of
76+
`array_agg` the list is nullable(`true`), meanwhile for `sum` the list
77+
is not nullable(`false`).

datafusion/functions-aggregate/src/array_agg.rs

+2
Original file line numberDiff line numberDiff line change
@@ -89,13 +89,15 @@ impl AggregateUDFImpl for ArrayAgg {
8989
if args.is_distinct {
9090
return Ok(vec![Field::new_list(
9191
format_state_name(args.name, "distinct_array_agg"),
92+
// See COMMENTS.md to understand why nullable is set to true
9293
Field::new("item", args.input_type.clone(), true),
9394
true,
9495
)]);
9596
}
9697

9798
let mut fields = vec![Field::new_list(
9899
format_state_name(args.name, "array_agg"),
100+
// See COMMENTS.md to understand why nullable is set to true
99101
Field::new("item", args.input_type.clone(), true),
100102
true,
101103
)];

datafusion/functions-aggregate/src/bit_and_or_xor.rs

+1
Original file line numberDiff line numberDiff line change
@@ -203,6 +203,7 @@ impl AggregateUDFImpl for BitwiseOperation {
203203
args.name,
204204
format!("{} distinct", self.name()).as_str(),
205205
),
206+
// See COMMENTS.md to understand why nullable is set to true
206207
Field::new("item", args.return_type.clone(), true),
207208
false,
208209
)])

datafusion/functions-aggregate/src/count.rs

+1
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,7 @@ impl AggregateUDFImpl for Count {
125125
if args.is_distinct {
126126
Ok(vec![Field::new_list(
127127
format_state_name(args.name, "count distinct"),
128+
// See COMMENTS.md to understand why nullable is set to true
128129
Field::new("item", args.input_type.clone(), true),
129130
false,
130131
)])

datafusion/functions-aggregate/src/nth_value.rs

+1-4
Original file line numberDiff line numberDiff line change
@@ -134,10 +134,7 @@ impl AggregateUDFImpl for NthValueAgg {
134134
fn state_fields(&self, args: StateFieldsArgs) -> Result<Vec<Field>> {
135135
let mut fields = vec![Field::new_list(
136136
format_state_name(self.name(), "nth_value"),
137-
// TODO: The nullability of the list element should be configurable.
138-
// The hard-coded `true` should be changed once the field for
139-
// nullability is added to `StateFieldArgs` struct.
140-
// See: https://github.com/apache/datafusion/pull/11063
137+
// See COMMENTS.md to understand why nullable is set to true
141138
Field::new("item", args.input_type.clone(), true),
142139
false,
143140
)];

datafusion/functions-aggregate/src/sum.rs

+1
Original file line numberDiff line numberDiff line change
@@ -174,6 +174,7 @@ impl AggregateUDFImpl for Sum {
174174
if args.is_distinct {
175175
Ok(vec![Field::new_list(
176176
format_state_name(args.name, "sum distinct"),
177+
// See COMMENTS.md to understand why nullable is set to true
177178
Field::new("item", args.return_type.clone(), true),
178179
false,
179180
)])

0 commit comments

Comments
 (0)