Skip to content

doc: why nullable of list item is set to true #11626

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jul 26, 2024
77 changes: 77 additions & 0 deletions datafusion/functions-aggregate/COMMENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Why Is List Item Always Nullable?

## Motivation

There were independent proposals to make the `nullable` setting of list
items in accumulator state be configurable. This meant adding additional
fields which captured the `nullable` setting from schema in planning for
the first argument to the aggregation function, and the returned value.

These fields were to be added to `StateFieldArgs`. But then we found out
that aggregate computation does not depend on it, and it can be avoided.

This document exists to make that reasoning explicit.

## Background

The list data type is used in the accumulator state for a few aggregate
functions like:

- `sum`
- `count`
- `array_agg`
- `bit_and`, `bit_or` and `bit_xor`
- `nth_value`

In all of the above cases the data type of the list item is equivalent
to either the first argument of the aggregate function or the returned
value.

For example, in `array_agg` the data type of item is equivalent to the
first argument and the definition looks like this:

```rust
// `args` : `StateFieldArgs`
// `input_type` : data type of the first argument
let mut fields = vec![Field::new_list(
format_state_name(self.name(), "nth_value"),
Field::new("item", args.input_type.clone(), true /* nullable of list item */ ),
false, // nullable of list itself
)];
```

For all the aggregates listed above, the list item is always defined as
nullable.

## Computing Intermediate State

By setting `nullable` (of list item) to be always `true` like this we
ensure that the aggregate computation works even when nulls are
present. The advantage of doing it this way is that it eliminates the
need for additional code and special treatment of nulls in the
accumulator state.

## Nullable Of List Itself

The `nullable` of list itself depends on the aggregate. In the case of
`array_agg` the list is nullable(`true`), meanwhile for `sum` the list
is not nullable(`false`).
2 changes: 2 additions & 0 deletions datafusion/functions-aggregate/src/array_agg.rs
Original file line number Diff line number Diff line change
Expand Up @@ -89,13 +89,15 @@ impl AggregateUDFImpl for ArrayAgg {
if args.is_distinct {
return Ok(vec![Field::new_list(
format_state_name(args.name, "distinct_array_agg"),
// See COMMENTS.md to understand why nullable is set to true
Field::new("item", args.input_type.clone(), true),
true,
)]);
}

let mut fields = vec![Field::new_list(
format_state_name(args.name, "array_agg"),
// See COMMENTS.md to understand why nullable is set to true
Field::new("item", args.input_type.clone(), true),
true,
)];
Expand Down
1 change: 1 addition & 0 deletions datafusion/functions-aggregate/src/bit_and_or_xor.rs
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,7 @@ impl AggregateUDFImpl for BitwiseOperation {
args.name,
format!("{} distinct", self.name()).as_str(),
),
// See COMMENTS.md to understand why nullable is set to true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Field::new("item", args.return_type.clone(), true),
false,
)])
Expand Down
1 change: 1 addition & 0 deletions datafusion/functions-aggregate/src/count.rs
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,7 @@ impl AggregateUDFImpl for Count {
if args.is_distinct {
Ok(vec![Field::new_list(
format_state_name(args.name, "count distinct"),
// See COMMENTS.md to understand why nullable is set to true
Field::new("item", args.input_type.clone(), true),
false,
)])
Expand Down
5 changes: 1 addition & 4 deletions datafusion/functions-aggregate/src/nth_value.rs
Original file line number Diff line number Diff line change
Expand Up @@ -134,10 +134,7 @@ impl AggregateUDFImpl for NthValueAgg {
fn state_fields(&self, args: StateFieldsArgs) -> Result<Vec<Field>> {
let mut fields = vec![Field::new_list(
format_state_name(self.name(), "nth_value"),
// TODO: The nullability of the list element should be configurable.
// The hard-coded `true` should be changed once the field for
// nullability is added to `StateFieldArgs` struct.
// See: https://github.com/apache/datafusion/pull/11063
// See COMMENTS.md to understand why nullable is set to true
Field::new("item", args.input_type.clone(), true),
false,
)];
Expand Down
1 change: 1 addition & 0 deletions datafusion/functions-aggregate/src/sum.rs
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,7 @@ impl AggregateUDFImpl for Sum {
if args.is_distinct {
Ok(vec![Field::new_list(
format_state_name(args.name, "sum distinct"),
// See COMMENTS.md to understand why nullable is set to true
Field::new("item", args.return_type.clone(), true),
false,
)])
Expand Down