feat: Add `flatten` array function #562

mobley-trent · 2024-01-16T11:20:11Z

Which issue does this PR close?

Refer to issue #463

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

mobley-trent · 2024-01-16T11:47:30Z

Hello @andygrove do you mind giving me a hand with this PR ? I exposed Flatten in functions.rs but the python array function test for flatten is failing like so:

name = 'flatten'

    def __getattr__(name):
>       return getattr(functions, name)
E       AttributeError: module 'functions' has no attribute 'flatten'

ongchi · 2024-02-06T09:23:03Z

Hello @andygrove do you mind giving me a hand with this PR ? I exposed Flatten in functions.rs but the python array function test for flatten is failing like so:
name = 'flatten'

    def __getattr__(name):
>       return getattr(functions, name)
E       AttributeError: module 'functions' has no attribute 'flatten'

Hi @mobley-trent
Did you try to rebuild the package before running pytest? Like this:

# build and install package
maturin develop

Also, don't forget to active the venv before this command.

mobley-trent · 2024-02-10T12:07:57Z

Hey @ongchi I tested the flatten function and its failing. Here is the code :

from datafusion import SessionContext, column
from datafusion import functions as f
import numpy as np
import pyarrow as pa


def py_flatten(arr):
    # Testing helper function
    result = []
    for elem in arr:
        if isinstance(elem, list):
            result.extend(py_flatten(elem))
        else:
            result.append(elem)
    return result

ctx = SessionContext()
data = [[1.0, 2.0, 3.0], [4.0, 5.0], [6.0]]

batch = pa.RecordBatch.from_arrays(
    [np.array(data, dtype=object)], names=["arr"]
)
df = ctx.create_dataframe([[batch]])
col = column("arr")


stmt = f.flatten(col)
py_expr = lambda: [py_flatten(data)]

result = df.select(stmt).collect()[0].column(0).tolist()

print(f"flatten query: {result}")
print(f"py_expr: {py_expr()}")

Results:

>>> flatten query: [[1.0, 2.0, 3.0], [4.0, 5.0], [6.0]]
>>> py_expr: [[1.0, 2.0, 3.0, 4.0, 5.0, 6.0]]

I expected the flatten query to be identical to the py_expr. Is there something I overlooked ? Or is this an underlying bug ?

mobley-trent · 2024-02-10T12:48:55Z

Using a regular flatten query:

ctx = SessionContext()
ctx.sql("select flatten([[1.0, 2.0, 3.0], [4.0, 5.0], [6.0]]);")

Result:

DataFrame()
+----------------------------------------------------------------------------------------------------------------------------+
| flatten(make_array(make_array(Float64(1),Float64(2),Float64(3)),make_array(Float64(4),Float64(5)),make_array(Float64(6)))) |
+----------------------------------------------------------------------------------------------------------------------------+
| [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]                                                                                             |
+----------------------------------------------------------------------------------------------------------------------------+

ongchi · 2024-02-11T01:13:08Z

DataFrame()
+----------------------------------------------------------------------------------------------------------------------------+
| flatten(make_array(make_array(Float64(1),Float64(2),Float64(3)),make_array(Float64(4),Float64(5)),make_array(Float64(6)))) |
+----------------------------------------------------------------------------------------------------------------------------+
| [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]                                                                                             |
+----------------------------------------------------------------------------------------------------------------------------+

Hi @mobley-trent
The df created in the test case maybe is a bit misleading, but it would be like this:

❯ SELECT column1 AS arr FROM (VALUES ([1.0, 2.0, 3.0, 3.0]), ([4.0, 5.0, 3.0]), ([6.0]));
+----------------------+
| arr                  |
+----------------------+
| [1.0, 2.0, 3.0, 3.0] |
| [4.0, 5.0, 3.0]      |
| [6.0]                |
+----------------------+

It's contains of multiple rows of one-dimensional array values. For the flatten function, the existing df should be modified or a new dataframe should be created for this test case.

…ting

andygrove

Thanks @mobley-trent

into flatten

mobley-trent · 2024-02-21T09:56:35Z

Fixed the merge conflicts

Updated functions.rs, test_functions.py - Flatten

f141dae

mobley-trent marked this pull request as ready for review January 16, 2024 11:44

Updated functions.rs - Formatting

514a5de

mobley-trent added 2 commits February 12, 2024 14:52

Updated test_functions.py - Converted test array to a literal for tes…

9fec85b

…ting

Updated test_functions.py - Linting

8721103

andygrove approved these changes Feb 21, 2024

View reviewed changes

Merge branch 'main' of https://github.com/apache/arrow-datafusion-python

8545f4a

into flatten

mobley-trent requested a review from andygrove February 21, 2024 08:59

andygrove merged commit 27a9264 into apache:main Feb 25, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add `flatten` array function #562

feat: Add `flatten` array function #562

mobley-trent commented Jan 16, 2024

mobley-trent commented Jan 16, 2024

ongchi commented Feb 6, 2024

mobley-trent commented Feb 10, 2024

mobley-trent commented Feb 10, 2024 •

edited

Loading

ongchi commented Feb 11, 2024

andygrove left a comment

mobley-trent commented Feb 21, 2024

feat: Add flatten array function #562

feat: Add flatten array function #562

Conversation

mobley-trent commented Jan 16, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

mobley-trent commented Jan 16, 2024

ongchi commented Feb 6, 2024

mobley-trent commented Feb 10, 2024

mobley-trent commented Feb 10, 2024 • edited Loading

ongchi commented Feb 11, 2024

andygrove left a comment

Choose a reason for hiding this comment

mobley-trent commented Feb 21, 2024

feat: Add `flatten` array function #562

feat: Add `flatten` array function #562

mobley-trent commented Feb 10, 2024 •

edited

Loading