Skip to content

Commit

Permalink
Updating documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
timsaucer committed Aug 31, 2024
1 parent 6ce4cfe commit 497d490
Show file tree
Hide file tree
Showing 3 changed files with 44 additions and 8 deletions.
37 changes: 37 additions & 0 deletions docs/source/user-guide/common-operations/expressions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,43 @@ examples for the and, or, and not operations.
heavy_red_units = (col("color") == lit("red")) & (col("weight") > lit(42))
not_red_units = ~(col("color") == lit("red"))
Arrays
------

For columns that contain arrays of values, you can access individual elements of the array by index
using bracket indexing. This is similar to callling the function
:py:func:`datafusion.functions.array_element`, except that array indexing using brackets is 0 based,
similar to Python arrays and ``array_element`` is 1 based indexing to be compatible with other SQL
approaches.

.. ipython:: python
from datafusion import SessionContext, col
ctx = SessionContext()
df = ctx.from_pydict({"a": [[1, 2, 3], [4, 5, 6]]})
df.select(col("a")[0].alias("a0"))
.. warning::

Indexing an element of an array via ``[]`` starts at index 0 whereas
:py:func:`~datafusion.functions.array_element` starts at index 1.

Structs
-------

Columns that contain struct elements can be accessed using the bracket notation as if they were
Python dictionary style objects. This expects a string key as the parameter passed.

.. ipython:: python
ctx = SessionContext()
data = {"a": [{"size": 15, "color": "green"}, {"size": 10, "color": "blue"}]}
df = ctx.from_pydict(data)
df.select(col("a")["size"].alias("a_size"))
Functions
---------

Expand Down
14 changes: 6 additions & 8 deletions python/datafusion/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -548,17 +548,15 @@ def __arrow_c_stream__(self, requested_schema: pa.Schema) -> Any:
def transform(self, func: Callable[..., DataFrame], *args: Any) -> DataFrame:
"""Apply a function to the current DataFrame which returns another DataFrame.
This is useful for chaining together multiple functions. For example
This is useful for chaining together multiple functions. For example::
```python
def add_3(df: DataFrame) -> DataFrame:
return df.with_column("modified", lit(3))
def add_3(df: DataFrame) -> DataFrame:
return df.with_column("modified", lit(3))
def within_limit(df: DataFrame, limit: int) -> DataFrame:
return df.filter(col("a") < lit(limit)).distinct()
def within_limit(df: DataFrame, limit: int) -> DataFrame:
return df.filter(col("a") < lit(limit)).distinct()
df = df.transform(modify_df).transform(within_limit, 4)
```
df = df.transform(modify_df).transform(within_limit, 4)
Args:
func: A callable function that takes a DataFrame as it's first argument
Expand Down
1 change: 1 addition & 0 deletions python/datafusion/tests/test_dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -876,6 +876,7 @@ def test_dataframe_export(df) -> None:
failed_convert = True
assert failed_convert


def test_dataframe_transform(df):
def add_string_col(df_internal) -> DataFrame:
return df_internal.with_column("string_col", literal("string data"))
Expand Down

0 comments on commit 497d490

Please sign in to comment.