Skip to content

Commit afb169c

Browse files
yyy1000alamb
andauthored
[Document] Adding UDF by impl ScalarUDFImpl (apache#9172)
* doc: update scalar udf * Update docs/source/library-user-guide/adding-udfs.md Co-authored-by: Andrew Lamb <[email protected]> * Update docs/source/library-user-guide/adding-udfs.md Co-authored-by: Andrew Lamb <[email protected]> * apply suggestions and format --------- Co-authored-by: Andrew Lamb <[email protected]>
1 parent b2ff63f commit afb169c

File tree

1 file changed

+82
-5
lines changed

1 file changed

+82
-5
lines changed

docs/source/library-user-guide/adding-udfs.md

+82-5
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,87 @@ First we'll talk about adding an Scalar UDF end-to-end, then we'll talk about th
3434

3535
## Adding a Scalar UDF
3636

37-
A Scalar UDF is a function that takes a row of data and returns a single value. For example, this function takes a single i64 and returns a single i64 with 1 added to it:
37+
A Scalar UDF is a function that takes a row of data and returns a single value. In order for good performance
38+
such functions are "vectorized" in DataFusion, meaning they get one or more Arrow Arrays as input and produce
39+
an Arrow Array with the same number of rows as output.
40+
41+
To create a Scalar UDF, you
42+
43+
1. Implement the `ScalarUDFImpl` trait to tell DataFusion about your function such as what types of arguments it takes and how to calculate the results.
44+
2. Create a `ScalarUDF` and register it with `SessionContext::register_udf` so it can be invoked by name.
45+
46+
In the following example, we will add a function takes a single i64 and returns a single i64 with 1 added to it:
47+
48+
For brevity, we'll skipped some error handling, but e.g. you may want to check that `args.len()` is the expected number of arguments.
49+
50+
### Adding by `impl ScalarUDFImpl`
51+
52+
This a lower level API with more functionality but is more complex, also documented in [`advanced_udf.rs`].
53+
54+
```rust
55+
use std::any::Any;
56+
use arrow::datatypes::DataType;
57+
use datafusion_common::{DataFusionError, plan_err, Result};
58+
use datafusion_expr::{col, ColumnarValue, Signature, Volatility};
59+
use datafusion_expr::{ScalarUDFImpl, ScalarUDF};
60+
61+
#[derive(Debug)]
62+
struct AddOne {
63+
signature: Signature
64+
};
65+
66+
impl AddOne {
67+
fn new() -> Self {
68+
Self {
69+
signature: Signature::uniform(1, vec![DataType::Int32], Volatility::Immutable)
70+
}
71+
}
72+
}
73+
74+
/// Implement the ScalarUDFImpl trait for AddOne
75+
impl ScalarUDFImpl for AddOne {
76+
fn as_any(&self) -> &dyn Any { self }
77+
fn name(&self) -> &str { "add_one" }
78+
fn signature(&self) -> &Signature { &self.signature }
79+
fn return_type(&self, args: &[DataType]) -> Result<DataType> {
80+
if !matches!(args.get(0), Some(&DataType::Int32)) {
81+
return plan_err!("add_one only accepts Int32 arguments");
82+
}
83+
Ok(DataType::Int32)
84+
}
85+
// The actual implementation would add one to the argument
86+
fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> {
87+
let args = columnar_values_to_array(args)?;
88+
let i64s = as_int64_array(&args[0])?;
89+
90+
let new_array = i64s
91+
.iter()
92+
.map(|array_elem| array_elem.map(|value| value + 1))
93+
.collect::<Int64Array>();
94+
Ok(Arc::new(new_array))
95+
}
96+
}
97+
```
98+
99+
We now need to register the function with DataFusion so that it can be used in the context of a query.
100+
101+
```rust
102+
// Create a new ScalarUDF from the implementation
103+
let add_one = ScalarUDF::from(AddOne::new());
104+
105+
// register the UDF with the context so it can be invoked by name and from SQL
106+
let mut ctx = SessionContext::new();
107+
ctx.register_udf(add_one.clone());
108+
109+
// Call the function `add_one(col)`
110+
let expr = add_one.call(vec![col("a")]);
111+
```
112+
113+
### Adding a Scalar UDF by [`create_udf`]
114+
115+
There is a an older, more concise, but also more limited API [`create_udf`] available as well
116+
117+
#### Adding a Scalar UDF
38118

39119
```rust
40120
use std::sync::Arc;
@@ -58,8 +138,6 @@ pub fn add_one(args: &[ColumnarValue]) -> Result<ArrayRef> {
58138
}
59139
```
60140

61-
For brevity, we'll skipped some error handling, but e.g. you may want to check that `args.len()` is the expected number of arguments.
62-
63141
This "works" in isolation, i.e. if you have a slice of `ArrayRef`s, you can call `add_one` and it will return a new `ArrayRef` with 1 added to each value.
64142

65143
```rust
@@ -74,11 +152,10 @@ assert_eq!(result, &Int64Array::from(vec![Some(2), None, Some(4)]));
74152

75153
The challenge however is that DataFusion doesn't know about this function. We need to register it with DataFusion so that it can be used in the context of a query.
76154

77-
### Registering a Scalar UDF
155+
#### Registering a Scalar UDF
78156

79157
To register a Scalar UDF, you need to wrap the function implementation in a [`ScalarUDF`] struct and then register it with the `SessionContext`.
80158
DataFusion provides the [`create_udf`] and helper functions to make this easier.
81-
There is a lower level API with more functionality but is more complex, that is documented in [`advanced_udf.rs`].
82159

83160
```rust
84161
use datafusion::logical_expr::{Volatility, create_udf};

0 commit comments

Comments
 (0)