Skip to content

Commit 9d35893

Browse files
alambozankabak
authored andcommitted
Docs: Document creating new extension APIs (apache#11425)
* Docs: Document creating new extension APIs * fix * Add clarification about extension APIs. Thanks @ozankabak * Apply suggestions from code review Co-authored-by: Mehmet Ozan Kabak <[email protected]> * Add a paragraph on datafusion-contrib * prettier --------- Co-authored-by: Mehmet Ozan Kabak <[email protected]>
1 parent 9a084a2 commit 9d35893

File tree

2 files changed

+75
-1
lines changed

2 files changed

+75
-1
lines changed

datafusion/core/src/lib.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -174,7 +174,7 @@
174174
//!
175175
//! DataFusion is designed to be highly extensible, so you can
176176
//! start with a working, full featured engine, and then
177-
//! specialize any behavior for their usecase. For example,
177+
//! specialize any behavior for your usecase. For example,
178178
//! some projects may add custom [`ExecutionPlan`] operators, or create their own
179179
//! query language that directly creates [`LogicalPlan`] rather than using the
180180
//! built in SQL planner, [`SqlToRel`].

docs/source/contributor-guide/architecture.md

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,3 +25,77 @@ possible. You can find the most up to date version in the [source code].
2525

2626
[crates.io documentation]: https://docs.rs/datafusion/latest/datafusion/index.html#architecture
2727
[source code]: https://github.com/apache/datafusion/blob/main/datafusion/core/src/lib.rs
28+
29+
## Forks vs Extension APIs
30+
31+
DataFusion is a fast moving project, which results in frequent internal changes.
32+
This benefits DataFusion by allowing it to evolve and respond quickly to
33+
requests, but also means that maintaining a fork with major modifications
34+
sometimes requires non trivial work.
35+
36+
The public API (what is accessible if you use the DataFusion releases from
37+
crates.io) is typically much more stable (though it does change from release to
38+
release as well).
39+
40+
Thus, rather than forks, we recommend using one of the many extension APIs (such
41+
as `TableProvider`, `OptimizerRule`, or `ExecutionPlan`) to customize
42+
DataFusion. If you can not do what you want with the existing APIs, we would
43+
welcome you working with us to add new APIs to enable your use case, as
44+
described in the next section.
45+
46+
## `datafusion-contrib`
47+
48+
While DataFusions comes with enough features "out of the box" to quickly start
49+
with a working system, it can't include everything useful feature (e.g.
50+
`TableProvider`s for all data formats). The [`datafusion-contrib`] project
51+
contains a collection of community maintained extensions that are not part of
52+
the core DataFusion project, and not under Apache Software Foundation governance
53+
but may be useful to others in the community. If you are interested adding a
54+
feature to DataFusion, a new extension in `datafusion-contrib` is likely a good
55+
place to start. Please [contact] us via github issue, slack, or Discord and
56+
we'll gladly set up a new repository for your extension.
57+
58+
[`datafusion-contrib`]: https://github.com/datafusion-contrib
59+
[contact]: ../contributor-guide/communication.md
60+
61+
## Creating new Extension APIs
62+
63+
DataFusion aims to be a general-purpose query engine, and thus the core crates
64+
contain features that are useful for a wide range of use cases. Use case specific
65+
functionality (such as very specific time series or stream processing features)
66+
are typically implemented using the extension APIs.
67+
68+
If have a use case that is not covered by the existing APIs, we would love to
69+
work with you to design a new general purpose API. There are often others who are
70+
interested in similar extensions and the act of defining the API often improves
71+
the code overall for everyone.
72+
73+
Extension APIs that provide "safe" default behaviors are more likely to be
74+
suitable for inclusion in DataFusion, while APIs that require major changes to
75+
built-in operators are less likely. For example, it might make less sense
76+
to add an API to support a stream processing feature if that would result in
77+
slower performance for built-in operators. It may still make sense to add
78+
extension APIs for such features, but leave implementation of such operators in
79+
downstream projects.
80+
81+
The process to create a new extension API is typically:
82+
83+
- Look for an existing issue describing what you want to do, and file one if it
84+
doesn't yet exist.
85+
- Discuss what the API would look like. Feel free to ask contributors (via `@`
86+
mentions) for feedback (you can find such people by looking at the most
87+
recently changed PRs and issues)
88+
- Prototype the new API, typically by adding an example (in
89+
`datafusion-examples` or refactoring existing code) to show how it would work
90+
- Create a PR with the new API, and work with the community to get it merged
91+
92+
Some benefits of using an example based approach are
93+
94+
- Any future API changes will also keep your example going ensuring no
95+
regression in functionality
96+
- There will be a blue print of any needed changes to your code if the APIs do change
97+
(just look at what changed in your example)
98+
99+
An example of this process was [creating a SQL Extension Planning API].
100+
101+
[creating a sql extension planning api]: https://github.com/apache/datafusion/issues/11207

0 commit comments

Comments
 (0)