File tree 2 files changed +42
-1
lines changed
datafusion/core/tests/fuzz_cases/aggregation_fuzzer
2 files changed +42
-1
lines changed Original file line number Diff line number Diff line change @@ -100,7 +100,28 @@ impl DatasetGeneratorConfig {
100
100
101
101
/// Dataset generator
102
102
///
103
- /// It will generate one random [`Dataset`] when `generate` function is called.
103
+ /// It will generate random [`Dataset`]s when the `generate` function is called. For each
104
+ /// sort key in `sort_keys_set`, an additional sorted dataset will be generated, and the
105
+ /// dataset will be chunked into staggered batches.
106
+ ///
107
+ /// # Example
108
+ /// For `DatasetGenerator` with `sort_keys_set = [["a"], ["b"]]`, it will generate 2
109
+ /// datasets. The first one will be sorted by column `a` and get randomly chunked
110
+ /// into staggered batches. It might look like the following:
111
+ /// ```text
112
+ /// a b
113
+ /// ----
114
+ /// 1 2 <-- batch 1
115
+ /// 1 1
116
+ ///
117
+ /// 2 1 <-- batch 2
118
+ ///
119
+ /// 3 3 <-- batch 3
120
+ /// 4 3
121
+ /// 4 1
122
+ /// ```
123
+ ///
124
+ /// # Implementation details:
104
125
///
105
126
/// The generation logic in `generate`:
106
127
///
Original file line number Diff line number Diff line change 15
15
// specific language governing permissions and limitations
16
16
// under the License.
17
17
18
+ //! Fuzzer for aggregation functions
19
+ //!
20
+ //! The main idea behind aggregate fuzzing is: for aggregation, DataFusion has many
21
+ //! specialized implementations for performance. For example, when the group cardinality
22
+ //! is high, DataFusion will skip the first stage of two-stage hash aggregation; when
23
+ //! the input is ordered by the group key, there is a separate implementation to perform
24
+ //! streaming group by.
25
+ //! This fuzzer checks the results of different specialized implementations and
26
+ //! ensures their results are consistent. The execution path can be controlled by
27
+ //! changing the input ordering or by setting related configuration parameters in
28
+ //! `SessionContext`.
29
+ //!
30
+ //! # Architecture
31
+ //! - `aggregate_fuzz.rs` includes the entry point for fuzzer runs.
32
+ //! - `QueryBuilder` is used to generate candidate queries.
33
+ //! - `DatasetGenerator` is used to generate random datasets.
34
+ //! - `SessionContextGenerator` is used to generate `SessionContext` with
35
+ //! different configuration parameters to control the execution path of aggregate
36
+ //! queries.
37
+
18
38
use arrow:: array:: RecordBatch ;
19
39
use arrow:: util:: pretty:: pretty_format_batches;
20
40
use datafusion:: prelude:: SessionContext ;
You can’t perform that action at this time.
0 commit comments