Skip to content

Commit 34efd1f

Browse files
authored
More comment to aggregation fuzzer (#15048)
1 parent 986be19 commit 34efd1f

File tree

2 files changed

+42
-1
lines changed

2 files changed

+42
-1
lines changed

datafusion/core/tests/fuzz_cases/aggregation_fuzzer/data_generator.rs

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,28 @@ impl DatasetGeneratorConfig {
100100

101101
/// Dataset generator
102102
///
103-
/// It will generate one random [`Dataset`] when `generate` function is called.
103+
/// It will generate random [`Dataset`]s when the `generate` function is called. For each
104+
/// sort key in `sort_keys_set`, an additional sorted dataset will be generated, and the
105+
/// dataset will be chunked into staggered batches.
106+
///
107+
/// # Example
108+
/// For `DatasetGenerator` with `sort_keys_set = [["a"], ["b"]]`, it will generate 2
109+
/// datasets. The first one will be sorted by column `a` and get randomly chunked
110+
/// into staggered batches. It might look like the following:
111+
/// ```text
112+
/// a b
113+
/// ----
114+
/// 1 2 <-- batch 1
115+
/// 1 1
116+
///
117+
/// 2 1 <-- batch 2
118+
///
119+
/// 3 3 <-- batch 3
120+
/// 4 3
121+
/// 4 1
122+
/// ```
123+
///
124+
/// # Implementation details:
104125
///
105126
/// The generation logic in `generate`:
106127
///

datafusion/core/tests/fuzz_cases/aggregation_fuzzer/mod.rs

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,26 @@
1515
// specific language governing permissions and limitations
1616
// under the License.
1717

18+
//! Fuzzer for aggregation functions
19+
//!
20+
//! The main idea behind aggregate fuzzing is: for aggregation, DataFusion has many
21+
//! specialized implementations for performance. For example, when the group cardinality
22+
//! is high, DataFusion will skip the first stage of two-stage hash aggregation; when
23+
//! the input is ordered by the group key, there is a separate implementation to perform
24+
//! streaming group by.
25+
//! This fuzzer checks the results of different specialized implementations and
26+
//! ensures their results are consistent. The execution path can be controlled by
27+
//! changing the input ordering or by setting related configuration parameters in
28+
//! `SessionContext`.
29+
//!
30+
//! # Architecture
31+
//! - `aggregate_fuzz.rs` includes the entry point for fuzzer runs.
32+
//! - `QueryBuilder` is used to generate candidate queries.
33+
//! - `DatasetGenerator` is used to generate random datasets.
34+
//! - `SessionContextGenerator` is used to generate `SessionContext` with
35+
//! different configuration parameters to control the execution path of aggregate
36+
//! queries.
37+
1838
use arrow::array::RecordBatch;
1939
use arrow::util::pretty::pretty_format_batches;
2040
use datafusion::prelude::SessionContext;

0 commit comments

Comments
 (0)