Skip to content

custom_datasource example panicked during RepartitionExec planning #15493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2010YOUY01 opened this issue Mar 30, 2025 · 10 comments · Fixed by #15496
Closed

custom_datasource example panicked during RepartitionExec planning #15493

2010YOUY01 opened this issue Mar 30, 2025 · 10 comments · Fixed by #15496
Labels
bug Something isn't working

Comments

@2010YOUY01
Copy link
Contributor

Describe the bug

I have a PR that didn't change the repartition code, but caused one assertion failure inside RepartitionExec's execute() method, during custom_datasource.rs example's execution.
The failed CI job run: https://github.com/apache/datafusion/actions/runs/14152369014/job/39647355093
After re-running, the CI it passed, this might be some heisenbug which occurs rarely?

To Reproduce

No response

Expected behavior

No response

Additional context

No response

@2010YOUY01 2010YOUY01 added the bug Something isn't working label Mar 30, 2025
@2010YOUY01
Copy link
Contributor Author

Likely caused by #15302
And it's getting reverted #15494

@goldmedal
Copy link
Contributor

I saw similar error messages when running tpch sqllogictest in the latest main branch

 ~/git/datafusion ▓▒░ INCLUDE_TPCH=true cargo test --test sqllogictests -- tpch                                                                                                              ░▒▓ ✔ │ wren-core-py Py │ 16:15:33 
   Compiling datafusion-common v46.0.1 (/Users/jax/git/datafusion/datafusion/common)
   Compiling datafusion-expr-common v46.0.1 (/Users/jax/git/datafusion/datafusion/expr-common)
   Compiling datafusion-physical-expr-common v46.0.1 (/Users/jax/git/datafusion/datafusion/physical-expr-common)
   Compiling datafusion-functions-aggregate-common v46.0.1 (/Users/jax/git/datafusion/datafusion/functions-aggregate-common)
   Compiling datafusion-functions-window-common v46.0.1 (/Users/jax/git/datafusion/datafusion/functions-window-common)
   Compiling datafusion-expr v46.0.1 (/Users/jax/git/datafusion/datafusion/expr)
   Compiling datafusion-macros v46.0.1 (/Users/jax/git/datafusion/datafusion/macros)
   Compiling datafusion-physical-expr v46.0.1 (/Users/jax/git/datafusion/datafusion/physical-expr)
   Compiling datafusion-execution v46.0.1 (/Users/jax/git/datafusion/datafusion/execution)
   Compiling datafusion-sql v46.0.1 (/Users/jax/git/datafusion/datafusion/sql)
   Compiling datafusion-functions v46.0.1 (/Users/jax/git/datafusion/datafusion/functions)
   Compiling datafusion-physical-plan v46.0.1 (/Users/jax/git/datafusion/datafusion/physical-plan)
   Compiling datafusion-functions-aggregate v46.0.1 (/Users/jax/git/datafusion/datafusion/functions-aggregate)
   Compiling datafusion-functions-window v46.0.1 (/Users/jax/git/datafusion/datafusion/functions-window)
   Compiling datafusion-optimizer v46.0.1 (/Users/jax/git/datafusion/datafusion/optimizer)
   Compiling datafusion-functions-nested v46.0.1 (/Users/jax/git/datafusion/datafusion/functions-nested)
   Compiling datafusion-session v46.0.1 (/Users/jax/git/datafusion/datafusion/session)
   Compiling datafusion-physical-optimizer v46.0.1 (/Users/jax/git/datafusion/datafusion/physical-optimizer)
   Compiling datafusion-datasource v46.0.1 (/Users/jax/git/datafusion/datafusion/datasource)
   Compiling datafusion-catalog v46.0.1 (/Users/jax/git/datafusion/datafusion/catalog)
   Compiling datafusion-datasource-csv v46.0.1 (/Users/jax/git/datafusion/datafusion/datasource-csv)
   Compiling datafusion-datasource-json v46.0.1 (/Users/jax/git/datafusion/datafusion/datasource-json)
   Compiling datafusion-datasource-parquet v46.0.1 (/Users/jax/git/datafusion/datafusion/datasource-parquet)
   Compiling datafusion-functions-table v46.0.1 (/Users/jax/git/datafusion/datafusion/functions-table)
   Compiling datafusion-catalog-listing v46.0.1 (/Users/jax/git/datafusion/datafusion/catalog-listing)
   Compiling datafusion-datasource-avro v46.0.1 (/Users/jax/git/datafusion/datafusion/datasource-avro)
   Compiling datafusion v46.0.1 (/Users/jax/git/datafusion/datafusion/core)
   Compiling datafusion-sqllogictest v46.0.1 (/Users/jax/git/datafusion/datafusion/sqllogictest)
    Finished `test` profile [unoptimized + debuginfo] target(s) in 59.21s
     Running bin/sqllogictests.rs (target/debug/deps/sqllogictests-09ecded463d2aafd)
[00:00:02] ################------------------------      33/83      "tpch/tpch.slt" - 3 took > 500 ms                                                                                                                            
thread 'tokio-runtime-worker' panicked at datafusion/physical-plan/src/repartition/mod.rs:618:22:
partition not used yet
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

thread 'tokio-runtime-worker' panicked at datafusion/physical-plan/src/repartition/mod.rs:618:22:
partition not used yet

thread 'tokio-runtime-worker' panicked at datafusion/physical-plan/src/repartition/mod.rs:618:22:
partition not used yet
Completed 1 test files in 29 seconds    

@goldmedal
Copy link
Contributor

I guess #15476 is the root cause.

When I checked out 9071503, the tests passed.

14635da (HEAD -> main, origin/main, origin/HEAD, goldmedal/main) perf: Reuse row converter during sort (#15302)
e2b7919 Improve performance sort TPCH q3 with Utf8Vew ( Sort-preserving merging on a single Utf8View ) (#15447)
7e0738a Remove CoalescePartitions insertion from HashJoinExec (#15476)
9071503 Update ClickBench queries to avoid to_timestamp_seconds (#15475)

@goldmedal
Copy link
Contributor

By the way, the fail example is dataframe, not custom_datasource.

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented Mar 30, 2025

Interesting, there were also some benchmark fails, it show the same error with the same line in: panicked at datafusion/physical-plan/src/repartition/mod.rs:618:22:

cargo bench -p datafusion --bench topk_aggregate --profile release-nonlto
    Finished `release-nonlto` profile [optimized] target(s) in 0.34s
     Running benches/topk_aggregate.rs (target/release-nonlto/deps/topk_aggregate-cbbaaf4e04209381)
Gnuplot not found, using plotters backend
Benchmarking aggregate 10000000 time-series rows: Warming up for 3.0000 s
thread 'tokio-runtime-worker' panicked at datafusion/physical-plan/src/repartition/mod.rs:618:22:
partition not used yet
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

thread 'tokio-runtime-worker' panicked at datafusion/physical-plan/src/repartition/mod.rs:618:22:
partition not used yet

thread 'tokio-runtime-worker' panicked at datafusion/physical-plan/src/repartition/mod.rs:618:22:
partition not used yet

thread 'tokio-runtime-worker' panicked at datafusion/physical-plan/src/repartition/mod.rs:618:22:
partition not used yet

thread 'tokio-runtime-worker' panicked at datafusion/physical-plan/src/repartition/mod.rs:618:22:
partition not used yet

thread 'tokio-runtime-worker' panicked at datafusion/physical-plan/src/repartition/mod.rs:618:22:
partition not used yet

#15213

@acking-you
Copy link
Contributor

Interesting,I just noticed that the CI also failed at this point

@zhuqi-lucas
Copy link
Contributor

Also the tpch benchmark fails for the main branch:

RUST_BACKTRACE=1 ./bench.sh run tpch10
***************************
DataFusion Benchmark Script
COMMAND: run
BENCHMARK: tpch10
DATAFUSION_DIR: /Users/zhuqi/arrow-datafusion/benchmarks/..
BRANCH_NAME: main
DATA_DIR: /Users/zhuqi/arrow-datafusion/benchmarks/data
RESULTS_DIR: /Users/zhuqi/arrow-datafusion/benchmarks/results/main
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************
RESULTS_FILE: /Users/zhuqi/arrow-datafusion/benchmarks/results/main/tpch_sf10.json
Running tpch benchmark...
    Finished `release` profile [optimized] target(s) in 0.20s
     Running `/Users/zhuqi/arrow-datafusion/target/release/tpch benchmark datafusion --iterations 5 --path /Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf10 --prefer_hash_join true --format parquet -o /Users/zhuqi/arrow-datafusion/benchmarks/results/main/tpch_sf10.json`
Running benchmarks with the following options: RunOpt { query: None, common: CommonOpt { iterations: 5, partitions: None, batch_size: 8192, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false }, path: "/Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf10", file_format: "parquet", mem_table: false, output_path: Some("/Users/zhuqi/arrow-datafusion/benchmarks/results/main/tpch_sf10.json"), disable_statistics: false, prefer_hash_join: true }
Query 1 iteration 0 took 690.9 ms and returned 4 rows
Query 1 iteration 1 took 598.7 ms and returned 4 rows
Query 1 iteration 2 took 601.4 ms and returned 4 rows
Query 1 iteration 3 took 581.8 ms and returned 4 rows
Query 1 iteration 4 took 637.5 ms and returned 4 rows
Query 1 avg time: 622.06 ms

thread 'tokio-runtime-worker' panicked at datafusion/physical-plan/src/repartition/mod.rs:618:22:
partition not used yet
stack backtrace:
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

thread 'tokio-runtime-worker' panicked at datafusion/physical-plan/src/repartition/mod.rs:618:22:
partition not used yet
stack backtrace:
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

@acking-you
Copy link
Contributor

I'm quite curious as to why the PR that caused the panic can pass the CI.

@goldmedal
Copy link
Contributor

I'm quite curious as to why the PR that caused the panic can pass the CI.

I think its branch https://github.com/ctsk/datafusion/tree/remove-hj-coalesce is based on the old main branch (about 2 weeks ago). Then, #15476 was created 2 days ago, some PRs were merged after its CI passed. Not sure which one is conflicting with it.

@zhuqi-lucas
Copy link
Contributor

Revert the #15476

Just tested, also fixed the tpch bench.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants