v2.0.0
What's Changed
Major changes
- Support Spark 3.4 and 3.5
- Add functions for left and left semi joins with overlap criteria accelerated by Databricks' range join optimization
- Register SQL functions via SQL extension service provider interface, so
glow.register
is no longer necessary if Glow is on the classpath when Spark is launched
Other user facing changes
- Remove Hail integration
- Remove features that frequently cause incompatibilities between versions (
aggregate_by_index
, CSV pipe transformer). Workarounds are provided in the documentation.
Internal changes
- Future proof for Spark 4.0 / Scala 2.13 / JDK 17
- Migrate CI and release process to GitHub Actions
Overlap join benchmarks
On a dataset with 1B left rows and 1M right rows and varying percentages of SNPs in the left table (tested with 1 4 core executor due to quota):
Inner range join + left join, all SNP percentages: 4h
Glow join, 0% SNPs: 4h
Glow join, 50% SNPs: 2h9m
Glow join, 90% SNPs: 0h42m
Other notes
The Python source artifact is built from tag v2.0.0-conda
in order to fix Glow's conda recipe.
New Contributors
- @dvcastillo made their first contribution in #505
- @dtzeng made their first contribution in #519
- @srowen made their first contribution in #524
- @a-li made their first contribution in #522
- @scala-steward-projectglow made their first contribution in #555
Full Changelog: v1.2.1...v2.0.0