Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pipedag compatibility with pydiverse.transform >=0.2 #237

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .github/workflows/nightly_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,6 @@ jobs:
- py312all
- py311all
- py310all
- py39all

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please add to the PR description why we decided to drop Python 3.9 from the tests?

- py39pdsa1all
- py311pdsa1all
uses: ./.github/workflows/test.yml
Expand Down
7 changes: 3 additions & 4 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,6 @@ jobs:
os:
- ubuntu-latest
environment:
- py39all
- py310all
- py311all
- py312all
Expand All @@ -80,7 +79,7 @@ jobs:
os:
- ubuntu-latest
environment:
- py39all
- py310all
- py312all
uses: ./.github/workflows/test.yml
with:
Expand All @@ -98,7 +97,7 @@ jobs:
os:
- ubuntu-latest
environment:
- py39ibm
- py310ibm
- py312ibm
uses: ./.github/workflows/test.yml
with:
Expand All @@ -118,7 +117,7 @@ jobs:
os:
- ubuntu-latest
environment:
- py39all
- py310all
- py311all
uses: ./.github/workflows/test.yml
with:
Expand Down
7 changes: 6 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -308,6 +308,11 @@ Same happened for MacOS. The driver was installed in `/opt/homebrew/etc/odbcinst

Furthermore, make sure you use 127.0.0.1 instead of localhost. It seems that /etc/hosts is ignored.

### Incompatibility with pydiverse.transform 0.2.0

Pydiverse.pipedag currently doesn't support pydiverse.transform 0.2.0, please consider using a newer Version,
or downgrading to pydiverse.transform 0.1.6.

## Packaging and publishing to pypi and conda-forge using github actions

- bump version number in [pyproject.toml](pyproject.toml)
Expand Down Expand Up @@ -345,4 +350,4 @@ Conda-forge packages are updated via:
- https://github.com/conda-forge/pydiverse-pipedag-feedstock#updating-pydiverse-pipedag-feedstock
- update `recipe/meta.yaml`
- test meta.yaml in pipedag repo: `conda-build build ../pydiverse-pipedag-feedstock/recipe/meta.yaml`
- commit `recipe/meta.yaml` to branch of fork and submit PR
- commit `recipe/meta.yaml` to branch of fork and submit PR
6 changes: 6 additions & 0 deletions docs/package/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,3 +150,9 @@ But `odbcinst -j` revealed that it installed the configuration in `/etc/unixODBC
its own `odbcinst` executable and that shows odbc config files are expected in `/etc/*`. Symlinks were enough to fix the
problem. Try `python -c 'import pyodbc;print(pyodbc.drivers())'` and see whether you get more than an empty list.
Furthermore, make sure you use 127.0.0.1 instead of localhost. It seems that /etc/hosts is ignored.

### Incompatibility with specific pydiverse.transform Versions

pydiverse.pipedag currently doesn't support pydiverse.transfrom Versions (0.2.0, 0.2.1, 0.2.2), due to major
differences to pdt 0.2.3 and pdt <0.2.
However it does still work with pdt <0.2.
9 changes: 9 additions & 0 deletions docs/source/changelog.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,14 @@
# Changelog

## 0.9.9.1 (2025-02-11)
- Fix incompatibility with pydiverse.transform 0.2.3 (<0.2.0 still supported) (0.2.0, 0.2.1, 0.2.2 still not compatible):
- Added new TableHook for new and kept the old.
- Added Version-Detecter (new/old) with try-except imports.
- Added same Version-Detector method to tests/test_table_hooks/test_pdtransform
- Added Version-Detecter to parquet.py as well as a new TableHook
- Added NotImplementedError("pydiverse.transform 0.2.0 isn't supported") when pydiverse.transform 0.2.0 is being used.
- Pinned prefect to Version ">=2.13.5, <3.0.0", because future versions were incompatible

## 0.9.9 (2025-02-05)
- Fix incompatibility with DuckDB 1.1.

Expand Down
57,380 changes: 27,851 additions & 29,529 deletions pixi.lock

Large diffs are not rendered by default.

38 changes: 21 additions & 17 deletions pixi.toml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ kazoo = ">=2.8.0"
dask = ">=2022.1.0"

[feature.prefect.dependencies]
prefect = ">=2.13.5"
prefect = ">=2.13.5, <3.0.0"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and could you please also add a note that/why we pinned the prefect versions?


[feature.snowflake.dependencies]
snowflake-sqlalchemy = ">=1.6.1"
Expand All @@ -58,7 +58,6 @@ tidypolars = ">=0.2.19"

[feature.all-tests.dependencies]
# Table Hooks
pydiverse-transform = ">=0.1.3, <0.2"
ibis-framework = ">=5.1.0, <9"
ibis-mssql = ">=5.1.0, <9"
ibis-postgres = ">=5.1.0, <9"
Expand All @@ -68,15 +67,21 @@ arrow-odbc = ">=7.0.4"
pyodbc = ">=4.0.35"
pytsql = ">=1.1.4"

[feature.pdtransform-new.dependencies]
pydiverse-transform = ">0.2.0"

[feature.pdtransform-old.dependencies]
pydiverse-transform = "<0.2"

# IBM DB2
[feature.ibm-db]
# ibm-db is not available on Apple Silicon
# https://ibm-data-and-ai.ideas.ibm.com/ideas/DB2CON-I-92
platforms = ["linux-64", "osx-64", "win-64"]
[feature.ibm-db.target.osx.pypi-dependencies]
# ibm_db is not on conda-forge for macOS
# https://github.com/ibmdb/db2drivers/issues/3
ibm-db = ">=3.1.4"
platforms = ["linux-64", "win-64"] # TODO "osx-arm64"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and that/why we removed ibm-on-osx support?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pavelzw maybe that's also something you could elaborate on. I'm wondering if we should maybe unbundle this particular change. wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue was that ibm-db on macos didn't have a conda-forge build back then (conda-forge/ibm_db-feedstock#77) and just got one recently in conda-forge/ibm_db-feedstock#79. In a future PR (#231) i will re-add macos support for ibm-db with the new package.

This lead to us using the ibm-db build on pypi for macos (in combination with telling people to install gcc next to clang) and thus activating pypi-dependencies in pixi.
This introduces some issues, namely in some cases (haven't found a minimal reproducer for this issue here, though) where pixi thinks the lockfile is out of date (which leads to pixi install --locked failing). xref prefix-dev/pixi#1295, prefix-dev/pixi#1417

I'm wondering if we should maybe unbundle this particular change

this would require a change of the lockfile which will likely lead to ci breakages. if it's not high effort to fix these breakages (or introduce corresponding pinnings in pixi.toml), we could do it 🤷🏻

#[feature.ibm-db.target.osx.pypi-dependencies]
## ibm_db is not on conda-forge for macOS
## https://github.com/ibmdb/db2drivers/issues/3
#ibm-db = ">=3.1.4"
[feature.ibm-db.target.linux.dependencies]
ibm_db = ">=3.1.4"
[feature.ibm-db.target.win.dependencies]
Expand Down Expand Up @@ -132,8 +137,8 @@ connectorx = ">=0.3.3"
connectorx = ">=0.3.3"

[environments]
default = ["py312", "all-tests", "dev", "filelock", "zookeeper", "dask", "prefect", "pd2", "sa2", "snowflake"]
all-tests = ["py312", "all-tests", "dev", "filelock", "zookeeper", "dask", "prefect", "pd2", "sa2", "snowflake"]
default = ["py312", "all-tests", "pdtransform-new", "dev", "filelock", "zookeeper", "dask", "prefect", "pd2", "sa2", "snowflake"]
all-tests = ["py312", "all-tests", "pdtransform-new", "dev", "filelock", "zookeeper", "dask", "prefect", "pd2", "sa2", "snowflake"]
docs = ["docs"]
release = { features=["release"], no-default-feature=true }
py39 = ["py39", "dev", "zookeeper", "pd2", "sa2", "dask", "duckdb1"]
Expand All @@ -143,11 +148,10 @@ py312 = ["py312", "dev", "zookeeper", "pd2", "sa2", "dask", "duckdb1"]
py39pdsa1 = ["py39", "dev", "zookeeper", "pd1", "sa1", "dask", "connectorx-noarm", "duckdb0"]
py311pdsa1 = ["py311", "dev", "zookeeper", "pd1", "sa1", "dask", "connectorx-noarm", "duckdb0"]
py312pdsa1 = ["py39", "dev", "zookeeper", "pd1", "sa1", "dask", "connectorx-noarm", "duckdb0"]
py39all = ["py39", "dev", "zookeeper", "all-tests", "filelock", "dask", "prefect", "pd2", "sa2", "snowflake", "tidypolars"]
py310all = ["py310", "dev", "zookeeper", "all-tests", "filelock", "dask", "prefect", "pd2", "sa2", "snowflake", "tidypolars"]
py311all = ["py311", "dev", "zookeeper", "all-tests", "filelock", "dask", "prefect", "pd2", "sa2", "snowflake"]
py312all = ["py312", "dev", "zookeeper", "all-tests", "filelock", "dask", "prefect", "pd2", "sa2", "snowflake"]
py39pdsa1all = ["py39", "dev", "zookeeper", "pd1", "sa1", "connectorx-noarm", "all-tests", "filelock", "dask", "prefect", "snowflake", "tidypolars"]
py311pdsa1all = ["py311", "dev", "zookeeper", "pd1", "sa1", "connectorx-noarm", "all-tests", "filelock", "dask", "prefect", "snowflake"]
py39ibm = ["py39", "dev", "zookeeper", "pd2", "sa2", "dask", "ibm-db", "all-tests", "tidypolars"]
py312ibm = ["py312", "dev", "zookeeper", "pd2", "sa2", "dask", "ibm-db", "all-tests"]
py310all = ["py310", "dev", "zookeeper", "all-tests", "pdtransform-new", "filelock", "dask", "prefect", "pd2", "sa2", "snowflake", "tidypolars"]
py311all = ["py311", "dev", "zookeeper", "all-tests", "pdtransform-new", "filelock", "dask", "prefect", "pd2", "sa2", "snowflake"]
py312all = ["py312", "dev", "zookeeper", "all-tests", "pdtransform-new", "filelock", "dask", "prefect", "pd2", "sa2", "snowflake"]
py39pdsa1all = ["py39", "dev", "zookeeper", "pd1", "sa1", "connectorx-noarm", "all-tests", "pdtransform-old", "filelock", "dask", "prefect", "snowflake", "tidypolars"]
py311pdsa1all = ["py311", "dev", "zookeeper", "pd1", "sa1", "connectorx-noarm", "all-tests", "pdtransform-old", "filelock", "dask", "prefect", "snowflake"]
py310ibm = ["py310", "dev", "zookeeper", "pd2", "sa2", "dask", "ibm-db", "all-tests", "pdtransform-old", "tidypolars"]
py312ibm = ["py312", "dev", "zookeeper", "pd2", "sa2", "dask", "ibm-db", "all-tests", "pdtransform-old"]
2 changes: 1 addition & 1 deletion src/pydiverse/pipedag/backend/table/cache/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ def _retrieve_table_obj(self, table: Table, as_type: type[T]) -> T:
self.logger.info("Retrieved table from local table cache", table=table)
return obj
except Exception as e:
self.logger.info(
self.logger.warning(
"Failed to retrieve table from local table cache",
table=table,
cause=str(e),
Expand Down
78 changes: 76 additions & 2 deletions src/pydiverse/pipedag/backend/table/cache/parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -190,12 +190,37 @@ def retrieve(

try:
import pydiverse.transform as pdt

try:
from pydiverse.transform.eager import PandasTableImpl

_ = PandasTableImpl

pdt_old = pdt
pdt_new = None
except ImportError:
try:
# detect if 0.2 or >0.2 is active
# this import would only work in <=0.2
from pydiverse.transform.extended import Polars

# ensures a "used" state for the import, preventing black from deleting it
_ = Polars

pdt_old = None
pdt_new = pdt
except ImportError:
raise NotImplementedError(
"pydiverse.transform 0.2.0 isn't supported"
) from None
except ImportError:
pdt = None
pdt_old = None
pdt_new = None


@ParquetTableCache.register_table(pdt)
class PydiverseTransformTableHook(TableHook[ParquetTableCache]):
@ParquetTableCache.register_table(pdt_old)
class PydiverseTransformTableHookOld(TableHook[ParquetTableCache]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no more indicative naming possible here instead of just Old/New? 😅

@classmethod
def can_materialize(cls, type_) -> bool:
return issubclass(type_, pdt.Table)
Expand Down Expand Up @@ -240,3 +265,52 @@ def retrieve(
return pdt.Table(PandasTableImpl(table.name, df))

raise ValueError(f"Invalid type {as_type}")


@ParquetTableCache.register_table(pdt_new)
class PydiverseTransformTableHookNew(TableHook[ParquetTableCache]):
@classmethod
def can_materialize(cls, type_) -> bool:
return issubclass(type_, pdt.Table)

@classmethod
def can_retrieve(cls, type_) -> bool:
from pydiverse.transform.extended import Pandas

return issubclass(type_, Pandas)

@classmethod
def materialize(
cls, store: ParquetTableCache, table: Table[pdt.Table], stage_name: str
):
from pydiverse.transform.extended import (
Polars,
export,
)

t = table.obj
table = table.copy_without_obj()

try:
table.obj = t >> export(Polars())
hook = store.get_hook_subclass(PolarsTableHook)
return hook.materialize(store, table, stage_name)
except Exception as e:
raise TypeError(f"Unsupported type {type(t._ast).__name__}") from e

@classmethod
def retrieve(
cls,
store: ParquetTableCache,
table: Table,
stage_name: str | None,
as_type: type,
):
from pydiverse.transform.extended import Polars

if isinstance(as_type, Polars):
hook = store.get_hook_subclass(PandasTableHook)
df = hook.retrieve(store, table, stage_name, pd.DataFrame)
return pdt.Table(df)

raise ValueError(f"Invalid type {as_type}")
Loading
Loading