Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mistaken documentation for dask_geopandas.read_parquet #235

Open
barbuz opened this issue Jan 24, 2023 · 2 comments
Open

Mistaken documentation for dask_geopandas.read_parquet #235

barbuz opened this issue Jan 24, 2023 · 2 comments

Comments

@barbuz
Copy link

barbuz commented Jan 24, 2023

I am trying to use the read_parquet function in my code, and I've encountered some trouble where the documentation says one thing and the code seems to be doing something different.

These are the two discrepancies I have found, but there could be others:

  • The engine parameter cannot really be set. dask-geopandas uses a custom GeoArrowEngine engine, and trying to specify anything else raises an exception.
  • The documentation says that for split_row_groups "Default is True if a _metadata file is available or if the dataset is composed of a single file (otherwise defult is False).". It looks like the code does not even try to provide a default value and just passes it directly to dask.dataframe.read_parquet, which defaults to False in all cases.
@TomAugspurger
Copy link
Contributor

I think the read_parquet docstring is just copied directly over from dask.dataframe.read_parquet at

read_parquet.__doc__ = dd.read_parquet.__doc__
.

Your first point is correct, only the custom engine is allowed in dask-geopandas I think.

For your second point, maybe raise an issue in dask/dask if you think that documentation isn't accurate there too. I know there's been some churn around that parameter recently and it may be out of date.


As for what to do here, I'm not sure. dask-geopandas mostly aligns with dask.dataframe.read_parquet, so it's nice to pick up the changes from there automatically. Dask does include a derived_from decorator that can be used to copy over docstrings, with some control over things like "unused arugments": https://github.com/dask/dask/blob/34a1e88bb3f6196361f398ddab55e59d315d8d40/dask/utils.py#L818. Perhaps that could be used to add a caveat about the engine.

@barbuz
Copy link
Author

barbuz commented Jan 24, 2023

I see, thank you. What confused me is the fact that the dask documentation is different than the dask-geopandas one for split_row_groups. But maybe as you said it could just be out of date.
I've noticed the effects of the derived_from decorator, where some pages (example) have a disclaimer saying that this docstring was copied from somewhere else. Using the decorator for read_parquet and to_parquet should make the same disclaimer appear there, but I believe that would involve rewriting the function definitions so that the arguments used are listed explicitly and not just packed into *args and *kwargs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants