Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Environment #3

Open
martinfleis opened this issue Feb 22, 2022 · 14 comments
Open

Environment #3

martinfleis opened this issue Feb 22, 2022 · 14 comments

Comments

@martinfleis
Copy link
Contributor

Hi, this is a great initiative.

As geopandas is currently in the state of performance migration sort of, the out of the box performance is not necessarily the best one (I'll leave another issue on that). I wanted to check the environment to see if you do have pygeos engine installed and what are the versions of GEOS and the libraries but it doesn't seem to be listed.

How do you create an environment for these tests?

@martinfleis
Copy link
Contributor Author

This is the result coming from my environment that includes pygeos, with no changes to the code (some of which will also be significant).
report.html.zip

@kadyb
Copy link
Owner

kadyb commented Feb 22, 2022

Thanks for the comment and your results! Generally, I don't expect super performance from Python and R - this is the domain of low-level languages. My idea was a simple comparison of packages for vector data processing without code optimization, i.e. I used simple functions available in the packages.

I used _Pop!OS 20.04 LTS system (based on Ubuntu 20.04 Focal Fossa) and the software available in the repository by default. I downloaded Python packages from PIP and R packages from CRAN.

I didn't use {pygeos}. However, correct me if I'm wrong, doesn't {pygeos} use multithreading by default, hence the speedup? All tested R packages are single-threaded, so such a comparison would be unfair. There are separate packages, eg sfurrr, that allow parallel computation, or you can write code yourself, but I did not include these cases in this benchmark. Here is a FR for {terra}, but still not implemented.

I'm surprised how much the distance calculation performance has improved in particular, nice.

@kadyb
Copy link
Owner

kadyb commented Feb 22, 2022

Here more information about the environment used. Let me know if anything more is needed.

> apt list --installed | grep libgeos
libgeos-3.8.0/focal,now 3.8.0-1build1 amd64 [installed,automatic]
libgeos-c1v5/focal,now 3.8.0-1build1 amd64 [installed,automatic]
libgeos-dev/focal,now 3.8.0-1build1 amd64 [installed,automatic]

> python3 -VV
Python 3.8.10 (default, Nov 26 2021, 20:14:08) 
[GCC 9.3.0]
terra::gdal(lib = "all")
#>    gdal    proj    geos
#> "3.0.4" "6.3.1" "3.8.0"
sf::sf_extSoftVersion()
#>    GEOS     GDAL   proj.4  GDAL_with_GEOS  USE_PROJ_H     PROJ
#> "3.8.0"  "3.0.4"  "6.3.1"          "true"      "true"  "6.3.1"
geos::geos_version()
#> [1] ‘3.10.0’
Python packages ``` > pip list Package Version ----------------------- ---------------------------------- affine 2.3.0 appdirs 1.4.4 attrs 19.3.0 beautifulsoup4 4.8.2 blinker 1.4 Brlapi 0.7.0 cachetools 4.2.2 certifi 2019.11.28 cftime 1.4.1 chardet 3.0.4 chrome-gnome-shell 0.0.0 click 8.0.3 click-plugins 1.1.1 cligj 0.7.2 cloudpickle 1.6.0 colorama 0.4.3 command-not-found 0.3 cryptography 2.8 cupshelpers 1.0 cycler 0.10.0 dask 2021.4.1 datacube 1.8.3 dbus-python 1.2.16 decorator 4.4.2 defer 1.0.6 distributed 2021.4.1 distro 1.4.0 entrypoints 0.3 Fiona 1.8.21 fsspec 2021.4.0 future 0.18.2 GDAL 3.0.4 geocube 0.0.16 geopandas 0.9.0 gpg 1.13.1-unknown greenlet 1.0.0 HeapDict 1.0.1 hidpidaemon 18.4.6 html5lib 1.0.1 httplib2 0.14.0 idna 2.8 importlib-metadata 1.5.0 ipython-genutils 0.2.0 Jinja2 2.10.1 jsonschema 3.2.0 jupyter-core 4.6.3 keyring 18.0.1 kiwisolver 1.0.1 language-selector 0.1 lark-parser 0.11.2 launchpadlib 1.10.13 lazr.restfulclient 0.14.2 lazr.uri 1.0.3 locket 0.2.1 louis 3.12.0 lxml 4.5.0 macaroonbakery 1.3.1 MarkupSafe 1.1.0 matplotlib 3.1.2 more-itertools 4.2.0 msgpack 1.0.2 munch 2.5.0 nbformat 5.0.4 netCDF4 1.5.6 netifaces 0.10.4 numpy 1.17.4 oauthlib 3.1.0 olefile 0.46 OWSLib 0.19.1 packaging 21.2 pandas 1.2.2 partd 1.2.0 Pillow 7.0.0 pip 20.0.2 plotly 4.4.1 pop-transition 1.1.2 protobuf 3.6.1 psutil 5.8.0 psycopg2 2.8.4 pycairo 1.16.2 pycups 1.9.73 pydbus 0.6.0 Pygments 2.3.1 PyGObject 3.36.0 PyJWT 1.7.1 pymacaroons 0.13.0 PyNaCl 1.3.0 PyOpenGL 3.1.0 pyparsing 2.4.6 pyproj 2.5.0 PyQt5 5.14.1 pyRFC3339 1.1 pyrsistent 0.15.5 python-apt 2.1.2pop0-1587756471-20.04-cd2988e python-dateutil 2.7.3 python-debian 0.1.36ubuntu1 python-xlib 0.23 pytz 2019.3 pyxdg 0.26 PyYAML 5.3.1 rasterio 1.2.10 rasterstats 0.16.0 repoman 1.2.2 requests 2.22.0 requests-unixsocket 0.2.0 retrying 1.3.3 rioxarray 0.10.0 scipy 1.6.3 screen-resolution-extra 0.0.0 SecretStorage 2.3.1 sessioninstaller 0.0.0 setuptools 45.2.0 Shapely 1.7.1 simplejson 3.16.0 sip 4.19.21 six 1.14.0 snuggs 1.4.7 sortedcontainers 2.3.0 soupsieve 1.9.5 SQLAlchemy 1.4.12 ssh-import-id 5.10 systemd-python 234 tblib 1.7.0 toolz 0.11.1 tornado 6.1 traitlets 4.3.3 ubuntu-advantage-tools 27.6 ubuntu-drivers-common 0.0.0 ufw 0.36 urllib3 1.25.8 wadllib 1.3.3 webencodings 0.5.1 wheel 0.34.2 wxPython 4.0.7 xarray 0.17.0 xkit 0.0.0 zict 2.0.0 zipp 1.0.0 ```

@martinfleis
Copy link
Contributor Author

martinfleis commented Feb 22, 2022

However, correct me if I'm wrong, doesn't {pygeos} use multithreading by default, hence the speedup?

No, it doesn't. Dask-geopandas would but pygeos is single-threaded, but vectorized. It is going to be shapely 2.0 and once released as such, a default geometry engine in geopandas. At the moment it is treated as experimental (though stable).

@kadyb
Copy link
Owner

kadyb commented Feb 22, 2022

Thanks for the clarification! Honestly, I've never used {pygeos}, I've always used {geopandas} alone. So it will be added as a default dependency in the near future?

@martinfleis
Copy link
Contributor Author

Yes and no :D. GeoPandas' default geometry engine is shapely. And pygeos has been integrated to shapely. So while we will never require pygeos to be installed explicitly, it will be factually installed when you install shapely 2.0 (to be released soon-ish, 95% of work is done). It is a long process aimed at consolidation of the ecosystem. Users of geopandas will get the speedup you see on my results for free essentially, without a need to change anything in their code. As you get now, if pygeos is installed.

@kadyb
Copy link
Owner

kadyb commented Feb 22, 2022

Great, so the best solution is if I install {pygeos} now and rerun the benchmark.

@martinfleis
Copy link
Contributor Author

Great, so the best solution is if I install {pygeos} now and rerun the benchmark.

Ideally with the changes proposed in #5 as some of the code is not following the ideal pattern now.

@kadyb
Copy link
Owner

kadyb commented Dec 14, 2022

@martinfleis, could you check if the results are reproducible for {sf} and {geopandas} (in particular, I mean with the new version of {shapely})? Do you also recommend removing {pygeos} now?

The only problem I haven't noticed before is:

sys:1: FutureWarning: The 'cascaded_union' attribute is deprecated, use 'unary_union' instead
/home/krzdyb/.local/lib/python3.8/site-packages/geopandas/_vectorized.py:653: UserWarning: Only Polygon objects have interior rings. For other geometry types, None is returned.

when I want to plot the points from sample.py. I see this is related to GEOS 3.3 (shapely/shapely#1001) but I have GEOS 3.8.

apt list --installed | grep libgeos
#> libgeos-3.8.0/focal,now 3.8.0-1build1 amd64 [installed,automatic]
#> libgeos-c1v5/focal,now 3.8.0-1build1 amd64 [installed,automatic]
#> libgeos-dev/focal,now 3.8.0-1build1 amd64 [installed,automatic]

@jorisvandenbossche
Copy link

Do you also recommend removing {pygeos} now?

Yes, if you ensure to have shapely >= 2.0, then it's best to remove pygeos (otherwise geopandas will still use pygeos for now, giving some overhead in converting between pygeos and shapely)

@jorisvandenbossche
Copy link

The only problem I haven't noticed before is: ... when I want to plot the points from sample.py.

What code are you using to plot?

@kadyb
Copy link
Owner

kadyb commented Dec 15, 2022

What code are you using to plot?

n = 10
smp = sample(gdf, n)
smp.plot()

@jorisvandenbossche
Copy link

But that result is supposed to only contain points, right? Not sure how that can trigger that warning ..
(you get that warning if you try to get the interiors from a GeoSeries that contains both polygons and non-polygons, and we do call that in the plotting code, but in the latest versions of geopandas, we also first split the input based on the geometry type before plotting the geometries of each type with a custom function for that geometry type. So we should never try to get the interior of points)

@kadyb
Copy link
Owner

kadyb commented Dec 15, 2022

Yes, points only. Anyway, the figure looks correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants