Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Dask in reprojection #119

Closed
davidbrochart opened this issue Jun 11, 2020 · 35 comments
Closed

Using Dask in reprojection #119

davidbrochart opened this issue Jun 11, 2020 · 35 comments
Labels
dask Dask related issue. proposal Idea for a new feature.

Comments

@davidbrochart
Copy link
Contributor

I believe rioxarray loads the whole data array before reprojecting. Reprojecting a data array chunk by chunk would allow to handle much bigger data. It would also remain in the Dask lazy execution model.

@davidbrochart davidbrochart added the proposal Idea for a new feature. label Jun 11, 2020
@snowman2
Copy link
Member

snowman2 commented Jun 11, 2020

It would definitely be a nice feature to have. One gotcha to watch out for is that reprojection considers points outside of the dask chunk due to resampling.

@djhoese
Copy link
Contributor

djhoese commented Jun 11, 2020

This is something we are constantly working on in the pyresample project (https://github.com/pytroll/pyresample/). We have a lot of dask compatible solutions but no explicitly defined interfaces. We're working towards a "Resampler" class that would make it easier to control dask-based resampling algorithms. Note that we aren't calling in to the GDAL resampling algorithms.

It should be noted that resampling is not the easiest operation to port to dask friendliness. One thing is as @snowman2 mentioned, you often have to deal with an overlap of source data chunks. Dask has some functions that can help with that like map_overlap (https://docs.dask.org/en/latest/array-overlap.html). The other issue is just the nature of resampling and trying to do it per chunk. You typically have to build a custom dask graph if you want to do anything efficient. For example, I have an Elliptical Weighted Averaging (EWA) algorithm in pyresample that I just recently rewrote this way. You can see the progress here: pytroll/pyresample#284. The hard part is efficiently saying "this input chunk should map to this output chunk". This type of check typically means you have to have M * N tasks where M is number of input chunks and N is number of output chunks. I've had to do some ugly things in that PR to try to do things "intelligently".

CC the other pyresample core devs @pnuu and @mraspaud

@davidbrochart
Copy link
Contributor Author

Thanks @djhoese, I'll checkout pyresample.

@djhoese
Copy link
Contributor

djhoese commented Jun 11, 2020

Oh I should have mentioned that the interfaces that aren't fully defined in pyresample are being used in Satpy. That might be a better starting point if you're curious. Of course, we're always working towards better rioxarray, rasterio, and plain old xarray compatibility.

@snowman2
Copy link
Member

I wonder if WarpedVRT could be useful here? I bet there is a way to have each chunk lazily loaded in from disk and reprojected using the data on disk at the same time using dask in parallel. This would remove any/all buffer problems as it would have access to all of data on disk.

@djhoese
Copy link
Contributor

djhoese commented Jun 12, 2020

Very interesting. I'm not sure whether virtual warping is doing something extremely magical or just doing the warping when you ask for it. Your concurrency link has been brought up before in a dask issue and I had some concerns that I discussed here: dask/dask#3255 (comment)

I'll quote it here:

I'm very confused by that rasterio example. I've never used a ThreadPoolExecutor, but it doesn't seem like any of the multi-threaded/parallel code in that example has to do with rasterio. The only thing spanning thread boundaries are numpy arrays that are read from a geotiff in the main thread and written to a geotiff in the main thread. The only thing that looks possibly multi-threaded for rasterio is the data_gen generator used in:

                for window, result in zip(
                    windows, executor.map(compute, data_gen)
                ):

But that would mean that .map is passing generator context to each thread which seems dangerous rather than iterating over the generator and giving the resulting object (a numpy array) to the compute function in a separate thread.

@snowman2
Copy link
Member

just doing the warping when you ask for it

It should just do it when you ask for it - lazy loading as I understand it.

@snowman2
Copy link
Member

The only thing spanning thread boundaries are numpy arrays that are read from a geotiff in the main thread

This is exactly the bit I am plugging for - read in and warp the data into a dask array in parallel.

@snowman2
Copy link
Member

https://rasterio.readthedocs.io/en/latest/api/rasterio.vrt.html

Abstracts the details of raster warping and allows access to data that is reprojected when read.

@snowman2
Copy link
Member

I read through the code, and it seems like it is already possible to do -
You can pass in rasterio.WarpedVRT: https://corteva.github.io/rioxarray/stable/rioxarray.html#rioxarray-open-rasterio

I haven't tested loading it in parallel with dask personally. Would be interesting to test sometime.

@pnuu
Copy link

pnuu commented Jun 12, 2020

Hi,

I worked on Pyresample gradient search resampler (originally implemented by @mraspaud) in pytroll/pyresample#282 a while back. The resampler now maps the source chunks to those, and only those, target chunks they have any overlap with. All the chunks that do not overlap are discarded to reduce the amount of computations. As there are often multiple input chunks contributing to a output chunk, these two versions are stacked before the full output array is formed.

The source -> target chunk mapping needs some non-dask computations to check for the overlap, but these are cheap operations for cases where the source data has a proper projection. The actual resampling (implemented in Cython) is delayed and the delayed results are concatenated (and padded) to form the resulting 2/3D array.

There are certainly some optimizations I missed, but I'm pretty happy with the results so far :-)

@snowman2
Copy link
Member

Looks like the ODC group has a nice solution for dask reprojection: #130 (comment)

@davidbrochart
Copy link
Contributor Author

Nice! You think we can steal that code? 😄

@snowman2
Copy link
Member

snowman2 commented Jun 18, 2020

You think we can steal that code?

It would be a lot of code to take over. There are some bits from the datacube-core to pull in as well as from the odc-tools. If we did that, we would need to add the odc-tools license into the repo and make sure to credit them for the code taken.

My preference would be for them to put it into a separate package that we could import in and use. Might be a worthwhile discussion to have.

@snowman2
Copy link
Member

Opened an issue: opendatacube/odc-tools#58 🤞

@rmg55
Copy link

rmg55 commented Aug 15, 2020

+1 on integrating dask/lazy arrays into the reprojection method. Suggested above was using Warped_VR, which I have used on some projects. Here is a gist showing the workflow: https://gist.github.com/rmg55/875a2b79ee695007a78ae615f1c916b2...

I also noticed that gdal 3.1 has support for multidimensional rasters (including mem and vrt). Not sure exactly how that might be integrated into rasterio, but I wonder if that could provide some additional gdal functionality into xarray (rioxarray) objects.

@snowman2
Copy link
Member

Suggested above was using Warped_VR, which I have used on some projects ...

That is pretty neat. Thanks for sharing 👍. How much did it improve performance?

Side note: I am curious if the chunks argument to open_rasterio would work in your example or not?

I also noticed that gdal 3.1 has support for multidimensional rasters (including mem and vrt). Not sure exactly how that might be integrated into rasterio, but I wonder if that could provide some additional gdal functionality into xarray (rioxarray) objects.

It has been brought up in the past: https://rasterio.groups.io/g/dev/topic/33040759. If there are enough interested parties willing to develop/maintain that feature, there is a slim possibility of it being added to rasterio.

@snowman2
Copy link
Member

Here is a gist showing the workflow: https://gist.github.com/rmg55/875a2b79ee695007a78ae615f1c916b2...

Is this something you would be interested in adding to the examples section in the documentation of rioxarray?

@rmg55
Copy link

rmg55 commented Aug 16, 2020

Happy to share @snowman2, and yes, I would be happy to add the example to the docs.

Also, I just saw this pr in xarray that allows xarray to read the warped_vrt object directly, so I have removed some unnecessary lines of code in the gist I posted above.

That is pretty neat. Thanks for sharing 👍. How much did it improve performance?

This is a pretty small file (26 MB), so I think the dask/scheduling overhead => than the parallel speedup in io.

Side note: I am curious if the chunks argument to open_rasterio would work in your example or not?

Yes, you can use the chunks argument in open_rasterio as well.

It has been brought up in the past: https://rasterio.groups.io/g/dev/topic/33040759. If there are enough interested parties willing to develop/maintain that feature, there is a slim possibility of it being added to rasterio.

Thanks for linking to the rasterio discussion on multi-dim rasters.

The thing I have not been able to figure out is how to build a warped_vrt from a xarray object. I think it could be useful to represent xarray objects as a vrt and rather than point to a on disk files, point to numpy or dask array. However, I am not sure that is possible with the vrt format....

@snowman2
Copy link
Member

Happy to share @snowman2, and yes, I would be happy to add the example to the docs.

That would be great 👍

The thing I have not been able to figure out is how to build a warped_vrt from a xarray object.

Interesting. Not sure if it makes sense to do so as the VRT is file based IIRC, but I haven't given it much thought. Feel free to open a new issue to look into and discuss this further.

@calexat-123
Copy link

Looks like the ODC group has a nice solution for dask reprojection: #130 (comment)

I'm working on something based on this and with generous advice from Kirill Kouzoubov (though I'm sure I've made errors of my own, and I haven't set up the error handling very well, yet). I added multiband chunking and support for applying separate kwargs to different band chunks.

https://github.com/lex-c/gdal_dask_reproject

@RichardScottOZ
Copy link
Contributor

It took a few hours with the rio commandline to warp a 40GB vrt dataset to a set bounds and 400GB upscale - be interesting to see how the dask/rioxarray variety does - match chunk size to the rasterio memory use 'defaults'?

@RichardScottOZ
Copy link
Contributor

Related:

Yes, looking forward to trying the odc-geo version out sometime.

@ljstrnadiii
Copy link

ljstrnadiii commented Feb 1, 2023

Any progress on this? I am looking to follow the large reproject suggestions, but with a vrt we build e.g.

env = rasterio.Env(
    GDAL_DISABLE_READDIR_ON_OPEN="EMPTY_DIR",
    CPL_VSIL_CURL_USE_HEAD=False,
    CPL_VSIL_CURL_ALLOWED_EXTENSIONS="VRT",
)
with env:
    with rasterio.open('gs://some/path.vrt') as src:
        with WarpedVRT(src, crs="EPSG:3857") as vrt:
            darray = rioxarray.open_rasterio(
                vrt, chunks={"x": 512, "y": 512}
            )

but I get RasterioIOError: '/vsigs/some/path.vrt' does not exist in the file system, and is not recognized as a supported dataset name.

I am getting a sense that there is something more to it than this...

@mraspaud
Copy link
Contributor

mraspaud commented Feb 3, 2023

To report back here (should have done it earlier sorry):
We have implemented a resample_blocks function in pyresample:
https://pyresample.readthedocs.io/en/latest/api/pyresample.html#pyresample.resampler.resample_blocks

This allows to do various operations (eg resampling) by blocks when the source and target arrays are in different CRSs. Resample blocks ensures in principle that the input data to the custom func is covering the chunk at hand in the target domain, but not much more. So there could be some overlap, but not much.

@Kirill888
Copy link
Contributor

Just wanted to let people know that odc-geo>=0.4.0 now supports Dask inputs/outputs. It uses GDAL via rasterio to perform actual pixel work, pyproj and shapely are used for determining minimal dependency graph from source to output chunks. Xarray inputs with arbitrary number of dimensions are supported as long as Y,X axis are next to each other and in that order.

@RichardScottOZ
Copy link
Contributor

Thanks vey much Kirill.

@RichardScottOZ
Copy link
Contributor

How big a thing have you tested? :)

@Kirill888
Copy link
Contributor

How big a thing have you tested? :)

Haven't really done any real data processing with it yet. But for testing I was using zoom level 8 COG of blue marble as input

Driver: GTiff/GeoTIFF
Files: Data/land_shallow_topo_slippy8_z6.tif
Size is 65536, 65536
Coordinate System is:
PROJCRS["WGS 84 / Pseudo-Mercator",
...

but extracting smaller regions in various projections/resolutions.

@djhoese
Copy link
Contributor

djhoese commented May 22, 2023

@Kirill888 This is great! This makes me hopeful that we're converging on a common practice for all this type of work. Everyone is generally dependent on pyproj or rasterio CRS objects and these new resampling functions (the GDAL based on in odc-geo and the custom algorithms in pyresample) are doing things chunk by chunk. What you describe with shapely polygons and CRSes being used to determine the graph/combination of source->target chunks sounds a lot like what @mraspaud started doing in resample_blocks in pyresample.

I'm excited to read more about odc-geo's Geometry (shapely + CRS) objects too since this is something that pyresample kind of does in hacky everything-is-separate kind of way and we really want something like what odc-geo is describing.

@Kirill888
Copy link
Contributor

@djhoese thanks for kind words.

odc-geo intentionally has a rather slim compulsory dependency set of libraries. Basically it's just this:

    affine
    cachetools
    numpy
    pyproj>=3.0.0
    shapely

although, for "Dask reproject" to work you also need xarray, rasterio, dask. It is also "Python only" library (thanks to pyproj, shapely and rasterio for handling binary distribution part). So one could, potentially depend on this library without making life harder for your users.

Design goal is to allow for minimal functionality to work with just those dependencies and expand available feature set when more libraries are installed. We have extracted useful geometry shape related utilities out of datacube, precisely because we think these are useful on their own and datacube just has way too many dependencies, some of which can be tricky to install.

odc-geo provides a odc.geo.geom.Geometry which is basically just a wrapper around shapely.geometry.shape with defined reference system using pyproj.crs.CRS object. It exposes geometric operations implemented by GEOS and made available to Python by shapely, but with an extra check that multi-geometry operations are only allowed on shapes defined within the same CRS. In addition .to_crs(...) method is provided to perform mapping between projections, supporting convenience things like adding more points before projection change.

But probably most useful class in odc.geo is GeoBox. This abstraction captures exact geo-referencing of every pixel of some bounded raster, provided raster is using linear mapping from pixel space to CRS. There is a similar GCPGeoBox abstraction for non-linear case that uses ground control points instead of an Affine matrix to map between pixel and world space.

[GCP]GeoBox offers a number of handy operations like cropping, padding, construction from geometries or bounding boxes, union, intersection, resolution change, coordinate reference change, etc. But most importantly one can (A) compute GeoBox for an xarray objects that were constructed by rioxarray or odc-geo or datacube or odc-stac or stackstac, and (B) "attach" geobox to an existing xarray object with compatible pixel shape along Y/X dimensions. We aim to stay "data compatible" with rioxarray approach for creating and locating geo-referencing information inside xarray objects^. So you can do xx.rio.to_raster on an array coming out of odc.geo and expect correct geo-referencing in the output file.

^ we differ in the way we handle non-axis aligned footprints

@mraspaud I have looked only briefly at pyresample, not sure how you "know location of each input pixel", or is this meant to be provided by the user via src_area? I'll need to have a deeper look at your spherical. classes, I suspect pyresample might be more robust in handling projections with non-trivial valid regions.

@mraspaud
Copy link
Contributor

@Kirill888 yes, the src_area define the geographical domain and projection for the data you provide.

Regarding the sperical module, indeed we have developed it to be able to cope with poles and date shift line cases, that shapely didn't handle well. There is also this library which maybe can help? https://spherely.readthedocs.io/en/latest/

@rhugonnet
Copy link

Hi @snowman2, all,

We also implemented a Dask-delayed reprojection in our package in the frame of an ongoing Xarray accessor (#687), see here: GlacioHack/geoutils#537, with part of the code inspired from @Kirill888's odc-geo implementation (thanks for the great work!).

The code is stand-alone (does not depends on any other class of our package: https://github.com/rhugonnet/geoutils/blob/fafd2399d35ff22bf79bc631a9d805ab19a18a94/geoutils/raster/delayed.py#L387), so could be moved easily to Rioxarray if you think it makes sense 🙂.
However, it would require an optional shapely (or geopandas) dependency to be installed (which I guess is true for most users, but I don't know if you want to go through the trouble of managing optional deps).

Also, while writing the test suite for the function, I noticed some pretty big discrepancies between rio.warp.reproject applied on a full output array or subset of that output array, especially for bilinear/cubic resampling: rasterio/rasterio#2052 (comment). Hopefully I just made a mistake somewhere, but it seems unlikely with the "nearest" resampling being a quasi-perfect match... 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dask Dask related issue. proposal Idea for a new feature.
Projects
None yet
Development

No branches or pull requests