Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

catchment weights #28

Closed
JordanLaserGit opened this issue Jan 11, 2024 · 15 comments
Closed

catchment weights #28

JordanLaserGit opened this issue Jan 11, 2024 · 15 comments
Assignees
Labels
enhancement New feature or request

Comments

@JordanLaserGit
Copy link

In order to create nextgen forcing files, weights (indices) are calculated from the hydrofabric and national water model files. Currently, this is done by rasterizing each catchment, which is a time consuming process. Would subsetting from a conus_weight.json file be a better solution? Would hfsubset be a good place to do this? These weights might change with each hydrofabric release, so it may be nice to have the weights computing automatically along with the geopackages.

@program--
Copy link
Member

@JordanLaserGit I can't say for certain whether this would fit into hfsubset or some other utility, but this is a good place to start that discussion since it sounds like a reasonable need. I'm in the process of doing a small rewrite of the hfsubset service to an actual API to address concurrency issues (which is why there's been a lack of updates).

I haven't had the chance to generate forcing files, only use them. Would you be able to give me an example of the weights files?

@JordanLaserGit
Copy link
Author

https://ngenresourcesdev.s3.us-east-2.amazonaws.com/01_weights.json
https://ngenresourcesdev.s3.us-east-2.amazonaws.com/nextgen_01.gpkg

@program-- Here's a weight file I generated a few months ago, along with the geopackage it came from. v20 hydrofabric I believe. This isn't an immediate need, just figured these weight files should be standardized and live somewhere the community can access readily. If you're curious to see the implementation in the forcing generation, you can find that here. https://github.com/CIROH-UA/ngen-datastream/tree/main/forcingprocessor

@program--
Copy link
Member

@JordanLaserGit Is this file generating the weights? If so, I think it's possible to integrate something similar on the service-side (or maybe client-side? since that'll let the user pick what forcing to use?).

For now, let's keep this issue open in case of further discussion, and I'll note the requirement as an addition to the hfsubset API service. 😄

@program-- program-- added the enhancement New feature or request label Jan 11, 2024
@program-- program-- self-assigned this Jan 11, 2024
@JordanLaserGit
Copy link
Author

@program-- Yup that'll create the weights. Assuming we do something like run that script once on conus, and then subset that file, then the speed of that script probably isn't too big of an issue. Though if we want that script to run upon user request, we may want to optimize it. CONUS took something like 18 hours (with no parallelization whatsoever).

Sounds good, thanks for taking a look at this!

@JordanLaserGit
Copy link
Author

Ask and ye shall receive! @JoshCu just sped things up. CIROH-UA/ngen-datastream#36

@JoshCu
Copy link

JoshCu commented Jan 11, 2024

In theory a further 2.5x speedup is possible using geometry_windows to raster only a subset of the whole region. But I'm having some interesting issues with transforms. depending on if I open the source netcdf with rasterio or with xarray the y axis is flipped and there's some rounding error by the looks of it? Entirely possible I'm doing something weird though, I don't have much experience with this. code to generate incorrect weightings quickly

rasterio using netcdf4 (i think)

| 1000.00, 0.00,-2303999.25|
| 0.00,-1000.00, 1919999.62|
| 0.00, 0.00, 1.00|

xarray definitely using netcdf4

| 1000.00, 0.00,-2303999.25|
| 0.00, 1000.00,-1920000.38|
| 0.00, 0.00, 1.00|
Starting at 2024-01-11 13:58:41.329457
Finished at 2024-01-11 13:58:51.810312
Total time: 0:00:10.480928

current weights

"cat-18": [[2421, 2422, 2422, 2423, 2423, 2424, 2424, 2424, 2424, 2424, 2425, 2425, 2425, 2425, 2425, 2426, 2426, 2426, 2426, 2426, 2426, 2427, 2427, 2427, 2427, 2427, 2427, 2427, 2427, 2428, 2428, 2428, 2428, 2428, 2428, 2428, 2428, 2428, 2429, 2429, 2429, 2429, 2429, 2429, 2430, 2430, 2431, 2431], [4323, 4323, 4324, 4323, 4324, 4322, 4323, 4324, 4325, 4326, 4322, 4323, 4324, 4325, 4326, 4322, 4323, 4324, 4327, 4328, 4329, 4321, 4322, 4323, 4324, 4325, 4326, 4327, 4328, 4321, 4322, 4323, 4324, 4325, 4326, 4327, 4328, 4329, 4324, 4325, 4326, 4327, 4328, 4329, 4327, 4328, 4327, 4328]]}

weights with rasterio transform (y axis flip has been corrected for)

"cat-18": [[2432, 2433, 2433, 2434, 2434, 2435, 2435, 2435, 2435, 2435, 2436, 2436, 2436, 2436, 2436, 2437, 2437, 2437, 2437, 2437, 2437, 2438, 2438, 2438, 2438, 2438, 2438, 2438, 2438, 2439, 2439, 2439, 2439, 2439, 2439, 2439, 2439, 2439, 2440, 2440, 2440, 2440, 2440, 2440, 2441, 2441, 2442, 2442], [4323, 4323, 4324, 4323, 4324, 4322, 4323, 4324, 4325, 4326, 4322, 4323, 4324, 4325, 4326, 4322, 4323, 4324, 4327, 4328, 4329, 4321, 4322, 4323, 4324, 4325, 4326, 4327, 4328, 4321, 4322, 4323, 4324, 4325, 4326, 4327, 4328, 4329, 4324, 4325, 4326, 4327, 4328, 4329, 4327, 4328, 4327, 4328]]}

The column indices are all shifted by 11 but otherwise the result is the same

If those rasterio weightings are correct then it only takes ~10 minutes to create conus weightings

@program--
Copy link
Member

program-- commented Jan 12, 2024

@JoshCu for my own purposes, I rewrote the weights generator in R

and to better integrate into hfsubset's subsetting code... though if I throw it on the client-side, I guess I'll have to rewrite it again in Go

R version
#!/usr/bin/env Rscript

gpkg_path <- "..."
grid_path <- "..."
output    <- "..."

# microbenchmark::microbenchmark({

grd  <- terra::rast(paste0(
    "netcdf:",
    grid_path,
    ":RAINRATE"
))

gpkg <-
    terra::vect(gpkg_path, "divides") |>
    terra::project(terra::crs(grd))

weight_rast <-
    terra::rasterize(
        x = gpkg,
        y = grd,
        field = "divide_id",
        background = NA_character_,
        touches = TRUE
    ) |>
    terra::flip(direction = "vertical")

weights <-
    terra::as.data.frame(weight_rast, cells = TRUE) |>
    dplyr::mutate(
        row = terra::rowFromCell(weight_rast, cell),
        col = terra::colFromCell(weight_rast, cell)
    ) |>
    dplyr::select(-cell) |>
    dplyr::arrange(divide_id, row, col)

split(weights, weights$divide_id) |>
    lapply(FUN = function(part) {
        setNames(as.list(part[, c("row", "col")]), c())
    }) |>
    jsonlite::write_json(output)

# }, times = 5L)

Which seems to have a decent speed up if I'm interpreting your results correctly.

Unit: seconds
      min       lq    mean   median       uq      max neval
 4.356602 4.376032 4.75269 4.406561 4.420441 6.203815     5

for VPU 01. However, I'm experiencing something similar to your issue, where my coordinates are (seemingly) slightly off. rasterio uses GDAL under the hood, just like terra in my case (hence why I needed to flip my raster too), and I know that xarray (particularly rioxarray) and GDAL perform operations differently, so that might be the root of the issue. Whether the weights are correct or not... well I guess that's up to @JordanLaserGit or someone else haha

I've attached my example weights output from R here: weights-r.json

@JoshCu
Copy link

JoshCu commented Jan 12, 2024

@program-- I'm just running bash time and python prints like a caveman but I'm pretty sure your implementation is way faster. Especially if that's single process, R's pretty alien to me so I can't tell.
Assuming that this https://lynker-spatial.s3.amazonaws.com/v20/gpkg/nextgen_01.gpkg is VPU 01?

VPU 01 on 56 Cores ~12.5s I had no idea you have folding sections in md, I'll be overusing them from now on
Starting at 2024-01-11 22:12:28.170569
Time spent opening files 0:00:00.344885
Total processing time: 0:00:08.741575
Total writing time: 0:00:00.318865
Total time: 0:00:11.566039

real    0m12.570s
user    6m55.572s
sys     0m12.635s

We do seem to have different weightings outputs, I'm not sure the best way to diff two large json files is but they're different sizes and grep spot checks show different results.
image

xarray transform

"cat-18": [[2421, 2422, 2422, 2423, 2423, 2424, 2424, 2424, 2424, 2424, 2425, 2425, 2425, 2425, 2425, 2426, 2426, 2426, 2426, 2426, 2426, 2427, 2427, 2427, 2427, 2427, 2427, 2427, 2427, 2428, 2428, 2428, 2428, 2428, 2428, 2428, 2428, 2428, 2429, 2429, 2429, 2429, 2429, 2429, 2430, 2430, 2431, 2431], [4323, 4323, 4324, 4323, 4324, 4322, 4323, 4324, 4325, 4326, 4322, 4323, 4324, 4325, 4326, 4322, 4323, 4324, 4327, 4328, 4329, 4321, 4322, 4323, 4324, 4325, 4326, 4327, 4328, 4321, 4322, 4323, 4324, 4325, 4326, 4327, 4328, 4329, 4324, 4325, 4326, 4327, 4328, 4329, 4327, 4328, 4327, 4328]]

rasterio transform

"cat-18": [[2432, 2433, 2433, 2434, 2434, 2435, 2435, 2435, 2435, 2435, 2436, 2436, 2436, 2436, 2436, 2437, 2437, 2437, 2437, 2437, 2437, 2438, 2438, 2438, 2438, 2438, 2438, 2438, 2438, 2439, 2439, 2439, 2439, 2439, 2439, 2439, 2439, 2439, 2440, 2440, 2440, 2440, 2440, 2440, 2441, 2441, 2442, 2442], [4323, 4323, 4324, 4323, 4324, 4322, 4323, 4324, 4325, 4326, 4322, 4323, 4324, 4325, 4326, 4322, 4323, 4324, 4327, 4328, 4329, 4321, 4322, 4323, 4324, 4325, 4326, 4327, 4328, 4321, 4322, 4323, 4324, 4325, 4326, 4327, 4328, 4329, 4324, 4325, 4326, 4327, 4328, 4329, 4327, 4328, 4327, 4328]]

terra transform

"cat-18":[[2424,2425,2425,2425,2426,2426,2426,2426,2426,2427,2427,2427,2427,2427,2428,2428,2428,2428,2428,2428,2428,2429,2429,2429,2429,2429,2430,2430,2430,2431,2432],[4324,4324,4325,4326,4323,4324,4325,4326,4327,4324,4325,4328,4329,4330,4323,4324,4325,4326,4327,4328,4329,4326,4327,4328,4329,4330,4328,4329,4330,4329,4329]]

It would be easier to see if I could figure out how to wrap lines in code blocks, I seem to have double the number of coordinates for the catchments
Maybe the all touched implementation is different too?

RTI have a weightings generator we could compare to, but I think I'll leave that to someone who's understanding extends beyond my insightful "the numbers are different" comments.
weights from nextgen_01.gpkg -> weights_01.json

@hellkite500
Copy link

@JordanLaserGit @JoshCu https://github.com/isciences/exactextract just recently got python bindings integrated (officially)... Check it out for weight generation, it's super fast and accurate. (And likely how the R code mentioned above is so quick 😉)

@program--
Copy link
Member

program-- commented Jan 12, 2024

@program-- I'm just running bash time and python prints like a caveman but I'm pretty sure your implementation is way faster. Especially if that's single process, R's pretty alien to me so I can't tell. Assuming that this https://lynker-spatial.s3.amazonaws.com/v20/gpkg/nextgen_01.gpkg is VPU 01?

...

Yeah VPU 01 is that link you sent (err, I might have used v20.1, but they should be effectively the same for this...). The size difference is most likely from the formatting between the JSON files, minifying has a drastic effect sometimes, and jsonlite does it by default.

It's interesting to see that all three generators have different results 😅 maybe @mikejohnson51 can weigh in (haha) to verify if my R implementation is correct since he's better with gridded data than I am. (! if you have time Mike)

@JordanLaserGit @JoshCu https://github.com/isciences/exactextract just recently got python bindings integrated (officially)... Check it out for weight generation, it's super fast and accurate. (And likely how the R code mentioned above is so quick 😉)

How dare you, I did it by hand 😛 but that would be useful, I didn't know you could generate weights with exactextract actually haha, so I'll have to check that out. I've only used it a bit.

@mikejohnson51
Copy link
Member

mikejohnson51 commented Jan 12, 2024

Hi all!

Yes we use to deliever the catchment weights for a few forcing products with each release as a forcing_weights.parquet file. The code to do this is still around. We did this with a slightly modified version of exactexract(used in zonal). If you want to use Python, check out the extactextract bindings that Dan Baston is working on through NASA (as @hellkite500 mentioned) or gdptools by Rich McDonald (@rmcd-mscb). In R (as @program--) did, terra can work but zonal and exactextractr will be faster and more accurate.

The reason it is more accurate is that it takes into account partial coverages (e.g. a polygon covers 1/2 a cell).

We stopped sharing those as the forcing workstream became more mature and all groups (forcing, model engine, ect) wanted to do it themselves.

Overall, I am not opposed to reintroducing the weights that would be accessiable with hfsubset (as a layer) if it is of high value and we can determine one (or two) primary datasets. Is AORC the main dataset of interest @JordanLaserGit?

@JordanLaserGit
Copy link
Author

The short term goal is to get daily ngen outputs via ngen-datastream. In this, a forcings engine takes in the weights as an input. I've already generated the weights for v2.1 and we will re-use the same file until the next hydrofabric version comes out. So having a conus weights file that could be subsetted, would be the idea dataset. Not an pressing need though, as ngen-datastream is still under development.

The primary motivation here is standardizing the catchment weights. We can all agree on using the same tool (like exactextract, but seems like the most secure way would be to generate the file once and have it live somewhere everyone can reference. Otherwise we may be setting people up for the problem of comparing two research projects over the "same" catchment may yield different results just because weight generation wasn't consistent.

@mikejohnson51
Copy link
Member

Makes perfect sense to me. Remember though (and I'm sure you got this) a weights file will only be valid for one grid (same grid size and crs).

We can add this to the releases. Would you mind adding an issue to the Hydrofabric repo to track this with a pointer to a sample forcing file?

@JoshCu
Copy link

JoshCu commented Jan 25, 2024

@hellkite500 Exact extract looks great! I didn't realise quite how low resolution the netcdf's are compared to the water basin geometries until I overlaid them just now. Do you know if there's a recommended method to parallelise it? I think it's caching the % of each pixel covered by the geometries so I've been chunking it by geometry rather than by time to prevent recalculating the polygons. I'm just using python's multiprocessing pool as it's what I'm familiar with, but probably not the best way of doing it?
image

@mikejohnson51
Copy link
Member

See here for a conus weights parquet file: s3://lynker-spatial/v20.1/forcing_weights.parquet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants