-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 75cc2fb
Showing
2 changed files
with
1,989 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,220 @@ | ||
--- | ||
title: "Parallel raster processing in **stars**" | ||
author: "Krzysztof Dyba" | ||
output: | ||
html_document: | ||
toc: yes | ||
toc_float: true | ||
css: style.css | ||
--- | ||
|
||
Prediction on large datasets can be time-consuming, but with enough computing | ||
power, this task can be parallelized easily. Some algorithms provide native | ||
multithreading like `predict()` function in `{ranger}` package (but this is | ||
not standard). In this situation, we have to parallelize the calculations | ||
ourselves. There are several approaches to do this, but for this example we will | ||
divide a large out of memory raster into smaller blocks and make predictions | ||
using multithreading. | ||
|
||
## Data loading | ||
|
||
```{r message=FALSE} | ||
library("stars") | ||
set.seed(1) # define a random seed | ||
``` | ||
|
||
The first step is to list the rasters representing the blue (B2), green (B3), | ||
red (B4), and near infrared (B5) bands to be loaded using the `list.files()` | ||
function. We need to define two additional arguments, i.e. `pattern = "\\.TIF$"` | ||
and `full.names = TRUE`. | ||
|
||
```{r} | ||
files = list.files("Landsat", pattern = "\\.TIF$", full.names = TRUE) | ||
files | ||
``` | ||
|
||
Since this data does not fit in memory, we have to load it as a proxy (actually, | ||
we will only load the metadata describing these rasters). To do this, we use the | ||
`read_stars()` function with the arguments `proxy = TRUE` and `along = 3`. This | ||
second argument will cause the bands (layers) to be combined into a | ||
three-dimensional matrix: $[longitude, \ latitude, \ spectral \ band]$. | ||
|
||
```{r} | ||
rasters = read_stars(files, proxy = TRUE, along = 3) | ||
``` | ||
|
||
We can replace the default filenames with band names, and rename the third | ||
dimension as follows: | ||
|
||
```{r} | ||
bands = c("Blue", "Green", "Red", "NIR") | ||
# rename bands and dimension | ||
rasters = st_set_dimensions(rasters, 3, values = bands, names = "band") | ||
names(rasters) = "Landsat" # rename the object | ||
``` | ||
|
||
## Sampling | ||
|
||
For modeling, we need to use a smaller sample rather than the entire dataset. | ||
Drawing training points consists of several steps, which are shown below: | ||
|
||
```{r} | ||
bbox = st_as_sfc(st_bbox(rasters)) # define a bounding box | ||
smp = st_sample(bbox, size = 20000) # sample points from the extent of the polygon | ||
smp = st_extract(rasters, smp) # extract the pixel values for those points | ||
smp = st_drop_geometry(st_as_sf(smp)) # convert to a data frame (remove geometry) | ||
smp = na.omit(smp) # remove missing values (NA) | ||
head(smp) | ||
``` | ||
|
||
## Modelling | ||
|
||
The aim of our analysis is to perform unsupervised classification (clustering) | ||
of raster pixels based on spectral bands. In other words, we want to group similar | ||
cells together to make homogeneous clusters. A popular method is the Gaussian | ||
mixture models available in the **mclust** package. Clustering can be done using | ||
the `Mclust()` function and requires the target number of clusters to be defined | ||
in advance (e.g. `G = 6`). | ||
|
||
```{r message=FALSE} | ||
library("mclust") | ||
mdl = Mclust(smp, G = 6) # train the model | ||
``` | ||
|
||
## Prediction | ||
|
||
As stated earlier, the entire raster is too large to be loaded into memory. | ||
We need to split it into smaller blocks. For this, we can use the `st_tile()` | ||
function, which requires the total number of rows and columns, and the number | ||
of rows and columns of a small block (for example, it could be 2048 x 2048, | ||
but usually we should find the optimal size). | ||
|
||
```{r} | ||
tiles = st_tile(nrow(rasters), ncol(rasters), 2048, 2048) | ||
head(tiles) | ||
``` | ||
|
||
Finally, the input raster will be divided into `r nrow(tiles)` smaller blocks. | ||
In the following sections, we will compare the performance of single-threaded | ||
and multi-threaded processing. | ||
|
||
Now let's discuss what steps we need to take to make a prediction: | ||
|
||
1. We have `r nrow(tiles)` blocks, so we need to make predictions in a loop. | ||
2. We have to load the rasters, but this time into memory (`proxy = FALSE`) and | ||
only the block (`RasterIO = tiles[iterator, ]`). | ||
3. We use the `predict()` function for clustering. Note we need to define the | ||
`drop_dimensions = TRUE` argument to remove the coordinates from the data frame. | ||
4. Finally, we have to save the clustering results to disk (it can be a temporary | ||
directory). Be sure to specify the missing values (`NA_value = 0`) and the block | ||
size as the input (`chunk_size = dim(tile)`). | ||
|
||
**Note!** The `read_stars()` function opens the connection to the file and closes | ||
it each time, which causes overhead. With a large number of blocks, this can | ||
significantly affect performance. In this case, it is better to use a low-level | ||
API, e.g. **[gdalraster](https://github.com/USDAForestService/gdalraster)**. | ||
|
||
### Single thread | ||
|
||
```{r} | ||
start_time = Sys.time() | ||
for (i in seq_len(nrow(tiles))) { | ||
tile = read_stars(files, proxy = FALSE, RasterIO = tiles[i, ]) | ||
names(tile) = bands # rename bands | ||
pr = predict(tile, mdl, drop_dimensions = TRUE) | ||
pr = pr[1] # select only clusters from output | ||
save_path = file.path(tempdir(), paste0("tile_", i, ".tif")) | ||
write_stars(pr, save_path, NA_value = 0, options = "COMPRESS=NONE", | ||
type = "Byte", chunk_size = dim(tile)) | ||
} | ||
end_time_1 = difftime(Sys.time(), start_time, units = "secs") | ||
end_time_1 = round(end_time_1) | ||
``` | ||
|
||
The prediction took `r as.integer(end_time_1)` seconds. | ||
|
||
### Multiple threads | ||
|
||
Multithreaded prediction is a bit more complicated. This requires the use of | ||
two additional packages, i.e. **foreach** and **doParallel**. Now we need to | ||
setup a parallel backend by defining the number of available threads (or by | ||
detecting them automatically using `detectCores()`) in `makeCluster()`, and then | ||
registering the cluster using the `registerDoParallel()` function. This can be | ||
accomplished, for instance, with: | ||
|
||
```{r message=FALSE} | ||
library("foreach") | ||
library("doParallel") | ||
threads = 3 # specify number of threads | ||
cl = makeCluster(threads) | ||
registerDoParallel(cl) | ||
``` | ||
|
||
Once we have the computing cluster prepared, we can write a loop that will be | ||
executed in parallel. Instead of the standard `for()` loop, we will use `foreach()` | ||
with the `%dopar%` operator. In `foreach()` we need to define an iterator and | ||
packages that will be exported to each worker. Then we use the `%dopar%` operator | ||
and write the core of the function (it is identical as in the single-threaded | ||
example). | ||
|
||
```{r} | ||
packages = c("stars", "mclust") | ||
start_time = Sys.time() | ||
tifs = foreach(i = seq_len(nrow(tiles)), .packages = packages) %dopar% { | ||
tile = read_stars(files, proxy = FALSE, RasterIO = tiles[i, ], NA_value = 0) | ||
names(tile) = bands | ||
pr = predict(tile, mdl, drop_dimensions = TRUE) | ||
pr = pr[1] | ||
save_path = file.path(tempdir(), paste0("tile_", i, ".tif")) | ||
write_stars(pr, save_path, NA_value = 0, options = "COMPRESS=NONE", | ||
type = "Byte", chunk_size = dim(tile)) | ||
return(save_path) | ||
} | ||
end_time_2 = difftime(Sys.time(), start_time, units = "secs") | ||
end_time_2 = round(end_time_2) | ||
``` | ||
|
||
The prediction took `r as.integer(end_time_2)` seconds. By using three threads | ||
instead of one, we have cut the operation time by half! | ||
|
||
## Post-processing | ||
|
||
We have our raster blocks saved in a temporary directory. The final step is to | ||
make a mosaic (that is, to combine the blocks into a single raster). I recommend | ||
using GDAL tools, as they provide the best performance when there are a large | ||
number of blocks. We can use: | ||
|
||
1. `buildvrt` to create a virtual mosaic. | ||
2. `translate` to save as a geotiff. | ||
|
||
We can call these tools using the `gdal_utils()` function from the **sf** package. | ||
|
||
```{r} | ||
vrt = tempfile(fileext = ".vrt") | ||
gdal_utils(util = "buildvrt", unlist(tifs), destination = vrt) | ||
gdal_utils(util = "translate", vrt, destination = "predict.tif") | ||
``` | ||
|
||
Once the parallel computation is complete, we can close the computing cluster | ||
using `stopCluster()` function (this will delete the temporary files). | ||
|
||
```{r} | ||
stopCluster(cl) | ||
``` | ||
|
||
So let's see what our final map looks like. | ||
|
||
```{r message=FALSE} | ||
clustering = read_stars("predict.tif") | ||
colors = c("#29a329", "#cbcbcb", "#ffffff", "#086209", "#fdd327", "#064d06") | ||
names = c("Low vegetation", "Bare soil", "Cloud", "Forest 1", "Cropland", "Forest 2") | ||
clustering[[1]] = factor(clustering[[1]], labels = names) | ||
plot(clustering, main = NULL, col = colors) | ||
``` |
Large diffs are not rendered by default.
Oops, something went wrong.