Skip to content

Commit

Permalink
Place all figures at the bottom of summary notebook
Browse files Browse the repository at this point in the history
  • Loading branch information
maxrjones committed Nov 1, 2024
1 parent 5a78d05 commit 5b9a487
Show file tree
Hide file tree
Showing 2 changed files with 53 additions and 25 deletions.
2 changes: 1 addition & 1 deletion _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,6 @@ format:
code-overflow: wrap
css: styles.css
toc: true
toc-depth: 3
toc-depth: 4
filters:
- quarto
76 changes: 52 additions & 24 deletions examples/summarize-results.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,49 @@
"- Rasterio was not included as a resampling method for GPM IMERG due to a lack of simple methods for handling the non-standard axis order (e.g., (time, x, y) instead of (time, y, x)). Non-standard data and metadata would likely be a barrier to use for many NetCDF files."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Variability related to I/O\n",
"\n",
"The figures below show the resampling duration and peak memory allocation for data stored as NetCDF and accessed through the H5NetCDF library, data stored as NetCDF but virtualized into Zarr and accessed via the Zarr and Icechunk libraries, and data transformed to Zarr and accessed via the Zarr and Icechunk libraries. Here are some key takeaways:\n",
"\n",
"- Virtualizing the data as Zarr gives a >2x performance improvement relative to loading with the H5NetCDF library.\n",
"- If the chunk sizes remain the same, virtualization gives the same performance benefit as conversion to a cloud-optimized data format like Zarr. Differences would be observed if the chunk configuration and size is optimized for the particular workflow."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Variability related to web-optimization\n",
"\n",
"The figures below show the resampling duration and peak memory allocation for tile generation from cloud-optimized GeoTIFF relative to virtualized NetCDF and \"web-optimized Zarr\" (WOZ), which in this case are Zarr data spoofed to contain overviews. Here are some key takeaways:\n",
"\n",
"- Overviews dramatically improve the performance of tile generation at all zoom levels. For example, tile generation was ~20x faster at zoom level 0 and ~3x faster at zoom level 6.\n",
"- Resampling from WOZ using rioxarray added overhead relative to resampling from Web-Optimized Zarr using rasterio, due to the increased import and object instantiation times in Xarray relative to using Zarr, Numpy, and Rasterio alone. While the performance differences between COG and WOZ resampling with rasterio could likely be eliminated with future development, rasterio will likely always be raster than rioxarray when using overviews."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Implications for future development\n",
"\n",
"- Virtualizing archival file formats greatly improves performance relative to archival file readers such as h5netcdf and motivates the generation of virtual references whenever possible.\n",
"- The Web-Optimized Zarr example shows the potential for Zarr overviews to enable highly performant visualization and motivates the development of the GeoZarr and multi-scales Zarr specifications.\n",
"- Pyinstrument showed a significant fraction of the total time when resampling Web-Optimized Zarr using rioxarray went towards Xarray importing Pandas and guessing the chunk manager. Both of these components could be improved or removed through future development.\n",
"- The dramatic difference between using XESMF with and without pre-generated weights raises the question of whether similar relative performance improvements could be gained by pre-generating weights for reprojection with GDAL. Given that pyinstrument shows only ~1/4 of the time is spent on the actual resampling operation when using COGs, building specifications for web-optimizing Zarr (i.e., GeoZarr and multi-scales), virtualizing existing datasets, and reducing import times would likely be much simpler and more fruitful activities."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Summary figures"
]
},
{
"cell_type": "code",
"execution_count": 1,
Expand Down Expand Up @@ -162,6 +205,13 @@
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Summary figures for comparing resampling methods"
]
},
{
"cell_type": "code",
"execution_count": 2,
Expand Down Expand Up @@ -594,12 +644,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Variability related to I/O\n",
"\n",
"The figures below show the resampling duration and peak memory allocation for data stored as NetCDF and accessed through the H5NetCDF library, data stored as NetCDF but virtualized into Zarr and accessed via the Zarr and Icechunk libraries, and data transformed to Zarr and accessed via the Zarr and Icechunk libraries. Here are some key takeaways:\n",
"\n",
"- Virtualizing the data as Zarr gives a >2x performance improvement relative to loading with the H5NetCDF library.\n",
"- If the chunk sizes remain the same, virtualization gives the same performance benefit as conversion to a cloud-optimized data format like Zarr. Differences would be observed if the chunk configuration and size is optimized for the particular workflow."
"#### Summary figures for comparing storage formats and I/O libraries"
]
},
{
Expand Down Expand Up @@ -810,12 +855,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Variability related to web-optimization\n",
"\n",
"The figures below show the resampling duration and peak memory allocation for tile generation from cloud-optimized GeoTIFF relative to virtualized NetCDF and \"web-optimized Zarr\" (WOZ), which in this case are Zarr data spoofed to contain overviews. Here are some key takeaways:\n",
"\n",
"- Overviews dramatically improve the performance of tile generation at all zoom levels. For example, tile generation was ~20x faster at zoom level 0 and ~3x faster at zoom level 6.\n",
"- Resampling from WOZ using rioxarray added overhead relative to resampling from Web-Optimized Zarr using rasterio, due to the increased import and object instantiation times in Xarray relative to using Zarr, Numpy, and Rasterio alone. While the performance differences between COG and WOZ resampling with rasterio could likely be eliminated with future development, rasterio will likely always be raster than rioxarray when using overviews."
"#### Summary figures for exploring web-optimization"
]
},
{
Expand Down Expand Up @@ -1021,18 +1061,6 @@
"source": [
"plot_memory_by_weboptimization()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Implications for future development\n",
"\n",
"- Virtualizing archival file formats greatly improves performance relative to archival file readers such as h5netcdf and motivates the generation of virtual references whenever possible.\n",
"- The Web-Optimized Zarr example shows the potential for Zarr overviews to enable highly performant visualization and motivates the development of the GeoZarr and multi-scales Zarr specifications.\n",
"- Pyinstrument showed a significant fraction of the total time when resampling Web-Optimized Zarr using rioxarray went towards Xarray importing Pandas and guessing the chunk manager. Both of these components could be improved or removed through future development.\n",
"- The dramatic difference between using XESMF with and without pre-generated weights raises the question of whether similar relative performance improvements could be gained by pre-generating weights for reprojection with GDAL. Given that pyinstrument shows only ~1/4 of the time is spent on the actual resampling operation when using COGs, building specifications for web-optimizing Zarr (i.e., GeoZarr and multi-scales), virtualizing existing datasets, and reducing import times would likely be much simpler and more fruitful activities."
]
}
],
"metadata": {
Expand Down

0 comments on commit 5b9a487

Please sign in to comment.