Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement to ERA5 Data Retrieval and Download Process #397

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

yndevops2
Copy link

@yndevops2 yndevops2 commented Oct 28, 2024

This update introduces an optimized approach for data retrieval and caching for ERA5 data from the Climate Data Store (CDS). Key changes include:

  1. Caching Mechanism: Added a caching mechanism to prevent repeated downloads for identical data requests. The cache files are named based on a unique hash of the request parameters, making subsequent retrievals faster by using pre-downloaded data.

  2. Custom Download Function: Integrated a custom download function with a progress bar to enhance user experience. The function uses chunked downloading with error handling and retry mechanisms for a robust download process.

  3. Progress Bar: A dynamic progress bar displays the download status of multiple files, with completed files removed from the display to improve readability.

These improvements aim to make data retrieval more efficient and user-friendly.

Closes # (if applicable).

Changes proposed in this Pull Request

Checklist

  • Code changes are sufficiently documented; i.e. new functions contain docstrings and further explanations may be given in doc.
  • Newly introduced dependencies are added to environment.yaml, environment_docs.yaml and setup.py (if applicable).
  • A note for the release notes doc/release_notes.rst of the upcoming release is included.
  • Unit tests for new features were added (if applicable).
  • I consent to the release of this PR's code under the MIT license.

@yndevops2 yndevops2 changed the title Update era5.py Enhancement to ERA5 Data Retrieval and Download Process Nov 3, 2024
@fneum fneum requested a review from lkstrp November 4, 2024 11:03
Copy link
Member

@lkstrp lkstrp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yndevops2 for the contribution!
In general you can contact us via GitHub and PRs, you don't need to send any emails.
Still, I haven't heard back from you.

A caching feature like this adds a lot of overhead and I do not know if it is needed. Also, the CDS API already provides caching and there is no real use case for re-downloading data

@awongel
Copy link

awongel commented Jan 28, 2025

Hi @lkstrp,
We asked @yndevops2 to help us speed up the download because with the main branch version of Atlite it was not possible to download global grid-scale multiyear time series for the capacity factors, which we needed for a project. With this upgrade that is now possible.

The caching is an optional flag anyway, but we can talk about if you'd be interested in only integrating the sped-up download. The idea for the caching was that when you change the region of interest to something smaller than what has been downloaded before, one could avoid redownloading the data.

@yndevops2
Copy link
Author

Hi @lkstrp,
Downloading multi-year time series at a global scale is slow because .nc files download one by one, and caching only works if all data is fully downloaded.

Please check this code:

import atlite
import logging
import geopandas as gpd

def main(year):
    
    logging.basicConfig(level=logging.INFO)

    url = "https://naciscdn.org/naturalearth/110m/cultural/ne_110m_admin_0_countries.zip"

    world = gpd.read_file(url)
    # Drop uninhabited regions and Antarctica
    world = world[(world["POP_EST"] > 0) & (world["NAME"] != "Antarctica")]

    region = world
    region_name = "world"

    # Loop over the years
    logging.info(f"Processing {year}")

    # Define the cutout; this will not yet trigger any major operations
    cutout = atlite.Cutout(
        path=f"{region_name}-{year}_timeseries", module="era5", 
        bounds=region.unary_union.bounds, 
        time=f"{year}",
        chunks={"time": 100,},)
    # This is where all the work happens (this can take some time).
    cutout.prepare(
        compression={"zlib": True, "complevel": 9},
        monthly_requests=True,
        concurrent_requests=True)

    # Extract the wind power generation capacity factors
    wind_power_generation = cutout.wind(
        "Vestas_V112_3MW", 
        capacity_factor_timeseries=True,
        )

    # Extract the solar power generation capacity factors
    solar_power_generation = cutout.pv(
        panel="CSi", 
        orientation='latitude_optimal', 
        tracking="horizontal",
        capacity_factor_timeseries=True,)
    
    # Extract the concenctrated solar power generation capacity factors
    csp_power_generation = cutout.csp(
        installation="SAM_parabolic_trough", 
        capacity_factor_timeseries=True,)

    # Save gridded data as netCDF files
    wind_power_generation.to_netcdf(f"{region_name}_wind_CF_timeseries_{year}.nc")
    solar_power_generation.to_netcdf(f"{region_name}_solar_CF_timeseries_{year}.nc")
    csp_power_generation.to_netcdf(f"{region_name}_csp_CF_timeseries_{year}.nc")

if __name__ == "__main__":
    main("2023")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants