Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for GOES and Himawari Satellite Imagery #222

Open
jacobbieker opened this issue Feb 4, 2024 · 17 comments · Fixed by #240
Open

Add support for GOES and Himawari Satellite Imagery #222

jacobbieker opened this issue Feb 4, 2024 · 17 comments · Fixed by #240
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@jacobbieker
Copy link
Member

jacobbieker commented Feb 4, 2024

It would be great if we could support GOES and Himawari satellite imagery. Between the 2 GOES and Himawari, this would then support global geostationary coverage of satellite imagery. The idea is to essentially make Satip not EUMETSAT specific, but more akin to NWP-Consumer, but for satellite imagery.

Detailed Description

GOES-16,-17,-18 support could be fairly easy through Microsoft's Planetary Computer, which has the NetCDF or GeoTIFF easily available there, which could be opened with rioxarray, which is already part of this repo. There are NetCDF Himawari live images on AWS as well (GOES is also there) so data access should not be a problem. The public data is available from 2017 for GOES, and 2020 for Himawari.

Ideally, it would also support older GOES and Himawari imagery, Satpy supports GOES-13,-14,-15 imagery which goes back to around 2010, and Himawari is available in archives from JAXA. This could give a global, decade-ish archive of imagery which could be quite useful for a lot of studies and training models nearly anywhere in the world.

Context

I would use it in either Dagster, or planetary-datasets for processing archival imagery and making it available on Hugging Face, which in addition to the public, near-realtime archives would be quite an impactful project. Already having the entire EUMETSAT MSG RSS datset available (for just about a year) has results in one paper we know of using the entire archive for a paper on solar forecasting. Extending that to have global coverage is a natural extension.

Possible Implementation

It could be a combination of unifying accessing the satellite imagery in different ways. For example, for Himawari data, I am creating kerchunk dataset of the public archival NetCDF files for Himawari-8 and -9 here: https://huggingface.co/datasets/jacobbieker/himawari8-kerchunk so OCF could copy that and keep extending it. GOES-17,-18-19 could be pulled from Planetary Computer, or do something similar as for Himawari, and pull the data straight from the AWS NetCDF archive.

For older GOES imagery, the archive is available at NOAA's CLASS archive, which is free to download and redistribute. The data actually goes back at least 2 generations of GOES satellites, so the archive could go back to the early 2000s, or earlier. But I would propose just going back one generation of GOES satellites, so 2011ish, which would mostly match with the EUMETSAT archive (~2008).

For older Himawari imagery, it is available through DIAS, although there are more licensing restrictions, and I'm not necessarily sure if we could redistribute the data. But we could atleast include the more recent Himawari imagery that is publicly available.

@jacobbieker jacobbieker added enhancement New feature or request good first issue Good for newcomers labels Feb 4, 2024
@abhijelly
Copy link

hey @jacobbieker - for my understanding, the task is to download the GOES/Himawari data as NETCDF and upload it as a dataset on hugginface, similar to what you are trying to accoplish by creating the kerchunk dataset? Thanks

@jacobbieker
Copy link
Member Author

Hi,

Somewhat, we want to add the ability to convert the native files from Himawari and GOES to Zarr with a similar format as the Google Public Dataset version of the EUMETSAT imagery, so it can all be accessed in the same way. Ideally it would also work for the GOES 13 to 15 imagery as well, which is not available on AWS and has to be accessed from the NOAA CLASS archive.

@Rishikesh-Reddy
Copy link

Rishikesh-Reddy commented Feb 28, 2024

Hi @jacobbieker - according to understanding of the task is to add the capability for Satip to process and convert native GOES/Himawari files from NetCDF to Zarr format, similar to how it currently handles EUMETSAT data.

Additions to be made are :

  • Implement a download manager specifically for GOES and Himawari data (Goes-2-go by Brian Blaylock can be useful got GOES)
  • change app.py to accept a parameter and allow processing of all three data sources (EUMETSAT, GOES, and Himawari).
  • Enable the download of raw Himawari and GOES data in their native format and subsequent conversion to Zarr format.

As someone very new to the field of satellite image processing, I'd like to begin with a small task to gain a comprehensive understanding of the codebase. I would greatly appreciate any help you can offer in this regard.

@jacobbieker
Copy link
Member Author

Hi, Yes, that is all correct! The smallest first task would probably be to use Goes2-go to add a download manager for GOES, or alternatively just download from the AWS bucket directly. The conversion to Zarr should also be fairly straightforward, as satpy can already load the NetCDF files from Himawari and GOES, so what satip needs to do is take that output and save it in a similar format to the current EUMETSAT data.

Of the two, I would probably go with trying to get GOES-2-go to download the data first, that might be the most straightfoward.

@14Richa
Copy link
Contributor

14Richa commented Mar 22, 2024

@jacobbieker Thanks for the steps. I tried taking a stab at it. The below code snippet can download the files from the goes.

from goes2go.data import goes_latest
data = goes_latest()

This works for me locally. Would the next step be to convert this data to Zarr format? Let me know if my understanding is correct.

@jacobbieker
Copy link
Member Author

@jacobbieker Thanks for the steps. I tried taking a stab at it. The below code snippet can download the files from the goes.

from goes2go.data import goes_latest
data = goes_latest()

This works for me locally. Would the next step be to convert this data to Zarr format? Let me know if my understanding is correct.

Hi,

That is a good start, but we want to be able to give a datetime or range of dates and have the downloader download all the images during that time, not just the latest images. But once being able to select dates to download and downloading those dates, then the next step would be to convert the data to Zarr. For this, you should be able to open the NetCDF files that are downloaded with xarray, and then save them out to Zarr format. There might need to be some preprocessing that is done, but that would be the first step for that.

@14Richa
Copy link
Contributor

14Richa commented Mar 24, 2024

@jacobbieker Thanks for the steps. I tried taking a stab at it. The below code snippet can download the files from the goes.

from goes2go.data import goes_latest
data = goes_latest()

This works for me locally. Would the next step be to convert this data to Zarr format? Let me know if my understanding is correct.

Hi,

That is a good start, but we want to be able to give a datetime or range of dates and have the downloader download all the images during that time, not just the latest images. But once being able to select dates to download and downloading those dates, then the next step would be to convert the data to Zarr. For this, you should be able to open the NetCDF files that are downloaded with xarray, and then save them out to Zarr format. There might need to be some preprocessing that is done, but that would be the first step for that.

Hey @jacobbieker, Thanks for your reply.

I have raised a PR that adds the GOES Data Download Manager Script. I also have a couple of questions regarding its integration:

  1. I'm thinking about whether to incorporate the GOES Data Download Manager script into the existing download manager for EUMETSAT. Would it be more practical to merge these functionalities into a single manager, or would it be preferable to keep them separate?

  2. Additionally, we need to decide how to differentiate between commands for GOES and EUMETSAT downloads. One approach could be to use flags within the command structure to specify which satellite data to retrieve. However, I wanted to get your thoughts on whether this approach aligns with our objectives or if you have alternative ideas.

@jacobbieker
Copy link
Member Author

Thanks for the PR! I'll look over it soon. For this architecture, we want it to be the same interface for getting all the different satellite imagery, so integrating it with the current DownloadManager is my preferred way of doing it. Potentially the differentiation can be passing in which provider to the DownloadManager (i.e. goes,eumetsat, with future ones adding jaxa,or gk2a) and internally then picking the right code path for the different providers. So yes, the idea in 2. is more what I was thinking.

@suleman1412
Copy link
Contributor

hi @jacobbieker, is this issue still open? From my understanding, we have to merge eumetsat and goes in DownloadManager file itself. One way I thought of doing this is by creating a common or base class which could be used by the eumetsat and goes class individually, and then the actual DownloadManager which acts the main entry point. Please let me know if I'm in the right direction.

@iyui1223
Copy link

HI. @jacobbieker , @suleman1412 , while you are working on the common download class for different satellites,
I will be looking into the possible data sources for the Himawari satellites.
I'll start implementing Himawari downloader after the direction is set.

The best source I could find is the WorldScienceDataBank run by NICT https://sc-web.nict.go.jp/himawari/himawari-archive.html.
https://sc-nc-web.nict.go.jp/wsdb_osndisk/shareDirDownload/03ZzRnKS
This site provides an archive for entire Himawari series 1-9. According to JMA's data policy, data obtained from there can only be used for non-profit purpose but no restriction for re-distribution.

@jacobbieker
Copy link
Member Author

hi @jacobbieker, is this issue still open? From my understanding, we have to merge eumetsat and goes in DownloadManager file itself. One way I thought of doing this is by creating a common or base class which could be used by the eumetsat and goes class individually, and then the actual DownloadManager which acts the main entry point. Please let me know if I'm in the right direction.

Hi,sorry for the delayed response, but yes, this would be the way to go.

@jacobbieker
Copy link
Member Author

HI. @jacobbieker , @suleman1412 , while you are working on the common download class for different satellites,
I will be looking into the possible data sources for the Himawari satellites.
I'll start implementing Himawari downloader after the direction is set.

The best source I could find is the WorldScienceDataBank run by NICT https://sc-web.nict.go.jp/himawari/himawari-archive.html.
https://sc-nc-web.nict.go.jp/wsdb_osndisk/shareDirDownload/03ZzRnKS
This site provides an archive for entire Himawari series 1-9. According to JMA's data policy, data obtained from there can only be used for non-profit purpose but no restriction for re-distribution.

Yeah, that would be great, that data source is the same one that I found for it too. So yeah, if you want to go ahead and start adding that downloader, we can integrate it with the above later.

@iyui1223
Copy link

iyui1223 commented Jul 3, 2024

I've made (or adopted from satip) a simple downloader for Himawari satellite. I noticed the noaa's AWS based data provision has better flexibility, so I changed my plan and downloaded data from there instead from NICT. The currently available time range is rather limited, but if I am not mis-informed this AWS will be the mainstream for the future Himawari and other satellite data provision in near future, so better investing time there than to NICT.

@jacobbieker
Now I am trying to cook up a simple compression/zarr conversion tool for the raw images. May I ask how did the scaling min/max values for Eumetsat were decided? I am wondering if it is better to neglect outlier values, and set the upper-lower rounding values as top/bottom 0.3% from randomly taken 10 samples. I want to know the criteria you set for Eumetsat, just for the sake of data-processing consisetency. That is, if the numbers are not completely arbitrarily chosen.

@jacobbieker
Copy link
Member Author

Hey, yeah, AWS seems to be the future for Himawari satellite data, but being able to access data from NICT will also be helpful in the future, as it allows for getting a longer archive of data, back before the start date of the AWS datasets. But for a first pass, just the AWS data is still a good thing to add!

For EUMETSAT, we calculated the min and max across ~1000 randomly sampled raw images, and then used that to scale between 0 and 1ish (there are and will be data outside of those ranges in the final dataset, but most will fall within that range). I'd prefer not to throw away any of the data or information, just rescale it. If its not perfectly between 0 and 1 that is also okay though.

@iyui1223
Copy link

Thank you for the explanation. Integrating both NICT and AWS data sounds ideal, and I'll consider adding NICT data after completing my current implementation.

Regarding data compression, I'm not familiar with how rescaling to a 0-1 range reduces data size. Could you provide a link or more details on this method? Is it related to using fewer exponent bytes in float point expressions?

I understand the priority is to maintain the original data as unchanged as possible. I’m currently compressing data into uint8 by normalizing values from min to max (0 to 255), reducing size to a quarter compared to saving as float. However, this may lose precision for small value differences, as the resolution for small differences is (max - min)/256. Using the 0.3 percentile was sufficient since differences smaller than this disappear with compression anyway. However, to cover rare extreme values, this method may not be enough. If storage allows, I can use a more generous compression method to better preserve data quality.

@jacobbieker
Copy link
Member Author

We rescale the data to 0-1 for a few reasons. 1. We aren't allowed to redistributed EUMETSAT data in the original data values, so rescaling fixes that. 2. It puts the different bands in the same range of values and already normalized for ML usage, where we usually want the data between 0 and 1. It doesn't particularly help with compression, but helps with downstream tasks.

As for converting to ints, that's fine, but in the code, we'd like to keep it as float and all the information that we can, since we use lossless compression we won't be throwing any of the data away anyway, and it can be helpful to have that extra information. If storage space is a concern, we could always set you up to be able to push Zarrs to our Hugging Face, or Source Cooperative so it can be stored, and easily accessible to everyone, in those places.

@iyui1223
Copy link

Now I understand the purpose of rescaling. I'll make an option to normalize data by maybe 4 sigma or bigger to archive equivalent result will less computation cost. However, redistribution in original format is not restricted for Himawari, so I will keep the data as is for the default.
Also, I will switch to a lossless compression algorithm too. It is nice to know that your Hugging Face is available as a storage.

I'm currently working on my laptop, and will have access to a research computing system in October. After I tested my code there I will send a pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants