-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ICESat-2 ML tutorial - photon classification on ATL07 sea ice data #17
Conversation
First draft with a rough layout of sections for the ICESat-2 ML photon classification tutorial. Included learning objectives, and some initial code to read ATL07 sea ice data from HDF5 to a geopandas.GeoDataFrame. Deciding to do a reimplementation of the Koo et al., 2023 paper with code at https://github.com/YoungHyunKoo/IS2_ML.
Show how to save geopandas.GeoDataFrame to a GeoParquet file, and load it back again. Also put down some notes about compression codecs.
Some quick code to convert the geopandas.GeoDataFrame to a torch.Tensor and put it in a torch DataLoader. Showing how to move data from CPU to GPU using the `.to` method. Might modify this section's title/subtitle later depending on how the code goes.
One more entry in the tutorial index page. Putting down Machine Learning and Pytorch as the topics, and ATL07 as the dataset used for now.
Less boilerplate s3fs code to manage, and not using icepyx means this should run on the Pangeo pytorch-notebook docker image too!
book/tutorials/photon_classifier.py
Outdated
# Authenticate using NASA EarthData login | ||
auth = earthaccess.login() | ||
s3 = earthaccess.get_s3fs_session(daac="NSIDC") # Start an AWS S3 session |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continuing from #17 (comment), this is the current error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[2], line 3
1 # Authenticate using NASA EarthData login
2 auth = earthaccess.login()
----> 3 s3 = earthaccess.get_s3fs_session(daac="NSIDC") # Start an AWS S3 session
File ~/micromamba/envs/hackweek/lib/python3.11/site-packages/earthaccess/api.py:352, in get_s3fs_session(daac, provider, results)
350 session = earthaccess.__store__.get_s3fs_session(endpoint=endpoint)
351 return session
--> 352 session = earthaccess.__store__.get_s3fs_session(daac=daac, provider=provider)
353 return session
AttributeError: 'NoneType' object has no attribute 'get_s3fs_session'
Not sure if there's a way to pass auth credentials to GitHub Actions and/or Netlify build so that this line works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can definitely add secrets for EARTHDATA_USERNAME
and EARTHDATA_PASSWORD
, which would enable earthdata
to login during actions. However GitHub actions run on azure, so s3 access isn't available. I think we'd have to build a self-hosted runner that deploys on AWS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we could execute these notebooks on a remotely booted CryoCloud user server. I think all the pieces exist to do this nowadays? @yuvipanda I saw this relevant issue NASA-IMPACT/veda-jupyterhub#46 as I was exploring your latest awesome 2i2c tech including https://github.com/yuvipanda/jupyter-sshd-proxy. Is there a straightforward way to start a user server via CI? Then you'd just have to fire up an ssh connection, execute a notebook, copy and commit the rendered version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is if CryoCloud had a BinderHub, we could follow Project Pythia and use execute_notebooks: binder
in _config.yml
(Example here, xref https://discourse.pangeo.io/t/statement-of-need-integrating-jupyterbook-and-jupyterhubs-via-ci/2705/14). The ssh option would be pretty cool to get working though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking for a coincident alignment of two satellites (ICESat-2 and Sentinel-2) capturing data at the same time! Managed to find a coincident capture on 2019-02-24, though haven't checked if the spatial extent matches yet. Can improve the search algorithm later by expanding the search time window (+/- X minutes) and using a more exact bounding box search in the STAC API query.
Temporal match wasn't enough, so adding the spatial match as well. Metadata on ICESat-2 was lacking unfortunately, so need to open the ATL07 HDF5 file to get the xy coordinates and build a linestring from it to pass to the STAC query. Managed to find a lucky coincident match on 2019-10-31, and have verified that the crossover is valid.
Add the `x_atc`, `layer_flag` and `height_segment_ssh_flag` data variables to the GeoDataFrame which will be useful for plotting/filtering later. Using `height_segment_ssh_flag` to remove points that might be affected by clouds.
Get the Sentinel-2 RGB image, reproject the ATL07 points and subset to the image's bounding box, then plot them both using PyGMT! The plot colors sea ice points as blue, and sea surface (water) points as orange.
Use PyGMT's grdtrack to get the Sentinel-2 Red band's pixel values sampled at every ATL07 xy point, and then apply a simple threshold to classify into water (dark), thin ice (gray) and thick ice (white).
Reorganizing some content so Part 2 is focused on preparing the DataLoader and neural network model architecture. Have now moved the dataloader for-loop to Part 3 'Training' and commented out the to CUDA parts. Also calculated "hist_mean_median_h_diff" column which is the actual variable we want to use in training.
Writing up section about choosing a machine learning algorithm, including ML models with different levels of complexity from decision trees to neural networks and state-of-the-art models. Also implemented a simple multi-layer perceptron model based on the description in Koo et al., 2023's paper (but without the tanh activation).
Finally got to the actual neural network model training! Now properly splitting the mini-batch data into input and target tensors, passing the input into the model to get the prediction, and minimizing the loss between prediction and target. Needed to do some ugly dtype casting to prevent `RuntimeError`s. Trying to keep this fairly basic without train/validation splits, and only ran this for 3 epochs. Have shifted some markdown blocks up where they belong too.
Default CryoCloud docker image won't have Pytorch, so will need to install it at the first step.
Default CryoCloud image now has Geopandas 1.x, so can save to a non-beta version of GeoParquet schema now.
This should be ready for an initial round of reviews. If possible, I'd appreciate some help with the authentication issue at #17 (comment) (need to grab both an ATL07 file and also Sentinel-2), but I can also try to sort that out over the weekend. There are a few other things I'd like to add to the notebook such as more explanation text at the start, and also show what the trained model's predicted ATL07 photon classifications look like, but that can be done in a follow-up PR. |
book/tutorials/photon_classifier.py
Outdated
df_red = pygmt.grdtrack( | ||
grid=da_image.sel(band="red").compute(), # Choose only the Red band | ||
points=gdf.get_coordinates(), # x/y coordinates from ATL07 | ||
newcolname="red_band_value", | ||
interpolation="n", # nearest neighbour | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first time I ran this on cryocloud I got a traceback:
RuntimeError: Error opening 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/2/C/ND/2019/10/S2B_2CND_20191031_0_L2A/B04.tif': RasterioIOError('No driver registered.')
But re-running succeeded... might be an intermittent thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have to re-run this sometimes too. Maybe there's a short-lived token or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @weiji14! I wont have time for an in-depth review, but to me this looks like a really great tutorial! I wish we could do the direct S3 access in the CI workflow, but I think the easiest thing for now is just to commit a rendered notebook. Please go ahead and merge once you're happy with it
Adding an overview diagram of the ATL07 + Sentinel-2 processing pipeline (illustrated using Excalidraw) to the start of the notebook. Made some minor edits to some of the markdown cells to include more references and explanatory text.
Pushing the photon_classifier Jupyter Notebook with pre-rendered cells that was ran on CryoCloud. Putting the files under a 'machine-learning' folder, to be consistent with the other tutorials using subfolders.
Thanks Scott, I've added an overview diagram at the top, and have pushed the pre-rendered notebook (6.0MB) 🙈. Will merge this in now, and might work on some extra stuff at the last section (e.g. show the model training results section a bit better) in a follow-up PR if there's time. |
I know this PR is closed, but is this a photon classifier or a segment classifier? Photon classifiers label the individual photon events from ATL03 (e.g. YAPC or the land/veg classifier). |
Ah yes, I should probably have called this a point cloud classifier since ATL07 is based on a aggregate of ATL03 points. Let me fix that in a follow-up PR later. Edit: Updates happening at #17 |
Draft tutorial for doing ICESat-2 ATL07 photon classification into 3 surface types (open water, thin ice, thick/snow-covered ice). Deciding to do a reimplementation of @YoungHyunKoo's code at https://github.com/YoungHyunKoo/IS2_ML.
Preview at https://deploy-preview-17--icesat2-website2024.netlify.app/tutorials/machine-learning/photon_classifier
Excalidraw file: ATL07_point_cloud_classifier.excalidraw.tar.gz
TODO:
geopandas.GeoDataFrame
Xref: uwhackweek/schedule-2024#38
References: