Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICESat-2 ML tutorial - photon classification on ATL07 sea ice data #17

Merged
merged 21 commits into from
Aug 18, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
c8b1690
Initial jupytext notebook with outline and ATL07 to geopandas script
weiji14 Aug 6, 2024
0c892ed
Save ATL07 photon data to GeoParquet file with ZSTD compression
weiji14 Aug 7, 2024
18336b9
Writeup sub-section on moving data from CPU to GPU
weiji14 Aug 7, 2024
5db2d77
Add ICESat-2 photon classification tutorial to index table
weiji14 Aug 7, 2024
561529c
Merge branch 'main' into photon_classifier
weiji14 Aug 7, 2024
6836bcd
Add pytorch (cpu build) to conda environment
weiji14 Aug 7, 2024
d616d77
Refactor to get ATL07 using earthaccess instead of icepyx
weiji14 Aug 8, 2024
1858273
Search for Sentinel-2 imagery captured at same time as ATL07 track
weiji14 Aug 10, 2024
f445e25
Second exact spatial intersection search using ATL07 line track
weiji14 Aug 12, 2024
fce139d
Add more columns to GeoDataFrame and filter out cloudy points
weiji14 Aug 12, 2024
770bb83
Plot ATL07 tracks on top of Sentinel-2 image
weiji14 Aug 12, 2024
1a11529
Label surface type of ATL07 points using Sentinel-2 Red band pixel value
weiji14 Aug 12, 2024
87cf493
Rename Part 2 to DataLoader and Model architecture
weiji14 Aug 13, 2024
1a42bcc
Architect PhotonClassificationModel and writeup ML model choices
weiji14 Aug 13, 2024
d599a37
Construct main training loop for ML model
weiji14 Aug 13, 2024
bdccda1
Merge branch 'main' into photon_classifier
weiji14 Aug 14, 2024
395982e
Add instructions to install pytorch in first code cell
weiji14 Aug 16, 2024
2a0ae41
Save geoparquet schema version 1.1.0 and reword note on zstd compression
weiji14 Aug 16, 2024
c3cace6
Merge branch 'main' into photon_classifier
weiji14 Aug 16, 2024
db55b60
Add overview flowchart to top of notebook and minor edits
weiji14 Aug 18, 2024
aac4747
Pre-render Jupyter notebook and move files to machine-learning folder
weiji14 Aug 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions book/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ parts:
sections:
- file: tutorials/example/tutorial-notebook
- file: tutorials/nb-to-package/index.md
- file: tutorials/photon_classifier
- caption: Projects
chapters:
- file: projects/index
Expand Down
1 change: 1 addition & 0 deletions book/tutorials/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@ Below you'll find a table keeping track of all tutorials presented at this event
| Tutorial | Topics | Datasets | Recording Link |
| - | - | - | - |
| [Example Notebook](./example/tutorial-notebook.ipynb) | Jupyter Book formatting, ipyleaflet | n/a | Not recorded |
| [ICESat-2 photon classification](./photon_classifier) | Machine Learning, PyTorch | ATL07 | TODO |
243 changes: 243 additions & 0 deletions book/tutorials/photon_classifier.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,243 @@
# ---
# jupyter:
# jupytext:
# formats: py:percent
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.16.2
# kernelspec:
# display_name: Python 3 (ipykernel)
# language: python
# name: python3
# ---

# %% [markdown]
# # Machine Learning with ICESat-2 data
#
# A machine learning pipeline from point clouds to photon classifications.
#
# Reimplementation of
# https://github.com/YoungHyunKoo/IS2_ML/blob/main/01_find_overlapped_data.ipynb

# %% [markdown]
# ```{admonition} Learning Objectives
# By the end of this tutorial, you should be able to:
# - Convert ICESat-2 point cloud data into an analysis/ML-ready format
# - Recognize the different levels of complexity of ML approaches and the
# benefits/challenges of each
# - Learn the potential of using ML for ICESat-2 photon classification
# ```

# %% [markdown]
# ## Part 0: Setup

# %%
import earthaccess
import geopandas as gpd
weiji14 marked this conversation as resolved.
Show resolved Hide resolved
import h5py
import torch

# %% [markdown]
# ## Part 1: Convert ICESat-2 data into ML-ready format
#
# Steps:
# - Get ATL07 data using [earthaccess](https://earthaccess.readthedocs.io)
# - Filter to only strong beams
# - Subset to 6 data variables only
#
# TODO: copy Table 1 from Koo et al., 2023 paper

# %%
# Authenticate using NASA EarthData login
auth = earthaccess.login()
s3 = earthaccess.get_s3fs_session(daac="NSIDC") # Start an AWS S3 session
Copy link
Member Author

@weiji14 weiji14 Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing from #17 (comment), this is the current error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[2], line 3
      1 # Authenticate using NASA EarthData login
      2 auth = earthaccess.login()
----> 3 s3 = earthaccess.get_s3fs_session(daac="NSIDC")  # Start an AWS S3 session

File ~/micromamba/envs/hackweek/lib/python3.11/site-packages/earthaccess/api.py:352, in get_s3fs_session(daac, provider, results)
    350         session = earthaccess.__store__.get_s3fs_session(endpoint=endpoint)
    351         return session
--> 352 session = earthaccess.__store__.get_s3fs_session(daac=daac, provider=provider)
    353 return session

AttributeError: 'NoneType' object has no attribute 'get_s3fs_session'

Not sure if there's a way to pass auth credentials to GitHub Actions and/or Netlify build so that this line works?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can definitely add secrets for EARTHDATA_USERNAME and EARTHDATA_PASSWORD, which would enable earthdata to login during actions. However GitHub actions run on azure, so s3 access isn't available. I think we'd have to build a self-hosted runner that deploys on AWS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we could execute these notebooks on a remotely booted CryoCloud user server. I think all the pieces exist to do this nowadays? @yuvipanda I saw this relevant issue NASA-IMPACT/veda-jupyterhub#46 as I was exploring your latest awesome 2i2c tech including https://github.com/yuvipanda/jupyter-sshd-proxy. Is there a straightforward way to start a user server via CI? Then you'd just have to fire up an ssh connection, execute a notebook, copy and commit the rendered version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is if CryoCloud had a BinderHub, we could follow Project Pythia and use execute_notebooks: binder in _config.yml (Example here, xref https://discourse.pangeo.io/t/statement-of-need-integrating-jupyterbook-and-jupyterhubs-via-ci/2705/14). The ssh option would be pretty cool to get working though!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weiji14 For now I recommend executing the notebook on CryoCloud, saving with outputs and adding an entry to not execute it in CI here:

exclude_patterns:
- "**/geospatial-advanced.ipynb"


# %%
# Set up spatiotemporal query for ATL07 sea ice product
granules = earthaccess.search_data(
short_name="ATL07",
cloud_hosted=True,
bounding_box=(-180, -78, -140, -70),
temporal=("2018-09-15", "2019-03-31"),
version="006",
)
granules[0] # visualize first data granule

# %%
granule0 = granules[0:1] # get just 1 granule for now
file_obj = earthaccess.open(granules=granule0)[0]

# %%
# %%time
atl_file = h5py.File(name=file_obj, mode="r")
atl_file.keys()

# %% [markdown]
# ### Get strong beams only
#
# Ref: https://github.com/ICESAT-2HackWeek/strong-beams

# %%
# orientation - 0: backward, 1: forward, 2: transition
orient = atl_file["orbit_info"]["sc_orient"][:]
if orient == 0:
strong_beams = ["gt1l", "gt2l", "gt3l"]
elif orient == 1:
strong_beams = ["gt3r", "gt2r", "gt1r"]
strong_beams

# %%
for beam in strong_beams:
print(beam)

# %% [markdown]
# Data variables to use:
# 1. `photon_rate`: photon rate
# 2. `hist_w`: width of the photon height distribution
# 3. `background_r_norm`: background photon rate
# 4. `height_segment_height`: relative surface height
# 5. `height_segment_n_pulse_seg`: number of laser pulses
# 6. `hist_mean_h` - `hist_median_h`: difference between mean and median height
#
# TODO link to data dictionary

# %%
gdf = gpd.GeoDataFrame(
data={
"photon_rate": atl_file[f"{beam}/sea_ice_segments/stats/photon_rate"][:],
"hist_w": atl_file[f"{beam}/sea_ice_segments/stats/hist_w"][:],
"background_r_norm": atl_file[
f"{beam}/sea_ice_segments/stats/background_r_norm"
][:],
"height_segment_height": atl_file[
f"{beam}/sea_ice_segments/heights/height_segment_height"
][:],
"height_segment_n_pulse_seg": atl_file[
f"{beam}/sea_ice_segments/heights/height_segment_n_pulse_seg"
][:],
"hist_mean_h": atl_file[f"{beam}/sea_ice_segments/stats/hist_mean_h"][:],
"hist_median_h": atl_file[f"{beam}/sea_ice_segments/stats/hist_median_h"][:],
},
geometry=gpd.points_from_xy(
x=atl_file[f"{beam}/sea_ice_segments/longitude"][:],
y=atl_file[f"{beam}/sea_ice_segments/latitude"][:],
),
crs="OGC:CRS84",
)
print(f"Total number of rows: {len(gdf)}")

# %%
gdf

# %% [markdown]
# ### Save to GeoParquet

# %% [markdown]
# Let's save the ATL07 photon data to a GeoParquet file so we don't have to run all the
# download and filtering code above again.

# %%
gdf.to_parquet(
path="ATL07_photons.gpq", compression="zstd", schema_version="1.0.0-beta.1"
)

# %% [markdown]
# ```{note} To compress or not?
# When storing your data, note that there is a tradeoff in terms of compression and read
# speeds. Uncompressed data would typically be fastest to read (assuming no network
# transfer) but result in large file sizes. We'll choose Zstandard (zstd) as the
# compression method here as it is typically faster to read (compared to the default
# 'snappy' compression codec), and still compresses well into a small file size.
# ```

# %%
# Load GeoParquet file back into geopandas.GeoDataFrame
gdf = gpd.read_parquet(path="ATL07_photons.gpq")

# %%

# %% [markdown]
# ## Part 2: Choosing a Machine Learning algorithm

# %% [markdown]
# ### Moving data from CPU to GPU
#
# Machine learning models are compute intensive, and typically run on specialized
# hardware called Graphical Processing Units (GPUs) instead of ordinary CPUs. Depending
# on your input data format (images, tables, audio, etc), and the machine learning
# library/framework you'll use (e.g. Pytorch, Tensorflow, RAPIDS AI CuML, etc), there
# will be different ways to transfer data from disk storage -> CPU -> GPU.
#
# For this exercise, we'll be using [PyTorch](https://pytorch.org), and do the following
# data conversions:
#
# [`geopandas.GeoDataFrame`](https://geopandas.org/en/v1.0.0/docs/reference/api/geopandas.GeoDataFrame.html) ->
# [`pandas.DataFrame`](https://pandas.pydata.org/pandas-docs/version/2.2/reference/api/pandas.DataFrame.html) ->
# [`torch.Tensor`](https://pytorch.org/docs/2.4/tensors.html#torch.Tensor) ->
# [torch `Dataset`](https://pytorch.org/docs/2.4/data.html#torch.utils.data.Dataset) ->
# [torch `DataLoader`](https://pytorch.org/docs/2.4/data.html#torch.utils.data.DataLoader)

# %%
# Select data variables from DataFrame that will be used for training
df = gdf[
[
"photon_rate",
"hist_w",
"background_r_norm",
"height_segment_height",
"height_segment_n_pulse_seg",
"hist_mean_h",
"hist_median_h",
]
]
tensor = torch.tensor(data=df.values) # convert pandas.DataFrame to torch.Tensor
assert tensor.shape == torch.Size([221346, 7]) # (rows, columns)
dataset = torch.utils.data.TensorDataset(tensor) # turn torch.Tensor into torch Dataset
dataloader = torch.utils.data.DataLoader( # put torch Dataset in a DataLoader
dataset=dataset,
batch_size=128, # mini-batch size
shuffle=True,
)

# %% [markdown]
# PyTorch's [`DataLoader`](https://pytorch.org/docs/2.4/data.html#torch.utils.data.DataLoader)
# is a convenient container to hold tensor data, and makes it easy for us to iterate
# over mini-batches using a for-loop.)

# %%
for batch in dataloader:
minibatch: torch.Tensor = batch[0]
assert minibatch.shape == (128, 7)
assert minibatch.device == torch.device("cpu") # Data is on CPU

minibatch = minibatch.to(device="cuda") # Move data to GPU
assert minibatch.device == torch.device("cuda:0") # Data is on GPU now!

break

# %%

# %% [markdown]
# ## Part 3: Training the neural network
#
# Use multi-layer perceptron with:
# - 2 hidden layers, 50 nodes each
# - tanh activation function
# - final layer with 3 nodes, for 3 surface types (open water, thin ice, thick/snow-covered ice)
# - Adam optimizer

# %%

# %%

# %% [markdown]
# ## References
# - Koo, Y., Xie, H., Kurtz, N. T., Ackley, S. F., & Wang, W. (2023).
# Sea ice surface type classification of ICESat-2 ATL07 data by using data-driven
# machine learning model: Ross Sea, Antarctic as an example. Remote Sensing of
# Environment, 296, 113726. https://doi.org/10.1016/j.rse.2023.113726


# %%
Loading
Loading