Skip to content

Commit 83f9be8

Browse files
Add mursst lithops example (#475)
* Add files * Refactoring some things * Fix some refactoring things * Update PR number * Use configure_zarr in helpers * Remove some env specific configuration in lithops.yaml
1 parent fd3a916 commit 83f9be8

22 files changed

+1286
-0
lines changed

docs/examples.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,4 @@ The following examples demonstrate the use of VirtualiZarr to create virtual dat
55
1. [Appending new daily NOAA SST data to Icechunk](https://github.com/zarr-developers/VirtualiZarr/blob/main/examples/append/noaa-cdr-sst.ipynb)
66
2. [Parallel reference generation using Coiled Functions](https://github.com/zarr-developers/VirtualiZarr/blob/main/examples/coiled/terraclimate.ipynb)
77
3. [Serverless parallel reference generation using Lithops](https://github.com/zarr-developers/VirtualiZarr/tree/main/examples/virtualizarr-with-lithops)
8+
4. [MUR SST Virtual and Zarr Icechunk Store Generation using Lithops](https://github.com/zarr-developers/VirtualiZarr/tree/main/examples/mursst-icechunk-with-lithops)

docs/releases.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,9 @@ Bug fixes
2121
Documentation
2222
~~~~~~~~~~~~~
2323

24+
- Added MUR SST virtual and zarr icechunk store generation using lithops example.
25+
(:pull:`475`) by `Aimee Barciauskas <https://github.com/abarciauskas-bgse>`_.
26+
2427
Internal Changes
2528
~~~~~~~~~~~~~~~~
2629

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
# Use AWS Lambda base image for Python 3.11
2+
FROM public.ecr.aws/lambda/python:3.11
3+
4+
ARG FUNCTION_DIR
5+
6+
# Set working directory
7+
WORKDIR /var/task
8+
9+
# Update system libraries and install necessary utilities
10+
RUN yum update -y && \
11+
yum install -y wget unzip tar gzip git && \
12+
yum clean all
13+
14+
# Install uv package manager and move it to /usr/local/bin
15+
RUN curl -LsSf https://astral.sh/uv/install.sh | sh && \
16+
mv ~/.local/bin/uv /usr/local/bin/uv && \
17+
chmod +x /usr/local/bin/uv
18+
19+
# Verify uv installation
20+
RUN uv --version
21+
22+
RUN uv pip install --upgrade pip wheel six setuptools --system \
23+
&& uv pip install --upgrade --no-cache-dir --system \
24+
awslambdaric \
25+
boto3 \
26+
redis \
27+
httplib2 \
28+
requests \
29+
numpy \
30+
scipy \
31+
pandas \
32+
pika \
33+
kafka-python \
34+
cloudpickle \
35+
ps-mem \
36+
tblib \
37+
psutil
38+
39+
# Set environment variables for Lambda
40+
ENV PYTHONPATH="/var/lang/lib/python3.11/site-packages:${FUNCTION_DIR}"
41+
42+
# Copy and install dependencies from requirements.txt using uv
43+
COPY requirements.txt /tmp/requirements.txt
44+
RUN uv pip install --no-cache-dir -r /tmp/requirements.txt --system
45+
46+
# Copy application code
47+
COPY lithops_lambda.zip ${FUNCTION_DIR}
48+
RUN unzip lithops_lambda.zip \
49+
&& rm lithops_lambda.zip \
50+
&& mkdir handler \
51+
&& touch handler/__init__.py \
52+
&& mv entry_point.py handler/
53+
54+
# Set Lambda entry point
55+
CMD [ "handler.entry_point.lambda_handler" ]
Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
# Lithops Package for MUR SST Data Processing
2+
3+
This package provides functionality for processing MUR SST (Multi-scale Ultra-high Resolution Sea Surface Temperature) data using [Lithops](https://lithops-cloud.github.io/), a framework for serverless computing.
4+
5+
## Environment + Lithops Setup
6+
7+
1. Set up a Python environment. The below example uses [`uv`](https://docs.astral.sh/uv/), but other environment mangers should work as well:
8+
9+
```sh
10+
uv venv virtualizarr-lithops --python 3.11
11+
source virtualizarr-lithops/bin/activate
12+
uv pip install -r requirements.txt
13+
```
14+
15+
2. Follow the [AWS Lambda Configuration](https://lithops-cloud.github.io/docs/source/compute_config/aws_lambda.html#configuration) instructions, unless you already have an appropriate AWS IAM role to use.
16+
17+
3. Follow the [AWS Credential setup](https://lithops-cloud.github.io/docs/source/compute_config/aws_lambda.html#aws-credential-setup) instructions.
18+
19+
4. Check and modify as necessary compute and storage backends for [lithops](https://lithops-cloud.github.io/docs/source/configuration.html) in `lithops.yaml`.
20+
21+
22+
5. Build the lithops lambda runtime if it does not exist in your target AWS environemnt.
23+
```bash
24+
export LITHOPS_CONFIG_FILE=$(pwd)/lithops.yaml
25+
lithops runtime build -b aws_lambda -f Dockerfile vz-runtime
26+
```
27+
28+
For various reasons, you may want to build the lambda runtime on EC2 (docker can be a resource hog and pushing to ECR is faster, for example). If you wish to use EC2, please see the scripts in `ec2_for_lithops_runtime/` in this directory.
29+
30+
> [!IMPORTANT]
31+
> If the runtime was created with a different IAM identity, an appropriate `user_id` will need to be included in the lithops configuration under `aws_lamda`.
32+
33+
> [!TIP]
34+
> You can configure the AWS Lambda architecture via the `architecture` key under `aws_lambda` in the lithops configuration file.
35+
36+
37+
6. (Optional) To rebuild the Lithops Lambda runtime image, delete the existing one:
38+
39+
```bash
40+
lithops runtime delete -b aws_lambda -d virtualizarr-runtime
41+
```
42+
43+
## Package Structure
44+
45+
The package is organized into the following modules:
46+
47+
- `__init__.py`: Package initialization and exports
48+
- `config.py`: Configuration settings and constants
49+
- `models.py`: Data models and structures
50+
- `url_utils.py`: URL generation and file listing
51+
- `repo.py`: Icechunk repository management
52+
- `virtual_datasets.py`: Virtual dataset operations
53+
- `zarr_operations.py`: Zarr array operations
54+
- `helpers.py`: Data helpers
55+
- `lithops_functions.py`: Lithops execution wrappers
56+
- `cli.py`: Command-line interface
57+
58+
## Usage
59+
60+
### Command-line Interface
61+
62+
The package provides a command-line interface for running various functions:
63+
64+
```bash
65+
python main.py <function> [options]
66+
```
67+
68+
Available functions:
69+
70+
- `write_to_icechunk`: Write data to Icechunk
71+
- `check_data_store_access`: Check access to the data store
72+
- `calc_icechunk_store_mean`: Calculate the mean of the Icechunk store
73+
- `calc_original_files_mean`: Calculate the mean of the original files
74+
- `list_installed_packages`: List installed packages
75+
76+
Options:
77+
78+
- `--start_date`: Start date for data processing (YYYY-MM-DD)
79+
- `--end_date`: End date for data processing (YYYY-MM-DD)
80+
- `--append_dim`: Append dimension for writing to Icechunk
81+
82+
### Examples
83+
84+
#### Writing Data to Icechunk
85+
86+
```bash
87+
python main.py write_to_icechunk --start_date 2022-01-01 --end_date 2022-01-02
88+
```
89+
90+
#### Calculating the Mean of the Icechunk Store
91+
92+
```bash
93+
python main.py calc_icechunk_store_mean --start_date 2022-01-01 --end_date 2022-01-31
94+
```
95+
96+
#### Checking Data Store Access
97+
98+
```bash
99+
python main.py check_data_store_access
100+
```
101+
102+
## Programmatic Usage
103+
104+
You can also use the package programmatically:
105+
106+
```python
107+
from lithops_functions import write_to_icechunk
108+
109+
# Write data to Icechunk
110+
write_to_icechunk(start_date="2022-01-01", end_date="2022-01-31")
111+
```
112+
113+
## Testing
114+
115+
To test the package, you can use the provided test functions:
116+
117+
```bash
118+
python main.py check_data_store_access
119+
```
120+
121+
This will verify that the package can access the data store.
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
"""
2+
Lithops package for MUR SST data processing import
3+
4+
This package provides functionality for processing MUR SST data using Lithops,
5+
a framework for serverless computing import
6+
"""
7+
8+
from . import (
9+
config,
10+
data_processing,
11+
lithops_functions,
12+
models,
13+
repo,
14+
url_utils,
15+
virtual_datasets,
16+
)
17+
18+
__all__ = [
19+
"config",
20+
"data_processing",
21+
"lithops_functions",
22+
"models",
23+
"repo",
24+
"url_utils",
25+
"virtual_datasets",
26+
]
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
"""
2+
Command-line interface.
3+
4+
This module provides a command-line interface for the package.
5+
"""
6+
7+
import argparse
8+
9+
from lithops_functions import (
10+
lithops_calc_icechunk_store_mean,
11+
lithops_calc_original_files_mean,
12+
lithops_check_data_store_access,
13+
lithops_list_installed_packages,
14+
write_to_icechunk,
15+
)
16+
17+
18+
def parse_args():
19+
"""
20+
Parse command-line arguments.
21+
22+
Returns:
23+
The parsed arguments
24+
"""
25+
parser = argparse.ArgumentParser(description="Run lithops functions.")
26+
parser.add_argument(
27+
"function",
28+
choices=[
29+
"write_to_icechunk",
30+
"check_data_store_access",
31+
"calc_icechunk_store_mean",
32+
"calc_original_files_mean",
33+
"list_installed_packages",
34+
],
35+
help="The function to run.",
36+
)
37+
parser.add_argument(
38+
"--start_date",
39+
type=str,
40+
help="Start date for data processing (YYYY-MM-DD).",
41+
)
42+
parser.add_argument(
43+
"--end_date",
44+
type=str,
45+
help="End date for data processing (YYYY-MM-DD).",
46+
)
47+
parser.add_argument(
48+
"--append_dim",
49+
type=str,
50+
help="Append dimension for writing to icechunk.",
51+
)
52+
return parser.parse_args()
53+
54+
55+
def main():
56+
"""
57+
Main entry point for the command-line interface.
58+
"""
59+
args = parse_args()
60+
start_date = args.start_date
61+
end_date = args.end_date
62+
append_dim = args.append_dim
63+
64+
if args.function == "write_to_icechunk":
65+
write_to_icechunk(
66+
start_date=start_date, end_date=end_date, append_dim=append_dim
67+
)
68+
elif args.function == "check_data_store_access":
69+
lithops_check_data_store_access()
70+
elif args.function == "calc_icechunk_store_mean":
71+
lithops_calc_icechunk_store_mean(start_date=start_date, end_date=end_date)
72+
elif args.function == "calc_original_files_mean":
73+
lithops_calc_original_files_mean(start_date=start_date, end_date=end_date)
74+
elif args.function == "list_installed_packages":
75+
lithops_list_installed_packages()
76+
77+
78+
if __name__ == "__main__":
79+
main()
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
"""
2+
Configuration settings for MUR SST data processing.
3+
4+
This module contains all the configuration settings and constants used
5+
throughout the package.
6+
"""
7+
8+
import fsspec
9+
10+
# S3 filesystem for reading data
11+
fs_read = fsspec.filesystem("s3", anon=False, skip_instance_cache=True)
12+
13+
# Data source configuration
14+
base_url = "s3://podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1"
15+
data_vars = ["analysed_sst", "analysis_error", "mask", "sea_ice_fraction"]
16+
drop_vars = ["dt_1km_data", "sst_anomaly"]
17+
18+
# Storage configuration
19+
bucket = "nasa-eodc-scratch"
20+
store_name = "MUR-JPL-L4-GLOB-v4.1-virtual-v1"
21+
directory = "test"
22+
23+
# Spatial subset configuration
24+
lat_slice = slice(48.5, 48.7)
25+
lon_slice = slice(-124.7, -124.5)
26+
27+
# Date range processing dictionary
28+
date_process_dict = {
29+
("2002-06-30", "2003-09-10"): "virtual_dataset",
30+
("2003-09-11", "2003-09-11"): "zarr",
31+
("2003-09-12", "2021-02-19"): "virtual_dataset",
32+
("2021-02-20", "2021-02-21"): "zarr",
33+
("2021-02-22", "2021-12-23"): "virtual_dataset",
34+
("2021-12-24", "2022-01-26"): "zarr",
35+
("2022-01-27", "2022-11-08"): "virtual_dataset",
36+
("2022-11-09", "2022-11-09"): "zarr",
37+
("2022-11-10", "2023-02-23"): "virtual_dataset",
38+
("2023-02-24", "2023-02-28"): "zarr",
39+
("2023-03-01", "2023-04-21"): "virtual_dataset",
40+
("2023-04-22", "2023-04-22"): "zarr",
41+
("2023-04-23", "2023-09-03"): "virtual_dataset",
42+
}
43+
44+
zarr_concurrency = 4
45+
46+
mursst_var_chunks = {
47+
"analysed_sst": {"time": 1, "lat": 1023, "lon": 2047},
48+
"analysis_error": {"time": 1, "lat": 1023, "lon": 2047},
49+
"mask": {"time": 1, "lat": 1447, "lon": 2895},
50+
"sea_ice_fraction": {"time": 1, "lat": 1447, "lon": 2895},
51+
}
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
export SECURITY_GROUP_NAME=XXX
2+
export VPC_ID=XXX
3+
aws ec2 create-security-group --group-name $SG_GROUP_NAME --description "security group for ithops runtime builder ec2" --vpc-id $VPC_ID
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# look up the group id created
2+
export SECURITY_GROUP_ID=XXX
3+
export YOUR_IP=$(curl -s https://checkip.amazonaws.com)
4+
export AMI_ID=ami-027951e78de46a00e
5+
export SSH_KEY_NAME=XXX
6+
aws ec2 authorize-security-group-ingress --group-id $SECURITY_GROUP_ID --ip-permissions '{"IpProtocol":"tcp","FromPort":22,"ToPort":22,"IpRanges":[{"CidrIp":"'$YOUR_IP'/32"}]}'
7+
aws ec2 run-instances --image-id $AMI_ID \
8+
--instance-type "t3.medium" --key-name $SSH_KEY_NAME \
9+
--block-device-mappings '{"DeviceName":"/dev/xvda","Ebs":{"Encrypted":false,"DeleteOnTermination":true,"Iops":3000,"SnapshotId":"snap-01783d80c688baa0f","VolumeSize":30,"VolumeType":"gp3","Throughput":125}}' \
10+
--network-interfaces '{"AssociatePublicIpAddress":true,"DeviceIndex":0,"Groups":["'$SECURITY_GROUP_ID'"]}' \
11+
--credit-specification '{"CpuCredits":"unlimited"}' \
12+
--metadata-options '{"HttpEndpoint":"enabled","HttpPutResponseHopLimit":2,"HttpTokens":"required"}' \
13+
--private-dns-name-options '{"HostnameType":"ip-name","EnableResourceNameDnsARecord":true,"EnableResourceNameDnsAAAARecord":false}' \
14+
--count "1"

0 commit comments

Comments
 (0)