Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update docs for latest configuration options / syntax #590

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
162 changes: 79 additions & 83 deletions docs/docs/icechunk-python/configuration.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# Configuration

When creating and opening Icechunk stores, there are a two different sets of configuration to be aware of:
When creating and opening Icechunk repositories, there are a two different sets of configuration to be aware of:

- [`StorageConfig`](./reference.md#icechunk.StorageConfig) - for configuring access to the object store or filesystem
- [`Storage`](./reference.md#icechunk.Storage) - for configuring access to the object store or filesystem
- [`RepositoryConfig`](./reference.md#icechunk.RepositoryConfig) - for configuring the behavior of the Icechunk Repository itself

## Storage Config
## Storage

Icechunk can be confirgured to work with both object storage and filesystem backends. The storage configuration defines the location of an Icechunk store, along with any options or information needed to access data from a given storage type.
Icechunk can be configured to work with both object storage and filesystem backends. The storage configuration defines the location of an Icechunk store, along with any options or information needed to access data from a given storage type.

### S3 Storage

Expand All @@ -16,62 +16,82 @@ When using Icechunk with s3 compatible storage systems, credentials must be prov
=== "From environment"

With this option, the credentials for connecting to S3 are detected automatically from your environment.
This is usually the best choice if you are connecting from within an AWS environment (e.g. from EC2). [See the API](./reference.md#icechunk.StorageConfig.s3_from_env)
This is usually the best choice if you are connecting from within an AWS environment (e.g. from EC2). [See the API](./reference.md#icechunk.Storage.new_s3)

```python
icechunk.StorageConfig.s3_from_env(
bucket="icechunk-test",
prefix="quickstart-demo-1"
icechunk.Storage.new_s3(
bucket="my-demo",
prefix="my-prefix",
config=icechunk.S3Options(),
)
```

=== "Provide credentials"

With this option, you provide your credentials and other details explicitly. [See the API](./reference.md#icechunk.StorageConfig.s3_from_config)
With this option, you provide your credentials and other details explicitly. [See the API](./reference.md#icechunk.Storage.new_s3)

```python
icechunk.StorageConfig.s3_from_config(
bucket="icechunk-test",
prefix="quickstart-demo-1",
region='us-east-1',
credentials=S3Credentials(
access_key_id='my-access-key',
secret_access_key='my-secret-key',
# session token is optional
session_token='my-token',
credentials = icechunk.s3_credentials(
access_key_id='my-access-key',
secret_access_key='my-secret-key',
# session token is optional
session_token='my-token',
)
icechunk.Storage.new_s3(
bucket="my-demo",
prefix="my-prefix",
config=icechunk.S3Options(
region='us-east-1',
endpoint_url=None,
allow_http=False,
),
endpoint_url=None,
allow_http=False,
credentials=credentials
)
```

=== "Anonymous"

With this option, you connect to S3 anonymously (without credentials).
This is suitable for public data. [See the API](./reference.md#icechunk.StorageConfig.s3_anonymous)
This is suitable for public data. [See the API](./reference.md#icechunk.Storage.s3_anonymous)

```python
icechunk.StorageConfig.s3_anonymous(
bucket="icechunk-test",
prefix="quickstart-demo-1",
region='us-east-1,
credentials = icechunk.s3_credentials(
anonymous=True,
)
icechunk.Storage.new_s3(
bucket="my-demo",
prefix="my-prefix",
config=icechunk.S3Options(
region='us-east-1',
endpoint_url=None,
allow_http=False,
),
credentials=credentials
)
```

### Filesystem Storage

Icechunk can also be used on a [local filesystem](./reference.md#icechunk.StorageConfig.filesystem) by providing a path to the location of the store
Icechunk can also be used on a [local filesystem](./reference.md#icechunk.Storage.new_local_filesystem) by providing a path to the location of the store

=== "Local filesystem"

```python
icechunk.StorageConfig.filesystem("/path/to/my/dataset")
config = RepositoryConfig.default()
repo = Repository.create(
storage=Storage.new_local_filesystem("./icechunk-repo"),
config=config,
)
```

## Repository Config

Separate from the storage config, the Repository can also be configured with options which control its runtime behavior.

```python
config = icechunk.RepositoryConfig.default()
```

### Writing chunks inline

Chunks can be written inline alongside the store metadata if the size of a given chunk falls within the configured threshold.
Expand All @@ -81,24 +101,21 @@ This is the default behavior for chunks smaller than 512 bytes, but it can be ov
=== "Never write chunks inline"

```python
RepositoryConfig(
inline_chunk_threshold_bytes=0,
...
)
repo_config.inline_chunk_threshold_bytes = 0
```

=== "Write bigger chunks inline"

```python
RepositoryConfig(
inline_chunk_threshold_bytes=1024,
...
)
repo_config.inline_chunk_threshold_bytes = 1024

```

### Virtual Reference Storage Config

Icechunk allows for reading "Virtual" data from [existing archival datasets](./virtual.md). This requires creating a distinct `VirtualRefConfig` (similar to `StorageConfig`) giving Icechunk the necessary permissions to access the archival data. This can be configured using the `virtual_ref_config` option:
Icechunk allows for reading "Virtual" data from [existing archival datasets](./virtual.md). This requires creating a distinct `VirtualRefConfig` (similar to `Storage`) giving Icechunk the necessary permissions to access the archival data. This can be configured using the `virtual_ref_config` option:

This section needs work!!!

=== "S3 from environment"

Expand Down Expand Up @@ -146,27 +163,22 @@ Now we can now create or open an Icechunk store using our config.
=== "Creating with S3 storage"

```python
storage = icechunk.StorageConfig.s3_from_env(
bucket='earthmover-sample-data',
prefix='icechunk/oisst.2020-2024/',
region='us-east-1',
)

repo = icechunk.Repository.create(
storage=storage,
storage=icechunk.Storage.new_s3(
bucket="my-bucket",
prefix="my-prefix"
config=icechunk.S3Options(
region="us-east-1"
),
)
)
```

=== "Creating with local filesystem"

```python
storage = icechunk.StorageConfig.filesystem("/path/to/my/dataset")
config = icechunk.RepositoryConfig(
inline_chunk_threshold_bytes=1024,
)

repo = icechunk.Repository.create(
storage=storage,
storage=icechunk.Storage.new_local_filesystem("./icechunk-local")
)
```

Expand All @@ -175,27 +187,22 @@ If you are not sure if the repo exists yet, an `icechunk Repository` can created
=== "Open or creating with S3 storage"

```python
storage = icechunk.StorageConfig.s3_from_env(
bucket='earthmover-sample-data',
prefix='icechunk/oisst.2020-2024/',
region='us-east-1',
)

repo = icechunk.Repository.open_or_create(
storage=storage,
storage=icechunk.Storage.new_s3(
bucket="my-bucket",
prefix="my-prefix"
config=icechunk.S3Options(
region="us-east-1"
),
)
)
```

=== "Open or creating with local filesystem"

```python
storage = icechunk.StorageConfig.filesystem("/path/to/my/dataset")
config = icechunk.RepositoryConfig(
inline_chunk_threshold_bytes=1024,
)

repo = icechunk.Repository.open_or_create(
storage=storage,
storage=icechunk.Storage.new_local_filesystem("./icechunk-local")
)
```

Expand All @@ -204,32 +211,21 @@ If you are not sure if the repo exists yet, an `icechunk Repository` can created
=== "Opening from S3 Storage"

```python
storage = icechunk.StorageConfig.s3_anonymous(
bucket='earthmover-sample-data',
prefix='icechunk/oisst.2020-2024/',
region='us-east-1',
)

config = icechunk.RepositoryConfig(
virtual_ref_config=icechunk.VirtualRefConfig.s3_anonymous(region='us-east-1'),
)

repo = icechunk.Repository.open_existing(
storage=storage,
config=config,
repo = icechunk.Repository.open(
storage=icechunk.Storage.new_s3(
bucket="my-bucket",
prefix="my-prefix"
config=icechunk.S3Options(
region="us-east-1"
),
)
)
```

=== "Opening from local filesystem"

```python
storage = icechunk.StorageConfig.filesystem("/path/to/my/dataset")
config = icechunk.RepositoryConfig(
inline_chunk_threshold_bytes=1024,
)

store = icechunk.IcechunkStore.open_existing(
storage=storage,
config=config,
repo = icechunk.Repository.open(
storage=icechunk.Storage.new_local_filesystem("./icechunk-local")
)
```
45 changes: 25 additions & 20 deletions docs/docs/icechunk-python/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,7 @@ pip install icechunk
!!! note

Icechunk is currently designed to support the [Zarr V3 Specification](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html).
Using it today requires installing the latest pre-release of Zarr Python 3.

Using it today requires installing `zarr>=3`.

## Create a new Icechunk repository

Expand All @@ -27,18 +26,24 @@ However, you can also create a repo on your local filesystem.
=== "S3 Storage"

```python
storage_config = icechunk.StorageConfig.s3_from_env(
bucket="icechunk-test",
prefix="quickstart-demo-1"
from icechunk import Repository, S3Options, Storage

repo = Repository.create(
storage=Storage.new_s3(
bucket="icechunk-test",
prefix="quickstart-demo-1"
config=S3Options(),
)
)
repo = icechunk.Repository.create(storage_config)

```

=== "Local Storage"

```python
storage_config = icechunk.StorageConfig.filesystem("./icechunk-local")
repo = icechunk.Repository.create(storage_config)
repo = Repository.create(
storage=Storage.new_local_filesystem("./icechunk-local")
)
```

## Accessing the Icechunk store
Expand All @@ -53,7 +58,7 @@ session = repo.writable_session("main")
Now that we have a session, we can access the `IcechunkStore` from it to interact with the underlying data using `zarr`:

```python
store = session.store()
store = session.store # A Zarr Store
```

## Write some data and commit
Expand All @@ -62,8 +67,8 @@ We can now use our Icechunk `store` with Zarr.
Let's first create a group and an array within it.

```python
group = zarr.group(store)
array = group.create("my_array", shape=10, dtype=int)
group = zarr.create_group(store)
array = group.create_array(name="my_array", shape=10, dtype='int32', chunks=(5,))
```

Now let's write some data
Expand Down Expand Up @@ -91,7 +96,7 @@ At this point, we have already committed using our session, so we need to get a

```python
session_2 = repo.writable_session("main")
store_2 = session_2.store()
store_2 = session_2.store
group = zarr.open_group(store_2)
array = group["my_array"]
```
Expand All @@ -113,14 +118,14 @@ snapshot_id_2 = session_2.commit("overwrite some values")
We can see the full version history of our repo:

```python
hist = repo.ancestry(snapshot_id_2)
hist = repo.ancestry(snapshot=snapshot_id_2)
for anc in hist:
print(anc.id, anc.message, anc.written_at)

# Output:
# AHC3TSP5ERXKTM4FCB5G overwrite some values 2024-10-14 14:07:27.328429+00:00
# Q492CAPV7SF3T1BC0AA0 first commit 2024-10-14 14:07:26.152193+00:00
# T7SMDT9C5DZ8MP83DNM0 Repository initialized 2024-10-14 14:07:22.338529+00:00
# W0EJE9HP0SQJV5540GPG overwrite some values 2025-01-17 17:58:49.393455+00:00
# 9CS51QMKZT23NXEJZMA0 first commit 2025-01-17 17:57:41.291348+00:00
# SDCT2B6F7TTX3N7QXYT0 Repository initialized 2025-01-17 17:54:27.423602+00:00
```

...and we can go back in time to the earlier version.
Expand All @@ -129,12 +134,12 @@ for anc in hist:
# latest version
assert array[0] == 2
# check out earlier snapshot
earlier_session = repo.readonly_session(snapshot_id=hist[1].id)
store = earlier_session.store()
earlier_session = repo.readonly_session(snapshot=hist[1].id)
store = earlier_session.store

# get the array
group = zarr.open_group(store)
array = group["my_array]
group = zarr.open_group(store, mode="r")
array = group["my_array"]

# verify data matches first version
assert array[0] == 1
Expand Down
Loading