Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancing and Standardizing ABFS docs #603

Merged
merged 12 commits into from
Nov 15, 2024
146 changes: 82 additions & 64 deletions spiceaidocs/docs/components/data-connectors/abfs.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,9 @@ sidebar_label: 'Azure BlobFS Data Connector'
description: 'Azure BlobFS Data Connector Documentation'
---

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
The Azure BlobFS (ABFS) Data Connector enables federated SQL queries on files stored in Azure Blob-compatible endpoints. This includes Azure BlobFS (`abfss://`) and Azure Data Lake (`adl://`) endpoints.

The Azure BlobFS (ABFS) Data Connector enables federated SQL query on files stored in Azure Blob-compatible endpoints. This includes Azure BlobFS (`abfss://`) and Azure Data Lake (`adl://`) endpoints.

If a folder path is provided, all child files will be loaded.
When a folder path is provided, all the contained files will be loaded.

File formats are specified using the `file_format` parameter, as described in [Object Store File Formats](/components/data-connectors/index.md#object-store-file-formats).

Expand All @@ -18,36 +15,43 @@ datasets:
- from: abfs://foocontainer/taxi_sample.csv
name: azure_test
params:
azure_account: spiceadls
azure_access_key: abc123==
abfs_account: spiceadls
abfs_access_key: abc123==
slyons marked this conversation as resolved.
Show resolved Hide resolved
file_format: csv
```

## Dataset Schema Reference
## Configuration

### `from`

The ABFS-compatible URI to a folder or object in one of two forms:
Defines the ABFS-compatible URI to a folder or object:

- `from: abfs://<container>/<path>` with the account name configured using `abfs_account` parameter, or
- `from: abfs://<container>@<account_name>.dfs.core.windows.net/<path>`

:::note

A valid URI must always be specified in the `from` field, even if you are setting the account or container name using [secrets](/components/secret-stores/index.md). When using secrets use a dummy account/container name and the values will be replaced with the values contained by the secrets at runtime.
A valid URI must always be specified in the `from` field, even if you are setting the account or container name using [secrets](/components/secret-stores/index.md). When using secrets, a dummy account/container name must be used. The values will be replaced at runtime with the values contained in the secrets.
slyons marked this conversation as resolved.
Show resolved Hide resolved

See the example [below](#using-secrets-for-container-and-account-name).
See the example [below](#using-secrets).

:::

### `name`

The dataset name. This will be used as the table name within Spice.
Defines the dataset name, which is used as the table name within Spice.

Example: `name: cool_dataset`
Example:
```yaml
datasets:
- from: abfs://foocontainer/taxi_sample.csv
name: cool_dataset
params:
...
```

```sql
SELECT COUNT(*) FROM cool_dataset
SELECT COUNT(*) FROM cool_dataset;
```

```shell
Expand All @@ -62,54 +66,62 @@ SELECT COUNT(*) FROM cool_dataset

#### Basic parameters

| Parameter name | Description |
| --------------------------- | --------------------------------------------------------------------------------------- |
| `abfs_account` | Azure storage account name |
| `abfs_container_name` | Azure storage container name |
| `abfs_sas_string` | SAS Token to use for authorization |
| `abfs_endpoint` | Storage endpoint to connect to. Defaults to `https://{account}.blob.core.windows.net` |
| `abfs_use_emulator` | Connect to a locally-running Azure Storage emulator. Valid values are `true` or `false` |
| `abfs_allow_http` | Allow insecure HTTP connections |
| `abfs_authority_host` | Use an alternative authority host. Defaults to `https://login.microsoftonline.com` |
| `abfs_proxy_url` | Proxy URL to use when connecting |
| `abfs_proxy_ca_certificate` | A trusted CA certificate for the proxy |
| `abfs_proxy_exludes` | A list of hosts to exclude from proxy connections |
| `abfs_disable_tagging` | Ignore any tags provided to `put_opts` |
| `hive_partitioning_enabled` | Enable partitioning using hive-style partitioning from the folder structure. Defaults to `false` |
slyons marked this conversation as resolved.
Show resolved Hide resolved
| Parameter name | Description |
| --------------------------- | --------------------------------------------------------------------------------------------- |
| `file_format` | Specifies the data format. Required if not inferrable from `from`. Options: `parquet`, `csv`. |
| `abfs_account` | Azure storage account name |
| `abfs_container_name` | Azure storage container name |
| `abfs_sas_string` | SAS (Shared Access Signature) Token to use for authorization |
| `abfs_endpoint` | Storage endpoint, default: `https://{account}.blob.core.windows.net` |
| `abfs_use_emulator` | Use `true` or `false` to connect to a local emulator |
| `abfs_allow_http` | Allow insecure HTTP connections |
| `abfs_authority_host` | Alternative authority host, default: `https://login.microsoftonline.com` |
| `abfs_proxy_url` | Proxy URL |
| `abfs_proxy_ca_certificate` | CA certificate for the proxy |
| `abfs_proxy_exludes` | A list of hosts to exclude from proxy connections |
| `abfs_disable_tagging` | Ignore tags in `put_opts` |
slyons marked this conversation as resolved.
Show resolved Hide resolved


#### Authentication parameters

The following parameters are used when authenticating with Azure. Only one of `abfs_access_key`, `abfs_bearer_token`, `abfs_client_secret` or `abfs_skip_signature` can be set at the same time. If none of these are set the connector will default to using a [managed identity](https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/overview)
The following parameters are used when authenticating with Azure. Only one of these parameters can be used at a time:

| Parameter name | Description |
| --------------------------- | ------------------------------------------------------------------------------------------------ |
| `abfs_access_key` | Secret access key to use when authenticating |
| `abfs_bearer_token` | `BEARER` token to use when authenticating |
| `abfs_client_id` | Client ID to use with the client authentication flow |
| `abfs_client_secret` | Client Secret to use with the client authentication flow |
| `abfs_tenant_id` | Tenant ID to use with client authentication flow |
| `abfs_skip_signature` | Skip fetching credentials and skip signing requests. Used for interacting with public containers |
| `abfs_msi_endpoint` | The endpoing to use for acquiring managed identity tokens |
| `abfs_federated_token_file` | File path for acquiring Azure federated identity token in Kubernetes |
| `abfs_use_cli` | Set to `true` to use the Azure CLI to acquire access tokens |
* `abfs_access_key`
* `abfs_bearer_token`
* `abfs_client_secret`
* `abfs_skip_signature`

If none of these are set the connector will default to using a [managed identity](https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/overview)

| Parameter name | Description |
| --------------------------- | ----------------------------------------------------------- |
| `abfs_access_key` | Secret access key |
| `abfs_bearer_token` | `BEARER` token |
| `abfs_client_id` | Client ID for client authentication flow |
| `abfs_client_secret` | Client Secret to use for client authentication flow |
| `abfs_tenant_id` | Tenant ID to use for client authentication flow |
| `abfs_skip_signature` | Skip credentials and request signing for public containers |
| `abfs_msi_endpoint` | Endpoint for managed identity tokens |
| `abfs_federated_token_file` | File path for federated identity token in Kubernetes |
| `abfs_use_cli` | Set to `true` to use the Azure CLI to acquire access tokens |

#### Retry parameters

| Parameter name | Description |
| ------------------------------- | -------------------------------------------------------------------------------------------- |
| `abfs_max_retries` | Maximum number of retries |
| `abfs_retry_timeout` | Timeout for all retries. Accepts any duration string (i.e `5s`, `1m`, etc) |
| `abfs_backoff_initial_duration` | How long to wait before the initial retry. Accepts any duration string (i.e `5s`, `1m`, etc) |
| `abfs_backoff_max_duration` | Maximum length to wait for a retry. Accepts any duration string (i.e `5s`, `1m`, etc) |
| `abfs_backoff_base` | Floating-point base of the exponential to use when backing off retries |
| Parameter name | Description |
| ------------------------------- | -------------------------------------------- |
| `abfs_max_retries` | Maximum retries |
| `abfs_retry_timeout` | Total timeout for retries (e.g., `5s`, `1m`) |
| `abfs_backoff_initial_duration` | Initial retry delay (e.g., `5s`) |
| `abfs_backoff_max_duration` | Maximum retry delay (e.g., `1m`) |
| `abfs_backoff_base` | Exponential backoff base (e.g., `0.1`) |

#### File format parameters
## Supported file formats

File formats are specified using the `file_format` parameter, as described in [Object Store File Formats](/components/data-connectors/index.md#object-store-file-formats).
Specify the file format using `file_format` parameter. More details in [Object Store File Formats](/components/data-connectors/index.md#object-store-file-formats).

## Examples

### Reading a CSV file using an Access Key
### Reading a CSV file with an Access Key

```yaml
datasets:
Expand All @@ -121,7 +133,7 @@ datasets:
file_format: csv
```

### Reading from a public container
### Using Public Containers

```yaml
datasets:
Expand All @@ -133,27 +145,29 @@ datasets:
file_format: csv
```

### Using secrets for container and account name
### Connecting to the Storage Emulator

```yaml
datasets:
# dummy_container will be overridden by the value in `abfs_container`
- from: abfs://dummy_container/my_csv.csv
name: prod_data
- from: abfs://test_container/test_csv.csv
name: test_data
params:
abfs_account: ${ secrets:PROD_ACCOUNT }
abfs_container: ${ secrets:PROD_CONTAINER }
abfs_use_emulator: true
file_format: csv
```

### Connecting to the Storage Emulator
### Using secrets for Account and Container

When using secrets for `abfs_container`, a dummy container name needs to be provided in the `from` field. This dummy value will be replaced by the value in the secret at runtime.
slyons marked this conversation as resolved.
Show resolved Hide resolved

```yaml
datasets:
- from: abfs://test_container/test_csv.csv
name: test_data
# dummy_container will be overridden by the value in `abfs_container`
- from: abfs://dummy_container/my_csv.csv
name: prod_data
params:
abfs_use_emulator: true
abfs_account: ${ secrets:PROD_ACCOUNT }
abfs_container: ${ secrets:PROD_CONTAINER }
file_format: csv
```

Expand All @@ -165,6 +179,10 @@ datasets:
name: my_data
params:
abfs_tentant_id: B3E1A8F4-9D5B-4D3B-8D2E-1F4A9D5B4D3B
slyons marked this conversation as resolved.
Show resolved Hide resolved
abfs_client_id: ${ secrets:MY_CLIENT_ID }
abfs_client_secret: ${ secrets:MY_CLIENT_SECRET }
```
abfs_client_id: A587D13A-7E4E-46AB-BB87-E7A8AAFB42F3
slyons marked this conversation as resolved.
Show resolved Hide resolved
abfs_client_secret: qoiwdjqidj213094103213o0~!!
slyons marked this conversation as resolved.
Show resolved Hide resolved
```
slyons marked this conversation as resolved.
Show resolved Hide resolved

## Secrets

Spice integrates with multiple secret stores to help manage sensitive data securely. For detailed information on supported secret stores, refer to the [secret stores documentation](/components/secret-stores). Additionally, learn how to use referenced secrets in component parameters by visiting the [using referenced secrets guide](/components/secret-stores#using-secrets).
2 changes: 1 addition & 1 deletion spiceaidocs/docs/components/secret-stores/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ secrets:
name: env
```

## Using referenced secrets in component parameters
## Using referenced secrets in component parameters {#using-secrets}

Secrets may be used by components with the syntax `${<secret_store_name>:<key_name>}`. For example, to reference a secret stored as an environment variable named `MY_SECRET` in the `env` secret store, use `${env:MY_SECRET}`.

Expand Down