Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add in-memory arrow accelerator quickstart #218

Merged
merged 1 commit into from
Oct 24, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
141 changes: 141 additions & 0 deletions arrow/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# In-Memory Arrow Data Accelerator Quickstart

Create a connector instance using sample data and accelerate it using In-Memory Arrow Data Accelerator.

## Requirements

- Spice CLI installed (see [Getting Started](https://docs.spiceai.org/getting-started)).

## Follow these steps

**Step 1.** Initialize a new Spice app.

```bash
spice init arrow-acceleration-qs
cd arrow-acceleration-qs
```

**Step 2.** Configure s3 dataset: copy and paste the YAML below to `spicepod.yaml` in the Spice app.

```yaml
version: v1beta1
kind: Spicepod
name: arrow-acceleration-qs
datasets:
- from: s3://spiceai-demo-datasets/taxi_trips/2024/
name: taxi_trips
description: taxi trips in s3
params:
file_format: parquet
```

**Step 3.** Start the Spice runtime.

```bash
spice run
```

Confirm in the terminal output the `taxi_trips` dataset has been loaded:

```bash
2024/10/22 12:27:07 INFO Checking for latest Spice runtime release...
2024/10/22 12:27:07 INFO Spice.ai runtime starting...
2024-10-22T19:27:08.372600Z INFO runtime::flight: Spice Runtime Flight listening on 127.0.0.1:50051
2024-10-22T19:27:08.372667Z INFO runtime::metrics_server: Spice Runtime Metrics listening on 127.0.0.1:9090
2024-10-22T19:27:08.372790Z INFO runtime::http: Spice Runtime HTTP listening on 127.0.0.1:8090
2024-10-22T19:27:08.373723Z INFO runtime::opentelemetry: Spice Runtime OpenTelemetry listening on 127.0.0.1:50052
2024-10-22T19:27:08.572674Z INFO runtime: Initialized results cache; max size: 128.00 MiB, item ttl: 1s
2024-10-22T19:27:08.582201Z INFO runtime: Tool [document_similarity] ready to use
2024-10-22T19:27:08.582226Z INFO runtime: Tool [table_schema] ready to use
2024-10-22T19:27:08.582231Z INFO runtime: Tool [sql] ready to use
2024-10-22T19:27:08.582236Z INFO runtime: Tool [list_datasets] ready to use
2024-10-22T19:27:08.582242Z INFO runtime: Tool [random_sample] ready to use
2024-10-22T19:27:08.582245Z INFO runtime: Tool [sample_distinct_columns] ready to use
2024-10-22T19:27:08.582251Z INFO runtime: Tool [top_n_sample] ready to use
2024-10-22T19:27:09.991363Z INFO runtime: Dataset taxi_trips registered (s3://spiceai-demo-datasets/taxi_trips/2024/), results cache enabled.
```

**Step 4.** Run queries against the dataset using the Spice SQL REPL.

_In a new terminal_, start the Spice SQL REPL

```bash
spice sql
```

Query the `taxi_trips` dataset, observing the long query time.

```sql
select "VendorID", tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count from taxi_trips limit 10;
+----------+----------------------+-----------------------+-----------------+
| VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count |
+----------+----------------------+-----------------------+-----------------+
| 2 | 2024-01-29T19:28:41 | 2024-01-29T19:36:46 | 2 |
| 1 | 2024-01-29T19:22:21 | 2024-01-29T19:28:45 | 2 |
| 1 | 2024-01-29T19:50:24 | 2024-01-29T20:09:21 | 2 |
| 1 | 2024-01-29T19:43:52 | 2024-01-29T20:01:40 | 2 |
| 1 | 2024-01-29T19:09:57 | 2024-01-29T19:55:36 | 2 |
| 1 | 2024-01-29T19:51:28 | 2024-01-29T20:09:16 | 2 |
| 1 | 2024-01-29T19:23:46 | 2024-01-29T19:31:06 | 2 |
| 2 | 2024-01-29T19:01:27 | 2024-01-29T19:09:07 | 2 |
| 1 | 2024-01-29T19:13:53 | 2024-01-29T19:23:09 | 2 |
| 1 | 2024-01-29T19:53:55 | 2024-01-29T20:06:56 | 2 |
+----------+----------------------+-----------------------+-----------------+

Time: 4.291336125 seconds. 10 rows.
```

**Step 5.** Update the `spicepod.yaml` to enable In-Memory Arrow acceleration.

```yaml
version: v1beta1
kind: Spicepod
name: arrow-acceleration-qs
datasets:
- from: s3://spiceai-demo-datasets/taxi_trips/2024/
name: taxi_trips
description: taxi trips in s3
params:
file_format: parquet
acceleration:
enabled: true
```

**Step 6.** Save the changes in Spice app and observe the dataset updating and accelerating.

```bash
2024-10-22T19:28:24.204608Z INFO runtime: Updating accelerated dataset taxi_trips...
2024-10-22T19:28:25.202828Z INFO runtime::accelerated_table::refresh_task: Loading data for dataset taxi_trips
2024-10-22T19:29:07.729346Z INFO runtime::accelerated_table::refresh_task: Loaded 2,964,624 rows (398.86 MiB) for dataset taxi_trips in 42s 525ms.
2024-10-22T19:29:09.217425Z INFO runtime: Dataset taxi_trips registered (s3://spiceai-demo-datasets/taxi_trips/2024/), acceleration (arrow), results cache enabled.
```

**Step 7.** Run a query against the `taxi_trips` dataset again, observing the fast query time.

```sql
select "VendorID", tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count from taxi_trips limit 10;
+----------+----------------------+-----------------------+-----------------+
| VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count |
+----------+----------------------+-----------------------+-----------------+
| 1 | 2024-01-12T19:14:53 | 2024-01-12T19:28:34 | 2 |
| 2 | 2024-01-12T19:03:12 | 2024-01-12T19:17:19 | 2 |
| 2 | 2024-01-12T19:34:22 | 2024-01-12T19:38:11 | 2 |
| 2 | 2024-01-12T19:44:51 | 2024-01-12T19:49:40 | 2 |
| 1 | 2024-01-12T19:31:54 | 2024-01-12T19:38:43 | 2 |
| 2 | 2024-01-12T19:54:37 | 2024-01-12T19:59:55 | 2 |
| 1 | 2024-01-12T19:02:32 | 2024-01-12T19:12:25 | 2 |
| 1 | 2024-01-12T19:22:38 | 2024-01-12T19:37:30 | 2 |
| 2 | 2024-01-12T19:34:30 | 2024-01-12T20:31:18 | 2 |
| 2 | 2024-01-12T19:54:06 | 2024-01-12T20:07:29 | 2 |
+----------+----------------------+-----------------------+-----------------+

Time: 0.013083584 seconds. 10 rows.
```

## Learn more

- [In-Memory Arrow Data Accelerator Documentation](https://docs.spiceai.org/components/data-accelerators/arrow).

- For using `spice sql`, see the [CLI reference](https://docs.spiceai.org/cli/reference/sql).

- See the [datasets reference](https://docs.spiceai.org/reference/spicepod/datasets) for additional dataset configuration options.
Loading