Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
Jeadie committed Oct 14, 2024
1 parent dcfdce9 commit 70f6a5b
Show file tree
Hide file tree
Showing 2 changed files with 95 additions and 112 deletions.
175 changes: 63 additions & 112 deletions search/README.md
Original file line number Diff line number Diff line change
@@ -1,116 +1,67 @@
# Quickstart: Mastering Search in Spice

Welcome to this quickstart guide! Here, you'll learn how to use Spice's powerful search capabilities, combining both SQL-based and advanced vector-based search functionalities. Whether you're new or experienced, follow these steps to get started and unlock the full potential of your data.

## Introduction to Searching with Spice

**Spice** integrates traditional SQL and cutting-edge vector-based search technologies to empower users with flexible and efficient data exploration.

---

## Getting Started

### Setting Up Your Environment

1. **Install Spice:**
- Ensure you have the Spice CLI installed. Follow the [Spice installation guide](link_to_installation_guide) if you haven't done so.

2. **Create a New Spice Pod:**
- Initialize a new spicepod to organize your datasets and configurations:
```bash
spice create my_first_spicepod
cd my_first_spicepod
```

3. **Configure Your Spicepod:**
- Edit the `spicepod.yaml` configuration file:
```yaml
embeddings:
- from: openai
name: remote_service
params:
openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }
- name: local_embedding_model
from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2
```

4. **Load Sample Datasets:**
- Add datasets to your Spice pod. For an example, creating a dataset from GitHub issues:
```yaml
datasets:
- from: github:github.com/spiceai/spiceai/issues
name: spiceai.issues
acceleration:
enabled: true
embeddings:
- column: body
use: local_embedding_model
```

### **Ensure API Keys Are Set:**
- Export or set necessary environment variables before proceeding (e.g., `SPICE_OPENAI_API_KEY`).

---

## Performing SQL-Based Search

Spice allows you to perform traditional SQL searches efficiently:

### Execute a Basic SQL Query

1. **Run a Query:**
- Use SQL to perform keyword searches within your dataset:
```sql
SELECT id, text_column
FROM spice.public.quickstarts
WHERE
LOWER(text_column) LIKE '%search_term%'
AND
date_published > '2021-01-01'
```

Run this via your SQL interface connected to Spice.

---

## Utilizing Vector-Based Search

Vector-based search in Spice enables semantic and similarity-based searches, enhancing your search capabilities beyond traditional keywords.

### Configure Vector Search

1. **Embedding Configuration:**
- Make sure your dataset column is configured for vector search in your `spicepod.yaml`.

2. **Perform a Search Query:**
- Execute a vector-based query using curl from the command line:
```shell
curl -XPOST http://localhost:8090/v1/search \\
-H 'Content-Type: application/json' \\
-d '{
"datasets": ["spiceai.issues"],
"text": "cutting edge AI",
"where": "author=\"jeadie\"",
"additional_columns": ["title", "state"],
"limit": 2
}'
```

This command returns results based on semantic similarities in your data.
# Quickstart: Searching with Spice

## Prerequistes
- Ensure you have the Spice CLI installed. Follow the [Spice installation guide](link_to_installation_guide) if you haven't done so.
- Populate `.env`.

### SQL Search
1. Execute a Basic SQL Query to perform keyword searches within your dataset:
```shell
spice sql
```

Then:
```sql
SELECT path
FROM spiceai.files
WHERE
LOWER(content) LIKE '%errors%'
AND NOT contains(path, 'docs/release_notes')
```

### Utilizing Vector-Based Search

1. In the `spicepod.yaml`, uncomment the `datasets[0].embeddings`.
2. Restart the spiced.
3. Perform a basic search
```shell
curl -XPOST http://localhost:8090/v1/search \
-H "Content-Type: application/json" \
-d "{
\"datasets\": [\"spiceai.files\"],
\"text\": \"testing\",
\"where\": \"not contains(path, 'docs/release_notes')\",
\"additional_columns\": [\"download_url\"],
\"limit\": 2
}"
```

### Additional Configuration - Chunking

- Spice supports chunking large text fields for more precise searches.

Example configuration:
```yaml
datasets:
- ...
embeddings:
- column: body
use: local_embedding_model
chunking:
enabled: true
target_chunk_size: 512
1. Update the spicepod `datasets[0].embeddings.chunking.enabled: true`.
2. Restart the spiced.
3. Rerun the search
```shell
curlie -XPOST http://localhost:8090/v1/search \
-H 'Content-Type: application/json' \
-d "{
\"datasets\": [\"spiceai.files\"],
\"text\": \"errors\",
\"where\": \"not contains(path, 'docs/release_notes')\",
\"additional_columns\": [\"download_url\"],
\"limit\": 2
}"
```

4. Rerun the search, and retrieve the full document (as an entry in `additional_coluumns`).
```shell
curlie -XPOST http://localhost:8090/v1/search \
-H 'Content-Type: application/json' \
-d "{
\"datasets\": [\"spiceai.files\"],
\"text\": \"errors\",
\"where\": \"not contains(path, 'docs/release_notes')\",
\"additional_columns\": [\"download_url\" , \"content\"],
\"limit\": 2
}"
```
32 changes: 32 additions & 0 deletions search/spicepod.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
version: v1beta1
kind: Spicepod
name: sharepoint-qs

models:
- from: openai
name: remote_service
params:
openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }

embeddings:
- name: local_embedding_model
from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2

datasets:
- from: github:github.com/spiceai/spiceai/files/trunk
name: spiceai.files
params:
github_token: ${secrets:GITHUB_TOKEN}
include: 'docs/**/*.md'
acceleration:
enabled: true
embeddings:
- column: content
use: local_embedding_model
column_pk:
- path
chunking:
enabled: true
target_chunk_size: 256
overlap_size: 64
file_format: md

0 comments on commit 70f6a5b

Please sign in to comment.