update

spiceai · Oct 14, 2024 · 70f6a5b · 70f6a5b
1 parent dcfdce9
commit 70f6a5b
Show file tree

Hide file tree

Showing 2 changed files with 95 additions and 112 deletions.
diff --git a/search/README.md b/search/README.md
@@ -1,116 +1,67 @@
-# Quickstart: Mastering Search in Spice
-
-Welcome to this quickstart guide! Here, you'll learn how to use Spice's powerful search capabilities, combining both SQL-based and advanced vector-based search functionalities. Whether you're new or experienced, follow these steps to get started and unlock the full potential of your data.
-
-## Introduction to Searching with Spice
-
-**Spice** integrates traditional SQL and cutting-edge vector-based search technologies to empower users with flexible and efficient data exploration.
-
----
-
-## Getting Started
-
-### Setting Up Your Environment
-
-1. **Install Spice:**
-   - Ensure you have the Spice CLI installed. Follow the [Spice installation guide](link_to_installation_guide) if you haven't done so.
-
-2. **Create a New Spice Pod:**
-   - Initialize a new spicepod to organize your datasets and configurations:
-     ```bash
-     spice create my_first_spicepod
-     cd my_first_spicepod
-     ```
-
-3. **Configure Your Spicepod:**
-   - Edit the `spicepod.yaml` configuration file:
-     ```yaml
-     embeddings:
-       - from: openai
-         name: remote_service
-         params:
-           openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }
-
-       - name: local_embedding_model
-         from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2
-     ```
-
-4. **Load Sample Datasets:**
-   - Add datasets to your Spice pod. For an example, creating a dataset from GitHub issues:
-     ```yaml
-     datasets:
-       - from: github:github.com/spiceai/spiceai/issues
-         name: spiceai.issues
-         acceleration:
-           enabled: true
-         embeddings:
-           - column: body
-             use: local_embedding_model
-     ```
-
-### **Ensure API Keys Are Set:**
-   - Export or set necessary environment variables before proceeding (e.g., `SPICE_OPENAI_API_KEY`).
-
----
-
-## Performing SQL-Based Search
-
-Spice allows you to perform traditional SQL searches efficiently:
-
-### Execute a Basic SQL Query
-
-1. **Run a Query:**
-   - Use SQL to perform keyword searches within your dataset:
-     ```sql
-     SELECT id, text_column
-     FROM spice.public.quickstarts
-     WHERE
-         LOWER(text_column) LIKE '%search_term%'
-       AND
-         date_published > '2021-01-01'
-     ```
-
-Run this via your SQL interface connected to Spice.
-
----
-
-## Utilizing Vector-Based Search
-
-Vector-based search in Spice enables semantic and similarity-based searches, enhancing your search capabilities beyond traditional keywords.
-
-### Configure Vector Search
-
-1. **Embedding Configuration:**
-   - Make sure your dataset column is configured for vector search in your `spicepod.yaml`.
-
-2. **Perform a Search Query:**
-   - Execute a vector-based query using curl from the command line:
-     ```shell
-     curl -XPOST http://localhost:8090/v1/search \\
-       -H 'Content-Type: application/json' \\
-       -d '{
-         "datasets": ["spiceai.issues"],
-         "text": "cutting edge AI",
-         "where": "author=\"jeadie\"",
-         "additional_columns": ["title", "state"],
-         "limit": 2
-       }'
-     ```
-
-This command returns results based on semantic similarities in your data.
+# Quickstart: Searching with Spice
+
+## Prerequistes 
+ - Ensure you have the Spice CLI installed. Follow the [Spice installation guide](link_to_installation_guide) if you haven't done so.
+ - Populate `.env`.
+
+### SQL Search
+1. Execute a Basic SQL Query to perform keyword searches within your dataset:
+```shell
+spice sql
+```
+
+Then:
+```sql
+SELECT path
+FROM spiceai.files
+WHERE
+    LOWER(content) LIKE '%errors%'
+    AND NOT contains(path, 'docs/release_notes')
+```
+
+### Utilizing Vector-Based Search
+
+1. In the `spicepod.yaml`, uncomment the `datasets[0].embeddings`.
+2. Restart the spiced.
+3. Perform a basic search
+```shell
+  curl -XPOST http://localhost:8090/v1/search \
+    -H "Content-Type: application/json" \
+    -d "{
+      \"datasets\": [\"spiceai.files\"],
+      \"text\": \"testing\",
+      \"where\": \"not contains(path, 'docs/release_notes')\",
+      \"additional_columns\": [\"download_url\"],
+      \"limit\": 2
+    }"
+```
 
 ### Additional Configuration - Chunking
 
-- Spice supports chunking large text fields for more precise searches.
-
-Example configuration:
-```yaml
-datasets:
-  - ...
-    embeddings:
-      - column: body
-        use: local_embedding_model
-        chunking:
-          enabled: true
-          target_chunk_size: 512
+1. Update the spicepod `datasets[0].embeddings.chunking.enabled: true`.
+2. Restart the spiced.
+3. Rerun the search
+```shell
+curlie -XPOST http://localhost:8090/v1/search \
+  -H 'Content-Type: application/json' \
+  -d "{
+    \"datasets\": [\"spiceai.files\"],
+    \"text\": \"errors\",
+    \"where\": \"not contains(path, 'docs/release_notes')\",
+    \"additional_columns\": [\"download_url\"],
+    \"limit\": 2
+  }"
+```
+
+4. Rerun the search, and retrieve the full document (as an entry in `additional_coluumns`).
+```shell
+ curlie -XPOST http://localhost:8090/v1/search \
+  -H 'Content-Type: application/json' \
+  -d "{
+    \"datasets\": [\"spiceai.files\"],
+    \"text\": \"errors\",
+    \"where\": \"not contains(path, 'docs/release_notes')\",
+    \"additional_columns\": [\"download_url\" , \"content\"],
+    \"limit\": 2
+  }"
 ```
diff --git a/search/spicepod.yaml b/search/spicepod.yaml
@@ -0,0 +1,32 @@
+version: v1beta1
+kind: Spicepod
+name: sharepoint-qs
+
+models:
+  - from: openai
+    name: remote_service
+    params:
+      openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }
+
+embeddings:
+  - name: local_embedding_model
+    from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2
+
+datasets:
+  - from: github:github.com/spiceai/spiceai/files/trunk
+    name: spiceai.files
+    params:
+      github_token: ${secrets:GITHUB_TOKEN}
+      include: 'docs/**/*.md'
+    acceleration:
+      enabled: true
+    embeddings:
+      - column: content
+        use: local_embedding_model
+        column_pk:
+          - path
+        chunking:
+          enabled: true
+          target_chunk_size: 256
+          overlap_size: 64
+          file_format: md