diff --git a/search/README.md b/search/README.md index 50a14f2..058d4dc 100644 --- a/search/README.md +++ b/search/README.md @@ -1,116 +1,67 @@ -# Quickstart: Mastering Search in Spice - -Welcome to this quickstart guide! Here, you'll learn how to use Spice's powerful search capabilities, combining both SQL-based and advanced vector-based search functionalities. Whether you're new or experienced, follow these steps to get started and unlock the full potential of your data. - -## Introduction to Searching with Spice - -**Spice** integrates traditional SQL and cutting-edge vector-based search technologies to empower users with flexible and efficient data exploration. - ---- - -## Getting Started - -### Setting Up Your Environment - -1. **Install Spice:** - - Ensure you have the Spice CLI installed. Follow the [Spice installation guide](link_to_installation_guide) if you haven't done so. - -2. **Create a New Spice Pod:** - - Initialize a new spicepod to organize your datasets and configurations: - ```bash - spice create my_first_spicepod - cd my_first_spicepod - ``` - -3. **Configure Your Spicepod:** - - Edit the `spicepod.yaml` configuration file: - ```yaml - embeddings: - - from: openai - name: remote_service - params: - openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY } - - - name: local_embedding_model - from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2 - ``` - -4. **Load Sample Datasets:** - - Add datasets to your Spice pod. For an example, creating a dataset from GitHub issues: - ```yaml - datasets: - - from: github:github.com/spiceai/spiceai/issues - name: spiceai.issues - acceleration: - enabled: true - embeddings: - - column: body - use: local_embedding_model - ``` - -### **Ensure API Keys Are Set:** - - Export or set necessary environment variables before proceeding (e.g., `SPICE_OPENAI_API_KEY`). - ---- - -## Performing SQL-Based Search - -Spice allows you to perform traditional SQL searches efficiently: - -### Execute a Basic SQL Query - -1. **Run a Query:** - - Use SQL to perform keyword searches within your dataset: - ```sql - SELECT id, text_column - FROM spice.public.quickstarts - WHERE - LOWER(text_column) LIKE '%search_term%' - AND - date_published > '2021-01-01' - ``` - -Run this via your SQL interface connected to Spice. - ---- - -## Utilizing Vector-Based Search - -Vector-based search in Spice enables semantic and similarity-based searches, enhancing your search capabilities beyond traditional keywords. - -### Configure Vector Search - -1. **Embedding Configuration:** - - Make sure your dataset column is configured for vector search in your `spicepod.yaml`. - -2. **Perform a Search Query:** - - Execute a vector-based query using curl from the command line: - ```shell - curl -XPOST http://localhost:8090/v1/search \\ - -H 'Content-Type: application/json' \\ - -d '{ - "datasets": ["spiceai.issues"], - "text": "cutting edge AI", - "where": "author=\"jeadie\"", - "additional_columns": ["title", "state"], - "limit": 2 - }' - ``` - -This command returns results based on semantic similarities in your data. +# Quickstart: Searching with Spice + +## Prerequistes + - Ensure you have the Spice CLI installed. Follow the [Spice installation guide](link_to_installation_guide) if you haven't done so. + - Populate `.env`. + +### SQL Search +1. Execute a Basic SQL Query to perform keyword searches within your dataset: +```shell +spice sql +``` + +Then: +```sql +SELECT path +FROM spiceai.files +WHERE + LOWER(content) LIKE '%errors%' + AND NOT contains(path, 'docs/release_notes') +``` + +### Utilizing Vector-Based Search + +1. In the `spicepod.yaml`, uncomment the `datasets[0].embeddings`. +2. Restart the spiced. +3. Perform a basic search +```shell + curl -XPOST http://localhost:8090/v1/search \ + -H "Content-Type: application/json" \ + -d "{ + \"datasets\": [\"spiceai.files\"], + \"text\": \"testing\", + \"where\": \"not contains(path, 'docs/release_notes')\", + \"additional_columns\": [\"download_url\"], + \"limit\": 2 + }" +``` ### Additional Configuration - Chunking -- Spice supports chunking large text fields for more precise searches. - -Example configuration: -```yaml -datasets: - - ... - embeddings: - - column: body - use: local_embedding_model - chunking: - enabled: true - target_chunk_size: 512 +1. Update the spicepod `datasets[0].embeddings.chunking.enabled: true`. +2. Restart the spiced. +3. Rerun the search +```shell +curlie -XPOST http://localhost:8090/v1/search \ + -H 'Content-Type: application/json' \ + -d "{ + \"datasets\": [\"spiceai.files\"], + \"text\": \"errors\", + \"where\": \"not contains(path, 'docs/release_notes')\", + \"additional_columns\": [\"download_url\"], + \"limit\": 2 + }" +``` + +4. Rerun the search, and retrieve the full document (as an entry in `additional_coluumns`). +```shell + curlie -XPOST http://localhost:8090/v1/search \ + -H 'Content-Type: application/json' \ + -d "{ + \"datasets\": [\"spiceai.files\"], + \"text\": \"errors\", + \"where\": \"not contains(path, 'docs/release_notes')\", + \"additional_columns\": [\"download_url\" , \"content\"], + \"limit\": 2 + }" ``` \ No newline at end of file diff --git a/search/spicepod.yaml b/search/spicepod.yaml new file mode 100644 index 0000000..ab59ed0 --- /dev/null +++ b/search/spicepod.yaml @@ -0,0 +1,32 @@ +version: v1beta1 +kind: Spicepod +name: sharepoint-qs + +models: + - from: openai + name: remote_service + params: + openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY } + +embeddings: + - name: local_embedding_model + from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2 + +datasets: + - from: github:github.com/spiceai/spiceai/files/trunk + name: spiceai.files + params: + github_token: ${secrets:GITHUB_TOKEN} + include: 'docs/**/*.md' + acceleration: + enabled: true + embeddings: + - column: content + use: local_embedding_model + column_pk: + - path + chunking: + enabled: true + target_chunk_size: 256 + overlap_size: 64 + file_format: md \ No newline at end of file