Skip to content

Latest commit

 

History

History
328 lines (263 loc) · 16.3 KB

README.md

File metadata and controls

328 lines (263 loc) · 16.3 KB

Vespa Cloud logo

Vespa Documentation Search

Vespa Documentation Search is a Vespa Cloud instance for searching documents in:

This sample app is auto-deployed to Vespa Cloud, see deploy-vespa-documentation-search.yaml

Vespa-Documentation-Search-Architecture

Deployment status:

  • Deploy vespa-documentation-search to Vespa Cloud
  • Vespa Cloud Documentation Search Deployment

Query API

Open API endpoints:

Example requests:

$ curl "https://doc-search.vespa.oath.cloud/document/v1/open/doc/docid/open%2Fen%2Freference%2Fquery-api-reference.html"
$ curl --data-urlencode 'yql=select * from doc where userInput(@userinput)' \
  --data-urlencode 'userinput=vespa ranking is great' \
  https://doc-search.vespa.oath.cloud/search/

Using these endpoints is a good way to get started with Vespa - see the github deploy action (use vespa:deploy to deploy to a dev instance or the quick-start) to deploy using Docker.

Refer to getting-started-ranking for example use of the Query API.

Feed your own instance

It is easy to set up your own instance on Vespa Cloud and feed documents from vespa-engine/documentation:

1: Generate the open_index.json feed file: cd vespa-engine/documentation && bundle exec jekyll build -p _plugins-vespafeed. Refer to the vespa_index_generator.rb for how the feed file is generated.

2: Add data plane credentials:

$ pwd; ll *.pem
/Users/myuser/github/vespa-engine/documentation
-rwxr-xr-x@ 1 myuser  staff  3272 Mar 17 09:30 data-plane-private-key.pem
-rwxr-xr-x@ 1 myuser  staff  1696 Mar 17 09:30 data-plane-public-key.pem

3: Set endpoint in _config.yml (get this from the Vespa Cloud Console):

diff --git a/_config.yml b/_config.yml
...
     feed_endpoints:
-        - url: https://vespacloud-docsearch.vespa-team.aws-us-east-1c.z.vespa-app.cloud/
-          indexes:
-              - open_index.json
-        - url: https://vespacloud-docsearch.vespa-team.aws-ap-northeast-1a.z.vespa-app.cloud/
+        - url: https://myinstance.vespacloud-docsearch.mytenant.aws-us-east-1c.dev.z.vespa-app.cloud/
           indexes:

Feed open_index.json:

$ ./feed_to_vespa.py

Ranking

The ranking is quite simplistic, and an introduction to using query rank features and summary features:

    rank-profile documentation inherits default {
        inputs {
            query(titleWeight) double: 2.0
            query(headersWeight) double: 1.0
            query(contentWeight) double: 1.0
            query(keywordsWeight) double: 10.0
            query(pathWeight) double: 1.0
        }
        first-phase {
            expression {
                query(titleWeight) * bm25(title) +
                query(contentWeight) * bm25(content) +
                query(headersWeight) * bm25(headers) +
                query(pathWeight) * bm25(path) +
                query(keywordsWeight) * bm25(keywords)
            }
        }
        summary-features {
            query(titleWeight)
            query(contentWeight)
            query(headersWeight)
            query(pathWeight)
            fieldMatch(title)
            fieldMatch(content)
            fieldMatch(content).matches
            fieldLength(title)
            fieldLength(content)
            bm25(title)
            bm25(content)
            bm25(headers)
            bm25(path)
            bm25(keywords)
        }
    }

With this it is easy to experiment with ranking by sending rank-properties in the query and observing the values in summary-features, like:

doc-search.vespa.oath.cloud/search/?yql=select * from doc where userInput(@userinput)&ranking=documentation&input.query(pathWeight)=10&userinput=vespa ranking is great

See approximate-nn-hnsw.md for use of (comma separated) keywords set in the frontmatter to rank higher for those, e.g.

---
title: "Approximate Nearest Neighbor Search using HNSW Index"
keywords: "ann, approximate nearest neighbor"
---

Document feed automation

Vespa Documentation is stored in GitHub:

Jekyll is used to serve the documentation, it rebuilds at each commit.

A change also triggers GitHub Actions. The Build step in the workflow uses the Jekyll Generator plugin to build a JSON feed, used in the Feed step:

Security

Vespa Cloud secures endpoints using mTLS. Secrets can be stored in GitHub Settings for a repository. Here, the private key secret is accessed in the GitHub Actions workflow that feeds to Vespa Cloud: feed.yml

Query integration

Query results are open to the internet. To access Vespa Documentation Search, an AWS Lambda function is used to get the private key secret from AWS Parameter Store, then add it to the https request to Vespa Cloud:

The lambda needs AmazonSSMReadOnlyAccess added to its Role to access the Parameter Store.

Note JSON-P being used (jsoncallback=) - this simplifies the search result page: search.html.

Vespa Cloud Development and Deployments

This is a Vespa Cloud application and has hence implemented automated deployments.

The feed can contain an array of links from each document. The OutLinksDocumentProcessor is custom java code that add an in-link in each target document using the Vespa Document API.

To test this functionality, the VespaDocSystemTest runs for each deployment.

Creating a System Test is also a great way to develop a Vespa application:

  • Use this application as a starting point
  • Create a Vespa Cloud tenant (i.e. account), and set tenant in pom.xml
  • Deploy the application to Vespa Cloud
  • Run the System Test from maven or IDE using the Endpoint

Generate and feed search suggestions

Use the script in search-suggestions to generate suggestions (generate ../../../documentation/open_index.json first):

$ python3 ../../incremental-search/search-suggestions/count_terms.py \
  ../../../documentation/open_index.json feed_terms.json 2 ../../incremental-search/search-suggestions/top100en.txt
$ curl -L -o vespa-feed-client-cli.zip \
  https://search.maven.org/remotecontent?filepath=com/yahoo/vespa/vespa-feed-client-cli/7.527.20/vespa-feed-client-cli-7.527.20-zip.zip
$ unzip vespa-feed-client-cli.zip
$ ./vespa-feed-client-cli/vespa-feed-client \
  --file feed_terms.json \
  --certificate ../../../documentation/data-plane-public-key.pem --private-key ../../../documentation/data-plane-private-key.pem \
  --endpoint https://vespacloud-docsearch.vespa-team.aws-us-east-1c.z.vespa-app.cloud/

The above feeds single terms and phrases of 2, with stop-word removal from top100en.txt. Suggestions with 3 terms creates a lot of noise - work around by adding to this file and feed it:

$ ./vespa-feed-client-cli/vespa-feed-client \
  --file extra_suggestions.json \
  --certificate ../../../documentation/data-plane-public-key.pem --privateKey ../../../documentation/data-plane-private-key.pem \
  --endpoint https://vespacloud-docsearch.vespa-team.aws-us-east-1c.z.vespa-app.cloud/

Feed grouping examples

cat << EOF | vespa-feed-client \
  --certificate data-plane-public-cert.pem --private-key data-plane-private-key.pem \
  --stdin --endpoint https://vespacloud-docsearch.vespa-team.aws-us-east-1c.z.vespa-app.cloud
{"fields": {"customer": "Smith","date": 1157526000,"item": "Intake valve","price": "1000","tax": "0.24"},"put": "id:purchase:purchase::0"}
{"fields": {"customer": "Smith","date": 1157616000,"item": "Rocker arm","price": "1000","tax": "0.12"},"put": "id:purchase:purchase::1"}
{"fields": {"customer": "Smith","date": 1157619600,"item": "Spring","price": "2000","tax": "0.24"},"put": "id:purchase:purchase::2"}
{"fields": {"customer": "Jones","date": 1157709600,"item": "Valve cover","price": "3000","tax": "0.12"},"put": "id:purchase:purchase::3"}
{"fields": {"customer": "Jones","date": 1157702400,"item": "Intake port","price": "5000","tax": "0.24"},"put": "id:purchase:purchase::4"}
{"fields": {"customer": "Brown","date": 1157706000,"item": "Head","price": "8000","tax": "0.12"},"put": "id:purchase:purchase::5"}
{"fields": {"customer": "Smith","date": 1157796000,"item": "Coolant","price": "1300","tax": "0.24"},"put": "id:purchase:purchase::6"}
{"fields": {"customer": "Jones","date": 1157788800,"item": "Engine block","price": "2100","tax": "0.12"},"put": "id:purchase:purchase::7"}
{"fields": {"customer": "Brown","date": 1157792400,"item": "Oil pan","price": "3400","tax": "0.24"},"put": "id:purchase:purchase::8"}
{"fields": {"customer": "Smith","date": 1157796000,"item": "Oil sump","price": "5500","tax": "0.12"},"put": "id:purchase:purchase::9"}
{"fields": {"customer": "Jones","date": 1157875200,"item": "Camshaft","price": "8900","tax": "0.24"},"put": "id:purchase:purchase::10"}
{"fields": {"customer": "Brown","date": 1157878800,"item": "Exhaust valve","price": "1440","tax": "0.12"},"put": "id:purchase:purchase::11"}
{"fields": {"customer": "Brown","date": 1157882400,"item": "Rocker arm","price": "2330","tax": "0.24"},"put": "id:purchase:purchase::12"}
{"fields": {"customer": "Brown","date": 1157875200,"item": "Spring","price": "3770","tax": "0.12"},"put": "id:purchase:purchase::13"}
{"fields": {"customer": "Smith","date": 1157878800,"item": "Spark plug","price": "6100","tax": "0.24"},"put": "id:purchase:purchase::14"}
{"fields": {"customer": "Jones","date": 1157968800,"item": "Exhaust port","price": "9870","tax": "0.12"},"put": "id:purchase:purchase::15"}
{"fields": {"customer": "Brown","date": 1157961600,"item": "Piston","price": "1597","tax": "0.24"},"put": "id:purchase:purchase::16"}
{"fields": {"customer": "Smith","date": 1157965200,"item": "Connection rod","price": "2584","tax": "0.12"},"put": "id:purchase:purchase::17"}
{"fields": {"customer": "Jones","date": 1157968800,"item": "Rod bearing","price": "4181","tax": "0.24"},"put": "id:purchase:purchase::18"}
{"fields": {"customer": "Jones","date": 1157972400,"item": "Crankshaft","price": "6765","tax": "0.12"},"put": "id:purchase:purchase::19"}
EOF

Simplified node.js Lambda code

'use strict';
const https = require('https')
const AWS = require('aws-sdk')

const publicCert = `-----BEGIN CERTIFICATE-----
MIIFbDCCA1QCCQCTyf46/BIdpDANBgkqhkiG9w0BAQsFADB4MQswCQYDVQQGEwJO
...
NxoOxvYcP8Pnxn8UGILy7sKl3VRQWIMrlOfXK4DEg8EGqeQzlFVScfSdbH0i6gQz
-----END CERTIFICATE-----`;

exports.handler = async (event, context) => {
    console.log('Received event:', JSON.stringify(event, null, 4));
    const query = event.queryStringParameters.query ? event.queryStringParameters.query : '';
    const jsoncallback = event.queryStringParameters.jsoncallback;
    const path = encodeURI(`/search/?jsoncallback=${jsoncallback}&query=${query}&hits=${hits}&ranking=${ranking}`);

    const ssm = new AWS.SSM();
    const privateKeyParam = await new Promise((resolve, reject) => {
        ssm.getParameter({
            Name: 'ThePrivateKey',
            WithDecryption: true
        }, (err, data) => {
            if (err) { return reject(err); }
            return resolve(data);
        });
    });

    var options = {
        hostname: 'vespacloud-docsearch.vespa-team.aws-us-east-1c.z.vespa-app.cloud',
        port: 443,
        path: path,
        method: 'GET',
        headers: { 'accept': 'application/json' },
        key: privateKeyParam.Parameter.Value,
        cert: publicCert
    }

    var body = '';
    const response = await new Promise((resolve, reject) => {
        const req = https.get(
            options,
            res => {
                res.setEncoding('utf8');
                res.on('data', (chunk) => {body += chunk})
                res.on('end', () => {
                    resolve({
                        statusCode: 200,
                        body: body
                    });
                });
            });
        req.on('error', (e) => {
          reject({
              statusCode: 500,
              body: 'Something went wrong!'
          });
        });
    });
    return response
};