Locust-pinecone is a load-testing tool for Pinecone, built on the Locust load-testing framework. Its aim is to make it easy for users to simulate realistic load against a Pinecone Index, to aid in workload sizing.
It can:
- Simulate an arbitrary number of clients issuing different kinds of requests against a Pinecone Index.
- Measure the throughput & latency of requests from the client's point of view.
- Populate the Index with a user-specified dataset before generating load, or operate against existing vectors.
- Be run interactively via Locust's Web-based UI or via the command-line.
locust-pinecone is highly scalable - it can generate from 1 to 10,000 Queries per second (QPS) by making use of multiple client processes to drive the load.
-
Clone this repo:
git clone https://github.com/pinecone-io/locust-pinecone.git
-
Use Poetry to install dependancies
cd locust-pinecone pip3 install poetry poetry install
-
Activate environment containing dependancies
poetry shell
-
Prepare the environment variables and start Locust:
./locust.sh
This assumes you already have a Pinecone account and an index defined.
The locust.sh
script starts with a setup shell tool which helps you configure the app.
You should provide this script the following:
- Pinecone API key such as
fb1b20bc-d702-4248-bb72-1fcd50f03616
This can be found in your Pinecone console under Projects. - Index Host such as https://squad-p2-2a849c7.svc.us-east-1-aws.pinecone.io This can be found in the index section of your Pinecone console.
Note: once you configure your app, the API key and Index Host will be written to a .env file in the current directory, and will be used automatically next time you run the script. You can edit this as needed if your API key or Index Host change.
After writing the .env file, the script will then start the Locust Web UI which can be accessed following the instructions in the shell - default https://localhost:8089
The next time you run locust.sh
, it will load the previously saved environment variables and start Locust immediately.
Locust provides a WebUI to configure the workload and monitor the results once started. On opening, you will be presented with the 'Start new load test' dialog, which allows the Number of Users, User Ramp up, and Host to be specified.
Click on "Start Swarm" to begin the load test. The UI switches to show details of the load test, initially showing a table summarising all requests so far, including count, error rate, and various latency statistics. Switching to the Charts tab shows graphs of the Requests per Second, and Latency of those requests:
The workload can be changed dynamically by selecting "Edit" in the menubar and adjusting the number of users.
See Locust's own Quickstart guide for full details on the Web UI.
Locust-pinecone can also be used in a non-interactive way via the command-line, for scripting specific workloads or part of a larger pipeline. This is done by calling locust with the --headless
option; and including the manditory --host=
option:
locust --host=https://demo-ngx3w25.svc.apw5-4e34-81fa.pinecone.io --headless
Locust will print periodic statistics on the workload as it runs. By default, it will generate load forever; to terminate press Ctrl-C
where it will print metrics on all requests issued:
Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
--------|---------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
Pine gRPC Fetch 36 0(0.00%) | 183 179 231 180 | 0.98 0.00
Pine gRPC Vector (Query only) 26 0(0.00%) | 197 186 308 190 | 0.70 0.00
Pine gRPC Vector + Metadata 41 0(0.00%) | 194 185 284 190 | 1.11 0.00
--------|---------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
Aggregated 163 0(0.00%) | 194 179 737 190 | 4.42 0.00
Response time percentiles (approximated)
Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
--------|--------------------|--------|------|------|------|------|------|------|------|------|------|------|------
Pine gRPC Fetch 180 180 180 180 180 210 230 230 230 230 230 36
Pine gRPC Vector (Query only) 190 190 190 190 200 250 310 310 310 310 310 26
Pine gRPC Vector + Metadata 190 190 190 200 200 210 280 280 280 280 280 41
--------|--------------------|--------|------|------|------|------|------|------|------|------|------|------|------
Aggregated 190 190 190 190 200 210 250 310 740 740 740 163
Here are some example workloads which can be performed by Locust-pinecone.
See Customising the workload for ways to customise how the workload is run.
Populate an index with 100,000 vectors taken from the quora_all-MiniLM-L6-bm25 dataset, then a single user will perform a mix of fetch and different queries for 60 seconds:
locust --host=<HOST> --headless --pinecone-dataset=quora_all-MiniLM-L6-bm25-100K --run-time=60s
Output:
[2024-03-04 12:16:54,288] localhost/INFO/locust.main: --run-time limit reached, shutting down
[2024-03-04 12:16:54,371] localhost/INFO/locust.main: Shutting down (exit code 0)
...
Response time percentiles (approximated)
Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
--------------|------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
Pinecone gRPC Fetch 27 29 30 30 31 34 36 40 47 47 47 455
Pinecone gRPC Vector (Query only) 26 27 27 28 29 33 35 37 43 43 43 453
Pinecone gRPC Vector + Metadata 24 24 24 25 25 26 29 32 120 120 120 405
Pinecone gRPC Vector + Metadata + Namespace (namespace1) 24 24 24 25 25 26 29 30 100 100 100 425
Pinecone gRPC Vector + Namespace (namespace1) 23 24 24 24 25 27 31 32 35 35 35 443
--------------|------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
Aggregated 24 26 27 27 29 31 34 36 47 120 120 2181
This is useful as a basic smoke-test to check the latencies of different operations from a single client to a Pinecone index.
However, there's only a single User (and each user issues requests sequentially), so it gives little information about the throughput which can be achieved by the given Index.
Populate an index with 100,000 vectors taken from the quora_all-MiniLM-L6-bm25 dataset, then 100 users will perform a mix of fetch and different queries for 60 seconds:
locust --host=<HOST> --headless --pinecone-dataset=quora_all-MiniLM-L6-bm25-100K \
--run-time=60s --users=100 --spawn-rate=100
Output:
Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
--------------|--------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
Pinecone gRPC Fetch 18295 0(0.00%) | 57 20 163 57 | 304.66 0.00
Pinecone gRPC Vector (Query only) 18623 0(0.00%) | 57 21 169 57 | 310.12 0.00
Pinecone gRPC Vector + Metadata 18113 0(0.00%) | 56 20 161 57 | 301.63 0.00
Pinecone gRPC Vector + Metadata + Namespace (namespace1) 18080 0(0.00%) | 56 20 183 57 | 301.08 0.00
Pinecone gRPC Vector + Namespace (namespace1) 18083 0(0.00%) | 56 20 193 57 | 301.13 0.00
--------------|--------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
Aggregated 91194 0(0.00%) | 56 20 193 57 | 1518.63 0.00
This Index is capable of at least 1518 requests per second, with average latencies of 56ms. It may be capable of more, depending on if 100 Users is sufficient or not to expose sufficient concurrency, or if the current client machine is already saturated.
Indeed, monitoring the client machine while the workload is running shows %CPU reaching 100%. Locust is effectively limited to a single core, so the client may be a bottleneck currently.
We can increase the number of processes which locust generates the load across, to see if this is the current throughput bottleneck. Spreading workload across 2 processes:
locust --host=<HOST> --headless --pinecone-dataset=quora_all-MiniLM-L6-bm25-100K \
--run-time=60s --users=100 --spawn-rate=100 --processes=2
Output:
...
Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
--------|--------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
Pinecone gRPC Fetch 23661 0(0.00%) | 44 20 174 37 | 394.00 0.00
Pinecone gRPC Vector (Query only) 23484 0(0.00%) | 47 22 178 45 | 391.05 0.00
Pinecone gRPC Vector + Metadata 23441 0(0.00%) | 46 20 175 41 | 390.34 0.00
Pinecone gRPC Vector + Metadata + Namespace (namespace1) 23654 0(0.00%) | 46 20 184 41 | 393.88 0.00
Pinecone gRPC Vector + Namespace (namespace1) 23731 0(0.00%) | 46 21 170 41 | 395.17 0.00
--------|--------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
Aggregated 117971 0(0.00%) | 46 20 184 40 | 1964.44 0.00
We see that doubling the number of processes from 1 to 2 has increased throughput from 1518 to 1964, or 1.29x, with average latency as good or better than the previous run. This suggests that 1 process with 100 Users was insufficient fully utilise the Index - we needed to increase to 2 locust client processes to achieve higher throughput.
However, we did not see a linear scaling (i.e. 2x throughput). This suggests that we have hit some new bottleneck - perhaps the network between the client and the Index, perhaps the amount of Users (request concurrency), or perhaps the capacity of the current Index configuration.
Populate an index with 9.99M vectors, then perform a fixed Query load of 100 QPS performed by 100 concurrent Users.
This is useful to determine what kind of latency can be achieved for a given expected client workload, on a dataset which is more representative of a small production system:
locust --host=<HOST> --headless --pinecone-dataset=ANN_DEEP1B_d96_angular --run-time=60s \
--users=100 --spawn-rate=100 --pinecone-throughput-per-user=1 --tags query
Locust-pinecone provides a wide range of options to customise the workload generated, along with Pinecone-specific options. See the output of locust --help
and Locust's own Command-line Options documentation for full details, but some of the more common options are listed below:
Run non-interactively for a fixed amount of time by specifying --run-time=TIME
, where time as a count and unit, e.g 60s
, 5m
, 1h
... Requires --headless
:
$ locust --host=<HOST> --headless --run-time=60s
By default, locust-pinecone will generate random query vectors to issue requests against the specified index. It can also use a pre-defined Dataset to provide both the documents to index, and the queries to issue.
To use a pre-defined dataset, specify the --pinecone-dataset=<DATASET>
with the name of the Pinecone Public Dataset to use. Specifying list
as the name of the dataset will list all available datasets:
$ locust --pinecone-dataset=list
Fetching list of available datasets for --pinecone-dataset...
Name Documents Queries Dimension
-------------------------------------------- ----------- --------- -----------
ANN_DEEP1B_d96_angular 9990000 10000 96
ANN_Fashion-MNIST_d784_euclidean 60000 10000 784
ANN_GIST_d960_euclidean 1000000 1000 960
ANN_GloVe_d100_angular 1183514 10000 100
quora_all-MiniLM-L6-bm25-100K 100000 15000 384
...
Passing one of the available names via --pinecone-dataset=
will download that dataset (caching locally in .dataset_cache/
), upsert the documents into the specified index and generate queries.
For example, to load the quora_all-MiniLM-L6-bm25-100K
dataset consisting of 100,000 vectors, then perform requests for 60s using the pre-defined 15,000 query vectors:
$ locust --host=<HOST> --headless --pinecone-dataset=quora_all-MiniLM-L6-bm25-100K --run-time=60s
[2024-02-28 11:28:59,977] localhost/INFO/locust.main: Starting web interface at http://0.0.0.0:8089
[2024-02-28 11:28:59,981] localhost/INFO/root: Loading Dataset quora_all-MiniLM-L6-bm25-100K into memory for Worker 66062...
Downloading datset: 100%|███████████████████████████████████████████████████████| 200M/200M [00:34<00:00, 5.75MBytes/s]
[2024-02-28 11:29:36,020] localhost/INFO/root: Populating index <HOST> with 100000 vectors from dataset 'quora_all-MiniLM-L6-bm25-100K'
Populating index: 100%|█████████████████████████████████████████████████| 100000/100000 [02:36<00:00, 639.83 vectors/s]
[2024-02-28 11:51:15,757] localhost/INFO/locust.main: Run time limit set to 60 seconds
[2024-02-28 11:51:15,758] localhost/INFO/locust.main: Starting Locust 2.23.1
...
Response time percentiles (approximated)
Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
--------|--------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
Pine gRPC Fetch 240 270 300 310 320 330 360 570 570 570 570 62
Pine gRPC Vector (Query only) 190 190 190 200 210 260 710 710 710 710 710 44
Pine gRPC Vector + Metadata 180 180 180 180 200 220 340 340 340 340 340 35
--------|--------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
Aggregated 190 200 220 230 300 310 330 570 770 770 770 273
Population can be used in either WebUI or headless mode.
When a dataset is specified the index will be populated with it if the existing Index vector count differs from the document count. This behaviour can be overridden using the --pinecone-populate-index
option, which takes one of three values:
always
: Always populate from dataset.never
: Never populate from dataset.if-count-mismatch
(default): Populate if the number of items in the index differs from the number of items in th dataset, otherwise skip population
In additional to latency, the Recall of Query requests
can also be measured and reported via the --pinecone-recall
option. This requires a Dataset is used (--pinecone-dataset=
) which includes a queries set - so there is an exact-NN to evaluate the Query results against.
Tip
--pinecone-datasets=list
can be used examine which Datasets have a Queries set. For example, the ANN_LastFM_d64_angular
dataset meets these requirements:
$ locust --pinecone-dataset=list
...
Name Documents Queries Dimension
-------------------------------------------- ----------- --------- -----------
ANN_DEEP1B_d96_angular 9990000 10000 96
ANN_Fashion-MNIST_d784_euclidean 60000 10000 784
ANN_LastFM_d64_angular 292385 50000 65
...
Locust doesn't currently have a way to add additional metrics to the statistics reported, so Recall values are reported instead of latencies; expressing the Recall as a value between 0 and 100 (Locust doesn't support fractional latency values).
Example of measuring Recall for "ANN_LastFM_d64_angular" dataset for 10s runtime:
$ locust --host=<HOST> --headless --pinecone-dataset=ANN_LastFM_d64_angular --pinecone-recall --tag query --run-time=10s
...
Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
--------------|----------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
Pinecone gRPC Vector (Query only) 1117 0(0.00%) | 96 10 100 100 | 18.59 0.00
--------------|----------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
Aggregated 1117 0(0.00%) | 96 10 100 100 | 18.59 0.00
Response time percentiles (approximated)
Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
--------------|----------------------|--------|------|------|------|------|------|------|------|------|------|------|------
Pinecone gRPC Vector (Query only) 100 100 100 100 100 100 100 100 100 100 100 1117
--------------|----------------------|--------|------|------|------|------|------|------|------|------|------|------|------
Aggregated 100 100 100 100 100 100 100 100 100 100 100 1117
- While this can run locally on a machine in your home network, you will experience additional latencies depending on your location. It is recommended to use this on a VM in the cloud, preferably on the same cloud provider (GCP,AWS) and in the same region to minimize the latency. This will give a more accurate picture of how your infrastructure performs with Pinecone when you go to production.
- The test cases included in the locust.py file are designed to generate random queries along with random categories. It also exercises several endpoints such as query, fetch, delete, and demonstrates metadata filtering. You should consider your use case and adjust these tests accordingly to reflect the real world scenarios you expect to encounter.
- There is a lot of functionality built into Locust and we encourage you to review the documentation and make use of all of the functionality it offers.
PRs are welcome!
The project defines a set of pre-commit hooks, which should be enabled to check commits pass all checks. After installing project dependencies (via poetry install
), run:
pre-commit install