python3 -m venv .
. bin/activate
pip install sqlite-vec sqlean.py flask click jsonlines
or if you want to use the requirements.txt
file do
python3 -m pip install -r requirements.txt
instead of the pip install ...
line.
flask --app vec_search init-db
Note that the above command populates the sqlite db with the vectors from the given file.
flask --app vec_search run --debug
For the vectors we read these from a file at application startup time. This allows for the vectors to be embedded from somewhere else. e.g. any application-and any vector embedding method.
You may have noticed the search form on the top of the listing page. If you enter a natural language description like you'll see something like
"GET /?q=a+function+to+add+an+element+to+an+array HTTP/1.1" 200
in the stdout for the terminal where the flask app is running. Until now we did not have embedding for the natural language text, including the vector embeddings requires we install a few more ML dependencies.
pip install numpy==1.24 torch transformers
If you execute searches locally you probably want to include your own index and not the current sampled jsonl file that contains only a few records. We provide the sampled jsonl file to keep the size of the repo small whilst maintaining a useable prototype.
A search engine results page (SERP) shows the results from a natural language query, in a format that is useful for the human. For semantic search we use cosine distance between the query embedding and the entities, in the example jsonl file provided these entities are are programming language and associated documentation.
The embedding model for the query and the code provided is the codebert
model from Microsoft. This is to keep the model size small while still
working end to end. The code model is configurable via changes to the
config.py
module's AI_MODEL
entry.
However, if you use a different model you may need to re-embed the code
entities to maintain proper alignment of vector spaces between query embeddings
and entity embeddings.
Typical inspection workflow:
Here you find the listings of all entries in your jsonl file in an arbitrary order because no search has been performed. Next use the search bar to enter a natural language search query.
Here you will find the entities from your search in semantic distance, closest is first and then proceeds in ascending order.
After clicking the inspect button you will be taken to a detail page where the dropdown menus provide a configurable view of the attention weights from the semantic code search. Hovering the mouse over code tokens or words enables inspection of particular terms in the query or the code.
After sensemaking the human may want to iteratively modify their natural language query using the 3 part workflow above.
The intent of this workflow is to enable a user to generate a benchmark AKA golden dataset from their code-query corpus.
Purpose: The annotation workflow, and subsequently generated benchmark data are useful for comparing a base model-say from huggingface-versus a fine tuned model for the users needs or comparing two distinct models which are not fine tuned.
Steps:
- Create-if necessary-and login as a user
- Execute a query
- The results of the query for the given corpus are shown in the SERP
- Select relevance annotations for each result, clicking done once you are sure of your relevance determination
- Repeat steps 1-4 for each query
Notes:
The queries and query relevances are stored in tables
queries
and query_relevances
in the sqlite db.
Once the human has completed the annotations these may be exported from
the sqlite db for further processing.
The sqlite db file is located in the var/vec_search-instance/
directory or more generally in the path specified in the config.py
module for the application.
To collect the data for the annotations:
flask --app vec_search export-rad-to-csv rad.csv
Note that this exports to a file in the current working directory named
rad.csv
. If you want a different filename this provide the alternate filename.
If the file already exists in the working directory then an overwrite will occur.I
The click command is similar to this workflow:
# opens a REPL environment for sqlite3, if you modify the config.py then change the path
sqlite3 var/vec_search-instance/vec_search.sqlite
# make the field names displayed in results of queries and a comma separator
.mode column
.mode csv
# output results to csv file
.output relevance_annotation_details.csv
# annotation results
# we concatenate duplicates in a comma sep list (post_id, query_id, user_id)
SELECT
qr.query_id,
qr.post_id,
q.user_id,
GROUP_CONCAT(qr.relevance) AS relevances,
qr.rank,
qr.distance,
q.query
FROM query_relevances AS qr
INNER JOIN (
SELECT query_id, query, user_id FROM queries
) AS q ON qr.query_id = q.query_id
GROUP BY
q.query_id,
user_id,
post_id
;
# exit sqlite REPL env
.quit
The file relevance_annotation_details.csv
should contain the results of the above query.
This file is placed in the directory where you initiated the sqlite3 command.
For debugging purposes it is sometimes helpful to see the schema for all tables in the REPL environment:
SELECT * FROM sqlite_master WHERE type='table';
A backend/batch workflow where relevances are assessed outside of a human workflow
is the behavior currently supported. Assuming in the previous step you wrote the human generated
annotations to the file rad.csv
and there is no file in the current working directory named
llm_gen_rel.csv
then run:
flask --app vec_search gen-llm-rels rad.csv llm_gen_rel.csv
and you will find the generated llm relevances in the csv file along with the data from rad.csv
and other llm client metadata, like e.g. token usage, etc.
In general the argments for gen-llm-rels
command look like:
flask --app vec_search gen-llm-rels <input-csv> <output-csv> <llm-model-name> <dup-strategy>
The defaults for the last 2 are 'openai'
and 'takelast'
.
There are 2 prompts in the llm_rel_gen.py
module, the default is to use the
umbrella prompt.
TODO: add support for other prompt via click.
For details cf. this open issue
WIP:
- IR Retrieval metrics for the data once placed into pandas df(s).
For details cf. this open issue
TODO... design and write up the workflow
TODO... design and write up the workflow for data analysis