This Metaflow implementation evaluates various HuggingFace embedding models using a given corpus and queries.
Refer to corpus.txt and queries.txt to view sample data
The following models will be evaluated:
- sentence-transformers/all-MiniLM-L6-v2
- sentence-transformers/all-mpnet-base-v2
- sentence-transformers/multi-qa-mpnet-base-cos-v1
- intfloat/e5-large-v2
- clips/mfaq
For the intfloat/e5-large-v2
model, the input text should start with either "query: " or "passage: ". For tasks other than retrieval, the "query: " prefix can be used. The data is formatted accordingly before encoding.
For the clips/mfaq
model, the questions need to be prepended with "", and the answers with "". The data is formatted accordingly before encoding.
For all other models, the data is formatted as-is before encoding.
The program flow can be described as follows:
-
Start: The flow begins by reading the corpus and queries from text files. It then moves to the
encode
step for each model in thetxt_embed_models
list. -
Encode: In this step, the flow encodes the corpus and queries using the specified model. It checks if the model is 'intfloat/e5-large-v2' or 'clips/mfaq' and applies the appropriate encoding method. Otherwise, it uses the default encoding method. After encoding, the flow moves to the
join
step. -
Join: In this step, the flow gathers the corpus embeddings and query embeddings from all the
encode
steps. It then moves to thecalculate_similarity
step. -
Calculate Similarity: In this step, the flow calculates cosine similarity scores between the query embeddings and corpus embeddings for each model. It then moves to the
end
step. -
End: This is the final step of the flow. It prints the top 3 hits for each query for each model.
Here is a mermaid diagram that depicts the program flow:
graph LR
A(Start) -- all-MiniLM-L6-v2 --> B(encode)
A(Start) -- all-mpnet-base-v2 --> B(encode)
A(Start) -- multi-qa-mpnet-base-cos-v1 --> B(encode)
A(Start) -- e5-large-v2 --> B(encode)
A(Start) -- mfaq --> B(encode)
B(encode) -- all-MiniLM-L6-v2 --> C(Join)
B(encode) -- all-mpnet-base-v2 --> C(Join)
B(encode) -- multi-qa-mpnet-base-cos-v1 --> C(Join)
B(encode) -- e5-large-v2 --> C(Join)
B(encode) -- mfaq --> C(Join)
C(Join) --> D(Cosine Similarity)
D(Cosine Similarity) --> E(End)
Please note that you need to have a Mermaid extension installed in your Markdown viewer to view the diagram. If you're viewing this on GitHub, you can use a browser extension like Mermaid Markdown Viewer for Google Chrome.
To run this script, you will need to install the following Python packages:
- metaflow
- transformers
- torch
- sklearn
- numpy
You can install these packages using pip:
pip install metaflow transformers torch sklearn numpy
For an interactive application using intfloat/e5-base-v2
for encoding and displaying top k hits on sagemaker faqs is available at streamlit_sagemaker_faqs_e5.py.
To run this app:
streamlit run streamlit_sagemaker_faqs_e5.py
This script provides a way to evaluate various Hugging Face text embedding models on a given corpus and set of queries. It handles variations in the encoding methods required by different models and calculates cosine similarity scores between the query and corpus embeddings. The results are printed as a table for each model, showing the top 3 hits for each query. The script is implemented using Metaflow, a human-friendly Python library that helps scientists and engineers build and manage real-life data science projects.
To run the script, execute the following command:
python3 evaluate_text_embedding_models.py --no-pylint run; \
python3 evaluate_text_embedding_models.py card view end
The results will be displayed in the console output.
Note: Update the corpus_file
and queries_file
parameters in the script to point to the actual file paths of the corpus and queries.
For more information on Metaflow, refer to the Metaflow documentation.