Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR to apply for E2E OLS evaluation framework for AAP chatbot #47

Merged
merged 5 commits into from
Jan 31, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions scripts/evaluation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ Currently we have 2 types of evaluations.
- QnAs were generated from OCP docs by LLMs. It is possible that some of the questions/answers are not entirely correct. We are constantly trying to verify both Questions & Answers manually. If you find any QnA pair to be modified or removed, please create a PR.
- OLS API should be ready/live with all the required provider+model configured.
- It is possible that we want to run both consistency and model evaluation together. To avoid multiple API calls for same query, *model* evaluation first checks .csv file generated by *consistency* evaluation. If response is not present in csv file, then only we call API to get the response.
- User needs to install python `matplotlib`, and `rouge_score` before running the evaluation.

### e2e test case

Expand All @@ -21,6 +22,11 @@ These evaluations are also part of **e2e test cases**. Currently *consistency* e
python -m scripts.evaluation.driver
```

### Sample run command
```
OPENAI_API_KEY=IGNORED python -m scripts.evaluation.driver --qna_pool_file ./scripts/evaluation/eval_data/aap-sample.parquet --eval_provider_model_id my_rhoai+granite3-8b --eval_metrics answer_relevancy answer_similarity_llm cos_score rougeL_precision --eval_modes vanilla --judge_model granite3-8b --judge_provider my_rhoai3 --eval_query_ids qna1
```

### Input Data/QnA pool
[Json file](eval_data/question_answer_pair.json)

Expand Down
Binary file added scripts/evaluation/eval_data/aap-sample.parquet
Binary file not shown.
Binary file added scripts/evaluation/eval_data/aap.parquet
Binary file not shown.
56 changes: 56 additions & 0 deletions scripts/evaluation/olsconfig.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# olsconfig.yaml sample for local ollama server
#
# 1. install local ollama server from https://ollama.com/
# 2. install llama3.1:latest model with:
# ollama pull llama3.1:latest
# 3. Copy this file to the project root of cloned lightspeed-service repo
# 4. Install dependencies with:
# make install-deps
# 5. Start lightspeed-service with:
# OPENAI_API_KEY=IGNORED make run
# 6. Open https://localhost:8080/ui in your web browser
#
llm_providers:
- name: ollama
type: openai
url: "http://localhost:11434/v1/"
models:
- name: "mistral"
justjais marked this conversation as resolved.
Show resolved Hide resolved
- name: 'llama3.2:latest'
- name: my_rhoai
justjais marked this conversation as resolved.
Show resolved Hide resolved
type: openai
url: "https://granite3-8b-wisdom-model-staging.apps.stage2-west.v2dz.p1.openshiftapps.com/v1"
credentials_path: ols_api_key.txt
models:
- name: granite3-8b
ols_config:
# max_workers: 1
reference_content:
# product_docs_index_path: "./vector_db/vector_db/aap_product_docs/2.5"
justjais marked this conversation as resolved.
Show resolved Hide resolved
# product_docs_index_id: aap-product-docs-2_5
# embeddings_model_path: "./vector_db/embeddings_model"
conversation_cache:
type: memory
memory:
max_entries: 1000
logging_config:
app_log_level: info
lib_log_level: warning
uvicorn_log_level: info
default_provider: ollama
default_model: 'llama3.2:latest'
query_validation_method: llm
user_data_collection:
feedback_disabled: false
feedback_storage: "/tmp/data/feedback"
transcripts_disabled: false
transcripts_storage: "/tmp/data/transcripts"
dev_config:
# config options specific to dev environment - launching OLS in local
enable_dev_ui: true
disable_auth: true
disable_tls: true
pyroscope_url: "https://pyroscope.pyroscope.svc.cluster.local:4040"
# llm_params:
# temperature_override: 0
# k8s_auth_token: optional_token_when_no_available_kube_config
2 changes: 2 additions & 0 deletions scripts/evaluation/utils/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@
"azure_openai+gpt-4o": ("azure_openai", "gpt-4o"),
"ollama+llama3.1:latest": ("ollama", "llama3.1:latest"),
"ollama+mistral": ("ollama", "mistral"),
"my_rhoai+granite3-8b": ("my_rhoai", "granite3-8b"),
"my_rhoai3+granite3-1-8b": ("my_rhoai3", "granite3-1-8b"),
}

NON_LLM_EVALS = {
Expand Down
2 changes: 1 addition & 1 deletion scripts/evaluation/utils/relevancy_score.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ def get_score(
# raise
sleep(time_to_breath)

if out:
if out and isinstance(out, dict):
romartin marked this conversation as resolved.
Show resolved Hide resolved
valid_flag = out["Valid"]
gen_questions = out["Question"]
score = 0
Expand Down