[WIP] Evaluating Large Language Models with mlflow!

See the technical blog here for more information!

This collection is meant to get individuals quickly started in evaluating their large language models and retrieval-augmented-generation chains with mlflow evaluate!

DISCLAIMER:

This is for reference and not meant to be for production environments. There is no SLA nor continued support as this is not an official Databricks asset. For more information, please contact your representative.

NOTE: This repo works on Azure Databricks currently and would need slight configuration changes for AWS and GCP.

Pull meta-llama/Meta-Llama-3-8B-Instruct from Huggingface and log the model using mlflow.
Deploy the model to a Databricks Model Serving endpoint for us in our RAG chain.
Construct a RAG chain using Langchain and our Custom Model Endpoint.
- Llama-3-8b-Instruct
- Databricks-BGE-Large
Use Databricks Foundation Model APIs for LLM-as-a-judge
- Llama-3-8b-Instruct [Custom]
- DBRX
- Llama-3-70b-Instruct
Evaluate the chain using mlflow evaluate.

External-Models-OpenAI-Langchain-MLflow-Text-QA

Construct a RAG chain using Langchain and Azure OpenAI models.
- ChatGPT 3.5 Turbo
- Text Embedding Ada 002
Use Databricks Foundation Model APIs for LLM-as-a-judge or ChatGPT.
Evaluate the chain using mlflow evaluate.

Examples:

Foundation Model APIs and RAG

Evaluation of RAG (Retrieval-Augmented Generation) chain using Databricks Foundation Model APIs and MLflow!

We will use langchain to pull MLflow documentation and chunk it.
We will use the Databricks Foundation Model APIs to automatically compute embeddings from the chunks.
We will then create an index within a Databricks Vector Search index to hold the embeddings and act as a retriever for our RAG chain.
DBRX from the Databricks Foundation Model APIs will be our primary model for our RAG chain.
We log all of this in mlflow so that we can have the run history and associated artifacts stored!
After creating the RAG chain, we will set up our evaluation metrics including toxicity and faithfulness.
- We will be using an additional LLM from the Foundation Model APIs to perform LLM-as-a-judge on our outputs.
Finally, we will evaluate our RAG chain and display the results!

Using the Foundation Model APIs is as easy as the following:

import os
from langchain_community.llms import Databricks
from langchain_core.messages import HumanMessage, SystemMessage

def transform_input(**request):
  request["messages"] = [
    {
      "role": "user",
      "content": request["prompt"]
    }
  ]
  del request["prompt"]
  return request

# databricks-meta-llama-3-70b-instruct or databricks-dbrx-instruct
llm = Databricks(endpoint_name="databricks-dbrx-instruct", transform_input_fn=transform_input, extra_params={"temperature": 0.1, "max_tokens":512})

Evaluation of RAG (Retrieval-Augmented Generation) chain using Databricks Model Serving (Llama3-8b) and MLflow!

We will use langchain to pull MLflow documentation and chunk it.
We will pull meta-llama/Meta-Llama-3-8B-Instruct and deploy to Databricks Model Serving.
We will use the Databricks Model Serving endpoint to automatically compute embeddings from the chunks.
We will then create an index within a Databricks Vector Search index to hold the embeddings and act as a retriever for our RAG chain.
We log all of this in mlflow so that we can have the run history and associated artifacts stored!
After creating the RAG chain, we will set up our evaluation metrics including toxicity and faithfulness.
- We will be using an additional LLM from the Foundation Model APIs to perform LLM-as-a-judge on our outputs.
Finally, we will evaluate our RAG chain and display the results!

Open AI Models

Evaluation of RAG (Retrieval-Augmented Generation) chain using Azure OpenAI [Databricks External Models] and MLflow!

We will register the Azure Open AI gpt-35-turbo and text-embedding-ada-002 for use as external models.
We will use langchain to pull MLflow documentation and chunk it.
We will use the Azure Open AI text-embedding-ada-002 to automatically compute embeddings from the chunks.
We will then create an index within a Databricks Vector Search index to hold the embeddings and act as a retriever for our RAG chain.
Chat GPT 3.5 Turbo from Azure OpenAI will be our primary model for our RAG chain.
We log all of this in mlflow so that we can have the run history and associated artifacts stored!
After creating the RAG chain, we will set up our evaluation metrics including toxicity and faithfulness.
- We will be using an additional LLM from the Foundation Model APIs to perform LLM-as-a-judge on our outputs.
Finally, we will evaluate our RAG chain and display the results!

Registering an external model is as easy as the following:

import mlflow.deployments

client = mlflow.deployments.get_deploy_client("databricks")

client.create_endpoint(
    name=dbutils.widgets.get("LLM_ENDPOINT_NAME"),
    config={
        "served_entities": [
            {
                "external_model": {
                    "name": dbutils.widgets.get("LLM_MODEL_NAME"),
                    "provider": "openai",
                    "task": "llm/v1/chat",
                    "openai_config": {
                        "openai_api_type": dbutils.widgets.get("OPENAI_API_TYPE"),
                        "openai_api_key": API_KEY,
                        "openai_api_base": dbutils.widgets.get("API_BASE"),
                        "openai_deployment_name": dbutils.widgets.get("DEPLOYMENT_NAME"),
                        "openai_api_version": dbutils.widgets.get("OPENAI_API_VERSION"),
                    },
                }
            }
        ]
    },
)

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
01-Foundation-Model-APIs		01-Foundation-Model-APIs
02-OpenAI-External-Models		02-OpenAI-External-Models
03-Served-Llama3-Custom-Models		03-Served-Llama3-Custom-Models
img		img
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[WIP] Evaluating Large Language Models with mlflow!

DISCLAIMER:

This is for reference and not meant to be for production environments. There is no SLA nor continued support as this is not an official Databricks asset. For more information, please contact your representative.

Table of Contents

Get Started

Requirements

Tested with:

Python Packages:

Notebooks

FMAPI-Langchain-MLflow-Text-QA

Custom-Model-Langchain-MLflow-Text-QA

External-Models-OpenAI-Langchain-MLflow-Text-QA

Examples:

Foundation Model APIs and RAG

Evaluation of RAG (Retrieval-Augmented Generation) chain using Databricks Foundation Model APIs and MLflow!

Using the Foundation Model APIs is as easy as the following:

Evaluation of RAG (Retrieval-Augmented Generation) chain using Databricks Model Serving (Llama3-8b) and MLflow!

Open AI Models

Evaluation of RAG (Retrieval-Augmented Generation) chain using Azure OpenAI [Databricks External Models] and MLflow!

Registering an external model is as easy as the following:

You can see the evaluation results after running mlflow.evaluate():

About

Releases

Packages

Languages

License

willsmithDB/llm-evaluation-mlflow

Folders and files

Latest commit

History

Repository files navigation

[WIP] Evaluating Large Language Models with mlflow!

DISCLAIMER:

This is for reference and not meant to be for production environments. There is no SLA nor continued support as this is not an official Databricks asset. For more information, please contact your representative.

Table of Contents

Get Started

Requirements

Tested with:

Python Packages:

Notebooks

FMAPI-Langchain-MLflow-Text-QA

Custom-Model-Langchain-MLflow-Text-QA

External-Models-OpenAI-Langchain-MLflow-Text-QA

Examples:

Foundation Model APIs and RAG

Evaluation of RAG (Retrieval-Augmented Generation) chain using Databricks Foundation Model APIs and MLflow!

Using the Foundation Model APIs is as easy as the following:

Evaluation of RAG (Retrieval-Augmented Generation) chain using Databricks Model Serving (Llama3-8b) and MLflow!

Open AI Models

Evaluation of RAG (Retrieval-Augmented Generation) chain using Azure OpenAI [Databricks External Models] and MLflow!

Registering an external model is as easy as the following:

You can see the evaluation results after running mlflow.evaluate():

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages