Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized prompt for multi-class classification contains only a subset of classifiers #1509

Open
aaronbriel opened this issue Sep 18, 2024 · 5 comments

Comments

@aaronbriel
Copy link

aaronbriel commented Sep 18, 2024

I followed the tutorials for optimizing a DSPy program for the task of multi-class classification and the "optimized" prompt resulted in a small subset of the available classifiers, making it unsuitable for consideration in a production environment.

I'll provide the relevant chunks of notebook code but I won't be able to actually show the prompt itself as it contains production data. Hopefully this is sufficient for identification of what may be the issue.

ISSUE 1:
The main issue is that the final "optimized" prompt only contains single few-shot samples for 8 of the 41 classifiers (with one of the classifiers having 2 samples). I expected it to contain multiple few-shot samples for each of the 41 classifiers.

ISSUE 2:
The secondary issue was that the evaluation metric showed a rather low score of 64.34. I expected this to be much higher since I trained with a decent size ground truth dataset (that was manually curated for accuracy) of 50 samples per classifier.

I'm guessing this is related to my optimizer configuration but I'm not sure what to adjust. Please advise. Thank you!

# source .env file
import os
import sys
from dotenv import load_dotenv
load_dotenv()

# Add the current directory to PYTHONPATH
sys.path.append('/Users/abriel/repos/projectname/')
sys.path.append(os.getenv('PYTHONPATH'))
sys.path.append(os.getenv('DEFAULT_MODEL'))

import os
import re
import dspy
from dspy import Predict
from dspy.datasets import DataLoader
from dspy.signatures import ensure_signature
from dspy.teleprompt import BootstrapFewShotWithRandomSearch
import pandas as pd

# Load the intent keys from an external source
from src.variables import intent_keys

# Set up the model using OpenAI's GPT
gpt4o = dspy.OpenAI(model=os.environ['DEFAULT_MODEL'])
dspy.configure(lm=gpt4o)

# Define the Intent Classifier Signature
class IntentClassifier(dspy.Signature):
    """
    Classifies a person's response into one of the given intents based on the conversation
    between a two people, person1 and person2.
    """
    conversation = dspy.InputField(
        desc="A conversation between person1 and person2.",
        prefix="Conversation: "
    )
    script_question = dspy.InputField(
        desc="Person1 question.",
        prefix="Question: "
    )
    response = dspy.InputField(
        desc="Person2's response to the question from person1.",
        prefix="Response: "
    )
    intent = dspy.OutputField(desc="One of the following intents: " + ", ".join(intent_keys))

# Create the IntentClassifierModule that incorporates ChainOfThought
class IntentClassifierModule(dspy.Module):
    """
    A module that defines the intent classification process.
    """
    def __init__(self):
        super().__init__()
        self.signature = IntentClassifier
        self.predictor = dspy.ChainOfThought(signature=self.signature)

    def forward(self, conversation, question, response):
        """
        Runs the forward pass for classifying intents.
        """
        result = self.predictor(
            conversation=conversation,
            question=question,
            response=response
        )
        return dspy.Prediction(
            intent=result.intent
        )

# Load and split datasets
dl = DataLoader()

full_dataset = dl.from_csv(
    "dataset_name.csv",
    fields=("conversation", "question", "response", "intent"),
    input_keys=("conversation", "question", "response")
)
splits = dl.train_test_split(dataset, train_size=0.8)
train_dataset = splits['train']
test_dataset = splits['test']

# Validation function to compare predicted and actual intents
def validate_answer(example, pred, trace=None):
    """
    Validates the prediction by comparing it to the actual intent.
    """
    return example.intent.lower() == pred.intent.lower()

# Configure the optimizer
config_ = {
    "max_bootstrapped_demos": 8,
    "max_labeled_demos": 8,
    "num_candidate_programs": 10,
    "num_threads": 4
}

# Use BootstrapFewShotWithRandomSearch to optimize the prompt
teleprompter = BootstrapFewShotWithRandomSearch(
    metric=validate_answer,
    **config_
)

# Compile and save the optimized program
optimized_program = teleprompter.compile(IntentClassifierModule(), trainset=train_dataset)
optimized_program.save('/Users/abriel/repos/projectname/optimized_intent_classifier.json')

This resulted in successful "training", running in 8 sets. I then completed an evaluation:

from dspy.evaluate import Evaluate
evaluator = Evaluate(devset=test_dataset, num_threads=1, display_progress=True, display_table=5)
evaluator(optimized_program, metric=validate_answer)

I then checked the optimized prompt by doing:

gpt4o.inspect_history(n=1)

ISSUE 1:
The resulting optimized_intent_classifier.json had single few-shot samples for only 8 intents, with one of the intents having 2 samples. There are 41 intents, so I expected multiple few-shot samples for each of the 41 intents.

ISSUE 2:
This showed a final score of 64.34, which was admittedly far lower than expected as I provided a ground truth dataset of 50 samples per intent.

@arnavsinghvi11
Copy link
Collaborator

Hi @aaronbriel ,

The optimized_program currently includes few-shot examples from only 8 of the classifiers because the BootstrapWithRandomSearch configuration is set to select:
"max_bootstrapped_demos": 8, "max_labeled_demos": 8,"

To get unique few-shot examples for all 41 classifiers, you can increase these parameters to 41.

However, note that the selection of fewshot examples in BootstrapFewShot doesn't guarantee uniqueness in all 41 few-shot demos (the optimizer just selects a set of 41 few-shots that pass the metric):

Some potential solutions for this could be:

  1. Adjusting the metric to have a global check for each unique classifier, modifying the validate_answer function to ensure that only examples unique to each classifer are selected and not repeated (e.g. - return example.intent.lower() == pred.intent.lower() and global_class_check(example)

  2. Filtering the train_dataset by the 41 classifier types, and then running the optimizer on each of the 41 train_sets (bootstrapping 41x!)

bootstrap_program_0 = teleprompter.compile(IntentClassifierModule(), trainset=train_dataset_0)
bootstrap_program_1 = teleprompter.compile(bootstrap_program_0, trainset=train_dataset_1)

the 2nd solution is likely more expensive but may ensure some more diversity by providing multiple sets of few-shots for the unique classifiers, which can potentially raise performance

Let me know if this helps!

@aaronbriel
Copy link
Author

@arnavsinghvi11 thank for the quick response! I will try this and let you know the results. Thanks!

@aaronbriel
Copy link
Author

@arnavsinghvi11 I keep running into the error below. I thought I had resolved it by adding format=str to each of the signature InputFields. It progressed a bit further but failed yet again several intent iterations later. I'm not seeing anything that jumps out in the data for that specific intent, as all of the text data across all intents contain special characters.

Do you know of any other tricks people have used to resolve this?

Traceback (most recent call last):
  File "/home/ubuntu/repos/project/experiments/dspy/build_intent_classifier_prompt.py", line 260, in <module>
    optimize_intent_classifier()
  File "/home/ubuntu/repos/project/experiments/dspy/build_intent_classifier_prompt.py", line 237, in optimize_intent_classifier
    bootstrap_program = teleprompter.compile(bootstrap_program, trainset=training_data_intent)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/envs/project-venv/lib/python3.12/site-packages/dspy/teleprompt/random_search.py", line 95, in compile
    program2 = program.compile(student, teacher=teacher, trainset=trainset2)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/envs/project-venv/lib/python3.12/site-packages/dspy/teleprompt/bootstrap.py", line 82, in compile
    self._prepare_student_and_teacher(student, teacher)
  File "/home/ubuntu/anaconda3/envs/project-venv/lib/python3.12/site-packages/dspy/teleprompt/bootstrap.py", line 99, in _prepare_student_and_teacher
    assert getattr(self.student, "_compiled", False) is False, "Student must be uncompiled."
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Student must be uncompiled.

@aaronbriel
Copy link
Author

Using the recommended solution in (1) above, the resulting prompt was still missing 20 intents so that is not a feasible solution for a production release. The "Student must be uncompiled" issue may have not occurred due to certain data in a missed intent not being encountered.

I'm going to have to hold off on leveraging this tool until I or another person can find a solution to said issue.

@chiragshah285
Copy link

@aaronbriel this may be helpful https://github.com/KarelDO/xmc.dspy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants