Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking GLUE tasks for in-context learning #707

Open
ashim95 opened this issue Oct 31, 2023 · 2 comments
Open

Benchmarking GLUE tasks for in-context learning #707

ashim95 opened this issue Oct 31, 2023 · 2 comments
Labels
question Further information is requested

Comments

@ashim95
Copy link

ashim95 commented Oct 31, 2023

❓ Question

I am trying to benchmark llama-2-7b on the GLUE benchmark for in-context learning. But the accuracy I get for MNLI (mismatched validation) is 35.22 for both zero-shot and 8-shot. My questions are:

  • During you benchmarking, did you run the models for any classification tasks? Any experimental results you share would be great.
  • Currently, I am using the format prescribed by InContextLearningMultipleChoiceTaskDataset? Is there another recommend way to implement this?

PS: Also ran evaluation for the qqp task: 36.82% for 0-shot and 63.09 for 8-shot.

Any help would be greatly appreciated.

Thank you,

Additional context

@ashim95 ashim95 added the question Further information is requested label Oct 31, 2023
@ashim95
Copy link
Author

ashim95 commented Nov 1, 2023

Update: Here is the yaml file we used:

max_seq_len: 4096
seed: 28
model_name_or_path: ~/huggingface_cache/Llama-2-7b-hf 

# Tokenizer
tokenizer:
  name: ${model_name_or_path}
  kwargs:
    model_max_length: ${max_seq_len}

models:
-
  model_name: ${model_name_or_path}
  model:
    name: hf_causal_lm
    pretrained_model_name_or_path: ${model_name_or_path}
    init_device: mixed
    pretrained: true
    token: <HF Token>
  tokenizer:
    name: ${model_name_or_path}
    kwargs:
      model_max_length: ${max_seq_len}

load_path: # Add your (optional) Composer checkpoint path here!

device_eval_batch_size: 4
precision: fp32

# FSDP config for model sharding
# either use multiple GPUs, or comment FSDP out
fsdp_config:
  sharding_strategy: FULL_SHARD
  mixed_precision: FULL

icl_tasks:
-
  label: mnli_mismatched
  dataset_uri: scripts/eval/local_data/mnli_mismatched.jsonl # ADD YOUR OWN DATASET URI
  num_fewshot: [8]
  icl_task_type: multiple_choice
  metric_names:
InContextLearningMultipleChoiceAccuracy
  prompt_string: '' # this goes at the beginning of each input
  example_delimiter: "\n" # this goes between fewshot examples
  continuation_delimiter: '' # this separates questions from answers

And here is a sample example from the jsonl file:

{
  "premise": "Your contribution helped make it possible for us to provide our students with a quality education.",
  "hypothesis": "Your contributions were of no help with our students' education.",
  "label": 2,
  "idx": 0,
  "query": "Premise:\nYour contribution helped make it possible for us to provide our students with a quality education.\n\nHypothesis:\nYour contributions were of no help with our students' education.\n\nLabel:",
  "choices": [
    "entailment",
    "neutral",
    "contradiction"
  ],
  "gold": 2,
  "context": "Premise:\nYour contribution helped make it possible for us to provide our students with a quality education.\n\nHypothesis:\nYour contributions were of no help with our students' education.\n\nLabel:\n"
}

Please let me know if you need any more details.

Thanks,
-- ashim

@ashim95
Copy link
Author

ashim95 commented Nov 1, 2023

We also tried running the evaluation using lm-evaluation-harness. Here are the numbers with the two libraries:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant