Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Openai compatible gauntlet #1017

Draft
wants to merge 167 commits into
base: main
Choose a base branch
from
Draft

Openai compatible gauntlet #1017

wants to merge 167 commits into from

Conversation

bmosaicml
Copy link
Contributor

@bmosaicml bmosaicml commented Mar 8, 2024

OpenAI run: api-eval-Ik2iMA

| Category   | Benchmark       | Subtask                             |   Accuracy | Number few shot   | Model                         |
|:-----------|:----------------|:------------------------------------|-----------:|:------------------|:------------------------------|
|            | gsm8k           |                                     |   0.482942 | 0-shot            | openai/gpt-3.5-turbo-instruct |
|            | lambada_openai  |                                     |   0.782651 | 0-shot            | openai/gpt-3.5-turbo-instruct |
|            | triviaqa_sm_sub |                                     |   0.727667 | 3-shot            | openai/gpt-3.5-turbo-instruct |
|            | jeopardy        | Average                             |   0.553084 | 3-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | american_history                    |   0.602906 | 3-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | literature                          |   0.714286 | 3-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | science                             |   0.434874 | 3-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | word_origins                        |   0.372603 | 3-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | world_history                       |   0.640751 | 3-shot            | openai/gpt-3.5-turbo-instruct |
|            | arc_challenge   |                                     |   0.687713 | 25-shot           | openai/gpt-3.5-turbo-instruct |
|            | mmlu            | Average                             |   0.713291 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | abstract_algebra                    |   0.47     | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | anatomy                             |   0.674074 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | astronomy                           |   0.776316 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | business_ethics                     |   0.79     | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | clinical_knowledge                  |   0.750943 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | college_biology                     |   0.763889 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | college_chemistry                   |   0.53     | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | college_computer_science            |   0.57     | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | college_mathematics                 |   0.47     | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | college_medicine                    |   0.699422 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | college_physics                     |   0.54902  | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | computer_security                   |   0.81     | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | conceptual_physics                  |   0.67234  | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | econometrics                        |   0.570175 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | electrical_engineering              |   0.662069 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | elementary_mathematics              |   0.608466 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | formal_logic                        |   0.642857 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | global_facts                        |   0.48     | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | high_school_biology                 |   0.809677 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | high_school_chemistry               |   0.571429 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | high_school_computer_science        |   0.8      | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | high_school_european_history        |   0.70303  | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | high_school_geography               |   0.818182 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | high_school_government_and_politics |   0.906736 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | high_school_macroeconomics          |   0.720513 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | high_school_mathematics             |   0.507407 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | high_school_microeconomics          |   0.785714 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | high_school_physics                 |   0.509934 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | high_school_psychology              |   0.838532 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | high_school_statistics              |   0.564815 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | high_school_us_history              |   0.823529 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | high_school_world_history           |   0.763713 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | human_aging                         |   0.7713   | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | human_sexuality                     |   0.847328 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | international_law                   |   0.859504 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | jurisprudence                       |   0.768519 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | logical_fallacies                   |   0.809816 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | machine_learning                    |   0.625    | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | management                          |   0.815534 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | marketing                           |   0.884615 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | medical_genetics                    |   0.88     | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | miscellaneous                       |   0.872286 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | moral_disputes                      |   0.710983 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | moral_scenarios                     |   0.436871 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | nutrition                           |   0.761438 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | philosophy                          |   0.713826 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | prehistory                          |   0.783951 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | professional_accounting             |   0.56383  | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | professional_law                    |   0.557366 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | professional_medicine               |   0.768382 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | professional_psychology             |   0.73366  | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | public_relations                    |   0.790909 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | security_studies                    |   0.763265 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | sociology                           |   0.850746 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | us_foreign_policy                   |   0.93     | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | virology                            |   0.662651 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            |                 | world_religions                     |   0.883041 | 5-shot            | openai/gpt-3.5-turbo-instruct |
|            | hellaswag       |                                     |   0.706333 | 10-shot           | openai/gpt-3.5-turbo-instruct |

Copy link
Contributor

@maxisawesome maxisawesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, couple questions but are mostly pointing to stuff that looks like WIP work rather than intended to be merged code

early_stopping_criteria:
- "\n\n"
- "Question:"
# - label: mmlu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be uncommented or deleted?

cd llm-foundry/scripts
composer eval/eval.py /mnt/config/parameters.yaml

# Mosaic Cloud will use run_name (with a unique suffix) to populate the env var $RUN_NAME
run_name: mpt-eval
name: mpt-eval-logging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a temporary change or permanent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

goes for whole file I think?

name: api-eval
cluster: r1z1 # replace with your cluster here!
gpu_num: 8 #
gpu_type: a100_80gb #
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably temporary?

if fsdp_config and model_cfg.model.get('load_in_8bit', False):
raise ValueError(
'The FSDP config block is not supported when loading ' +
'Hugging Face models in 8bit.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem openai gauntlet related but prob should be merged anyway?

name: hotpotqa
context_length: 65536
section: beginning
split: test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably don't want to delete these do we?

model_name: gpt-3.5-turbo
-
model_name: openai/davinci
model_name: openai/gpt-3.5-turbo-instruct
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

del old gpt-3.5-turbo throughout the codebase for instruct?

@@ -114,8 +114,7 @@
]

extra_deps['openai'] = [
'openai==1.3.8',
'tiktoken==0.4.0',
'openai==1.3.8', 'tiktoken==0.4.0', 'google-generativeai'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should google-generativeai be here or it's own extra_deps category?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants