Possibly wrong checkpoints for M2 and L #2

jglaser · 2024-06-06T07:12:51Z

I am trying to reproduce the SCIQ results from the SC'23 paper using Eleuther's LM evaluation harness.

These are my results

Model	SciQ	PIQA
forge-bio	0.788
forge-che	0.821
forge-eng	0.793
forge-mat	0.777
forge-phy	0.761
forge-soc	0.82
forge-s1	0.787
forge-s2	0.783
forge-s3	0.805
forge-s4	0.86
forge-m1	0.82
forge-m2	0.574	0.5577
forge-l	0.242

The highlighted scores are much lower than the others, and than what is expected from Table 8 of the paper. A quick check of the evaluation logs (data/eval/forge-m2) suggests that these are roughly the scores of the m2 checkpoint at iteration 1000, and probably of some very early checkpoint of forge-l.

I downloaded the checkpoints from the links in the README.md. I suspect that the dropbox versions were somehow mixed up.

Command line

 lm_eval --model hf --model_args pretrained=forge-bio,parallelize=True --tasks sciq --device cuda

The text was updated successfully, but these errors were encountered:

jqyin · 2024-11-03T15:26:24Z

This is likely due to the version mismatch of lm_eval. In the original evaluation, we used the eval_adapter of the gpt-neox#e48b0c45

To directly evaluate with lm_eval (v0.3.0), we also released an instructed version of Forge-m2, and you can find more details at https://github.com/jqyin/chatHPC if interested.

AlpinDale · 2024-11-10T19:03:23Z

I attempted to use the large model for one of my projects and tested it with version 0.3.0 of lm_eval but observed significant discrepancies in the results. Could you provide context on why these differences might occur? Could it be due to corrupted checkpoints or variations in the model versions? Also, is the instruct version 22.4B available? I'd like to use the instruct version for the larger model.

Tasks	Version	Filter	Metric	Value		Stderr
arc_challenge	Yaml	none	acc	0.2227	±	0.0122
		none	acc_norm	0.2662	±	0.0129
arc_easy	Yaml	none	acc	0.2689	±	0.0091
		none	acc_norm	0.2795	±	0.0092
hellaswag	Yaml	none	acc	0.2587	±	0.0044
		none	acc_norm	0.2560	±	0.0044
lambada_openai	Yaml	none	perplexity	23236265.9191	±	2223918.5377
		none	acc	0.0000	±	0.0000
openbookqa	Yaml	none	acc	0.1200	±	0.0145
		none	acc_norm	0.2800	±	0.0201
piqa	Yaml	none	acc	0.5169	±	0.0117
		none	acc_norm	0.5027	±	0.0117
sciq	Yaml	none	acc	0.2420	±	0.0136
		none	acc_norm	0.2400	±	0.0135

cc @jqyin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possibly wrong checkpoints for M2 and L #2

Possibly wrong checkpoints for M2 and L #2

jglaser commented Jun 6, 2024

jqyin commented Nov 3, 2024

AlpinDale commented Nov 10, 2024 •

edited

Loading

Possibly wrong checkpoints for M2 and L #2

Possibly wrong checkpoints for M2 and L #2

Comments

jglaser commented Jun 6, 2024

jqyin commented Nov 3, 2024

AlpinDale commented Nov 10, 2024 • edited Loading

AlpinDale commented Nov 10, 2024 •

edited

Loading