You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to reproduce the SCIQ results from the SC'23 paper using Eleuther's LM evaluation harness.
These are my results
Model
SciQ
PIQA
forge-bio
0.788
forge-che
0.821
forge-eng
0.793
forge-mat
0.777
forge-phy
0.761
forge-soc
0.82
forge-s1
0.787
forge-s2
0.783
forge-s3
0.805
forge-s4
0.86
forge-m1
0.82
forge-m2
0.574
0.5577
forge-l
0.242
The highlighted scores are much lower than the others, and than what is expected from Table 8 of the paper. A quick check of the evaluation logs (data/eval/forge-m2) suggests that these are roughly the scores of the m2 checkpoint at iteration 1000, and probably of some very early checkpoint of forge-l.
I downloaded the checkpoints from the links in the README.md. I suspect that the dropbox versions were somehow mixed up.
Command line
lm_eval --model hf --model_args pretrained=forge-bio,parallelize=True --tasks sciq --device cuda
The text was updated successfully, but these errors were encountered:
I attempted to use the large model for one of my projects and tested it with version 0.3.0 of lm_eval but observed significant discrepancies in the results. Could you provide context on why these differences might occur? Could it be due to corrupted checkpoints or variations in the model versions? Also, is the instruct version 22.4B available? I'd like to use the instruct version for the larger model.
I am trying to reproduce the SCIQ results from the SC'23 paper using Eleuther's LM evaluation harness.
These are my results
The highlighted scores are much lower than the others, and than what is expected from Table 8 of the paper. A quick check of the evaluation logs (
data/eval/forge-m2
) suggests that these are roughly the scores of them2
checkpoint at iteration 1000, and probably of some very early checkpoint of forge-l.I downloaded the checkpoints from the links in the README.md. I suspect that the dropbox versions were somehow mixed up.
Command line
The text was updated successfully, but these errors were encountered: