[bug] mlx_lm.evaluate ValueError on generative tasks #1266

deathcoder · 2025-02-08T21:14:09Z

i think there is a bug in mlx_lm.evaluate or i am doing something wrong, when i run mlx_lm.evaluate on a task like arc_easy which is multiple choice everything works fine, example:

mlx_lm.evaluate --model mlx-community/Qwen2.5-3B-4bit --tasks arc_easy                                                                                                   (lm_eval) 
Fetching 9 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 101203.05it/s]
2025-02-08:22:03:51,348 INFO     [evaluator.py:165] Setting random seed to 123 | Setting numpy seed to 123 | Setting torch manual seed to 123 | Setting fewshot manual seed to 123
2025-02-08:22:03:51,348 INFO     [evaluator.py:218] Using pre-initialized model
2025-02-08:22:04:00,519 WARNING  [evaluator.py:271] Overwriting default num_fewshot of arc_easy from None to 0
2025-02-08:22:04:00,519 WARNING  [evaluator.py:417] Chat template formatting change affects loglikelihood and multiple-choice tasks. See docs/chat-template-readme.md for details.
2025-02-08:22:04:00,519 INFO     [task.py:420] Building contexts for arc_easy on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 174.73it/s]
2025-02-08:22:04:00,525 INFO     [evaluator.py:513] Running loglikelihood requests
2025-02-08:22:04:00,525 INFO     [evaluate.py:183] Estimating loglikelihood for 4 pairs.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.05it/s]
Results:
{
    "alias": "arc_easy",
    "acc,none": 1.0,
    "acc_stderr,none": "N/A",
    "acc_norm,none": 1.0,
    "acc_norm_stderr,none": "N/A"
}

but when i run it with a generative task, like arc_challenge_chat

mlx_lm.evaluate --model mlx-community/Qwen2.5-3B-4bit --tasks arc_challenge_chat                                                                                         (lm_eval) 
Fetching 9 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 107240.73it/s]
2025-02-08:22:07:37,660 INFO     [evaluator.py:165] Setting random seed to 123 | Setting numpy seed to 123 | Setting torch manual seed to 123 | Setting fewshot manual seed to 123
2025-02-08:22:07:37,661 INFO     [evaluator.py:218] Using pre-initialized model
train-00000-of-00001.parquet: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 190k/190k [00:00<00:00, 3.98MB/s]
test-00000-of-00001.parquet: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 204k/204k [00:00<00:00, 9.85MB/s]
validation-00000-of-00001.parquet: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 55.7k/55.7k [00:00<00:00, 40.8MB/s]
Generating train split: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1119/1119 [00:00<00:00, 145473.95 examples/s]
Generating test split: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1172/1172 [00:00<00:00, 303946.35 examples/s]
Generating validation split: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 299/299 [00:00<00:00, 144381.41 examples/s]
2025-02-08:22:07:47,489 INFO     [evaluator.py:267] num_fewshot has been set to 0 for arc_challenge_chat in its config. Manual configuration will be ignored.
2025-02-08:22:07:47,489 WARNING  [evaluator.py:417] Chat template formatting change affects loglikelihood and multiple-choice tasks. See docs/chat-template-readme.md for details.
2025-02-08:22:07:47,489 INFO     [task.py:420] Building contexts for arc_challenge_chat on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 159.21it/s]
2025-02-08:22:07:47,495 INFO     [evaluator.py:513] Running generate_until requests
2025-02-08:22:07:47,496 INFO     [evaluate.py:288] Generating continuation for 1 sequences.
  0%|                                                                                                                                                          | 0/1 [00:00<?, ?it/s]max_tokens:  130412
  0%|                                                                                                                                                          | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/Users/admin/devtools/miniconda3/envs/lm_eval/bin/mlx_lm.evaluate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/admin/devtools/miniconda3/envs/lm_eval/lib/python3.12/site-packages/mlx_lm/evaluate.py", line 372, in main
    results = lm_eval.simple_evaluate(
              ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/admin/dev/experiments/lm-evaluation-harness/lm_eval/utils.py", line 402, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/admin/dev/experiments/lm-evaluation-harness/lm_eval/evaluator.py", line 304, in simple_evaluate
    results = evaluate(
              ^^^^^^^^^
  File "/Users/admin/dev/experiments/lm-evaluation-harness/lm_eval/utils.py", line 402, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/admin/dev/experiments/lm-evaluation-harness/lm_eval/evaluator.py", line 524, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/admin/devtools/miniconda3/envs/lm_eval/lib/python3.12/site-packages/mlx_lm/evaluate.py", line 305, in generate_until
    for response in stream_generate(
                    ^^^^^^^^^^^^^^^^
  File "/Users/admin/devtools/miniconda3/envs/lm_eval/lib/python3.12/site-packages/mlx_lm/utils.py", line 529, in stream_generate
    for n, (token, logprobs) in enumerate(token_generator):
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/admin/devtools/miniconda3/envs/lm_eval/lib/python3.12/site-packages/mlx_lm/utils.py", line 299, in generate_step
    model(y[:prefill_step_size][None], cache=prompt_cache)
  File "/Users/admin/devtools/miniconda3/envs/lm_eval/lib/python3.12/site-packages/mlx_lm/models/qwen2.py", line 184, in __call__
    out = self.model(inputs, mask, cache)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/admin/devtools/miniconda3/envs/lm_eval/lib/python3.12/site-packages/mlx_lm/models/qwen2.py", line 164, in __call__
    h = layer(h, mask, c)
        ^^^^^^^^^^^^^^^^^
  File "/Users/admin/devtools/miniconda3/envs/lm_eval/lib/python3.12/site-packages/mlx_lm/models/qwen2.py", line 129, in __call__
    r = self.self_attn(self.input_layernorm(x), mask, cache)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/admin/devtools/miniconda3/envs/lm_eval/lib/python3.12/site-packages/mlx_lm/models/qwen2.py", line 75, in __call__
    B, L, D = x.shape
    ^^^^^^^
ValueError: too many values to unpack (expected 3)

the same happens with other generative tasks so i think the issue is related to that, not exactly sure how to debug this problem, if you can point me in the right direction i'm happy to try and do a PR to fix it

The text was updated successfully, but these errors were encountered:

awni · 2025-02-11T22:59:54Z

I'm not able to load a task called arc_challenge_chat. Could you share more detailed steps for how to reproduce what you are seeing?

awni · 2025-02-11T23:27:25Z

Nvm I found a task to repro it. Fix is in #1277

awni mentioned this issue Feb 11, 2025

Fix generation evals #1277

Merged

awni closed this as completed in #1277 Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] mlx_lm.evaluate ValueError on generative tasks #1266

[bug] mlx_lm.evaluate ValueError on generative tasks #1266

deathcoder commented Feb 8, 2025

awni commented Feb 11, 2025 •

edited

Loading

awni commented Feb 11, 2025

[bug] mlx_lm.evaluate ValueError on generative tasks #1266

[bug] mlx_lm.evaluate ValueError on generative tasks #1266

Comments

deathcoder commented Feb 8, 2025

awni commented Feb 11, 2025 • edited Loading

awni commented Feb 11, 2025

awni commented Feb 11, 2025 •

edited

Loading