Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] mlx_lm.evaluate ValueError on generative tasks #1266

Closed
deathcoder opened this issue Feb 8, 2025 · 2 comments · Fixed by #1277
Closed

[bug] mlx_lm.evaluate ValueError on generative tasks #1266

deathcoder opened this issue Feb 8, 2025 · 2 comments · Fixed by #1277

Comments

@deathcoder
Copy link

i think there is a bug in mlx_lm.evaluate or i am doing something wrong, when i run mlx_lm.evaluate on a task like arc_easy which is multiple choice everything works fine, example:

mlx_lm.evaluate --model mlx-community/Qwen2.5-3B-4bit --tasks arc_easy                                                                                                   (lm_eval) 
Fetching 9 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 101203.05it/s]
2025-02-08:22:03:51,348 INFO     [evaluator.py:165] Setting random seed to 123 | Setting numpy seed to 123 | Setting torch manual seed to 123 | Setting fewshot manual seed to 123
2025-02-08:22:03:51,348 INFO     [evaluator.py:218] Using pre-initialized model
2025-02-08:22:04:00,519 WARNING  [evaluator.py:271] Overwriting default num_fewshot of arc_easy from None to 0
2025-02-08:22:04:00,519 WARNING  [evaluator.py:417] Chat template formatting change affects loglikelihood and multiple-choice tasks. See docs/chat-template-readme.md for details.
2025-02-08:22:04:00,519 INFO     [task.py:420] Building contexts for arc_easy on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 174.73it/s]
2025-02-08:22:04:00,525 INFO     [evaluator.py:513] Running loglikelihood requests
2025-02-08:22:04:00,525 INFO     [evaluate.py:183] Estimating loglikelihood for 4 pairs.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.05it/s]
Results:
{
    "alias": "arc_easy",
    "acc,none": 1.0,
    "acc_stderr,none": "N/A",
    "acc_norm,none": 1.0,
    "acc_norm_stderr,none": "N/A"
}

but when i run it with a generative task, like arc_challenge_chat

mlx_lm.evaluate --model mlx-community/Qwen2.5-3B-4bit --tasks arc_challenge_chat                                                                                         (lm_eval) 
Fetching 9 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 107240.73it/s]
2025-02-08:22:07:37,660 INFO     [evaluator.py:165] Setting random seed to 123 | Setting numpy seed to 123 | Setting torch manual seed to 123 | Setting fewshot manual seed to 123
2025-02-08:22:07:37,661 INFO     [evaluator.py:218] Using pre-initialized model
train-00000-of-00001.parquet: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 190k/190k [00:00<00:00, 3.98MB/s]
test-00000-of-00001.parquet: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 204k/204k [00:00<00:00, 9.85MB/s]
validation-00000-of-00001.parquet: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 55.7k/55.7k [00:00<00:00, 40.8MB/s]
Generating train split: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1119/1119 [00:00<00:00, 145473.95 examples/s]
Generating test split: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1172/1172 [00:00<00:00, 303946.35 examples/s]
Generating validation split: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 299/299 [00:00<00:00, 144381.41 examples/s]
2025-02-08:22:07:47,489 INFO     [evaluator.py:267] num_fewshot has been set to 0 for arc_challenge_chat in its config. Manual configuration will be ignored.
2025-02-08:22:07:47,489 WARNING  [evaluator.py:417] Chat template formatting change affects loglikelihood and multiple-choice tasks. See docs/chat-template-readme.md for details.
2025-02-08:22:07:47,489 INFO     [task.py:420] Building contexts for arc_challenge_chat on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 159.21it/s]
2025-02-08:22:07:47,495 INFO     [evaluator.py:513] Running generate_until requests
2025-02-08:22:07:47,496 INFO     [evaluate.py:288] Generating continuation for 1 sequences.
  0%|                                                                                                                                                          | 0/1 [00:00<?, ?it/s]max_tokens:  130412
  0%|                                                                                                                                                          | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/Users/admin/devtools/miniconda3/envs/lm_eval/bin/mlx_lm.evaluate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/admin/devtools/miniconda3/envs/lm_eval/lib/python3.12/site-packages/mlx_lm/evaluate.py", line 372, in main
    results = lm_eval.simple_evaluate(
              ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/admin/dev/experiments/lm-evaluation-harness/lm_eval/utils.py", line 402, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/admin/dev/experiments/lm-evaluation-harness/lm_eval/evaluator.py", line 304, in simple_evaluate
    results = evaluate(
              ^^^^^^^^^
  File "/Users/admin/dev/experiments/lm-evaluation-harness/lm_eval/utils.py", line 402, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/admin/dev/experiments/lm-evaluation-harness/lm_eval/evaluator.py", line 524, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/admin/devtools/miniconda3/envs/lm_eval/lib/python3.12/site-packages/mlx_lm/evaluate.py", line 305, in generate_until
    for response in stream_generate(
                    ^^^^^^^^^^^^^^^^
  File "/Users/admin/devtools/miniconda3/envs/lm_eval/lib/python3.12/site-packages/mlx_lm/utils.py", line 529, in stream_generate
    for n, (token, logprobs) in enumerate(token_generator):
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/admin/devtools/miniconda3/envs/lm_eval/lib/python3.12/site-packages/mlx_lm/utils.py", line 299, in generate_step
    model(y[:prefill_step_size][None], cache=prompt_cache)
  File "/Users/admin/devtools/miniconda3/envs/lm_eval/lib/python3.12/site-packages/mlx_lm/models/qwen2.py", line 184, in __call__
    out = self.model(inputs, mask, cache)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/admin/devtools/miniconda3/envs/lm_eval/lib/python3.12/site-packages/mlx_lm/models/qwen2.py", line 164, in __call__
    h = layer(h, mask, c)
        ^^^^^^^^^^^^^^^^^
  File "/Users/admin/devtools/miniconda3/envs/lm_eval/lib/python3.12/site-packages/mlx_lm/models/qwen2.py", line 129, in __call__
    r = self.self_attn(self.input_layernorm(x), mask, cache)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/admin/devtools/miniconda3/envs/lm_eval/lib/python3.12/site-packages/mlx_lm/models/qwen2.py", line 75, in __call__
    B, L, D = x.shape
    ^^^^^^^
ValueError: too many values to unpack (expected 3)

the same happens with other generative tasks so i think the issue is related to that, not exactly sure how to debug this problem, if you can point me in the right direction i'm happy to try and do a PR to fix it

@awni
Copy link
Member

awni commented Feb 11, 2025

I'm not able to load a task called arc_challenge_chat. Could you share more detailed steps for how to reproduce what you are seeing?

@awni
Copy link
Member

awni commented Feb 11, 2025

Nvm I found a task to repro it. Fix is in #1277

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants