Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix models path #25

Merged
merged 22 commits into from
Sep 10, 2024
Merged

Conversation

Samoed
Copy link
Contributor

@Samoed Samoed commented Sep 5, 2024

Since #18, you can't load results for models that are in the author__model folder format because the script only looks for folders that match the model name. For example:

from datasets import load_dataset
ds = load_dataset("mteb/results", "e5-small-v2")

This would result in an error because there is only an intfloat__e5-small-v2 folder. However, models without a revision still work since their folder names haven't changed.

I've also added a test to load all model results. I'm not sure how long this will take to run on CI, but locally, with a hard-coded local JSON, it took 2 minutes. Currently, it is failing due to errors like the one mentioned above.

@KennethEnevoldsen, could you please take a look at this?

@Samoed
Copy link
Contributor Author

Samoed commented Sep 5, 2024

Interesting. Locally I can run make test, but CI fails

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Sep 6, 2024

This sounds like a change that should be fixed within mteb's load function. Will add a fix there and the return to this issue.

@Samoed
Copy link
Contributor Author

Samoed commented Sep 6, 2024

I don't think so. Because this repo don't use mteb for work with data and source of problem wrong paths.json

__init__.py Outdated Show resolved Hide resolved
results.py Show resolved Hide resolved
Comment on lines 34 to 38
@pytest.mark.parametrize("model", MODELS)
def test_load_results_from_datasets(model):
"""Ensures that all models can be imported from dataset"""
path = Path(__file__).parent.parent / "results.py"
ds = load_dataset(str(path.absolute()), model, trust_remote_code=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this required? Does the leaderboard load it using datasets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests/test_load_results.py Outdated Show resolved Hide resolved
@KennethEnevoldsen
Copy link
Contributor

@Samoed if you resolve a comment, please resolve - it makes it easier for me to see what you have fixed.

Also seems like there is still unresolved files

@Samoed
Copy link
Contributor Author

Samoed commented Sep 6, 2024

Now I resolved only __init__ comment and was waiting for tests, but they didn't run. Other comments needed your review

results.py Outdated Show resolved Hide resolved
Samoed and others added 3 commits September 6, 2024 18:29
@KennethEnevoldsen
Copy link
Contributor

Other comments needed your review

Took a look over it again. Are you referring to:

I don't think so. Because this repo don't use mteb for work with data and source of problem wrong paths.json

Are you saying that the leaderboard repo CI, assumes something about the file structure on this repo and that is what is causing the issue?

The current file format is def. the one we want to stick with. So we should update the leaderboard code to match if that is the case.

@Samoed
Copy link
Contributor Author

Samoed commented Sep 6, 2024

Are you saying that the leaderboard repo CI, assumes something about the file structure on this repo and that is what is causing the issue?

A bit yes, now in paths.json can be two versions of model result files

  1. before adding author to model_name:
    https://github.com/embeddings-benchmark/results/blob/45604e5b10b383a06d8725125e485273211da8e1/paths.json#L592-L596
  2. and after that:
    https://github.com/embeddings-benchmark/results/blob/45604e5b10b383a06d8725125e485273211da8e1/paths.json#L305-L309

And results repo looks only for models that are defined in split and now in repo for bge-m3 result will be files without revision and others with revision can't be imported. My PR merges these results by finding model_name after author name, so new format results can be used with old format

@KennethEnevoldsen
Copy link
Contributor

Hmm right, getting a better hang of the code here - the integration with mteb is quite poor here, which leads to a few issues.

So ideal situation:

from datasets import load_dataset
ds = load_dataset("mteb/results", "intfloat/e5-small-v2") # using just e5-small-v2 can lead to multiple matches

and that results should be compatible with

results = mteb.load_results(models=["intfloat/e5-small-v2"])

For splits etc. this is a duplicate of information already within mteb. A benchmark should only be specified once (and that is within the package). Split etc. should all be derived from mteb.

I don't see any reason why we should filter it within this repository (I might be missing something).


That is the ideal case. We might need to make some shortcuts here.

@Samoed
Copy link
Contributor Author

Samoed commented Sep 6, 2024

I understand your idea and agree that it should be implemented this way. However, it will require more time to integrate the old results into the new format, and some refactoring of the leaderboard will be needed

@Samoed
Copy link
Contributor Author

Samoed commented Sep 6, 2024

Now tests fails becase of huggingface/datasets#7141

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your idea and agree that it should be implemented this way. However, it will require more time to integrate the old results into the new format, and some refactoring of the leaderboard will be needed

Thanks. Yep def. required more changes - I think this looks reasonable as is. Though we need to resolve the failing tests

@orionw since you are more familiar with the leaderboard integration, can I ask you to review this PR as well?

Copy link
Contributor

@orionw orionw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @Samoed!! There was definitely some problem with it before that you fixed, thanks! I may be having a hard time following this PR, but it seems like we just make it so that any shortened model name works also so we don't miss any/have duplicates.

If so, did we make sure we kept all the existing results? I see the GritLM no instruction one got deleted for example.

That's my only concern, otherwise LGTM.

results.py Show resolved Hide resolved
results.py Show resolved Hide resolved
results.py Outdated Show resolved Hide resolved
@KennethEnevoldsen
Copy link
Contributor

@Samoed we still have to resolve the failing test before we can merge this in. Also please respond to the comments.

You also write:

Currently, it is failing due to errors like the one mentioned above.

Which, when I read it along with the failing tests it seems like there is only partly a solution here. In which case please describe what is stopping it from becoming a finished solution.

@Samoed
Copy link
Contributor Author

Samoed commented Sep 9, 2024

I've added a test to load all model results by splits, but the results take path.json from the HF repository, where the paths are broken. To make this test pass, it needs to be fixed in the HF repository.

TIL: I wrote replies to all your comments, but I didn't know I needed to press the submit button in the files tab to make them visible for you

@Samoed
Copy link
Contributor Author

Samoed commented Sep 9, 2024

Currently, there is one model that won't be fixed by my PR: text-embedding-preview-0409-768. At the moment, there are two folders with the same model name, but the author_name is separated by a .. I think this should be standardized, but I'm not sure which model this corresponds to.

  1. https://github.com/embeddings-benchmark/results/tree/main/results/google-gecko.text-embedding-preview-0409/no_revision_available
  2. https://github.com/embeddings-benchmark/results/tree/main/results/google-gecko-256.text-embedding-preview-0409/no_revision_available

@KennethEnevoldsen
Copy link
Contributor

I've added a test to load all model results by splits, but the results take path.json from the HF repository, where the paths are broken. To make this test pass, it needs to be fixed in the HF repository.

Thanks for the clarification should have noted that. Seems like a test that will never fail when we want it to (when we introduce an error it will use the current working version) and when that error is added the next (working) PR will fail. Shouldn't it just reference the path.json within the PR?

TIL: I wrote replies to all your comments, but I didn't know I needed to press the submit button in the files tab to make them visible for you

Ahh we have all been there :)

re names, we should use '__' to separate the author from the model names. For now, let us treat 256 as a part of the authors (though my guess is that it refers to the embed size) so:

google-gecko-256.text-embedding-preview-0409/no_revision_available -> google-gecko-256__text-embedding-preview-0409/no_revision_available

Re: text-embedding-preview-0409-768

I can't seem to find that folder in the results repo?

@Samoed
Copy link
Contributor Author

Samoed commented Sep 9, 2024

I can't seem to find that folder in the results repo?

I can't find it either, but I think it might be google-gecko.text-embedding-preview-0409. However, I'm not sure. Found these models in leaderboard config https://github.com/embeddings-benchmark/leaderboard/blob/12fdafc4eb41fb4e4dddfc6a6a202f13e23b734a/model_meta.yaml#L814-L831

Shouldn't it just reference the path.json within the PR?

I thought the same regarding #24, but I believe that when HF tries to download the repository, it will only download results.py and README, not the paths file. So, the paths file is being downloaded by the results file at runtime, and I'm not sure how to improve this.

Revert "rename"

This reverts commit e252add.

rename
@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Sep 9, 2024

Let us delete text-embedding-preview-0409-768 for now. The results are not there and the two on the leaderboard fall under different names.

I thought the same regarding #24, but I believe that when HF tries to download the repository, it will only download results.py and README, not the paths file. So, the paths file is being downloaded by the results file at runtime, and I'm not sure how to improve this.

hmm not sure either the code:

ds = load_dataset(str(path.absolute()), model, trust_remote_code=True)

does seem to follow the docs. It does say legacy though. If we can't get it to work we should just remove the test

@Samoed
Copy link
Contributor Author

Samoed commented Sep 9, 2024

Let us delete text-embedding-preview-0409-768 for now. The results are not there and the two on the leaderboard fall under different names.

I thought the same regarding #24, but I believe that when HF tries to download the repository, it will only download results.py and README, not the paths file. So, the paths file is being downloaded by the results file at runtime, and I'm not sure how to improve this.

hmm not sure either the code:

ds = load_dataset(str(path.absolute()), model, trust_remote_code=True)

does seem to follow the docs. It does say legacy though. If we can't get it to work we should just remove the test

Yes, this will work for local development if full repo is provided, but when using load_dataset("mteb/results") code will give error, because it won't have results file

@orionw
Copy link
Contributor

orionw commented Sep 9, 2024

Thanks for the response @Samoed, I understand now.

google-gecko.text-embedding-preview-0409

I have been confused by the difference between Google Gecko and this text-embedding-004 model on the Arena. Google's naming conventions make this very hard to understand.

@KennethEnevoldsen
Copy link
Contributor

Thanks @Samoed for taking the time on this - I will merge it in now to allow the other PRs to be resolved as well

@KennethEnevoldsen KennethEnevoldsen merged commit 16d7a28 into embeddings-benchmark:main Sep 10, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants