-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: evaluate missing splits #1268
base: main
Are you sure you want to change the base?
fix: evaluate missing splits #1268
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thivyanth Thanks for taking a first stab at it! For any new functionality, we usually add a test / test cases to help confirm that the added code works. Could you please add a test for this? Specifically the following cases:
- Results exist, but no missing splits
- Results exist, and 1 missing split.
I can review the PR afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, but still a few things to adjust.
mteb/evaluation/MTEB.py
Outdated
if save_path.exists() and not overwrite_results: | ||
logger.info( | ||
f"{task.metadata.name} results already exists. Loading results from disk. Set overwrite_results=True to overwrite." | ||
existing_results = self.load_existing_results(save_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should use
MTEBResults.from_disk(path)
to load results.
(see line 354)
mteb/evaluation/MTEB.py
Outdated
logger.info( | ||
f"{task.metadata.name} results exist but missing splits: {missing_splits}. Running evaluation for missing splits." | ||
) | ||
task_eval_splits = missing_splits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will overwrite the existing file further down resulting in a new file where the splits are missing. You will have to merge the results objects.
mteb/evaluation/MTEB.py
Outdated
def compare_splits_and_subsets(self, existing_results, task_eval_splits): | ||
missing_splits = [] | ||
for split in task_eval_splits: | ||
if split not in existing_results: | ||
missing_splits.append(split) | ||
return missing_splits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def compare_splits_and_subsets(self, existing_results, task_eval_splits): | |
missing_splits = [] | |
for split in task_eval_splits: | |
if split not in existing_results: | |
missing_splits.append(split) | |
return missing_splits | |
@staticmethod | |
def compare_splits_and_subsets(existing_results: MTEBResults, task_eval_splits: list[str]) -> list[str]: | |
missing_splits = [] | |
for split in task_eval_splits: # this will need to be adapted to MTEBResults object | |
if split not in existing_results: | |
missing_splits.append(split) | |
return missing_splits |
@KennethEnevoldsen Thank you for the suggestions. I've implemented the changes as requested but from scratch. I manually checked the logs to ensure that only the missing splits are being evaluated rather than running a pytest. Could you please suggest a way to verify this with pytest, specifically to confirm that only the missing splits are evaluated? @isaac-chung @Muennighoff (The latest commit has passed make test and make lint.) |
Wrote a test as well. (The latest commit has passed make test and make lint.) |
@thivyanth rerunning failed test to ensure that everything passes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already looking much better - A few more changes are needed though, let me know if there are any questions
self.last_evaluated_splits[task.metadata_dict["name"]] = set() | ||
self.last_evaluated_splits[task.metadata_dict["name"]].add(split) | ||
|
||
new_results = MTEBResults.from_task_results( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to consider what to do with the MTEB version specification.
Ideally we should add multiple "1.2.11, 1.2.16". We should also add a flag to only append to existing files if the version matches (should probably be on by default)
f"{task.metadata.name} results already exist. Loading results from disk." | ||
) | ||
evaluation_results.append(existing_results) | ||
self.last_evaluated_splits[task.metadata.name] = [] # Add this line |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like an error?
@@ -488,3 +563,11 @@ def _save_model_metadata(model_meta: ModelMeta, output_folder: Path) -> None: | |||
|
|||
with save_path.open("w") as f: | |||
json.dump(model_meta.to_dict(), f) | |||
|
|||
def get_last_evaluated_splits(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed? (I would just add some logging messages instead)
if existing_results: | ||
merged_results = self._merge_results(existing_results, new_results) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably worth merging using the MTEBResult object, instead of directly on the dict.
new_results = MTEBResults(...)
if existing_results:
new_results.update(existing_results)
@pytest.fixture | ||
def model(): | ||
return SentenceTransformer("all-MiniLM-L6-v2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use mock model instead (see e.g. MockNumpyEncoder here)
@pytest.fixture | ||
def nfcorpus_tasks(): | ||
return mteb.get_tasks(tasks=["NFCorpus"], languages=["eng"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use MockTask as well (see above)
evaluation.run( | ||
model, | ||
eval_splits=["train", "test"], | ||
save_predictions=True, | ||
output_folder=str(tmp_path / "testcase1"), | ||
verbosity=2, | ||
) | ||
last_evaluated_splits = evaluation.get_last_evaluated_splits() | ||
print(last_evaluated_splits) | ||
assert "NFCorpus" in last_evaluated_splits | ||
assert set(last_evaluated_splits["NFCorpus"]) == {"train", "test"} | ||
assert len(last_evaluated_splits["NFCorpus"]) == 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can simplify tests a bit: (general across tests)
evaluation.run( | |
model, | |
eval_splits=["train", "test"], | |
save_predictions=True, | |
output_folder=str(tmp_path / "testcase1"), | |
verbosity=2, | |
) | |
last_evaluated_splits = evaluation.get_last_evaluated_splits() | |
print(last_evaluated_splits) | |
assert "NFCorpus" in last_evaluated_splits | |
assert set(last_evaluated_splits["NFCorpus"]) == {"train", "test"} | |
assert len(last_evaluated_splits["NFCorpus"]) == 2 | |
result_obj = evaluation.run( | |
model, | |
eval_splits=["train", "test"], | |
save_predictions=True, | |
output_folder=str(tmp_path / "testcase1"), | |
verbosity=2, | |
) | |
# check splits here based on object - no need to last_evaluated_splits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason why save_predictions is true?
@thivyanth great work thus far! It's a pretty useful PR so would be amazing to have it merged soon if you have time to address the comments? |
I just saw this message, few hours ago I had mailed you |
Any update here @thivyanth? Would be great to get this in! 🙌 |
@thivyanth would love to get this merged in. Do you still have time to work on it? |
Addresses #1260
Checklist
@Muennighoff @isaac-chung
make test
.944 passed, 236 skipped, 55 warnings in 232.13s (0:03:52)
make lint
.