Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSoC2024] Resolved issue #7523: Fixed error caused by filenames with same names #7602

Closed
wants to merge 11 commits into from
31 changes: 30 additions & 1 deletion cvat/apps/dataset_manager/formats/coco.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@
# SPDX-License-Identifier: MIT

import zipfile

import json
import uuid
from datumaro.components.dataset import Dataset
from datumaro.components.annotation import AnnotationType

Expand Down Expand Up @@ -32,8 +33,16 @@ def _import(src_file, temp_dir, instance_data, load_data_callback=None, **kwargs
load_data_callback(dataset, instance_data)
import_dm_annotations(dataset, instance_data)
else:
rename_mapping, src_file = update_annotation_data(src_file)
#To get reverse map with image names without extension
reverse_mapping = {value: key for key, value in rename_mapping.items()}
reverse_no_ext_mapping = {key.rsplit('.', 1)[0]: value.rsplit('.', 1)[0] for key, value in reverse_mapping.items()}

dataset = Dataset.import_from(src_file.name,
'coco_instances', env=dm_env)
#Reverse mapping names
for item in dataset:
item.id = reverse_no_ext_mapping[item.id]
Copy link
Contributor

@zhiltsov-max zhiltsov-max Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please try to implement the solution I described in #7523 (comment)? This code currently misses the most important part, that, in fact, should address the problem. Currently, as I see, the code doesn't change the situation, as it just modifies the input dataset and then changes everything back on the Datumaro side. As the problem is on the Datumaro side, it's not possible to solve it just by changing the Datumaro dataset object. Please also check the zip archive case as well, and then COCO Keypoints format in this file, it will have the same implementation for the fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhiltsov-max

I've been experimenting with the implementation provided in #7523(comment), and I encountered a circular import error while attempting to load the rename_mapping variable into bindings.py.

ImportError: cannot import name 'GetCVATDataExtractor' from partially initialized module 'cvat.apps.dataset_manager.bindings' (most likely due to a circular import) (/home/adkbbx/projects/CVAT/cvat/cvat/apps/dataset_manager/bindings.py)

Which I thought would be the way of implementing this logic please do correct me if I was wrong with the implementation.

On furthur debugging I found that the

self._frame_mapping variable inside the bindings.py only contains the value {"image1": 1} may be since same keys were used since path "Image1" was same for both the images in the dataset and only the value of fram got update from 0->1 which I felt would be the cause of multiple annotations being placed on the same image on my current implementation.

To resolve this, I attempted to updating the _frame_mapping attribute with the instance_data and this seems to resolve the issue of multiple annotations being placed on the same image on my current implementation but while I was testing the export of datasets the annotations aren't getting updated into json file.
"annotations": []

So I think updating _frame_mapping might be the cause of this export error which I am unable to debug.

I would appreciate any hints, insights or suggestions for the implementation of this solution further. I'm sorry for the frequent contact and thank you for your time and patience! 🙇

Copy link
Contributor

@zhiltsov-max zhiltsov-max Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been experimenting with the implementation provided in #7523 (comment), and I encountered a circular import error while attempting to load the rename_mapping variable into bindings.py.

ImportError: cannot import name 'GetCVATDataExtractor' from partially initialized module 'cvat.apps.dataset_manager.bindings' (most likely due to a circular import) (/home/adkbbx/projects/CVAT/cvat/cvat/apps/dataset_manager/bindings.py)

Which I thought would be the way of implementing this logic please do correct me if I was wrong with the implementation.

The suggestion in the comment was to add a new function parameter to import_dm_annotations() in bindings.py. I believe, this doesn't require importing any modules. Then, you can supply the constructed mapping into this function, when it's called in coco.py. After this, in import_dm_annotations(), the match_dm_item() function gets called. It also needs a new parameter. match_dm_item() needs to be updated to use this new parameter to map the item name back into the original one, and find the matching frame id in the job or task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhiltsov-max
I've pushed a new commit attempting to address the issue and think I was able to incorporated the suggestions you provided earlier, including adding a new parameter to import_dm_annotations() in bindings.py and updating match_dm_item() accordingly. Despite these changes, It seems that the annotations are still being duplicated on some images even though the anotations are being accepted now. As of what I am able to understand from my debugging the way instance_dataset._frame_mapping updates when there are images with same name and different extensions might be the cause of this issue.
Any further guidance you can provide would be greatly appreciated. Thank you for your continued support and feedback.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very nice to see advances in solving this problem.

As of what I am able to understand from my debugging the way instance_dataset._frame_mapping updates when there are images with same name and different extensions might be the cause of this issue.

Yeah, it seems to be the issue now. From what I've checked, we can try to change what's stored in these frame mappings, as they seem to only be used for frame matching. Probably, we can include file extensions in these mappings as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to get the inital tests setup and am able to run the tests sucessfully but using pytest ./tests/python and am getting theese results where some test cases are failing as well as status code 500 error which I think is an internal server error and am a little unsure about how to deal with this any pointers would be helpful.

Please try to check server logs with docker logs test_cvat_server_1, docker logs test_cvat_worker_import_1, docker logs test_cvat_worker_export_1. You can also attach a debugger, if it's needed.

please note that I came up with this refering to the current implementation of the test so do let me know if it is incorrect. Thank you for your support!

  • Seems like it does quite a few unnecessary actions. Please consider minimizing the amount of actions done in the test setup.
  • Importing the exported dataset makes no sense, as it doesn't reproduce the problem (since the export is broken too wrt. this issue).
  • Why the test is checking that there are no annotations in the task after uploading?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhiltsov-max I am sorry I think that was a mistake from me I will update the tests and check the logs as requested thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, any updates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhiltsov-max I am sorry for the delayed response and must inform you that progress on this task has been slower than anticipated since I had been a little busy with the proposal submission for GSOC. I plan to complete working on this issue as soon as possible. I apologize for any inconvenience this may have caused, and I assure you that I am dedicated to completing this task promptly and effectively. Thank you for your understanding and continued support. 🙇

Copy link
Contributor

@zhiltsov-max zhiltsov-max Apr 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, do as you see fits best to you, it's your PR after all. Just keep in mind the program deadlines, if you want it to be a part of GSoC.

import_dm_annotations(dataset, instance_data)

@exporter(name='COCO Keypoints', ext='ZIP', version='1.0')
Expand Down Expand Up @@ -65,3 +74,23 @@ def remove_extra_annotations(dataset):
'coco_person_keypoints', env=dm_env)
remove_extra_annotations(dataset)
import_dm_annotations(dataset, instance_data)


def update_annotation_data(src_file):
rename_mapping = {}
with open(src_file.name, 'r') as f:
annotation_data = json.load(f)

for image in annotation_data['images']:
original_filename = image['file_name']
unique_id = str(uuid.uuid4())[:8] # Generating a unique ID
name, extension = original_filename.split('.')
new_name = f"{name}_{unique_id}.{extension}"
rename_mapping[original_filename] = new_name
image['file_name'] = new_name

with open(src_file.name, 'w') as f:
json.dump(annotation_data, f, indent=4)

# Returning src_file along with rename_mapping
return rename_mapping, src_file
Loading