-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add wildreceipt dataset #1359
Add wildreceipt dataset #1359
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1359 +/- ##
==========================================
+ Coverage 95.78% 95.80% +0.01%
==========================================
Files 154 155 +1
Lines 6910 6954 +44
==========================================
+ Hits 6619 6662 +43
- Misses 291 292 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @HamzaGbada 👋,
Thanks a lot this looks overall pretty good 👍
I have added a few comments
Furthermore could you also update the docs please ? :)
https://github.com/mindee/doctr/blob/main/docs/source/index.rst -> Supported Datasets
https://github.com/mindee/doctr/blob/main/docs/source/using_doctr/using_datasets.rst -> Tables
If you are done please run
make style
make quality
to fix formatting, etc.
NOTE: Don't take care of the failing CI TF detection test i have opened a fix for this already :)
doctr/datasets/wildreceipt.py
Outdated
|
||
|
||
class WILDRECEIPT(AbstractDataset): | ||
"""WildReceipt is a collection of receipts. It contains, for each photo, of a list of OCRs - with bounding box, text, and class." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""WildReceipt dataset from `"Spatial Dual-Modality Graph Reasoning for Key Information Extraction"
doctr/datasets/wildreceipt.py
Outdated
<https://arxiv.org/abs/2103.14470v1>`_ | | ||
`repository <https://download.openmmlab.com/mmocr/data/wildreceipt.tar>`_. | ||
|
||
>>> # NOTE: You need to download/generate the dataset from the repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
download/generate -> download
doctr/datasets/wildreceipt.py
Outdated
self.data: List[Tuple[Union[str, Path, np.ndarray], Union[str, Dict[str, Any]]]] = [] | ||
|
||
# define folder to write IMGUR5K recognition dataset | ||
reco_folder_name = "WILDRECEIPT_recognition_train" if self.train else "WILDRECEIPT_recognition_test" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many samples are in the train and test splits ?
Do we really need to save it locally or can we keep it in RAM ?
Otherwise we can store it directly in RAM
example:
Line 94 in bc2d3c5
if recognition_task: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly, given the limited number of samples available – specifically, 1268 samples for the training set and 472 samples for the test set – I've opted to store the data directly in RAM.
doctr/datasets/wildreceipt.py
Outdated
np_dtype = np.float32 | ||
self.data: List[Tuple[Union[str, Path, np.ndarray], Union[str, Dict[str, Any]]]] = [] | ||
|
||
# define folder to write IMGUR5K recognition dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WildReceipt
doctr/datasets/wildreceipt.py
Outdated
dtype=np_dtype | ||
) | ||
else: | ||
box = self._convert_xmin_ymin(coordinates) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to write an own function you can use the functions from doctr.utils
from .utils import polygon_to_bbox
box_targets = polygon_to_bbox(tuple((coordniates[i], coordinates[i + 1]) for i in range(0, len(coordinates), 2)))
box = [coord for coords in box_targets for coord in coords]
OR
write the logic directly here (function is only used onces)
x, y = box[::2], box[1::2]
box = [min(x), min(y), max(x), max(y)]
I would prefer the sec way
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HamzaGbada can we use the secound suggestion please after reading this again i really don't like it 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
doctr/datasets/wildreceipt.py
Outdated
img_path=os.path.join(tmp_root, img_path), geoms=np.asarray(box_targets, dtype=int).clip(min=0) | ||
) | ||
for crop, label in zip(crops, list(text_targets)): | ||
with open(os.path.join(reco_folder_path, f"{reco_images_counter}.txt"), "w") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned i don't think that we need to save it locally wdyt ?
About fixing formatting, these two commends return a Error:
Do you have an idea about it ? |
You have installed doctr with it's dev dependencies correct ?
Looks like you are in the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HamzaGbada Close to merge really good job 👍🏼 😃
Only some minor stuff left and make
@@ -84,6 +86,8 @@ This datasets contains the information to train or validate a text recognition m | |||
+-----------------------------+---------------------------------+---------------------------------+---------------------------------------------+ | |||
| IIITHWS | 7141797 | 793533 | english / handwritten / external resources | | |||
+-----------------------------+---------------------------------+---------------------------------+---------------------------------------------+ | |||
| WILDRECEIPT | 1268 | 472 | english / external resources | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks not correct here we should add the number of samples we get if we use the dataset for recognition :)
So this should be much more samples
doctr/datasets/wildreceipt.py
Outdated
<https://arxiv.org/abs/2103.14470v1>`_ | | ||
`repository <https://download.openmmlab.com/mmocr/data/wildreceipt.tar>`_. | ||
|
||
>>> # NOTE: You need to download the dataset from the repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to: You need to download the dataset first.
crops = crop_bboxes_from_image( | ||
img_path=os.path.join(tmp_root, img_path), geoms=np.asarray(box_targets, dtype=int).clip(min=0) | ||
) | ||
for crop, label in zip(crops, list(text_targets)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if there are text inside we need to filter out ?
For example text which contains whitespaces ?
Ref.:
Line 100 in f22f6dd
if not any(char in label for char in ["☑", "☐", "\uf703", "\uf702"]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth noting that this dataset contains small text elements that might not be conducive to the recognition task. For instance, we could consider filtering out text elements that are empty or consist of characters such as "-", "*", "/", "=", "#", or "@"
to enhance the quality of the recognition process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HamzaGbada
Mh in this case i think it would be enough to filter empty elements or if a whitespace is in the label.
We can handle all the above punctuations :)
"""WildReceipt dataset from `"Spatial Dual-Modality Graph Reasoning for Key Information Extraction" | ||
<https://arxiv.org/abs/2103.14470v1>`_ | | ||
`repository <https://download.openmmlab.com/mmocr/data/wildreceipt.tar>`_. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional
If we have an image to give a general overview of the dataset would be great
See:
https://mindee.github.io/doctr/modules/datasets.html
Line 24 in f22f6dd
.. image:: https://doctr-static.mindee.com/models?id=v0.5.0/funsd-grid.png&src=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where should I put the image ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HamzaGbada you can post it here
@odulcy-mindee Could you upload it please ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixT2K @HamzaGbada Here you go:
https://doctr-static.mindee.com/models?id=v0.7.0/wildreceipt-dataset.jpg&src=0
It would be enough if you post the mentioned image here we can update the docstring later :) |
No it returns:
|
what happens if you run the following commands (single without make):
|
doctr/datasets/wildreceipt.py
Outdated
@@ -99,7 +99,7 @@ def __init__( | |||
img_path=os.path.join(tmp_root, img_path), geoms=np.asarray(box_targets, dtype=int).clip(min=0) | |||
) | |||
for crop, label in zip(crops, list(text_targets)): | |||
if not any(char in label for char in ["", "-", "*", "/", "=", "#", "@"]): | |||
if not any(char in label for char in ["", " "]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if label and " " not in label:
Got it, the issue was related to my Linux distribution. |
@HamzaGbada |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now thanks a lot 🤗
Thank you @HamzaGbada for this contribution ! 👏 |
No description provided.