Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add wildreceipt dataset #1359

Merged
merged 33 commits into from
Oct 27, 2023
Merged

Conversation

HamzaGbada
Copy link
Contributor

No description provided.

@codecov
Copy link

codecov bot commented Oct 26, 2023

Codecov Report

Merging #1359 (e257a29) into main (e83c3ab) will increase coverage by 0.01%.
Report is 6 commits behind head on main.
The diff coverage is 97.77%.

❗ Current head e257a29 differs from pull request most recent head 478a420. Consider uploading reports for the commit 478a420 to get more accurate results

@@            Coverage Diff             @@
##             main    #1359      +/-   ##
==========================================
+ Coverage   95.78%   95.80%   +0.01%     
==========================================
  Files         154      155       +1     
  Lines        6910     6954      +44     
==========================================
+ Hits         6619     6662      +43     
- Misses        291      292       +1     
Flag Coverage Δ
unittests 95.80% <97.77%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
doctr/datasets/__init__.py 100.00% <100.00%> (ø)
doctr/datasets/wildreceipt.py 97.72% <97.72%> (ø)

... and 4 files with indirect coverage changes

@felixdittrich92 felixdittrich92 added this to the 0.7.1 milestone Oct 26, 2023
@felixdittrich92 felixdittrich92 added topic: documentation Improvements or additions to documentation ext: tests Related to tests folder module: datasets Related to doctr.datasets labels Oct 26, 2023
@felixdittrich92 felixdittrich92 self-assigned this Oct 26, 2023
Copy link
Contributor

@felixdittrich92 felixdittrich92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @HamzaGbada 👋,

Thanks a lot this looks overall pretty good 👍
I have added a few comments

Furthermore could you also update the docs please ? :)
https://github.com/mindee/doctr/blob/main/docs/source/index.rst -> Supported Datasets
https://github.com/mindee/doctr/blob/main/docs/source/using_doctr/using_datasets.rst -> Tables

If you are done please run

make style
make quality

to fix formatting, etc.

NOTE: Don't take care of the failing CI TF detection test i have opened a fix for this already :)



class WILDRECEIPT(AbstractDataset):
"""WildReceipt is a collection of receipts. It contains, for each photo, of a list of OCRs - with bounding box, text, and class."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"""WildReceipt dataset from `"Spatial Dual-Modality Graph Reasoning for Key Information Extraction"

<https://arxiv.org/abs/2103.14470v1>`_ |
`repository <https://download.openmmlab.com/mmocr/data/wildreceipt.tar>`_.

>>> # NOTE: You need to download/generate the dataset from the repository.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

download/generate -> download

self.data: List[Tuple[Union[str, Path, np.ndarray], Union[str, Dict[str, Any]]]] = []

# define folder to write IMGUR5K recognition dataset
reco_folder_name = "WILDRECEIPT_recognition_train" if self.train else "WILDRECEIPT_recognition_test"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many samples are in the train and test splits ?
Do we really need to save it locally or can we keep it in RAM ?

Otherwise we can store it directly in RAM
example:

if recognition_task:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly, given the limited number of samples available – specifically, 1268 samples for the training set and 472 samples for the test set – I've opted to store the data directly in RAM.

np_dtype = np.float32
self.data: List[Tuple[Union[str, Path, np.ndarray], Union[str, Dict[str, Any]]]] = []

# define folder to write IMGUR5K recognition dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WildReceipt

dtype=np_dtype
)
else:
box = self._convert_xmin_ymin(coordinates)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to write an own function you can use the functions from doctr.utils

from .utils import polygon_to_bbox
box_targets = polygon_to_bbox(tuple((coordniates[i], coordinates[i + 1]) for i in range(0, len(coordinates), 2)))
box = [coord for coords in box_targets for coord in coords]

OR

write the logic directly here (function is only used onces)

x, y = box[::2], box[1::2]
box = [min(x), min(y), max(x), max(y)]

I would prefer the sec way
:)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HamzaGbada can we use the secound suggestion please after reading this again i really don't like it 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

img_path=os.path.join(tmp_root, img_path), geoms=np.asarray(box_targets, dtype=int).clip(min=0)
)
for crop, label in zip(crops, list(text_targets)):
with open(os.path.join(reco_folder_path, f"{reco_images_counter}.txt"), "w") as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned i don't think that we need to save it locally wdyt ?

doctr/datasets/wildreceipt.py Outdated Show resolved Hide resolved
doctr/datasets/wildreceipt.py Outdated Show resolved Hide resolved
doctr/datasets/wildreceipt.py Outdated Show resolved Hide resolved
@HamzaGbada
Copy link
Contributor Author

About fixing formatting, these two commends return a Error:

make style
make quality
Sphinx error:
Builder name style not registered or available through entry point
make: *** [Makefile:20: style] Error 2

Do you have an idea about it ?

@felixT2K
Copy link
Contributor

About fixing formatting, these two commends return a Error:

make style
make quality
Sphinx error:
Builder name style not registered or available through entry point
make: *** [Makefile:20: style] Error 2

Do you have an idea about it ?

You have installed doctr with it's dev dependencies correct ?

cd doctr
pip3 install -e .[dev]

Looks like you are in the docs directory

cd doctr
make style
make quality

https://github.com/mindee/doctr/blob/main/Makefile

Copy link
Contributor

@felixT2K felixT2K left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HamzaGbada Close to merge really good job 👍🏼 😃

Only some minor stuff left and make

@@ -84,6 +86,8 @@ This datasets contains the information to train or validate a text recognition m
+-----------------------------+---------------------------------+---------------------------------+---------------------------------------------+
| IIITHWS | 7141797 | 793533 | english / handwritten / external resources |
+-----------------------------+---------------------------------+---------------------------------+---------------------------------------------+
| WILDRECEIPT | 1268 | 472 | english / external resources |
Copy link
Contributor

@felixT2K felixT2K Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks not correct here we should add the number of samples we get if we use the dataset for recognition :)
So this should be much more samples

<https://arxiv.org/abs/2103.14470v1>`_ |
`repository <https://download.openmmlab.com/mmocr/data/wildreceipt.tar>`_.

>>> # NOTE: You need to download the dataset from the repository.
Copy link
Contributor

@felixT2K felixT2K Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to: You need to download the dataset first.

crops = crop_bboxes_from_image(
img_path=os.path.join(tmp_root, img_path), geoms=np.asarray(box_targets, dtype=int).clip(min=0)
)
for crop, label in zip(crops, list(text_targets)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if there are text inside we need to filter out ?
For example text which contains whitespaces ?

Ref.:

if not any(char in label for char in ["☑", "☐", "\uf703", "\uf702"]):

Copy link
Contributor Author

@HamzaGbada HamzaGbada Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting that this dataset contains small text elements that might not be conducive to the recognition task. For instance, we could consider filtering out text elements that are empty or consist of characters such as "-", "*", "/", "=", "#", or "@" to enhance the quality of the recognition process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HamzaGbada
Mh in this case i think it would be enough to filter empty elements or if a whitespace is in the label.
We can handle all the above punctuations :)

"""WildReceipt dataset from `"Spatial Dual-Modality Graph Reasoning for Key Information Extraction"
<https://arxiv.org/abs/2103.14470v1>`_ |
`repository <https://download.openmmlab.com/mmocr/data/wildreceipt.tar>`_.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional
If we have an image to give a general overview of the dataset would be great

See:
https://mindee.github.io/doctr/modules/datasets.html

.. image:: https://doctr-static.mindee.com/models?id=v0.5.0/funsd-grid.png&src=0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where should I put the image ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HamzaGbada you can post it here

@odulcy-mindee Could you upload it please ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combined_image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixT2K @HamzaGbada Here you go:

https://doctr-static.mindee.com/models?id=v0.7.0/wildreceipt-dataset.jpg&src=0

@felixT2K
Copy link
Contributor

It would be enough if you post the mentioned image here we can update the docstring later :)
Does make style and quality now work ?

@HamzaGbada
Copy link
Contributor Author

It would be enough if you post the mentioned image here we can update the docstring later :) Does make style and quality now work ?

No it returns:

isort .
make: isort: No such file or directory
make: *** [Makefile:12: style] Error 127

@felixT2K
Copy link
Contributor

It would be enough if you post the mentioned image here we can update the docstring later :) Does make style and quality now work ?

No it returns:

isort .
make: isort: No such file or directory
make: *** [Makefile:12: style] Error 127

what happens if you run the following commands (single without make):

isort .
black .
ruff --fix .

@@ -99,7 +99,7 @@ def __init__(
img_path=os.path.join(tmp_root, img_path), geoms=np.asarray(box_targets, dtype=int).clip(min=0)
)
for crop, label in zip(crops, list(text_targets)):
if not any(char in label for char in ["", "-", "*", "/", "=", "#", "@"]):
if not any(char in label for char in ["", " "]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if label and " " not in label:

@HamzaGbada
Copy link
Contributor Author

ruff --fix .

Got it, the issue was related to my Linux distribution.

@felixdittrich92
Copy link
Contributor

Copy link
Contributor

@felixdittrich92 felixdittrich92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now thanks a lot 🤗

@odulcy-mindee
Copy link
Collaborator

Thank you @HamzaGbada for this contribution ! 👏
Thanks @felixdittrich92 for the review !

@odulcy-mindee odulcy-mindee merged commit 7222fe8 into mindee:main Oct 27, 2023
66 of 68 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ext: tests Related to tests folder module: datasets Related to doctr.datasets topic: documentation Improvements or additions to documentation type: new feature New feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants