How to have new Vocab and its Training #608

osman-aktepe · 2021-11-11T23:19:50Z

osman-aktepe
Nov 11, 2021

Hi all,

First of all, I tested and admired your project. What I want is to add turkish vocab, and retrain the recognition model.

My question is;

Can I use classification training under references section? Or do i have to use recognition training with prepared data. Also If your answer is recognition part, do you have any data generator from texts for that format?

I could not find a document and, in the discussion part, I could not see any explanation about it. If there is, please guide me and forgive me.

Regards

Answered by charlesmindee

Nov 12, 2021

Hi @osman-aktepe, thank you for your interest in docTR!

If you want to train with turkish vocab, we first need to integrate this vocab to doctr indeed. Then, you need to retrain a recognition model with this vocab on a turkish dataset (images of word boxes + corresponding annotations).
If you don't have such a dataset, you can either collect turkish pictures of words and then annotate them manually (or pass them through another OCR to get them annotated), or you can generate a fully synthetic dataset writing words on images with different fonts, sizes, colors, ... and you have directly the annotations because you know the words you just drew.

I hope this answer your question ! 😄

View full answer

charlesmindee · 2021-11-12T09:24:13Z

charlesmindee
Nov 12, 2021
Maintainer

Hi @osman-aktepe, thank you for your interest in docTR!

If you want to train with turkish vocab, we first need to integrate this vocab to doctr indeed. Then, you need to retrain a recognition model with this vocab on a turkish dataset (images of word boxes + corresponding annotations).
If you don't have such a dataset, you can either collect turkish pictures of words and then annotate them manually (or pass them through another OCR to get them annotated), or you can generate a fully synthetic dataset writing words on images with different fonts, sizes, colors, ... and you have directly the annotations because you know the words you just drew.

I hope this answer your question ! 😄

7 replies

osman-aktepe Nov 14, 2021
Author

Hi @charlesmindee ,

I want to clarify several things too,

Is it enough to add extra line for turkish vocab here: https://github.com/mindee/doctr/blob/b27c3a664e99989e11d7bc527278abc33a539db0/doctr/datasets/vocabs.py or do I need to do something more?
For example 500-600 images are enough or should I create more?
Should I create words with punctuations?
Thanks

charlesmindee Nov 15, 2021
Maintainer

Hi @osman-aktepe,

yes it is enough to add the extra line there (with a PR)
I think 500-600 word images is too few, we use several millions on our side but maybe a few thousands/tens of thousands/hundreds of thousands are enough. It depends on the maximum word length you want your model to be robust on, if you want it to be able to predict 20-chars words with a lot of different chars in your vocab, you really need a lot of data.
If you want you model to learn punctuation, you need to insert punctuations in your words, for instance: "home:", "coffee.", "..."

osman-aktepe Nov 17, 2021
Author

Hi @charlesmindee,

I know I am bothering you but now I am coding for synthetic dataset generator and I want to ask some questions. I am trying to follow your guidance and I do not want to do something ridiculous :)

Should the overall size be 32 x 128? For example should i fit 20 chars on 32 x 128?
If I create a big image, will it be resized or splitted?
Do you have any small dataset file to download to understand what kind of backgrounds and image kinds should I create?

Thanks for your help

charlesmindee Nov 17, 2021
Maintainer

Hi @osman-aktepe ,
This is a sample of our dataset:
images.zip
Then we resize each image to 32 x 128 (input size of the recognition model), when we feed the model. We have a preprocessing pipe which performs resizing, normalization, standardization, image augmentations, batching. You can either train with our script and our pipeline, or create your own preprocessing pipeline. 😄
I hope this answer your question!

osman-aktepe Nov 17, 2021
Author

Hi @charlesmindee ,

This explains everytihng :) I will use your pipeline.

Thanks a lot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to have new Vocab and its Training #608

{{title}}

Replies: 1 comment 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to have new Vocab and its Training #608

osman-aktepe Nov 11, 2021

Replies: 1 comment · 7 replies

charlesmindee Nov 12, 2021 Maintainer

osman-aktepe Nov 14, 2021 Author

charlesmindee Nov 15, 2021 Maintainer

osman-aktepe Nov 17, 2021 Author

charlesmindee Nov 17, 2021 Maintainer

osman-aktepe Nov 17, 2021 Author

osman-aktepe
Nov 11, 2021

Replies: 1 comment 7 replies

charlesmindee
Nov 12, 2021
Maintainer

osman-aktepe Nov 14, 2021
Author

charlesmindee Nov 15, 2021
Maintainer

osman-aktepe Nov 17, 2021
Author

charlesmindee Nov 17, 2021
Maintainer

osman-aktepe Nov 17, 2021
Author