Beginner questions #1139

DrRSatzteil · 2023-02-05T11:27:39Z

DrRSatzteil
Feb 5, 2023

Hello everyone,

first of all thank you for this excellent software!

I'm a beginner in the field of machine learning and I played around with doctr for a while now and got some pretty good results with some of my own documents. However since I'm trying to analyse documents in german language I realized quickly that I would need to perform some training to allow me to recognize also the german special characters äöüß.

The last days I played around with the packaged training scripts and tried to perform some training on the crnn_vgg16_bn model. I got my training scripts to work and also wrote some scripts to try my results on some of my documents. So far the results ranged from "a bit worse" when using the pretrained model as base for training to "a lot worse" when starting from scratch.

Since training is quite time (and power) consuming I wanted to kindly ask whether my approach is in general ok or if I'm completely on the wrong track here:

Is it in general ok to appy training on the pretrained french model with a german vocab? Or is it advisable to start with a non-pretrained model?
So far I used the packaged training scripts with only synthetically generated words. Does it make sense to only rely on synthetically generated data or would it be advisable to use other sources in addition to synthetic data or instead of synthetic data for training/validation?
What should I expect in terms of number of epochs and what could be an advisable size of the training and validation data sets?

So far I tried:

Train the pretrained french model for 5 epochs with the german vocab. The results were on par in most parts of my reference documents but also worse for some words.
Train a model from scratch for 10 epochs with the german vocab. The results were much worse than compared to the untouched pretrained model.

In both runs I used the standard settings from the packaged training script (except setting vocab to german of course) and reached a validation loss of 0.00592982 (Exact: 97.36% | Partial: 97.36%) in the second try and pretty similar in the first try.

I know that there are lots of parameters that could be changed, maybe you could guide me a bit as to what should be changed regarding my general approach or if I just need a lot more training runs or training data or a better validation set or whatever... Thank you for your input! :)

Answered by felixdittrich92

Feb 6, 2023

Hi @DrRSatzteil 👋,

sure you can use the pretrained model (embeddings and last linear layer will be initialized new)

best case: real world train and val set (more is better .. as a reference the pretrained recognition models are trained from scratch on ~10M real word crops (french))
medium case: train with synthetic data and validated on real world samples
worst case: full synthetic (depending on the data quality for example MJSynth and SynthText (english) works pretty well)

3.This question is hard to answer (depends on the model you want to train, data quality, regular text scene ? irregular text scene ? handwritten ? )

Cheers ✌️

View full answer

felixdittrich92 · 2023-02-06T07:52:10Z

felixdittrich92
Feb 6, 2023
Maintainer

Hi @DrRSatzteil 👋,

sure you can use the pretrained model (embeddings and last linear layer will be initialized new)

best case: real world train and val set (more is better .. as a reference the pretrained recognition models are trained from scratch on ~10M real word crops (french))
medium case: train with synthetic data and validated on real world samples
worst case: full synthetic (depending on the data quality for example MJSynth and SynthText (english) works pretty well)

3.This question is hard to answer (depends on the model you want to train, data quality, regular text scene ? irregular text scene ? handwritten ? )

Cheers ✌️

1 reply

DrRSatzteil Feb 6, 2023
Author

Thank you very much for your detailed answer!

Unfortunately I could not find some good German real world data for this purpose and my own collection is just too small (this is just a private project mainly for learning purposes for myself).

However I know that I'm not completely on the wrong track.

What I'm trying to do: analyse documents scanned by an office scanner (invoices, letters and such) so usually the data has a pretty decent quality (no photos or weird angles/rotations, mainly black letters on white background).

I will continue training/testing and will hopefully find ways to improve my results :)

TomekPro · 2023-02-08T14:53:45Z

TomekPro
Feb 8, 2023

Hi @DrRSatzteil,
I recently faced similar problems that you do (finetuning models to polish language, which as well has a few different characters). Here is what I learned:

When generated synthetic data (100k) from real polish words, the performance improved just a bit. Data was generated changing fonts, blurring, skewness, etc like here: https://theailearner.com/2021/03/10/text-recognition-datasets/
Then we managed to manually label 10k of real word crops - that improved quality a lot! Labeling is itself quite quick when you push word crops to google sheet column and in another column you have default model prediction - then you just fix errors.
CRNN - trains very quickly - 15 minutes, 30 epochs on RTX3070, 10k samples. After 15 epochs loss stabilized.
MASTER - model needed to be trained for at least 300 epochs to achieve similar (but worse) performance to CRNN. This model is very sensitive to parameters changes (lr, wd) rising up wd just a bit could make model unable to learn.
VITSTR - similarly to MASTER, but learning was even slower and finally I was not able to get results close to CRNN
Detection model sometimes joins tokens together. For example in polish and is "i". Detection model was not trained on polish data so it very often sticks "i" to neighbouring words.

5 replies

DrRSatzteil Feb 8, 2023
Author

Thank you very much for sharing your experiences!

In the meantime I had another go with 100k words (and 10k words for validation) I created with https://github.com/Belval/TextRecognitionDataGenerator

The results were actually very promising. The only real problems that remained were words that had punctuation symbols close to them, numbers or special characters. But this was somehow expected since these samples were not part of my training dataset. I am working on a more realistic word corpus right now that includes all the missing constellations and run training again. I will let you know how it did work out.

DrRSatzteil Feb 9, 2023
Author

@TomekPro may I ask you how you created the word crops from actual documents?

TomekPro Feb 9, 2023

This part was done by my colleague but in general you export .hocr (export_as_xml) and there you have each crop's coordinates (bbox), then you just cut image.

DrRSatzteil Feb 21, 2023
Author

In the meantime I created a bit of tooling to create word crops from real documents and to create an excel file along with it. It is really easy to create a reasonable training set from there. I now have a collection of roughly 11.000 real words crops and could also increase quality significantly. Some errors remain, so I might increase my training set size later on but for now the quality is good enough for my purposes and I can continue from here. Thanks again for your very valuable input!

krzynio Mar 17, 2023

A note about labelling word crops, you can use Label Studio for that with following config:

<View style="display: flex;">
  <View style="flex: 75%; padding: 10px">
    <Style>.lsf-toolbar { display: none }</Style>
  <Image name="image" value="$captioning" maxHeight="10%" />
  </View>
  <View style="flex: 25%; padding: 10px;" >
    <Style>.ant-typography { font-size: 1.5rem }</Style>
  <TextArea name="caption" toName="image" placeholder="Enter description here..."
            rows="1" maxSubmissions="1" required="true" />
  </View>
</View>

Combine that with preannotated word crops from tesseract or other ocr and you can easily tag 3k word crops in an hour.

DrRSatzteil · 2023-02-11T20:38:59Z

DrRSatzteil
Feb 11, 2023
Author

I'm currently experiencing a problem with my new training data and I don't understand what I'm doing wrong. The training goes smoothly with only regular words but in my current dataset I also use a lot of other characters (mainly punctuation but also currency symbols, the paragraph symbol, urls and the like). Whenever I do I get an error:

Traceback (most recent call last):--------------------------------------------------------------------------------| 0.00% [0/1154 00:00<?] File "C:\Users\laute\projects\trainingwin\doctr\references\recognition\train_pytorch.py", line 468, in <module> main(args) File "C:\Users\laute\projects\trainingwin\doctr\references\recognition\train_pytorch.py", line 379, in main fit_one_epoch(model, train_loader, batch_transforms, optimizer, scheduler, mb, amp=args.amp) File "C:\Users\laute\projects\trainingwin\doctr\references\recognition\train_pytorch.py", line 117, in fit_one_epoch train_loss = model(images, targets)["loss"] File "C:\Users\laute\miniconda3\envs\pydml\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "C:\Users\laute\projects\trainingwin\doctr\doctr\models\recognition\crnn\pytorch.py", line 227, in forward out["loss"] = self.compute_loss(logits, target) File "C:\Users\laute\projects\trainingwin\doctr\doctr\models\recognition\crnn\pytorch.py", line 184, in compute_loss ctc_loss = F.ctc_loss( File "C:\Users\laute\miniconda3\envs\pydml\lib\site-packages\torch\nn\functional.py", line 2628, in ctc_loss return torch.ctc_loss( RuntimeError: Expected tensor to have size at least 64 at dimension 1, but got size 32 for argument #2 'targets' (while checking arguments for ctc_loss_cpu)

Since everything works ok with only regular words I'm pretty sure that some of my training data is causing this. However I don't really know how to find out what particular cases lead to this issue. Does anyone have a clue what might be causing this?

I have a feeling that this might be related to the length of my words. There are some pretty long urls in the training set.

3 replies

TomekPro Feb 12, 2023

Have you tried just removing from your training labels.json entries which contains characters not present in vocab and with len > 32?

DrRSatzteil Feb 12, 2023
Author

The vocab is fine but I still have a couple of strings longer than 32. will try to remove these as well. Where does this limit come from? I could not find it in the docs?

DrRSatzteil Feb 12, 2023
Author

It seems to work fine with the 32 character limit, thank you @TomekPro !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Beginner questions #1139

{{title}}

Replies: 3 comments 9 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Beginner questions #1139

DrRSatzteil Feb 5, 2023

Replies: 3 comments · 9 replies

felixdittrich92 Feb 6, 2023 Maintainer

DrRSatzteil Feb 6, 2023 Author

TomekPro Feb 8, 2023

DrRSatzteil Feb 8, 2023 Author

DrRSatzteil Feb 9, 2023 Author

TomekPro Feb 9, 2023

DrRSatzteil Feb 21, 2023 Author

krzynio Mar 17, 2023

DrRSatzteil Feb 11, 2023 Author

TomekPro Feb 12, 2023

DrRSatzteil Feb 12, 2023 Author

DrRSatzteil Feb 12, 2023 Author

DrRSatzteil
Feb 5, 2023

Replies: 3 comments 9 replies

felixdittrich92
Feb 6, 2023
Maintainer

DrRSatzteil Feb 6, 2023
Author

TomekPro
Feb 8, 2023

DrRSatzteil Feb 8, 2023
Author

DrRSatzteil Feb 9, 2023
Author

DrRSatzteil Feb 21, 2023
Author

DrRSatzteil
Feb 11, 2023
Author

DrRSatzteil Feb 12, 2023
Author

DrRSatzteil Feb 12, 2023
Author