Beginner questions #1139
-
Hello everyone, first of all thank you for this excellent software! I'm a beginner in the field of machine learning and I played around with doctr for a while now and got some pretty good results with some of my own documents. However since I'm trying to analyse documents in german language I realized quickly that I would need to perform some training to allow me to recognize also the german special characters äöüß. The last days I played around with the packaged training scripts and tried to perform some training on the crnn_vgg16_bn model. I got my training scripts to work and also wrote some scripts to try my results on some of my documents. So far the results ranged from "a bit worse" when using the pretrained model as base for training to "a lot worse" when starting from scratch. Since training is quite time (and power) consuming I wanted to kindly ask whether my approach is in general ok or if I'm completely on the wrong track here:
So far I tried:
In both runs I used the standard settings from the packaged training script (except setting vocab to german of course) and reached a validation loss of 0.00592982 (Exact: 97.36% | Partial: 97.36%) in the second try and pretty similar in the first try. I know that there are lots of parameters that could be changed, maybe you could guide me a bit as to what should be changed regarding my general approach or if I just need a lot more training runs or training data or a better validation set or whatever... Thank you for your input! :) |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 9 replies
-
Hi @DrRSatzteil 👋,
3.This question is hard to answer (depends on the model you want to train, data quality, regular text scene ? irregular text scene ? handwritten ? ) Cheers ✌️ |
Beta Was this translation helpful? Give feedback.
-
Hi @DrRSatzteil,
|
Beta Was this translation helpful? Give feedback.
-
I'm currently experiencing a problem with my new training data and I don't understand what I'm doing wrong. The training goes smoothly with only regular words but in my current dataset I also use a lot of other characters (mainly punctuation but also currency symbols, the paragraph symbol, urls and the like). Whenever I do I get an error:
Since everything works ok with only regular words I'm pretty sure that some of my training data is causing this. However I don't really know how to find out what particular cases lead to this issue. Does anyone have a clue what might be causing this? I have a feeling that this might be related to the length of my words. There are some pretty long urls in the training set. |
Beta Was this translation helpful? Give feedback.
Hi @DrRSatzteil 👋,
3.This question is hard to answer (depends on the model you want to train, data quality, regular text scene ? irregular text scene ? handwritten ? )
Cheers ✌️