Train master model on multiple language #1103

khawar-islam · 2022-11-14T00:31:02Z

khawar-islam
Nov 14, 2022

I hope everything is good. My document contains Korean, English and digits characters. I would like to train a model with vocabs = korean+english+digits+alphanumeric and I have only korean dataset (9M images)

How i can collect english+digits+alphanumeric? Do i need to collect it or there are any other way to train model with English+digits+alphanumeric?

felixdittrich92 · 2022-11-14T07:34:05Z

felixdittrich92
Nov 14, 2022
Maintainer

Hi @khawar-islam 👋,

We have already implemented MJSynth and SynthText datasets which are commonly used to train recognition models from scratch (english). More information:
https://mindee.github.io/doctr/using_doctr/using_datasets.html
https://mindee.github.io/doctr/modules/datasets.html#doctr.datasets.MJSynth
https://mindee.github.io/doctr/modules/datasets.html#doctr.datasets.SynthText

with recognition_task=True you can easily adapt the training script to load directly from this (NOTE: SynthText will save the crops on your local machine which takes ~9GB space and some time)

changes in the training script (MJSynth example same changes for SynthText (you could also merge both datasets)):

from doctr.datasets import VOCABS, DataLoader, RecognitionDataset, WordGenerator, MJSynth

if isinstance(args.val_path, str):
        with open(os.path.join(args.val_path, 'labels.json'), 'rb') as f:
            val_hash = hashlib.sha256(f.read()).hexdigest()

        # Load val data generator
        val_set = RecognitionDataset(
            img_folder=os.path.join(args.val_path, 'images'),
            labels_path=os.path.join(args.val_path, 'labels.json'),
            img_transforms=T.Resize((args.input_size, 4 * args.input_size), preserve_aspect_ratio=True),
        )
    else:
        val_hash = None
        # Load synthetic data generator
        """
        val_set = WordGenerator(
            vocab=vocab,
            min_chars=args.min_chars,
            max_chars=args.max_chars,
            num_samples=args.val_samples * len(vocab),
            font_family=fonts,
            img_transforms=T.Compose([
                T.Resize((args.input_size, 4 * args.input_size), preserve_aspect_ratio=True),
                # Ensure we have a 90% split of white-background images
                T.RandomApply(T.ColorInversion(), 0.9),
            ])
        )
        """
        val_set = MJSynth(train=False,
                          img_folder='/home/felix/.cache/doctr/datasets/mjsynth/mnt/ramdisk/max/90kDICT32px',
                          label_path='/home/felix/.cache/doctr/datasets/mjsynth/mnt/ramdisk/max/90kDICT32px/imlist.txt',
                          img_transforms=T.Resize((args.input_size, 4 * args.input_size), preserve_aspect_ratio=True),
                          )

if isinstance(args.train_path, str):
        # Load train data generator
        base_path = Path(args.train_path)
        parts = [base_path] if base_path.joinpath('labels.json').is_file() else [
            base_path.joinpath(sub) for sub in os.listdir(base_path)
        ]
        with open(parts[0].joinpath('labels.json'), 'rb') as f:
            train_hash = hashlib.sha256(f.read()).hexdigest()

        train_set = RecognitionDataset(
            parts[0].joinpath('images'),
            parts[0].joinpath('labels.json'),
            img_transforms=T.Compose([
                T.RandomApply(T.ColorInversion(), .1),
                T.Resize((args.input_size, 4 * args.input_size), preserve_aspect_ratio=True),
                # Augmentations
                T.RandomJpegQuality(60),
                T.RandomSaturation(.3),
                T.RandomContrast(.3),
                T.RandomBrightness(.3),
            ]),
        )
        if len(parts) > 1:
            for subfolder in parts[1:]:
                train_set.merge_dataset(RecognitionDataset(
                    subfolder.joinpath('images'), subfolder.joinpath('labels.json')))
    else:
        train_hash = None
        # Load synthetic data generator
        """
        train_set = WordGenerator(
            vocab=vocab,
            min_chars=args.min_chars,
            max_chars=args.max_chars,
            num_samples=args.train_samples * len(vocab),
            font_family=fonts,
            img_transforms=T.Compose([
                T.Resize((args.input_size, 4 * args.input_size), preserve_aspect_ratio=True),
                # Ensure we have a 90% split of white-background images
                T.RandomApply(T.ColorInversion(), 0.9),
                T.RandomJpegQuality(60),
                T.RandomSaturation(.3),
                T.RandomContrast(.3),
                T.RandomBrightness(.3),
            ])
        )
        """
        train_set = MJSynth(train=True,
                          img_folder='/home/felix/.cache/doctr/datasets/mjsynth/mnt/ramdisk/max/90kDICT32px',
                          label_path='/home/felix/.cache/doctr/datasets/mjsynth/mnt/ramdisk/max/90kDICT32px/imlist.txt',
                          img_transforms=T.Resize((args.input_size, 4 * args.input_size), preserve_aspect_ratio=True),
                          )

Side note: I have a lot of other stuff to do currently which is why I had to stop my contributions to docTR for now

35 replies

felixdittrich92 Feb 23, 2023
Maintainer

You can copy the snippet i posted in a test.py and run it (from doctr.datasets import SynthText is missing in the snippet)

khawar-islam Feb 23, 2023
Author

Thanks @felixdittrich92, I did it and now waiting for extraction

felixdittrich92 Feb 23, 2023
Maintainer

Nice :)

felixdittrich92 Feb 23, 2023
Maintainer

Afterwards you can use your training script as before :)

khawar-islam Mar 6, 2023
Author

@felixdittrich92 thank you problem solved i will train a model soon after DataParallel code.

Now, we are not using --fonts argument in our training script because of multiple languages. Do we need it or not?
fonts = args.font.split(",") variable fonts is not utilizing anymore

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train master model on multiple language #1103

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 35 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Train master model on multiple language #1103

khawar-islam Nov 14, 2022

Replies: 1 comment · 35 replies

felixdittrich92 Nov 14, 2022 Maintainer

felixdittrich92 Feb 23, 2023 Maintainer

khawar-islam Feb 23, 2023 Author

felixdittrich92 Feb 23, 2023 Maintainer

felixdittrich92 Feb 23, 2023 Maintainer

khawar-islam Mar 6, 2023 Author

khawar-islam
Nov 14, 2022

Replies: 1 comment 35 replies

felixdittrich92
Nov 14, 2022
Maintainer

felixdittrich92 Feb 23, 2023
Maintainer

khawar-islam Feb 23, 2023
Author

felixdittrich92 Feb 23, 2023
Maintainer

felixdittrich92 Feb 23, 2023
Maintainer

khawar-islam Mar 6, 2023
Author