Skip to content

Latest commit

 

History

History
76 lines (48 loc) · 2.83 KB

README.md

File metadata and controls

76 lines (48 loc) · 2.83 KB

LayoutLM_pytorch

The layoutLM by Microsoft is a text and layout image understanding solution. It is built on top of the BERT transformer architecture with two additional input embeddings. The first is a 2-d positional encoding to capture the relative positional information

Taking form understanding as an example, given a key in a form (e.g., “Passport ID:”), its corresponding value is much more likely on its right or below instead of on the left or above.

image.png

The second is an image embedding of the image tokens that correspond to the text features. Since a document is a combination of textual and visual information, image embedding allows to capture information that would not normally be present in text like font and boldness.

LayoutLM can be used to extract content and structure information from forms. The model is fine-tuned on the FUNSD dataset. It contains almost 200 scanned documents, and over 9K semantic entities, and 31K+ words. In each semantic entity is a unique identifier, label (header, question, answer) and bounding box.

In pre-training, the document image is passed through an OCR engine (TesseractOCR) which returns the recognized text information along with their locations. The text tokens are passed into the layoutLM architecture while the location information is used to generate image embeddings for the token images using faster-RCNN.

For inference, a scanned document image is once again passed through tesseractOCR to extract the text and location information. This information is used to generate text, image and positional embeddings and passed to the model. For each token, the model predicts one of ['B-ANSWER', 'B-HEADER', 'B-QUESTION', 'E-ANSWER', 'E-HEADER', 'E-QUESTION', 'I-ANSWER', 'I-HEADER', 'I-QUESTION', 'O', 'S-ANSWER', 'S-HEADER', 'S-QUESTION']. The B,I,E,O,S tags indicate whether the token is at the Beginning, Inside, End, Outside of a given entity.

Results - Original

image.png

### Output image.png

to run model

python model.py --model_dir PATH_TO_PARAMS_FILE

for inference

python infer.py --model_dir PATH_TO_PARAMS_FILE

params

params.json file contains model parameters including

  • number of epochs
  • batch size & learning rate
  • train/evaluation document folders
  • inference file folder
  • model and inference save folder

References