FUNSD Datasets

Table of contents

FUNSD Datasets
References

This repository's content is intended to provide a guide on the original FUNSD dataset and its versions. Alongside the original dataset, there are two additional datasets derived from the it: the FUNSD Revised Dataset and the FUNSD+ Dataset.

The FUNSD (Form Understanding in Noisy Scanned Documents) dataset is commonly used to understand documents and extract information from scanned documents. It is particularly useful for tasks involving Text Detection, Optical Character Recognition, Spatial Layout Analysis, and Form Understanding.

Figure 1: Example of forms from the FUNSD dataset from https://guillaumejaume.github.io/FUNSD/img/two_forms.png. The labels indicate different types of information: questions (blue), answers (green), headers (yellow), and other elements (orange).

FUNSD Dataset (Original) [1] ↑

The FUNSD dataset was published by Jaume et al. (2019) for form understanding in noisy scanned documents. It is available on Guillaume Jaume's homepage and is licensed for non-commercial, research, and educational purposes.

The FUNSD dataset consists of 199 document images, which is a subset sampled from the RVL-CDIP dataset introduced by Harley et al. (2015)[4]. The RVL-CDIP dataset comprises 400,000 grayscale images of various documents from the 1980s and 1990s. These scanned documents have a low resolution of 100 dpi and suffer from low quality due to various types of noise introduced by successive scanning and printing procedures. The RVL-CDIP dataset categorizes its images into four classes: letter, email, magazine, and form.

The authors of the FUNSD dataset manually reviewed 25,000 images from the "form" category, discarding unreadable and duplicate images. This process resulted in a refined set of 3,200 images, from which 199 images were randomly sampled for annotation to create the FUNSD dataset. The RVL-CDIP dataset is itself a subset of another dataset called the Truth Tobacco Industry Document (TTID)[5], an archive collection of scientific research, marketing, and advertising documents from the largest tobacco companies in the US.

Data statistics

In the original FUNSD dataset, the metadata of each image is contained in a separate JSON file, in which the form is represented as a list of interlinked semantic entities. Each entity consists of a group of words that belong together both semantically and spatially.

Each semantic entity is associated with:

a unique identifier,
a label which can be header, question, answer, or other,
a bounding box,
a list of words belonging to said entity,
and a list of links with other entities.

Each word is described by its contextual content: a OCR label and its bounding box.

Please keep in mind the following information:

the question and answer labels are similar to key and value;
the bounding boxes are represented as [left, top, right, bottom];
and the links are formatted as a pair of entity identifiers, with the identifier of the current entity being the first element.

The dataset contains 199 images, with over 30,000 word-level annotations and approximately 10,000 entities, and it was pre-divided into a training set of 149 images and a testing set of 50 images.

Getting the Dataset

The FUNSD dataset can be downloaded from the original source.
We generated annotations from Azure AI Document Intelligence service for the images in the FUNSD dataset.
You also can use git with dvc to download the dataset containing the Azure annotations.:
```
# in your virtual environment
pip install dvc[ssh]
dvc get https://github.com/crcresearch/FUNSD datasets/FUNSD
```
Note: After running the last command, you may be prompted to enter your password to SSH into the CRC Cluster.
You can also download it from the following link: https://www.crc.nd.edu/~pmoreira/funsd.zip
To plot the annotations on the image, you can use the following notebook: FUNSD notebook Instructions
Find the code used to generate the Azure annotations at here.

FUNSD Revised Dataset [2] ↑

FUNSD Revised notebook Instructions

Vu et al. (2020) in [2] reported several inconsistencies in labeling, which could hinder the applicability of FUNSD to the key-value extraction problem. They revised the dataset by correcting the labels and adding new annotations.

FUNSD+ Dataset [3] ↑

The FUNSD+ dataset is an enhanced version of the original FUNSD (Form Understanding in Noisy Scanned Documents) dataset, designed for more comprehensive document understanding tasks.

FUNSD+ notebook Instructions

Get the Dataset via DVC

# in your virtual environment
pip install dvc[ssh]
dvc get https://github.com/crcresearch/FUNSD datasets/FUNSD_plus

Note: After running the last command, you may be prompted to enter your password to SSH into the CRC Cluster.

References

[1] Jaume, Guillaume, Ekenel, Hazim Kemal, and Thiran, Jean-Philippe. "FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents," 2019. https://arxiv.org/pdf/1905.13538

[2] Vu, Hieu M., and Nguyen, Diep Thi-Ngoc. "Revising FUNSD Dataset for Key-Value Detection in Document Images," 2020. https://arxiv.org/pdf/2010.05322

[3] Zagami, Davide, and Helm, Christopher. "FUNSD+: A Larger and Revised FUNSD Dataset," 2022. https://konfuzio.com/de/funsd-plus/

[4] A. W. Harley, A. Ufkes, K. G. Derpanis, "Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval," in ICDAR, 2015. https://arxiv.org/abs/1502.07058

[5] Truth Tobacco Industry Documents Library. https://www.industrydocuments.ucsf.edu/tobacco

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
.dvc		.dvc
data		data
datasets		datasets
nbs		nbs
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FUNSD Datasets

FUNSD Dataset (Original) [1] ↑

Data statistics

Getting the Dataset

FUNSD Revised Dataset [2] ↑

FUNSD+ Dataset [3] ↑

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

crcresearch/FUNSD

Folders and files

Latest commit

History

Repository files navigation

FUNSD Datasets

FUNSD Dataset (Original) [1] ↑

Data statistics

Getting the Dataset

FUNSD Revised Dataset [2] ↑

FUNSD+ Dataset [3] ↑

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages