-
Notifications
You must be signed in to change notification settings - Fork 3
How does the ID recognition work?
One of the main quality-of-life features of Instructor Pilot is the ability to detect the student associated with each scanned submission in a matter of a few seconds. This article describes how this is possible by recognizing handwritten IDs.
Below is a flowchart of the identification pipeline.
Counter-intuitively, converting the PDFs to images and cropping them is the bottleneck of the identification pipeline. We utilize the python PDF library pymupdf
to handle all PDF-related IO. In our testing, the creation of multiprocessing
pools helps speed up this process by at least a factor of 2x.
The problem of detecting a row of N connected cells containing the N handwritten digits lies in the general domain of table detection. We choose to avoid requiring a template PDF, which would make the detection trivial, in order to maximize the generality of the pipeline and the ease of use for the end user. Our approach makes the following assumption:
The row of cells is a closed contour that has 2 long, mostly horizontal lines and 2 short, mostly vertical lines. No larger closed contour with these properties exists in the region of interest.
Therefore, one needs to avoid adding rectangular contours with similar aspect ratio and size close to the ID input. By default, the region of interest (i.e. crop box) is defined to be the top 25%, left 50% of the page, but the user interface allows to choose any rectangular crop box - even separate regions for each page. We make heavy use of OpenCV
at this stage.
We can divide now the contour into N cells to extract the digit images. Before doing so however, it is helpful to remove the inner and outer borders of the table. One has to be careful however, especially when removing the vertical lines. For example, if the handwritten digit is the digit "9", there is a danger to accidentally confuse the mostly vertical right side of the digit with a cell border, and if removed the digit could start resembling the digit "0". Thus, we decide to be very conservative and remove only the outermost portions of the borders at this stage. We take additional actions regarding the borders during digit preprocessing and the vision model training.
We have found that most tricky aspect of the pipeline is preparing the located digits before they are fed to digit recognition model. We have to make sure that their properties closely resemble those of the training set used for the recognition model. Since our training set is based on the MNIST dataset, our digit preprocessing attempts to bring them as close as possible to MNIST. The MNIST dataset consists of 28x28 grey-scale images, in which the digit is completely isolated, its location is such that the c.o.m. is at the center of the image and its size is 20x20 pixels. We convert the our images to grey-scale. In our case it is not feasible to completely isolate each digit, since it is not possible to perfectly remove the cell borders. To improve the situation, we examine all the contours found in each digit image and we keep only the contours whose c.o.m. is at least a couple of pixels away from the four edges of the image. Additionally, and crucially, as we will discuss below, we add features resembling "border fragments" to our MNIST training images which helps the neural net to learn to ignore these border features.
We also apply a set of morphological transformations to the digits in an attempt to fix pixel-level details of the digits that might have been cropped out or removed. Some of these transformation are also equivalent to trying multiple font weights.
Finally we resize and center the digit contours. Resizing with anti-aliasing converts the 0/1 black & white images to greyscale, which matches the properties of MNIST.
Now our 28x28 digits are ready to be fed into the convolutional neural network (CNN). The details of our convolutional follow closely this guide, while our MNIST data augmentation is largely based on this guide. While it is true that "the MNIST is a solved problem", we found that data augmentation to be a crucial step to get accuracy of 97%+ on our real-world dataset. We add linear segments with a small probability to the edges of the MNIST data and apply some salt-and-pepper.
One can reproduce (up to a random seed) the weights of our model by following the model_digits.ipynb
In order to make this apps more lightweight and in order to load the trained model faster for inference, it is crucial to export the trained model to a format that is easier to serve. We export the model to ONNX format and we serve it server-side with ONNXruntime
. This makes the model quite fast even on a CPU, to the point that the initial PDF IO is the currently the bottleneck of this pipeline.
Now that we have probabilities of the categorical distribution for each of the N
digit images, we can compare them to the Student IDs (remember, each of IDs is of length N
). Here we make another assumption:
The IDs are i.i.d. draws from a discrete uniform distribution on the integer interval from
0
to10^N - 1
.
This means that each ID has an apriori probability of 10^{-N}
to appear and for large enough values of N
and a small number of students the probability of near-miss collisions is very small. For example, at the University of Florida we use 8 digit uniform IDs. Even near-misses (cases where most digits of two student IDs match) can be safe-guarded by refusing to mark as identified submissions with multiple confident student detections. We have set the detection probability for an ID to be 10^{4-N}
, i.e. ten thousand times greater than random probability. As a side-note, this means that four-digit IDs or smaller would never be identified.