-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
457 lines (371 loc) · 25.6 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
<html>
<head>
<meta name="robots" content="noindex">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Rainfall Rescue Writeup</title>
<link href="style.css" rel="stylesheet">
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-174660312-1"></script>
<script>
window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-174660312-1');
</script>
</head>
<body>
<div class="container">
<h1 class="font-proper">
Rainfall Rescue - A Writeup
</h1>
<div class="profile">
<picture>
<source type="image/webp" srcset="images/webp/profile_1x.webp 1x, images/webp/profile_1.5x.webp 1.5x, images/webp/profile_2x.webp 2x, images/webp/profile_2.5x.webp 2.5x, images/webp/profile_3x.webp 3x">
<source type="image/jpeg" srcset="images/profile_1x.jpg 1x, images/profile_1.5x.jpg 1.5x, images/profile_2x.jpg 2x, images/profile_2.5x.jpg 2.5x, images/profile_3x.jpg 3x">
<img src="images/profile_1x.jpg">
</picture>
<div class="details">
<div class="name font-relaxed">Cian Yong Leow</div>
<div class="meta font-relaxed">August 4th, 2020 · 10 min read</div>
</div>
<div class="social">
<a href="https://www.linkedin.com/in/cianyleow/">
<svg role="img" viewBox="0 0 24 24" class="icon">
<path d="M20.447 20.452h-3.554v-5.569c0-1.328-.027-3.037-1.852-3.037-1.853 0-2.136 1.445-2.136 2.939v5.667H9.351V9h3.414v1.561h.046c.477-.9 1.637-1.85 3.37-1.85 3.601 0 4.267 2.37 4.267 5.455v6.286zM5.337 7.433c-1.144 0-2.063-.926-2.063-2.065 0-1.138.92-2.063 2.063-2.063 1.14 0 2.064.925 2.064 2.063 0 1.139-.925 2.065-2.064 2.065zm1.782 13.019H3.555V9h3.564v11.452zM22.225 0H1.771C.792 0 0 .774 0 1.729v20.542C0 23.227.792 24 1.771 24h20.451C23.2 24 24 23.227 24 22.271V1.729C24 .774 23.2 0 22.222 0h.003z"/>
</svg>
</a>
<a href="https://github.com/cianyleow">
<svg role="img" viewBox="0 0 24 24" class="icon">
<path d="M12 .297c-6.63 0-12 5.373-12 12 0 5.303 3.438 9.8 8.205 11.385.6.113.82-.258.82-.577 0-.285-.01-1.04-.015-2.04-3.338.724-4.042-1.61-4.042-1.61C4.422 18.07 3.633 17.7 3.633 17.7c-1.087-.744.084-.729.084-.729 1.205.084 1.838 1.236 1.838 1.236 1.07 1.835 2.809 1.305 3.495.998.108-.776.417-1.305.76-1.605-2.665-.3-5.466-1.332-5.466-5.93 0-1.31.465-2.38 1.235-3.22-.135-.303-.54-1.523.105-3.176 0 0 1.005-.322 3.3 1.23.96-.267 1.98-.399 3-.405 1.02.006 2.04.138 3 .405 2.28-1.552 3.285-1.23 3.285-1.23.645 1.653.24 2.873.12 3.176.765.84 1.23 1.91 1.23 3.22 0 4.61-2.805 5.625-5.475 5.92.42.36.81 1.096.81 2.22 0 1.606-.015 2.896-.015 3.286 0 .315.21.69.825.57C20.565 22.092 24 17.592 24 12.297c0-6.627-5.373-12-12-12"/>
</svg>
</a>
</div>
</div>
<div class="article">
<picture class="main">
<source type="image/webp" srcset="images/webp/FullPage_Top_1x.webp 1x, images/webp/FullPage_Top_1.5x.webp 1.5x, images/webp/FullPage_Top_2x.webp 2x, images/webp/FullPage_Top_2.5x.webp 2.5x, images/webp/FullPage_Top_3x.webp 3x">
<source type="image/jpg" srcset="images/FullPage_Top_1x.jpg 1x, images/FullPage_Top_1.5x.jpg 1.5x, images/FullPage_Top_2x.jpg 2x, images/FullPage_Top_2.5x.jpg 2.5x, images/FullPage_Top_3x.jpg 3x">
<img src="images/FullPage_Top_1x.jpg">
</picture>
<p>At the start of the COVID-19 lockdowns I came across an <a href="https://www.bbc.com/news/science-environment-52040822">article on the BBC</a> requesting help digitizing historical rainfall records in the UK. A little bored at home, I managed to spend five minutes transcribing records manually before I started thinking about ways to make the job more efficient (and interesting for me).</p>
<p>The records were all pretty much identical and highly organized – at the core of each page, a grid of handwritten records organized by month/year. Could I just feed the high-quality scans into the <a href="https://cloud.google.com/vision/docs/ocr">Google Cloud Vision API</a> and get the results in seconds with OCR?</p>
<div class="figure-group">
<figure>
<picture>
<source type="image/webp" srcset="images/webp/Google_OCR_1x.webp 1x, images/webp/Google_OCR_1.5x.webp 1.5x">
<source type="image/jpeg" srcset="images/Google_OCR_1x.jpg 1x, images/Google_OCR_1.5x.jpg 1.5x">
<img src="images/Google_OCR_1x.jpg">
</picture>
<figcaption>The output for the Google Cloud Vision API, unable to process some handwritten numbers and unaware of spatial context.</figcaption>
</figure>
</div>
<p>While the Cloud Vision API is accurate at recognizing characters and text, there were two fairly major issues:</p>
<ol>
<li>Handwriting is difficult for OCR algorithms to identify</li>
<li>The spatial context of these records is very important – each cell represents a specific time record and, more annoyingly, not every cell has to be filled in</li>
</ol>
<p>With the failure of the easy route, I started looking down the harder, but admittedly more interesting route, of putting together an image processing workflow to identify handwritten numbers in each of the temporally indexed cells of the main table.</p>
<h2>The Plan</h2>
<p>The original plan was simple:</p>
<ol>
<li>Define a grid of cells on the page (one for each month/year combination)</li>
<li>Run some handwriting compatible OCR on each cell</li>
<li>Present my results and collect the Nobel Prize for contributions to the Environment</li>
</ol>
<p>As always, the best laid plans are foiled, and I ran into a few problems along the way. In the end, the plan looked something more like this:</p>
<div class="figure-group">
<figure class="half">
<picture>
<source type="image/webp" srcset="images/webp/0-1_1x.webp 1x, images/webp/0-1_1.5x.webp 1.5x, images/webp/0-1_2x.webp 2x, images/webp/0-1_2.5x.webp 2.5x, images/webp/0-1_3x.webp 3x">
<source type="image/png" srcset="images/0-1_1x.png 1x, images/0-1_1.5x.png 1.5x, images/0-1_2x.png 2x, images/0-1_2.5x.png 2.5x, images/0-1_3x.png 3x">
<img src="images/0-1_1x.png">
</picture>
<figcaption>Extract the Primary Data Region</figcaption>
</figure>
<figure class="half">
<picture>
<source type="image/webp" srcset="images/webp/0-2_1x.webp 1x, images/webp/0-2_1.5x.webp 1.5x, images/webp/0-2_2x.webp 2x, images/webp/0-2_2.5x.webp 2.5x">
<source type="image/png" srcset="images/0-2_1x.png 1x, images/0-2_1.5x.png 1.5x, images/0-2_2x.png 2x, images/0-2_2.5x.png 2.5x">
<img src="images/0-2_1x.png">
</picture>
<figcaption>Identify the boundaries of each grid cell</figcaption>
</figure>
<figure class="half">
<picture>
<source type="image/webp" srcset="images/webp/0-3_1x.webp 1x, images/webp/0-3_1.5x.webp 1.5x, images/webp/0-3_2x.webp 2x, images/webp/0-3_2.5x.webp 2.5x, images/webp/0-3_3x.webp 3x">
<source type="image/png" srcset="images/0-3_1x.png 1x, images/0-3_1.5x.png 1.5x, images/0-3_2x.png 2x, images/0-3_2.5x.png 2.5x, images/0-3_3x.png 3x">
<img src="images/0-3_1x.png">
</picture>
<figcaption>Use a <i>homemade</i> neural network to identify digits</figcaption>
</figure>
<figure class="half">
<picture>
<source type="image/webp" srcset="images/webp/0-4_1x.webp 1x, images/webp/0-4_1.5x.webp 1.5x, images/webp/0-4_2x.webp 2x, images/webp/0-4_2.5x.webp 2.5x, images/webp/0-4_3x.webp 3x">
<source type="image/png" srcset="images/0-4_1x.png 1x, images/0-4_1.5x.png 1.5x, images/0-4_2x.png 2x, images/0-4_2.5x.png 2.5x, images/0-4_3x.png 3x">
<img src="images/0-4_1x.png">
</picture>
<figcaption>Perform manual verification and continuous model training</figcaption>
</figure>
</div>
<h2>The Work</h2>
<p>In my opinion, the most difficult part of adopting any new tool is figuring out how to present your problems in a way it can <i>efficiently</i> solve them – in essence, creating the proverbial nail for a hammer. My tool of choice this time? OpenCV2, the de-facto standard for computer vision.</p>
<p>While it is a wonderfully powerful library, OpenCV2 only shines when the source image has been adequately prepared. Trying to jump from a page full of text to a spreadsheet of numbers would be as pointless as my Google Cloud Vision experiment, so an image processing workflow would be required!</p>
<h3>1) Extracting the Primary Data Region</h3>
<p>Step one of the workflow was normalizing the scanned sheets into a semi-consistent state (i.e. a smart crop). By isolating the primary data region, as seen below, the next steps in the workflow became significantly simpler.</p>
<div class="figure-group">
<figure>
<picture>
<source type="image/webp" srcset="images/webp/PDR_1x.webp 1x, images/webp/PDR_1.5x.webp 1.5x">
<source type="image/jpeg" srcset="images/PDR_1x.jpg 1x, images/PDR_1.5x.jpg 1.5x">
<img src="images/PDR_1x.jpg">
</picture>
<figcaption>The 'primary data region', or data table, of a scanned rainfall record.</figcaption>
</figure>
</div>
<p>The smart crop process included a couple of key steps:</p>
<ol>
<li>Transforming the image to ensure it was parallel to the view frame (i.e. perfectly horizontal and vertical lines)</li>
<li>Extracting the primary data region with the <i>full</i> border included (an important aspect of the next stage)</li>
</ol>
<p>The entire process can be seen below and essentially uses computer vision to ‘look’ for a large enough rectangle in the right position on the page.</p>
<div class="figure-group">
<figure class="half">
<picture>
<source type="image/webp" srcset="images/webp/1-1_1x.webp 1x, images/webp/1-1_1.5x.webp 1.5x, images/webp/1-1_2x.webp 2x">
<source type="image/png" srcset="images/1-1_1x.png 1x, images/1-1_1.5x.png 1.5x, images/1-1_2x.png 2x">
<img src="images/1-1_1x.png">
</picture>
<figcaption>The original image being transformed to parallel the view port</figcaption>
</figure>
<figure class="half">
<picture>
<source type="image/webp" srcset="images/webp/1-2_1x.webp 1x, images/webp/1-2_1.5x.webp 1.5x, images/webp/1-2_2x.webp 2x">
<source type="image/png" srcset="images/1-2_1x.png 1x, images/1-2_1.5x.png 1.5x, images/1-2_2x.png 2x">
<img src="images/1-2_1x.png">
</picture>
<figcaption>Vertical lines being analyzed and extracted</figcaption>
</figure>
<figure class="half">
<picture>
<source type="image/webp" srcset="images/webp/1-3_1x.webp 1x, images/webp/1-3_1.5x.webp 1.5x, images/webp/1-3_2x.webp 2x">
<source type="image/png" srcset="images/1-3_1x.png 1x, images/1-3_1.5x.png 1.5x, images/1-3_2x.png 2x">
<img src="images/1-3_1x.png">
</picture>
<figcaption>Horizontal lines being analyzed and extracted</figcaption>
</figure>
<figure class="half">
<picture>
<source type="image/webp" srcset="images/webp/1-4_1x.webp 1x, images/webp/1-4_1.5x.webp 1.5x, images/webp/1-4_2x.webp 2x">
<source type="image/png" srcset="images/1-4_1x.png 1x, images/1-4_1.5x.png 1.5x, images/1-4_2x.png 2x">
<img src="images/1-4_1x.png">
</picture>
<figcaption>The extracted lines combined and the PDR identified</figcaption>
</figure>
</div>
<h3>2) Creating a Grid of Cells</h3>
<p>With the main data grid extracted, the next step was to accurately define a cartesian grid of month/year boxes to send for OCR. Once configured, the values of each month/year combination would be individually identifiable and empty cells would no longer be an issue.</p>
<p>Thanks to the smart cropping in step one, defining the columns for each year was surprisingly easy. With a perfectly aligned and cropped image, the thirteen vertical column edges were easily discernible and the primary data region went from a single image to twelve discrete column in about thirty lines of code.</p>
<div class="figure-group">
<figure class="half">
<picture>
<source type="image/webp" srcset="images/webp/2-1_1x.webp 1x, images/webp/2-1_1.5x.webp 1.5x">
<source type="image/png" srcset="images/2-1_1x.png 1x, images/2-1_1.5x.png 1.5x">
<img src="images/2-1_1x.png">
</picture>
<figcaption>Isolating the vertical lines in the image</figcaption>
</figure>
<figure class="half">
<picture>
<source type="image/webp" srcset="images/webp/2-2_1x.webp 1x, images/webp/2-2_1.5x.webp 1.5x">
<source type="image/png" srcset="images/2-2_1x.png 1x, images/2-2_1.5x.png 1.5x">
<img src="images/2-2_1x.png">
</picture>
<figcaption>Identifying the column locations in the original image</figcaption>
</figure>
</div>
<figure>
<picture>
<source type="image/webp" srcset="images/webp/2-3_1x.webp 1x, images/webp/2-3_1.5x.webp 1.5x">
<source type="image/png" srcset="images/2-3_1x.png 1x, images/2-3_1.5x.png 1.5x">
<img src="images/2-3_1x.png">
</picture>
<figcaption>The final result - each column as a separate image</figcaption>
</figure>
<p>On the flip side, finding the row coordinates was a lot more involved. As there were both no lines between each monthly row and the spacing was slightly variable, using line detection, approximations or other short cuts was out the question.</p>
<p>After some brain storming, my eureka moment came when I realized there are row markers – the months of the year – and by plugging the text into an OCR library, the locations of each row could be ‘read’ from the page!</p>
<figure>
<picture>
<source type="image/webp" srcset="images/webp/PDR_Marked_1x.webp 1x">
<source type="image/jpeg" srcset="images/PDR_Marked_1x.jpg 1x">
<img src="images/PDR_Marked_1x.jpg">
</picture>
<figcaption>Field names on the left hand side of the 'Primary Data Region' provide row markers</figcaption>
</figure>
<p>Unfortunately, 100 year old type writer text (the records stretch back to the late 1800s) didn’t always play well with out-of-the-box <a href="https://opensource.google/projects/tesseract">Tesseract OCR library</a> and trying to make sense of the occasional <code>RT a</code>, <code>Bist</code>, <code>ACen ayindensy</code> or <code>Ppa</code> was hardly a robust solution.</p>
<figure>
<picture>
<source type="image/webp" srcset="images/webp/BadTesseract_1x.webp 1x">
<source type="image/png" srcset="images/BadTesseract_1x.png 1x">
<img src="images/BadTesseract_1x.png">
</picture>
<figcaption>Incorrectly predicted values from off the shelf OCR of the field names</figcaption>
</figure>
<p>Back at the drawing board, I was researching ways to teach Tesseract to better recognise 100 year old type writer text when I realised: I didn’t actually need to read the text, I just needed to recognise twelve different words – the months of the year.</p>
<figure>
<picture>
<source type="image/webp" srcset="images/webp/MonthLabels_1x.webp 1x">
<source type="image/png" srcset="images/MonthLabels_1x.png 1x">
<img src="images/MonthLabels_1x.png">
</picture>
<figcaption>The twelve month labels that can be fed into a neural network to identify row locations</figcaption>
</figure>
<p>And thus, a trivially simple neural network was defined and trained to recognise the 17 words in the primary data region and a robust process for finding row coordinates was created.</p>
<figure>
<picture>
<source type="image/webp" srcset="images/webp/2-6_1x.webp 1x">
<source type="image/png" srcset="images/2-6_1x.png 1x">
<img src="images/2-6_1x.png">
</picture>
<figcaption>Object detection and basic machine learning used to 'read' the page and find the row boundaries for each month</figcaption>
</figure>
<p>Et voila, with the row and column coordinates, finally a grid of individual cells could be extracted for digit detection!</p>
<figure>
<picture>
<source type="image/webp" srcset="images/webp/2-7_1x.webp 1x">
<source type="image/png" srcset="images/2-7_1x.png 1x">
<img src="images/2-7_1x.png">
</picture>
<figcaption>The resulting cartesian grid of cells, one for each year-month combination</figcaption>
</figure>
<h3>3) Reading Handwritten Digits into a Spreadsheet</h3>
<p>The final stage of the image processing workflow was to ‘read’ the digits in the cell and report a numeric value. Having already determined that off-the-shelf OCR libraries are poor at processing handwriting, and buoyed by the success of my simple AI for reading the months of the year, I decided to create another neural network to identify the numeric value of each cell.</p>
<figure>
<picture>
<source type="image/webp" srcset="images/webp/SortedDigits_1x.webp 1x, images/webp/SortedDigits_1.5x.webp 1.5x, images/webp/SortedDigits_2x.webp 2x, images/webp/SortedDigits_2.5x.webp 2.5x">
<source type="image/jpeg" srcset="images/SortedDigits_1x.jpg 1x, images/SortedDigits_1.5x.jpg 1.5x, images/SortedDigits_2x.jpg 2x, images/SortedDigits_2.5x.jpg 2.5x">
<img src="images/SortedDigits_1x.jpg">
</picture>
<figcaption>Sorted images eventually used to train the neural network to identify individual digits</figcaption>
</figure>
<p>Obviously (it took me a day to figure this one out) training a neural network to identify decimal numbers – which are technically an infinite series – is a fruitless task. However, training a neural network to identify the 10 base digits and a decimal point is much, much easier.</p>
<p>The digits, selected with basic contour selection in OpenCV2:</p>
<figure>
<picture>
<source type="image/webp" srcset="images/webp/3-2_1x.webp 1x">
<source type="image/png" srcset="images/3-2_1x.png 1x">
<img src="images/3-2_1x.png">
</picture>
<figcaption>A single cell with each component digit highlighted to be identified individually</figcaption>
</figure>
<p>Were then pre-processed to be run through the neural network:</p>
<figure>
<picture>
<source type="image/webp" srcset="images/webp/3-3_1x.webp 1x, images/webp/3-3_1.5x.webp 1.5x">
<source type="image/png" srcset="images/3-3_1x.png 1x, images/3-3_1.5x.png 1.5x">
<img src="images/3-3_1x.png">
</picture>
<figcaption>The component digits of a cell prepared for a neural network to predict their value</figcaption>
</figure>
<p>And, with all the digit predictions combined, a single, numeric, value was outputted for the cell:</p>
<figure>
<picture>
<source type="image/webp" srcset="images/webp/PredictedValues_1x.webp 1x, images/webp/PredictedValues_1.5x.webp 1.5x, images/webp/PredictedValues_2x.webp 2x, images/webp/PredictedValues_2.5x.webp 2.5x, images/webp/PredictedValues_3x.webp 3x">
<source type="image/png" srcset="images/PredictedValues_1x.png 1x, images/PredictedValues_1.5x.png 1.5x, images/PredictedValues_2x.png 2x, images/PredictedValues_2.5x.png 2.5x, images/PredictedValues_3x.png 3x">
<img src="images/PredictedValues_1x.png">
</picture>
<figcaption>The value of a cell (built from the individually predicted digits)</figcaption>
</figure>
<p>Thus, having trained a second neural network to read the component digits of each cell, the workflow was complete and finally, numeric predictions of each cell could be produced at scale.</p>
<figure>
<picture>
<source type="image/webp" srcset="images/webp/SideBySide_1x.webp 1x, images/webp/SideBySide_1.5x.webp 1.5x, images/webp/SideBySide_2x.webp 2x, images/webp/SideBySide_2.5x.webp 2.5x">
<source type="image/jpeg" srcset="images/SideBySide_1x.jpg 1x, images/SideBySide_1.5x.jpg 1.5x, images/SideBySide_2x.jpg 2x, images/SideBySide_2.5x.jpg 2.5x">
<img src="images/SideBySide_1x.jpg">
</picture>
<figcaption>The outputted CSV file of predictions next to the original PDF input</figcaption>
</figure>
<h3>4) Manual Verification & Continuous Model Training</h3>
<p>After two weeks of work and armed with the capability to turn a PDF scan into a CSV of predictions, I logged back onto the Rainfall Rescue page. It was at that moment that I realized 16,000 volunteers, with a COVID-19 lockdown to reckon with, had surpassed the projects volunteering expectations and <i>smashed</i> through the remaining 60,000 pages in record time.</p>
<figure>
<picture>
<source type="image/webp" srcset="images/webp/ProjectOver_1x.webp 1x">
<source type="image/jpeg" srcset="images/ProjectOver_1x.jpg 1x">
<img src="images/ProjectOver_1x.jpg">
</picture>
<figcaption>The Rainfall Rescue Project, just as my data processing workflow came online.</figcaption>
</figure>
<p>The volunteers had done such a good job that the project owner estimated that records were > 99% accurate on the first attempt and had been cross verified by at least two people.</p>
<p>In direct contrast to that, my predictions were averaging around 88% and dipping below 80% in some cases!</p>
<figure>
<picture>
<source type="image/webp" srcset="images/webp/Accuracy_1x.webp 1x">
<source type="image/jpeg" srcset="images/Accuracy_1x.jpg 1x">
<img src="images/Accuracy_1x.jpg">
</picture>
<figcaption>Percentage accuracy of thirty four PDF scans tested with the prediction workflow</figcaption>
</figure>
<p>While the Rainfall Rescue Project was complete and my work wouldn't have added anything to the data set, I spent a couple more days analysing the results as I was interested in improving the original prediction accuracy as much as possible.</p>
<p>To identify the most common prediction issues I created a simple 'Manual Verification' web app.</p>
<figure>
<picture>
<source type="image/webp" srcset="images/webp/ManualVerification_1x.webp 1x, images/webp/ManualVerification_1.5x.webp 1.5x, images/webp/ManualVerification_2x.webp 2x, images/webp/ManualVerification_2.5x.webp 2.5x">
<source type="image/jpeg" srcset="images/ManualVerification_1x.jpg 1x, images/ManualVerification_1.5x.jpg 1.5x, images/ManualVerification_2x.jpg 2x, images/ManualVerification_2.5x.jpg 2.5x">
<img src="images/ManualVerification_1x.jpg">
</picture>
<figcaption>A web app to perform rapid visual verification and error identification</figcaption>
</figure>
<p>Designed to aid rapid visual verification and enable error identification, the web app used a point and click interface to highlight problematic predictions</p>
<p>In the end, the four most common prediction issues - pictured below - were actually fairly easy to tackle.</p>
<div class="figure-group">
<figure class="quarter">
<picture>
<source type="image/webp" srcset="images/webp/Issue-1_1x.webp 1x">
<source type="image/jpeg" srcset="images/Issue-1_1x.jpg 1x">
<img src="images/Issue-1_1x.jpg">
</picture>
<figcaption>Ink smudges</figcaption>
</figure>
<figure class="quarter">
<picture>
<source type="image/webp" srcset="images/webp/Issue-2_1x.webp 1x">
<source type="image/jpeg" srcset="images/Issue-2_1x.jpg 1x">
<img src="images/Issue-2_1x.jpg">
</picture>
<figcaption>Conjoined digits</figcaption>
</figure>
<figure class="quarter">
<picture>
<source type="image/webp" srcset="images/webp/Issue-3_1x.webp 1x">
<source type="image/jpeg" srcset="images/Issue-3_1x.jpg 1x">
<img src="images/Issue-3_1x.jpg">
</picture>
<figcaption>Poor alignment</figcaption>
</figure>
<figure class="quarter">
<picture>
<source type="image/webp" srcset="images/webp/Issue-4_1x.webp 1x">
<source type="image/jpeg" srcset="images/Issue-4_1x.jpg 1x">
<img src="images/Issue-4_1x.jpg">
</picture>
<figcaption>Decimal place confusion</figcaption>
</figure>
</div>
<p>Making small changes to the code, I added rules to 'hide' smudged ink, 'split' joined digits down the middle, 're-align' offset text and 'size check' decimal places.</p>
<p>The result?</p>
<p>A <b>7% improvement</b> in prediction accuracy, bringing the <b>average up to ~95%.</b></p>
<h2>Wrapping Up</h2>
<p>Having achieved a letter grade of 'A' in prediction accuracy at the time, I was done with the project and didn't really think about it until I started this writeup.</p>
<p>Strangely, I realized that despite having two weeks of work never be used, I had really enjoyed myself.</p>
<p>It was satisfying to see a PDF scan get dissected with OpenCV2 and exciting (COVID-19 has really lowered my bar for entertainment) to see the accuracy of the predictions coming out of a neural network that I had built.</p>
<p>At the end of the day, I find putting structure around data fairly relaxing and actually compare it to building Lego without instructions. Although all the blocks <i>can</i> fit together, it is far more interesting when they're put together to create a rocket or a building.</p>
</div>
</div>
<div class="footer">
<div class="container">
<div class="copy">
<b>© 2020</b> — <a href="https://github.com/cianyleow/rainfall">Actually Made By Me</a>
</div>
</div>
</div>
</body>
</html>