Improve preprocessing steps #6

knkski · 2017-10-15T02:03:15Z

Right now our preprocessing is limited to simply extracting the files and converting them to numpy arrays:

https://github.com/knkski/atai/blob/master/preprocess.py

There's some room for improvement here. Some options we could try out are:

Remove crazy/nonsensical fonts
Rotate/shear/transform/etc existing fonts
Download online fonts and create our own training images

knkski · 2017-11-04T03:10:01Z

Removing crazy/nonsensical fonts turned out to have a major impact on the performance. Here's a tensorboard chart showing all of the previous runs that I've done with various algorithms, maxing out at just shy of 95% accuracy:

The green line at the top is the first run with cleaned data, with a relatively simple CNN network. Here's a closer look at performance between the regular and cleaned dataset, with an identical CNN network:

Some of that increase is probably due to the fact that there's just less data now (cleaning removed 5%, or 502809 vs 529119 records), but some of that performance increase is almost certainly due to the algorithm no longer having to fit images like this, which are simply the logo of a font website:

There's also a fair number of fonts that have the digits 0-9 or 1-10 in place of actual letters, so the algorithm was trying to fit 1 or 2 instead of a B.

Next steps are to use the trained model on the font images, and then determine where the learning algorithm is having difficulty, and seeing if there's any way of fixing that, such as generating more, similar examples.

knkski mentioned this issue Nov 5, 2017

Write script to help filter out bad training examples #14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve preprocessing steps #6

Improve preprocessing steps #6

knkski commented Oct 15, 2017

knkski commented Nov 4, 2017

Improve preprocessing steps #6

Improve preprocessing steps #6

Comments

knkski commented Oct 15, 2017

knkski commented Nov 4, 2017