Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve preprocessing steps #6

Open
knkski opened this issue Oct 15, 2017 · 1 comment
Open

Improve preprocessing steps #6

knkski opened this issue Oct 15, 2017 · 1 comment

Comments

@knkski
Copy link
Owner

knkski commented Oct 15, 2017

Right now our preprocessing is limited to simply extracting the files and converting them to numpy arrays:

https://github.com/knkski/atai/blob/master/preprocess.py

There's some room for improvement here. Some options we could try out are:

  • Remove crazy/nonsensical fonts
  • Rotate/shear/transform/etc existing fonts
  • Download online fonts and create our own training images
@knkski
Copy link
Owner Author

knkski commented Nov 4, 2017

Removing crazy/nonsensical fonts turned out to have a major impact on the performance. Here's a tensorboard chart showing all of the previous runs that I've done with various algorithms, maxing out at just shy of 95% accuracy:

cleaned

The green line at the top is the first run with cleaned data, with a relatively simple CNN network. Here's a closer look at performance between the regular and cleaned dataset, with an identical CNN network:

cleaned2

Some of that increase is probably due to the fact that there's just less data now (cleaning removed 5%, or 502809 vs 529119 records), but some of that performance increase is almost certainly due to the algorithm no longer having to fit images like this, which are simply the logo of a font website:

uvr5cgvtcxvhcmutqmxhy2tnywp1c2nszxnfcc5vdgy

There's also a fair number of fonts that have the digits 0-9 or 1-10 in place of actual letters, so the algorithm was trying to fit 1 or 2 instead of a B.

Next steps are to use the trained model on the font images, and then determine where the learning algorithm is having difficulty, and seeing if there's any way of fixing that, such as generating more, similar examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant