Generates passphrases from a list of basic English words at least 8192 words in length.
Shell script to “shuffle” the input word list (output of mostFrequent
) and pick n at a time.
Once you’ve processed all the input raw data into a list of most-frequently-occurring words, this is the script you’ll use most often to generate password possibilities.
(Note: It might be possible to run this script via Windows Subsystem for Linux, https://docs.microsoft.com/en-us/windows/wsl/about. Untested at this time.)
Sample run:
J1231-W10$ ./shuffle-pick.sh exotic performing flood calm strain membership holding cooper lifting sphere thomas august accelerated ever seeking candy passions void encouraging epic measurement amino constituted tariff wealthy comparison bibliography take supra promise
Essentially,
cat googlebooks-eng-us-all-1gram-20120701-a | ./mostFrequent
Although there are some other steps you can take to clean things up a bit.
- mostFrequent.hs
- Find the n most frequent words occurring on or after year y in a Google ngrams raw data file, in the 2012 format.
- PassphraseGenerator.hs
- The guts of the program, so I can write a separate unit-test suite.
- spec.hs
- Unit test.
- filterSpecialCharacters.awk
- Awk program to filter out words with non-ASCII characters. We only speak Murrican here.
- go
- A bash script that encapsulates the commands to build this thing for profiling, prepare the data, and actually run it (and get profiling output).
- go2
- Bash script to run
mostFrequent
over all the google raw ngram data files. This is the “Big Run” to produce final output. - filterCommonSuffixes.hs
- Program to read a list of words and filter out those that are just
another word in the list with a common suffix added on (e.g., to pluralize it). This was
the first program I wrote, but
mostFrequent
is where I’ve spent most of my effort recently (as of 24 Oct 2015). I’ve been having second thoughts about whether I wanted to actually run this or not, since I don’t think words that are “almost duplicates” reduces entropy, really. - raw data
- The raw data is at http://storage.googleapis.com/books/ngrams/books/datasetsv2.html, under section “American English” (version 20120701, 1-grams).