A Markov chain is a very simple language model: For any given symbol (usually a word), what symbols come next, and with what probabilities?
This generates a Markov model for two documents, then does a dot-product to compare them. As any given symbols might not be in both documents, and two documents that don't share any symbols are obviously orthogonal, this dot product is calculated with:
Then divide by the magnitude of each Markov model… (hang on, the transition vectors are normalised already, so the magnitudes should be equal to the number of non-terminal symbols — perhaps this is why they're so much more dissimilar than I was expecting?)… to get the cosine of the angles between the vectors that the Markov chains represent.
(Assuming I've not gotten confused, which is always a possibility, especially for a solo project like this).
I want to see how good this mechanism is at identifying authorship.
The short answer is: ha ha no
The long answer is: within a language, almost all the documents I tried were in the range of 85-88° from parallel, so at the very least it needs something to alter the dynamic range if I want to get anything resembling a useful insight
Someone was interested in interviewing me for a Mac desktop job with ObjC code, and I've not done ObjC for 5 years so I thought "this project I've had on the back burner for ages, I'll do that in ObjC and for macOS".
I may redo in something else later, perhaps JS so anyone can use it without a mac or iDevice. But only if I can make a version which is more useful than to falisify my hypothesis about this being useful.