-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce edit distance based measures of text similarity ? #17
Comments
The idea of edit distances for token vectors is an interesting one and could have some substantive applications such as in the EUP paper. I read the paper, the supplemental information, and the replication file, and unfortunately there is no code showing how they compute the dataset - in other words, how they actually implement "Docutoads". There is https://github.com/ajhhermansson/DocuToads but it's basically a pair of Python scripts. The idea that edit distances for individual words means anything substantive is just flat wrong. The problem in English is just the opposite: too many words that are homographs with totally different meanings. There is already the package textreuse which unfortunately reimplements the NLP foundations that quanteda is so well developed for, but since it only needs to do one thing, that's probably ok. In short: I'm open-minded but a bit skeptical. |
I received several emails and requests in the last couple of weeks as to whether these algorithms are included in quanteda or other R packages. So there seems to be demand for this (also based on the citations for the Wilkerson et al (2015) paper). textreuse does include the Smith-Waterman algorithm, but library(textreuse)
align_local("The answer is blowin' in the wind.",
"As the Bob Dylan song says, the answer was blowing in the wind.")
#> TextReuse alignment
#> Alignment score: 8
#> Document A:
#> The answer is ### blowin ####### in the wind
#>
#> Document B:
#> the answer ## was ###### blowing in the wind |
OK, what would the function signature be called and what inputs and arguments would it need, and what would the output look like? I assume it would be a |
I was also interested in positional distance/similarity measures such as Hamming (which we removed from |
This might be interesting:
|
Requested feature
@stefan-mueller and I were wondering whether we should introduce edit distance based measures of text similarity to quanteda.
Stefan shared with me a paper by James P. Cross and Henrik Hermansson (2017) here: they document the Levenshtein Distance, DocuToads distance, and
I was wondering if such measures would be a meaningful addition to quanteda / how widely such measures are used? @kbenoit
The Levenshtein Distance is very well-documented with lots of existing implementations (e.g. stringdist) We can make it quanteda friendly and I should not have too much trouble to implementing it.
The DocuToads distance is a bit more challenging, but the online appendix to the Cross and Hermansson paper is very detailed; with enough time, I should be able to work through them.
EDIT: Another paper of interest is by Wilkerson et al. (2015) here
The text was updated successfully, but these errors were encountered: