Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce edit distance based measures of text similarity ? #17

Open
jiongweilua opened this issue Mar 20, 2019 · 5 comments
Open

Introduce edit distance based measures of text similarity ? #17

jiongweilua opened this issue Mar 20, 2019 · 5 comments

Comments

@jiongweilua
Copy link

Requested feature

@stefan-mueller and I were wondering whether we should introduce edit distance based measures of text similarity to quanteda.

Stefan shared with me a paper by James P. Cross and Henrik Hermansson (2017) here: they document the Levenshtein Distance, DocuToads distance, and

I was wondering if such measures would be a meaningful addition to quanteda / how widely such measures are used? @kbenoit

The Levenshtein Distance is very well-documented with lots of existing implementations (e.g. stringdist) We can make it quanteda friendly and I should not have too much trouble to implementing it.

The DocuToads distance is a bit more challenging, but the online appendix to the Cross and Hermansson paper is very detailed; with enough time, I should be able to work through them.

EDIT: Another paper of interest is by Wilkerson et al. (2015) here

@kbenoit
Copy link
Contributor

kbenoit commented Mar 21, 2019

The idea of edit distances for token vectors is an interesting one and could have some substantive applications such as in the EUP paper. I read the paper, the supplemental information, and the replication file, and unfortunately there is no code showing how they compute the dataset - in other words, how they actually implement "Docutoads". There is https://github.com/ajhhermansson/DocuToads but it's basically a pair of Python scripts.

The idea that edit distances for individual words means anything substantive is just flat wrong. The problem in English is just the opposite: too many words that are homographs with totally different meanings.

There is already the package textreuse which unfortunately reimplements the NLP foundations that quanteda is so well developed for, but since it only needs to do one thing, that's probably ok.

In short: I'm open-minded but a bit skeptical.

@stefan-mueller
Copy link

I received several emails and requests in the last couple of weeks as to whether these algorithms are included in quanteda or other R packages. So there seems to be demand for this (also based on the citations for the Wilkerson et al (2015) paper).

textreuse does include the Smith-Waterman algorithm, but align_local() does not give the scores for the entire document-pair, but only reports the longest alignment score. Since we have added measures on the tokens level, it might make sense to add at least the Smith-Waterman algorithm. Adding something like that to quanteda could be useful (and this implementation is not based edit distances for individual words). What's your opinion?

library(textreuse)
align_local("The answer is blowin' in the wind.",
            "As the Bob Dylan song says, the answer was blowing in the wind.")
#> TextReuse alignment
#> Alignment score: 8 
#> Document A:
#> The answer is ### blowin ####### in the wind
#> 
#> Document B:
#> the answer ## was ###### blowing in the wind

@kbenoit
Copy link
Contributor

kbenoit commented Mar 26, 2019

OK, what would the function signature be called and what inputs and arguments would it need, and what would the output look like? I assume it would be a textstat_*() function.

@koheiw
Copy link
Collaborator

koheiw commented Mar 26, 2019

I was also interested in positional distance/similarity measures such as Hamming (which we removed from textstat_dist) and Levenshtein distance. These measures computationally intensive, so we need to write using RcppParalell, but we can start prototyping in R.

@kbenoit
Copy link
Contributor

kbenoit commented Mar 26, 2019

This might be interesting:
https://stackoverflow.com/questions/55352498/how-to-output-in-r-all-possible-deviations-of-a-word-for-a-fixed-distance-value

utils::adist() is very fast. But string distances for tokens are semantically meaningless and generally useless unless the task is something very practical such as matching proper names that might have alternative spellings. Matching edit distances for texts with tokens as the elements should be our focus.

@kbenoit kbenoit transferred this issue from quanteda/quanteda Nov 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants