Skip to content
This repository has been archived by the owner on Aug 26, 2024. It is now read-only.

implemented token_sim_ratio() function with cosine similarity #296

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Exquisition
Copy link

@Exquisition Exquisition commented Dec 17, 2020

Implemented solution to the following issue: #272

token_sim_ratio(s1, s2 ... ) robustly handles any issues associated with lexicographic sorting of tokens for the 2nd string introduced by fuzz.token_sort_ratio(s1, s2...). The similarity is calculated using cosine similarity, other similarity measures could be integrated easily (built-in leveinstein, Jaro-Winkler, etc).

Copy link

@MichaelYingEngineering MichaelYingEngineering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exemplary code. Good unit tests.

@nol13
Copy link
Contributor

nol13 commented Mar 7, 2021

Love the idea! addresses one of the main cases where you would get sub-optimal results from this. Was messing around with porting this PR into fuzzball.js.

Wondering.. would it work if a version of token_set also use the similarity sort?

Like maybe using the similarity sort here could work?

sorted_2to1 = " ".join(sorted(diff2to1))

Also partial is handled in _token_sim but currently it will always be False?

@nol13
Copy link
Contributor

nol13 commented Mar 7, 2021

Also, not sure maintenance status of this anyway, but can add the new functions to process.py line 97 or it will miss some optimization. Probably some other optimizations hidden in there too if you can say avoid recalculating the counters every time.

@nol13
Copy link
Contributor

nol13 commented Mar 7, 2021

Haven't tested but looks order of the arguments might matter though too in some cases? Not sure if ti would matter enough to try running it both ways

@nol13
Copy link
Contributor

nol13 commented Apr 2, 2021

Was getting good results in testing, I added experimental support for this into fuzzball.js 1.4! Referenced this PR in the docs. Sorted the arguments by # of tokens or string length before doing the similarity sort, seemed to make sense to give the shorter one more precedence when sorting, and at least it should be consistent. Also have added an option to use the similarity sort when calculating token_set_ratio.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants