implemented token_sim_ratio() function with cosine similarity #296

Exquisition · 2020-12-17T21:21:24Z

Implemented solution to the following issue: #272

token_sim_ratio(s1, s2 ... ) robustly handles any issues associated with lexicographic sorting of tokens for the 2nd string introduced by fuzz.token_sort_ratio(s1, s2...). The similarity is calculated using cosine similarity, other similarity measures could be integrated easily (built-in leveinstein, Jaro-Winkler, etc).

Implemented solution to: seatgeek#272

MichaelYingEngineering

Exemplary code. Good unit tests.

nol13 · 2021-03-07T19:21:01Z

Love the idea! addresses one of the main cases where you would get sub-optimal results from this. Was messing around with porting this PR into fuzzball.js.

Wondering.. would it work if a version of token_set also use the similarity sort?

Like maybe using the similarity sort here could work?

sorted_2to1 = " ".join(sorted(diff2to1))

Also partial is handled in _token_sim but currently it will always be False?

nol13 · 2021-03-07T19:41:26Z

Also, not sure maintenance status of this anyway, but can add the new functions to process.py line 97 or it will miss some optimization. Probably some other optimizations hidden in there too if you can say avoid recalculating the counters every time.

nol13 · 2021-03-07T22:07:58Z

Haven't tested but looks order of the arguments might matter though too in some cases? Not sure if ti would matter enough to try running it both ways

nol13 · 2021-04-02T01:18:52Z

Was getting good results in testing, I added experimental support for this into fuzzball.js 1.4! Referenced this PR in the docs. Sorted the arguments by # of tokens or string length before doing the similarity sort, seemed to make sense to give the shorter one more precedence when sorting, and at least it should be consistent. Also have added an option to use the similarity sort when calculating token_set_ratio.

implemented token_sim_ratio() function with cosine similarity

f88f7fc

Implemented solution to: seatgeek#272

MichaelYingEngineering approved these changes Dec 19, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implemented token_sim_ratio() function with cosine similarity #296

implemented token_sim_ratio() function with cosine similarity #296

Exquisition commented Dec 17, 2020 •

edited

Loading

MichaelYingEngineering left a comment

nol13 commented Mar 7, 2021

nol13 commented Mar 7, 2021

nol13 commented Mar 7, 2021

nol13 commented Apr 2, 2021

implemented token_sim_ratio() function with cosine similarity #296

Are you sure you want to change the base?

implemented token_sim_ratio() function with cosine similarity #296

Conversation

Exquisition commented Dec 17, 2020 • edited Loading

MichaelYingEngineering left a comment

Choose a reason for hiding this comment

nol13 commented Mar 7, 2021

nol13 commented Mar 7, 2021

nol13 commented Mar 7, 2021

nol13 commented Apr 2, 2021

Exquisition commented Dec 17, 2020 •

edited

Loading