Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Free soundex out string on deallocation #17

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Free soundex out string on deallocation #17

wants to merge 1 commit into from

Conversation

zhammer
Copy link

@zhammer zhammer commented Nov 6, 2019

context

possible fix for #14 (i'm new to cython and haven't coded in C for a long time)

currently fuzzy.Soundex has occasional undefined behavior on py3, as discussed in #14 and can be reproduced by running this script locally:

import itertools
import fuzzy

soundex = fuzzy.Soundex(6)

def main():
    # keep running the soundex function on the same input until encountering a UnicodeDecodeError
    for attempt in itertools.count():
        try:
            soundex('input')
        except UnicodeDecodeError as e:
            print(f'error on iteration {attempt}, err: {e}')
            break

if __name__ == "__main__":
    main()
# script output
(venv) Zach-Hammer:fuzzy zachhammer$ python3 soundextest.py 
error on iteration 2227560, err: 'ascii' codec can't decode byte 0xc1 in position 4
: ordinal not in range(128)

the UnicodeDecodeError is easiest to replicate as it breaks the code flow, but as noted in #14 the function can return valid (but incorrect) python strings

early deallocation of output string

i have a feeling this is occurring because of a change in python2/python3 cython behavior in this block:

fuzzy/src/fuzzy.pyx

Lines 227 to 230 in e15b195

pout = out
free(out)
return pout

  1. we assign pout = out
    • pout has no declared type (from looking at compiled c code it seems to be a char *)
    • the cython string docs aren't super clear here, but i'd imagine this is just a char * start pointer copy
  2. we free out
    • those bytes can be overwritten at any time, possibly even on free iirc
  3. we return pout
    • from the c compiled code, it seems like a few things happen here, most notably: __pyx_t_5 = __Pyx_PyUnicode_FromString(__pyx_v_pout); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 230, __pyx_L1_error) which seems to be doing some char * -> py string coercion
    • at some point here the free'd memory can be overwritten
    • that'd either lead to a UnicodeDecodeError if attempting to decode garbage bytes, or some random string output if the bytes are valid as a python string

i'm not sure what would cause this to be an issue only on py3, though, unless soundex has been occasionally returning garbage chars without notice on py2 as no UnicodeDecodeError was being thrown?

proposed fix

use try...finally deallocation syntax to free the output string on python garbage collection

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant