Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] label segmentation on whitespace #213

Merged
merged 3 commits into from
Apr 21, 2020

Conversation

shuttle1987
Copy link
Member

@shuttle1987 shuttle1987 commented Dec 27, 2018

Related is #214 that deals with Unicode space characters

@shuttle1987 shuttle1987 added this to the 0.4.0 milestone Dec 27, 2018
@shuttle1987 shuttle1987 changed the title [WIP] label segmentation [WIP] label segmentation interface Dec 27, 2018
@shuttle1987
Copy link
Member Author

Interesting that the test fails here with this:

    def test_unicode_segmentation():
        """Test that unicode whitespace characters are correctly handled in segmentation"""
        from persephone.preprocess.labels import segment_into_chars
        no_break_space = "hello\u00A0world"
>       assert segment_into_chars(no_break_space) == "h e l l o w o r l d"
E       AssertionError: assert 'h e l l o \xa0 w o r l d' == 'h e l l o w o r l d'
E         - h e l l o   w o r l d
E         ?           --
E         + h e l l o w o r l d

@shuttle1987 shuttle1987 changed the title [WIP] label segmentation interface [MRG] label segmentation on whitespace Apr 19, 2020
@shuttle1987
Copy link
Member Author

I've reduced the scope of the PR here since I think this chunk of work about unicode whitespaces is separate to the interface related considerations

@oadams
Copy link
Collaborator

oadams commented Apr 21, 2020

Nice stuff, looks good!

@oadams oadams closed this Apr 21, 2020
@oadams oadams reopened this Apr 21, 2020
@oadams oadams merged commit 826a559 into persephone-tools:master Apr 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants