Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable whitespace-preserving splitting #8

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

lfurrer
Copy link

@lfurrer lfurrer commented Nov 7, 2022

This PR introduces a new flag strip_whitespace for the existing split() method and a new method boundaries.

The new features are best described by example:

>>> from sentence_splitter import SentenceSplitter
>>> sbd = SentenceSplitter('en')
>>>
>>> # Inherited behaviour (unchanged):
>>> sbd.split("A brief note.   Another   one.\tAnd a final one.\n")
['A brief note.', 'Another one.', 'And a final one.']
>>>
>>> # New flag strip_whitespace (default: True):
>>> sbd.split("A brief note.   Another   one.\tAnd a final one.\n", strip_whitespace=False)
['A brief note.   ', 'Another   one.\t', 'And a final one.\n']
>>>
>>> # New method boundaries():
>>> sbd.boundaries("A brief note.   Another   one.\tAnd a final one.\n")
[16, 31]

The changes in prose:

  • If the new flag strip_whitespace is True (the default), leading, trailing and duplicated whitespace is stripped (current behaviour). If the flag is False, all whitespace is preserved, such that
    "".join(sbd.split(text, strip_whitespace=False)) == text
  • The new method SentenceSplitter.boundaries(text: str) -> List[int] returns a list of character offsets into the original string, denoting sentence boundaries. The number of boundaries is always less than the number of sentences by one (except for an empty string, in which case both methods return an empty list).

The reason for the proposed changes is simple: the current implementation mixes two tasks, sentence boundary detection and whitespace normalisation, in an inseparable way. This makes this sentence splitter unusable in contexts where the original whitespace needs to be retained. This PR adds an option to perform non-destructive sentence splitting.

I tried to stick with the original code as much as possible in order not to change the behaviour of the splitter; in particular, the regular expressions have not been changed. The new and old implementations yielded identical results for a small corpus of 35k docs/60k sentences in EN/DE/ES/FR/IT/PL. However, this doesn't guarantee that the new implementation behaves exactly the same in all edge cases. A test suite for typical and interesting cases in some (or all) languages supported by this sentence splitter could help gaining some confidence here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant