Enable whitespace-preserving splitting #8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces a new flag
strip_whitespace
for the existingsplit()
method and a new methodboundaries
.The new features are best described by example:
The changes in prose:
strip_whitespace
isTrue
(the default), leading, trailing and duplicated whitespace is stripped (current behaviour). If the flag isFalse
, all whitespace is preserved, such that"".join(sbd.split(text, strip_whitespace=False)) == text
SentenceSplitter.boundaries(text: str) -> List[int]
returns a list of character offsets into the original string, denoting sentence boundaries. The number of boundaries is always less than the number of sentences by one (except for an empty string, in which case both methods return an empty list).The reason for the proposed changes is simple: the current implementation mixes two tasks, sentence boundary detection and whitespace normalisation, in an inseparable way. This makes this sentence splitter unusable in contexts where the original whitespace needs to be retained. This PR adds an option to perform non-destructive sentence splitting.
I tried to stick with the original code as much as possible in order not to change the behaviour of the splitter; in particular, the regular expressions have not been changed. The new and old implementations yielded identical results for a small corpus of 35k docs/60k sentences in EN/DE/ES/FR/IT/PL. However, this doesn't guarantee that the new implementation behaves exactly the same in all edge cases. A test suite for typical and interesting cases in some (or all) languages supported by this sentence splitter could help gaining some confidence here.