Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misplaced display result in the sentence segmentation service #73

Closed
tantikristanti opened this issue Mar 20, 2018 · 2 comments
Closed
Assignees
Labels

Comments

@tantikristanti
Copy link
Collaborator

tantikristanti commented Mar 20, 2018

In the service of sentence segmentation, there are times that the sentences are misplaced on the display result even though they have some correct offsets.
Let’s take an example of a text of WW1, the result of the offsets are as follows:

{
    "sentences": [
        {
            "offsetStart": 0,
            "offsetEnd": 111
        },
        {
            "offsetStart": 111,
            "offsetEnd": 275
        },
        {
            "offsetStart": 275,
            "offsetEnd": 431
        },
        {
            "offsetStart": 431,
            "offsetEnd": 599
        },
        {
            "offsetStart": 599,
            "offsetEnd": 737
        },
        {
            "offsetStart": 737,
            "offsetEnd": 1033
        },
        {
            "offsetStart": 1033,
            "offsetEnd": 1318
        },
        {
            "offsetStart": 1318,
            "offsetEnd": 1387
        },
        {
            "offsetStart": 1387,
            "offsetEnd": 1456
        },
        {
            "offsetStart": 1456,
            "offsetEnd": 1603
        },
        {
            "offsetStart": 1603,
            "offsetEnd": 1758
        },
        {
            "offsetStart": 1758,
            "offsetEnd": 2033
        },
        {
            "offsetStart": 2033,
            "offsetEnd": 2199
        },
        {
            "offsetStart": 2199,
            "offsetEnd": 2388
        },
        {
            "offsetStart": 2388,
            "offsetEnd": 2567
        }
    ]
}

But they have the display results as follows:
screen shot 2018-03-20 at 14 25 46

@tantikristanti tantikristanti self-assigned this Jun 4, 2018
lfoppiano added a commit that referenced this issue Sep 28, 2018
… formatting (e.g. spaces are missing) #73

Adding some UTF_8 conversion when getting bytes from string
@tantikristanti
Copy link
Collaborator Author

tantikristanti commented Sep 28, 2018

The main causes of the wrong position in the sentence segmentation service are:

  1. GET request does not maintain text formatting, thanks @lfoppiano .
  2. There are some hidden carriage returns "\r\n" in the text.

The changes regarding this issue were done in the nerd.new.js, NerdRestService.java and ProcessText.java.

Currently, the result from the service can be seen like this:

screen shot 2018-09-28 at 16 03 59

@lfoppiano
Copy link
Collaborator

I think we can close this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants