Adding Bold Surface Forms #17

abhishekg2389 · 2015-05-11T21:09:11Z

This is regarding pull request dbpedia-spotlight/dbpedia-spotlight#356.
I have made some changes to get the bold mentions out of dump file. Please review the changes.
As far as test cases are concerned I have extracted words surrounded by triples single quote (''') from the dump.

tgalery · 2015-05-14T12:58:09Z

src/main/java/pignlproc/markup/AnnotatingMarkupParser.java

@@ -385,6 +387,13 @@ public void nodesToText(List<? extends Object> nodes, Appendable buffer,
                                    tagName));
                            countingBuffer.append("\n\n");
                        }
+                        if ("b".equals(tagName)) {
+                            if (tagBegin < 500) {


Just a question, is this for you to get boldforms occur in the first 500 chars of the page ?

Hi @tgalery,
Actually I want to extract bold forms only from the definition/introduction part of page, not from the whole page. If I include bold forms from whole page then it may introduce some false surface forms. Following are some examples:

http://en.wikipedia.org/wiki/Sidewalk: "roadway"

http://en.wikipedia.org/wiki/Analysis_of_variance: "most used" and "most useful"

http://en.wikipedia.org/wiki/Phrase_(music): "antecedent phrase", "consequent phrase", "phrase-group" and "Phrase rhythm"

http://en.wikipedia.org/wiki/Afroasiatic_languages: Several bold forms in table
We can also get some valid surface forms if we extract all bold forms from page but we are not sure about validity in all cases as we have seen above.

http://en.wikipedia.org/wiki/Radio_Warwick: "URW312", "University Radio Warwick"

tgalery · 2015-05-14T13:02:56Z

I wonder why the tests are failing, maybe you changed a method name, that doesn't need to be changed?

abhishekg2389 · 2015-05-15T21:20:13Z

I have also updated the testCase for Boldform Extraction.

tgalery · 2015-05-17T21:36:22Z

Thanks @abhishekg2389! I will take a look at it when I have some time!

abhishekg2389 · 2015-05-18T18:48:55Z

Hi @tgalery,

Can you please confirm that surface form store is generated from pignlproc only and nothing else, so that I can start working on creating unseen surface forms as I mentioned in my proposal.

tgalery · 2015-05-18T20:23:48Z

@abhishekg2389 pignlproc uses generates all the stats we use. It would be good to run the pirnlproc process locally to see whether the generated model indeed has the sfs in bold that you are trying to capture. Running the process on a small wikidump, like danish, would be feasiable on a single machine. Not sure when I will have some spare time though.
As for the unseen forms, I would take a step back and implement a loose sf matching function that generates variations based on a match and tries to retrieve all the candidates for all the matches.
There was a very old commit of mine that would try to fix some white space issues here https://github.com/dbpedia-spotlight/dbpedia-spotlight/pull/284/files , but I would go beyond that and do the following: (i) get a spot -> (ii) generate a set of possible sfs on the basis of (i) -> (iii) get all the candidates from the sfs in (ii) but adjust the sf probability according to some function that calculates some distance between the original sf in (i) and the sf generated in (ii) .

dav009 · 2015-05-19T08:26:23Z

There is a small sample of the wikidump in this repo: https://github.com/dbpedia-spotlight/pignlproc/tree/master/src/test/resources
You could check if there is any markup for Bold SFs for a particular article and create a test-set

abhishekg2389 · 2015-05-19T09:38:09Z

Hi @dav009, @tgalery,

I have committed a couple of files above named Brian_Eno.xml (for test case resource) and TestWikipediaLoader.java in which I have written the test case for the xml file. I generated the bold forms locally (4 were generated) and use their starting point and ending point of bold forms for verification of bold form generation. Moreover I'll be uploading output on Danish wikidump soon.

abhishekg2389 · 2015-05-19T09:48:31Z

@tgalery,

In one of the mails I also suggested a similar idea that instead of generating all unseen SFs for SF store we can generate unseen SFs on the go (during SF/mention and candidate matching) but it might take some time. Moreover I also suggested a probability function in my proposal. So I think we can use that over here.
So may I know if I can work on your suggested idea (if you want to).

tgalery · 2015-05-19T13:21:04Z

Hi @abhishekg2389, thanks for your effort, on the test case, it would be nice to put a bold surface form that starts at, say, char 497, and ends at an index greater than 500, so according to your rule, the beginning of the bold word would be captured but not the end. In this case I would expect your functions not to break, nor to output the substring between indices 497-500.

tgalery · 2015-05-19T13:31:56Z

On the loose sf matching, doing the generation would is actually kind of easy. The strategy at the moment is this: first we get a spot, if we manage to find candidades in the uppercase SF store, you retrieve its candidates and move them to the disambiguation phase. You need to change that: the pipeline would be : get the spot -> generate new spots -> get all the candidates in both the uppercase SF store and the lowercase sf store. I would advise you to do the generation step in stages: first just generating different case combinations and removing plurals, and then we might move to something more complex. You could base your generator on this https://github.com/dnmilne/wikipediaminer/blob/049b1d4a9568144e9e6704b9090d0579b21b3e2e/wikipedia-miner-core/src/main/java/org/wikipedia/miner/util/text/CaseAccentSimpleTextProcessor.java . Hope it helps

abhishekg2389 · 2015-05-19T16:19:23Z

Hi @tgalery

According to my rule we will consider boldforms which starts before 500 characters. So I will be considering boldform which will before 500 and end in 500+ but that will be the last one that we will take. So should I change the rule and test case or not? And can you tell me if I have to upload the output from Danish wikidump for verification.
Moreover I will start working on generating surface forms and will commit the code soon.

tgalery · 2015-05-19T16:41:08Z

Hi @abhishekg2389 you the output of the pinlproc process will be a new danish model, it would be good to check some danish wikipedia pages and see whether the bold forms that you captured are indeed there.

tgalery · 2015-05-19T16:41:55Z

Maybe you can post the zipped model to dropbox or something and post the url here so we can double check.

tgalery · 2015-05-19T17:09:37Z

And I forgot to mention, if you are working on the spotlight code, could you fork from our repo and create a feature branch from dev, please ?

abhishekg2389 · 2015-05-19T18:39:38Z

Hi @tgalery,

Please find only bold surface forms here: https://www.dropbox.com/s/3gopxn0csynto8i/bolds.tar.gz?dl=0
In my code we are having a small problem with the cases when the text is bold as well as italics but those cases are very rare.
And I will create a feature branch from dev and start working on that.

abhishekg2389 · 2015-06-16T14:15:35Z

Hi @tgalery,

Have you checked the output that I uploaded on DropBox?
Moreover I have a started working on the approach and will come up with the code in a few days.

abhishekg2389 · 2015-06-16T19:13:26Z

There is one more thing that I would like to ask. For removing plurals there are two options:

Replacing with rules - "s" -> "", "ses" -> "s", "xes" -> "x", "zes" -> "z", "ches" -> "ch", "shes" -> "sh", "men" -> "man", "ies" -> "y"
Use stemmer by John Carroll: https://github.com/knowitall/morpha

In option 1 we might miss some complex like men, children. But besides these rules I have a semi-exhaustive list of plurals to singular conversion.
Option 2 might be more exhaustive than in Option 1 but it will be more time consuming.
Please suggest which option would be appropriate.

tgalery · 2015-06-17T10:15:21Z

Hi @abhishekg2389 I'm sorry but I still am swamped. I might try to have a look at the dropbox files Sunday, but it's a bit unlikely. Is your plural remover related to normalizing data in the pignlproc or to the possible loose sf matching you are investigating at the moment?

abhishekg2389 · 2015-11-03T19:26:44Z

Hi @tgalery,

Have you got any chance to look at the uploaded files?

tgalery · 2015-11-04T00:21:41Z

Hi @abhishekg2389 sorry for the massive delay, but I haven't had much time to look at things. Might try this or the next week. To be honest, we are trying to move away from pignlproc to a newer system based on spark. The basis of such system relies on a repo called json wikipedia https://github.com/diegoceccarelli/json-wikipedia which creates a json representation of wikipedia. It would be great if you could add a field in the schema that represents the page that captures the bold surface forms. The parse of pages is done by some libs e.g. jwlpro and bliki, so they might have some helper functions already in place.

abhishekg2389 added 4 commits May 12, 2015 02:18

Update nerd_commons.pig

969a911

Update ParsingWikipediaLoader.java

be1807e

Update AnnotatingMarkupParser.java

6c59c2e

Update TestWikipediaLoader.java

ee06f41

abhishekg2389 mentioned this pull request May 14, 2015

Adding Bold Surface Forms: Issue #353 dbpedia-spotlight/dbpedia-spotlight#356

Closed

tgalery reviewed May 14, 2015
View reviewed changes

abhishekg2389 added 2 commits May 16, 2015 02:45

Update TestWikipediaLoader.java

7f8ecfe

Create Brian_Eno.xml

d927148

dav009 mentioned this pull request May 18, 2015

Wikipedia articles coverage idio/wiki2vec#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Bold Surface Forms #17

Adding Bold Surface Forms #17

abhishekg2389 commented May 11, 2015

tgalery May 14, 2015

abhishekg2389 May 15, 2015

tgalery commented May 14, 2015

abhishekg2389 commented May 15, 2015

tgalery commented May 17, 2015

abhishekg2389 commented May 18, 2015

tgalery commented May 18, 2015

dav009 commented May 19, 2015

abhishekg2389 commented May 19, 2015

abhishekg2389 commented May 19, 2015

tgalery commented May 19, 2015

tgalery commented May 19, 2015

abhishekg2389 commented May 19, 2015

tgalery commented May 19, 2015

tgalery commented May 19, 2015

tgalery commented May 19, 2015

abhishekg2389 commented May 19, 2015

abhishekg2389 commented Jun 16, 2015

abhishekg2389 commented Jun 16, 2015

tgalery commented Jun 17, 2015

abhishekg2389 commented Nov 3, 2015

tgalery commented Nov 4, 2015

Adding Bold Surface Forms #17

Are you sure you want to change the base?

Adding Bold Surface Forms #17

Conversation

abhishekg2389 commented May 11, 2015

tgalery May 14, 2015

Choose a reason for hiding this comment

abhishekg2389 May 15, 2015

Choose a reason for hiding this comment

tgalery commented May 14, 2015

abhishekg2389 commented May 15, 2015

tgalery commented May 17, 2015

abhishekg2389 commented May 18, 2015

tgalery commented May 18, 2015

dav009 commented May 19, 2015

abhishekg2389 commented May 19, 2015

abhishekg2389 commented May 19, 2015

tgalery commented May 19, 2015

tgalery commented May 19, 2015

abhishekg2389 commented May 19, 2015

tgalery commented May 19, 2015

tgalery commented May 19, 2015

tgalery commented May 19, 2015

abhishekg2389 commented May 19, 2015

abhishekg2389 commented Jun 16, 2015

abhishekg2389 commented Jun 16, 2015

tgalery commented Jun 17, 2015

abhishekg2389 commented Nov 3, 2015

tgalery commented Nov 4, 2015