Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Bold Surface Forms #17

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

abhishekg2389
Copy link

Hi @tgalery ,

This is regarding pull request dbpedia-spotlight/dbpedia-spotlight#356.
I have made some changes to get the bold mentions out of dump file. Please review the changes.
As far as test cases are concerned I have extracted words surrounded by triples single quote (''') from the dump.

@@ -385,6 +387,13 @@ public void nodesToText(List<? extends Object> nodes, Appendable buffer,
tagName));
countingBuffer.append("\n\n");
}
if ("b".equals(tagName)) {
if (tagBegin < 500) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question, is this for you to get boldforms occur in the first 500 chars of the page ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tgalery,
Actually I want to extract bold forms only from the definition/introduction part of page, not from the whole page. If I include bold forms from whole page then it may introduce some false surface forms. Following are some examples:

  1. http://en.wikipedia.org/wiki/Sidewalk: "roadway"
  2. http://en.wikipedia.org/wiki/Analysis_of_variance: "most used" and "most useful"
  3. http://en.wikipedia.org/wiki/Phrase_(music): "antecedent phrase", "consequent phrase", "phrase-group" and "Phrase rhythm"
  4. http://en.wikipedia.org/wiki/Afroasiatic_languages: Several bold forms in table
    We can also get some valid surface forms if we extract all bold forms from page but we are not sure about validity in all cases as we have seen above.
  5. http://en.wikipedia.org/wiki/Radio_Warwick: "URW312", "University Radio Warwick"

@tgalery
Copy link
Member

tgalery commented May 14, 2015

I wonder why the tests are failing, maybe you changed a method name, that doesn't need to be changed?

@abhishekg2389
Copy link
Author

I have also updated the testCase for Boldform Extraction.

@tgalery
Copy link
Member

tgalery commented May 17, 2015

Thanks @abhishekg2389! I will take a look at it when I have some time!

@abhishekg2389
Copy link
Author

Hi @tgalery,

Can you please confirm that surface form store is generated from pignlproc only and nothing else, so that I can start working on creating unseen surface forms as I mentioned in my proposal.

@tgalery
Copy link
Member

tgalery commented May 18, 2015

@abhishekg2389 pignlproc uses generates all the stats we use. It would be good to run the pirnlproc process locally to see whether the generated model indeed has the sfs in bold that you are trying to capture. Running the process on a small wikidump, like danish, would be feasiable on a single machine. Not sure when I will have some spare time though.
As for the unseen forms, I would take a step back and implement a loose sf matching function that generates variations based on a match and tries to retrieve all the candidates for all the matches.
There was a very old commit of mine that would try to fix some white space issues here https://github.com/dbpedia-spotlight/dbpedia-spotlight/pull/284/files , but I would go beyond that and do the following: (i) get a spot -> (ii) generate a set of possible sfs on the basis of (i) -> (iii) get all the candidates from the sfs in (ii) but adjust the sf probability according to some function that calculates some distance between the original sf in (i) and the sf generated in (ii) .

@dav009
Copy link
Member

dav009 commented May 19, 2015

There is a small sample of the wikidump in this repo: https://github.com/dbpedia-spotlight/pignlproc/tree/master/src/test/resources
You could check if there is any markup for Bold SFs for a particular article and create a test-set

@abhishekg2389
Copy link
Author

Hi @dav009, @tgalery,

I have committed a couple of files above named Brian_Eno.xml (for test case resource) and TestWikipediaLoader.java in which I have written the test case for the xml file. I generated the bold forms locally (4 were generated) and use their starting point and ending point of bold forms for verification of bold form generation. Moreover I'll be uploading output on Danish wikidump soon.

@abhishekg2389
Copy link
Author

@tgalery,

In one of the mails I also suggested a similar idea that instead of generating all unseen SFs for SF store we can generate unseen SFs on the go (during SF/mention and candidate matching) but it might take some time. Moreover I also suggested a probability function in my proposal. So I think we can use that over here.
So may I know if I can work on your suggested idea (if you want to).

@tgalery
Copy link
Member

tgalery commented May 19, 2015

Hi @abhishekg2389, thanks for your effort, on the test case, it would be nice to put a bold surface form that starts at, say, char 497, and ends at an index greater than 500, so according to your rule, the beginning of the bold word would be captured but not the end. In this case I would expect your functions not to break, nor to output the substring between indices 497-500.

@tgalery
Copy link
Member

tgalery commented May 19, 2015

On the loose sf matching, doing the generation would is actually kind of easy. The strategy at the moment is this: first we get a spot, if we manage to find candidades in the uppercase SF store, you retrieve its candidates and move them to the disambiguation phase. You need to change that: the pipeline would be : get the spot -> generate new spots -> get all the candidates in both the uppercase SF store and the lowercase sf store. I would advise you to do the generation step in stages: first just generating different case combinations and removing plurals, and then we might move to something more complex. You could base your generator on this https://github.com/dnmilne/wikipediaminer/blob/049b1d4a9568144e9e6704b9090d0579b21b3e2e/wikipedia-miner-core/src/main/java/org/wikipedia/miner/util/text/CaseAccentSimpleTextProcessor.java . Hope it helps

@abhishekg2389
Copy link
Author

Hi @tgalery

According to my rule we will consider boldforms which starts before 500 characters. So I will be considering boldform which will before 500 and end in 500+ but that will be the last one that we will take. So should I change the rule and test case or not? And can you tell me if I have to upload the output from Danish wikidump for verification.
Moreover I will start working on generating surface forms and will commit the code soon.

@tgalery
Copy link
Member

tgalery commented May 19, 2015

Hi @abhishekg2389 you the output of the pinlproc process will be a new danish model, it would be good to check some danish wikipedia pages and see whether the bold forms that you captured are indeed there.

@tgalery
Copy link
Member

tgalery commented May 19, 2015

Maybe you can post the zipped model to dropbox or something and post the url here so we can double check.

@tgalery
Copy link
Member

tgalery commented May 19, 2015

And I forgot to mention, if you are working on the spotlight code, could you fork from our repo and create a feature branch from dev, please ?

@abhishekg2389
Copy link
Author

Hi @tgalery,

Please find only bold surface forms here: https://www.dropbox.com/s/3gopxn0csynto8i/bolds.tar.gz?dl=0
In my code we are having a small problem with the cases when the text is bold as well as italics but those cases are very rare.
And I will create a feature branch from dev and start working on that.

@abhishekg2389
Copy link
Author

Hi @tgalery,

Have you checked the output that I uploaded on DropBox?
Moreover I have a started working on the approach and will come up with the code in a few days.

@abhishekg2389
Copy link
Author

There is one more thing that I would like to ask. For removing plurals there are two options:

  1. Replacing with rules - "s" -> "", "ses" -> "s", "xes" -> "x", "zes" -> "z", "ches" -> "ch", "shes" -> "sh", "men" -> "man", "ies" -> "y"
  2. Use stemmer by John Carroll: https://github.com/knowitall/morpha

In option 1 we might miss some complex like men, children. But besides these rules I have a semi-exhaustive list of plurals to singular conversion.
Option 2 might be more exhaustive than in Option 1 but it will be more time consuming.
Please suggest which option would be appropriate.

@tgalery
Copy link
Member

tgalery commented Jun 17, 2015

Hi @abhishekg2389 I'm sorry but I still am swamped. I might try to have a look at the dropbox files Sunday, but it's a bit unlikely. Is your plural remover related to normalizing data in the pignlproc or to the possible loose sf matching you are investigating at the moment?

@abhishekg2389
Copy link
Author

Hi @tgalery,

Have you got any chance to look at the uploaded files?

@tgalery
Copy link
Member

tgalery commented Nov 4, 2015

Hi @abhishekg2389 sorry for the massive delay, but I haven't had much time to look at things. Might try this or the next week. To be honest, we are trying to move away from pignlproc to a newer system based on spark. The basis of such system relies on a repo called json wikipedia https://github.com/diegoceccarelli/json-wikipedia which creates a json representation of wikipedia. It would be great if you could add a field in the schema that represents the page that captures the bold surface forms. The parse of pages is done by some libs e.g. jwlpro and bliki, so they might have some helper functions already in place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants