-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Bold Surface Forms #17
base: master
Are you sure you want to change the base?
Conversation
@@ -385,6 +387,13 @@ public void nodesToText(List<? extends Object> nodes, Appendable buffer, | |||
tagName)); | |||
countingBuffer.append("\n\n"); | |||
} | |||
if ("b".equals(tagName)) { | |||
if (tagBegin < 500) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question, is this for you to get boldforms occur in the first 500 chars of the page ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tgalery,
Actually I want to extract bold forms only from the definition/introduction part of page, not from the whole page. If I include bold forms from whole page then it may introduce some false surface forms. Following are some examples:
- http://en.wikipedia.org/wiki/Sidewalk: "roadway"
- http://en.wikipedia.org/wiki/Analysis_of_variance: "most used" and "most useful"
- http://en.wikipedia.org/wiki/Phrase_(music): "antecedent phrase", "consequent phrase", "phrase-group" and "Phrase rhythm"
- http://en.wikipedia.org/wiki/Afroasiatic_languages: Several bold forms in table
We can also get some valid surface forms if we extract all bold forms from page but we are not sure about validity in all cases as we have seen above. - http://en.wikipedia.org/wiki/Radio_Warwick: "URW312", "University Radio Warwick"
I wonder why the tests are failing, maybe you changed a method name, that doesn't need to be changed? |
I have also updated the testCase for Boldform Extraction. |
Thanks @abhishekg2389! I will take a look at it when I have some time! |
Hi @tgalery, Can you please confirm that surface form store is generated from pignlproc only and nothing else, so that I can start working on creating unseen surface forms as I mentioned in my proposal. |
@abhishekg2389 pignlproc uses generates all the stats we use. It would be good to run the pirnlproc process locally to see whether the generated model indeed has the sfs in bold that you are trying to capture. Running the process on a small wikidump, like danish, would be feasiable on a single machine. Not sure when I will have some spare time though. |
There is a small sample of the wikidump in this repo: https://github.com/dbpedia-spotlight/pignlproc/tree/master/src/test/resources |
I have committed a couple of files above named Brian_Eno.xml (for test case resource) and TestWikipediaLoader.java in which I have written the test case for the xml file. I generated the bold forms locally (4 were generated) and use their starting point and ending point of bold forms for verification of bold form generation. Moreover I'll be uploading output on Danish wikidump soon. |
In one of the mails I also suggested a similar idea that instead of generating all unseen SFs for SF store we can generate unseen SFs on the go (during SF/mention and candidate matching) but it might take some time. Moreover I also suggested a probability function in my proposal. So I think we can use that over here. |
Hi @abhishekg2389, thanks for your effort, on the test case, it would be nice to put a bold surface form that starts at, say, char 497, and ends at an index greater than 500, so according to your rule, the beginning of the bold word would be captured but not the end. In this case I would expect your functions not to break, nor to output the substring between indices 497-500. |
On the loose sf matching, doing the generation would is actually kind of easy. The strategy at the moment is this: first we get a spot, if we manage to find candidades in the uppercase SF store, you retrieve its candidates and move them to the disambiguation phase. You need to change that: the pipeline would be : get the spot -> generate new spots -> get all the candidates in both the uppercase SF store and the lowercase sf store. I would advise you to do the generation step in stages: first just generating different case combinations and removing plurals, and then we might move to something more complex. You could base your generator on this https://github.com/dnmilne/wikipediaminer/blob/049b1d4a9568144e9e6704b9090d0579b21b3e2e/wikipedia-miner-core/src/main/java/org/wikipedia/miner/util/text/CaseAccentSimpleTextProcessor.java . Hope it helps |
Hi @tgalery According to my rule we will consider boldforms which starts before 500 characters. So I will be considering boldform which will before 500 and end in 500+ but that will be the last one that we will take. So should I change the rule and test case or not? And can you tell me if I have to upload the output from Danish wikidump for verification. |
Hi @abhishekg2389 you the output of the pinlproc process will be a new danish model, it would be good to check some danish wikipedia pages and see whether the bold forms that you captured are indeed there. |
Maybe you can post the zipped model to dropbox or something and post the url here so we can double check. |
And I forgot to mention, if you are working on the spotlight code, could you fork from our repo and create a feature branch from dev, please ? |
Hi @tgalery, Please find only bold surface forms here: https://www.dropbox.com/s/3gopxn0csynto8i/bolds.tar.gz?dl=0 |
Hi @tgalery, Have you checked the output that I uploaded on DropBox? |
There is one more thing that I would like to ask. For removing plurals there are two options:
In option 1 we might miss some complex like men, children. But besides these rules I have a semi-exhaustive list of plurals to singular conversion. |
Hi @abhishekg2389 I'm sorry but I still am swamped. I might try to have a look at the dropbox files Sunday, but it's a bit unlikely. Is your plural remover related to normalizing data in the pignlproc or to the possible loose sf matching you are investigating at the moment? |
Hi @tgalery, Have you got any chance to look at the uploaded files? |
Hi @abhishekg2389 sorry for the massive delay, but I haven't had much time to look at things. Might try this or the next week. To be honest, we are trying to move away from pignlproc to a newer system based on spark. The basis of such system relies on a repo called json wikipedia https://github.com/diegoceccarelli/json-wikipedia which creates a json representation of wikipedia. It would be great if you could add a field in the schema that represents the page that captures the bold surface forms. The parse of pages is done by some libs e.g. jwlpro and bliki, so they might have some helper functions already in place. |
Hi @tgalery ,
This is regarding pull request dbpedia-spotlight/dbpedia-spotlight#356.
I have made some changes to get the bold mentions out of dump file. Please review the changes.
As far as test cases are concerned I have extracted words surrounded by triples single quote (''') from the dump.