Expand aspect extraction on sentence context and further improve aspect extraction #92

ChulioZ · 2018-09-28T14:55:55Z

@alexanderpanchenko

Here's my report for this week:

I've included an option to expand aspect extraction on website context of the sentence. You can test it live on http://ltdemos.informatik.uni-hamburg.de/cam. There are two parameters to choose:

how big the context shall be; that is, how many sentences before and after the comparison sentence shall be used for aspect extraction. -1 means all sentences of the context are used.
how many sentences' contexts shall be used. As you can imagine, the aspect extraction process takes way longer than before because every single website has to be requested and analyzed. That means we can't use the context of all sentences (unless we want to wait for a few hours if the search was a bigger one). You can choose from 0.1 % to 100 % of the sentences and you'll see yourself that the time the process needs will go up dramatically if you choose a high number here. If, for example, you chose 1 %, that means only the top 1 % of the sentences (sorted by their ES score) will have their contexts analyzed. My guess is that there should be some kind of threshold for the amount of contexts we can analyze in an acceptable timeframe. If you choose a context size of 0 you'll get the same results as before without any context analysis happening.

In addition to that I've also further improved the aspect extraction process as a whole. There were a few minor fixes and one bigger part which was a new keyword: "reason" (and also "reasons") is now implemented as another trigger word for aspect extraction. I'm currently looking for various different types of sentences regarding this keyword:

... reason object (verb) (comparative adjective or adverb) than object ... [Example: The reason Python is better than Matlab is its performance]
... reason for ...
... reason why ...
Whenever I find a structure like that in a sentence all nouns that appear before or after this structure (depending on the rest of the sentence) are treated as aspects.

As always I'm open for suggestions and ideas.

Have a good weekend!

mschildw · 2018-09-28T15:45:33Z

Interesting way to improve the aspect extraction, I like it!

Two points:

I would prefer it, if you could use the /cam2 for such experimental features, as it always can break something. Please don't change the backend running on /cam-api2 because that one is used by our study system (I switched from cam-api, because changes came in).
Actually as I see it, the ML classifiers are not working anymore
That selections are just there for test reasons, right? Because from user perspective I would not want so select the context size I think. The feature is nice no question about that, but maybe the context size should be selected automatically if there is too few context available because too few sentences got found.

Have a nice weekend, too!

alexanderpanchenko · 2018-09-28T17:53:00Z

Hello, yes, it would be better to use cam2 indeed.

…

On 28 Sep 2018, at 17:45, Matthias Schildwächter ***@***.***> wrote: Interesting way to improve the aspect extraction, I like it! Two points: I would prefer it, if you could use the /cam2 for such experimental features, as it always can break something. Please don't change the backend running on /cam-api2 because that one is used by our study system (I switched from cam-api, because changes came in). Actually as I see it, the ML classifiers are not working anymore That selections are just there for test reasons, right? Because from user perspective I would not want so select the context size I think. The feature is nice no question about that, but maybe the context size should be selected automatically if there is too few context available because too few sentences got found. Have a nice weekend, too! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#92 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABY6vtI7JpXbnOJ-KlYcKK4BiVH4LwDbks5ufkQdgaJpZM4W-qjv>.

ChulioZ · 2018-10-07T12:00:50Z

@MSchild
Alright, I didn't really know what all the different versions of CAM were all about. So if I understand you correctly, previously you would have preferred me using /cam2 for this, but now you switched the study to /cam2 and that means now you actually want me to use standard /cam for such things, right? Please let me know if I didn't get that correctly.

As for your other question, that's true, the additional options are only visible in the front end page for testing purposes (@alexanderpanchenko wanted to be able to try different configurations with it). As soon as I (and whoever else wants to test this) have found out which combination of context size and amount of sentences to use this'll be implemented in the back end only and will not be visible for the user. I could imagine giving the user the option of using the context for aspect extraction versus not doing so but maybe that's not even necessary.

As for your third point, I actually have no idea how the ML classifiers work as those weren't written by me. If I broke something there with my additions let me know where and how I could fix it because I really don't know anything about the ML code.

I'll be reporting back with my testing results from this week a bit later today.

ChulioZ · 2018-10-07T15:45:41Z

@alexanderpanchenko

Here's my report for this week.

I've done a few smaller things to further improve on what I'm already doing:

I've verified that NLTK POS tagging is a good method to use. While there were definitely some false tags in the examples I've looked at, NLTK's POS tagger is not just a dictionary lookup and seems to be accurate in most cases. Of course it's still possible that while this does what we want there may be other packages that do the same thing but even better.
I've added a list for context aspects for Sentence objects. This list holds all aspects that are found within the context of each sentence and can be used for searching sentences containing specific generated aspects (see Make clicking on a generated aspect find all relevant sentences #93).
I've changed the context aspect extraction so that whenever a sentence contains multiple id_pairs, only one of them is used because often times the contexts were almost the same (forum entries quoting another forum entry; same article on multiple pages) leading to the same aspects being counted multiple times.

In addition to that, my main goal was to further test the context aspect extraction. I've not tested the quality of aspects being found within the contexts but instead I've tested the amount of time it takes. The results are pretty underwhelming: As stated in my last report, even a pretty small amount of sentences being used for context aspect extraction means that the whole process takes waaay longer than without any usage of contexts. Because of that default was at 0.1 % of all sentences. I've tested multiple different configurations regarding context size, amount of sentences to use, and objects to compare. After all this testing my suggestions would be:

Set the amount of sentences used for context aspect extraction to 10 per object. When doing this, average time needed for a CAM comparison seems to be around 25 seconds (obviously depending on the machine it's running on) which should be an okay value. We could also go up to about 20 sentences per object which would increase the duration to around 55 seconds which seems like a little bit too much. As a comparison: CAM requests without using any context aspect extraction take around 5-20 seconds depending on the amount of sentences that are found. You can see that even with a very small number of sentences used for context aspect extraction the average time needed for a CAM request increases significantly. Important note if you want to test yourself: Caching seems to play a pretty big role here. When I tried a specific combination for the first time, it usually took longer than when I tried it again. Sometimes this difference was actually huge. This seems to be true both for the Elastic Search requests and also for the URL requests happening during context aspect extraction. I had to alter my search parameters all the time to keep this effect at a minimum.
Set the context size to 2 (so for each sentence used for context aspect extraction, look at the two sentences appearing before and the two sentences appearing after the specific sentence). This seems to be good regarding time needed (although the impact of this is not too big; amount of sentences used has a way larger impact) and, more importantly, regarding quality -- it wouldn't make that much sense to use a sentence appearing 50 sentences later as that sentence most likely doesn't have anything to do with the sentence CAM found. A sentence that's closer to CAM's sentence is much more likely to have relevant information. It's even possible that this should be reduced to 1 because of this. Note however that as stated earlier I haven't made any tests regarding quality so this suggestion should be seen as an idea rather than the result of serious testing.

mschildw · 2018-10-08T05:47:19Z

Alright, I didn't really know what all the different versions of CAM were all about. So if I understand you correctly, previously you would have preferred me using /cam2 for this, but now you switched the study to /cam2 and that means now you actually want me to use standard /cam for such things, right? Please let me know if I didn't get that correctly.

The studie used the backend on cam-api and I switched that to cam-api2.
So you could use cam-api3 + cam3 for example.

mschildw · 2018-10-11T06:37:30Z

@ChulioZ I moved your changes to another branch to keep the master branch clean and running. (I created another branch with your changes and went back on master to the commit before)
Please use git checkout feature/aspectExtractionUsingContext to work on the created branch.
If you have a well running finished feature please just shortly ask @alexanderpanchenko if it is a problem to merge it into master, I also do this for my features.

As next step, I will deploy your new branch on /cam3 to provide access to your features and set /cam to the old one, to have a running demo.

If you have questions, feel free to ask :)

Edit: deployed now http://ltdemos.informatik.uni-hamburg.de/cam3/#/

mschildw · 2018-10-18T07:23:48Z

I fixed the issue with the ML approaches and redeployed it on /cam3/ so that at least BoW also can be used now.

To create the demo version I created a new branch called "demo" ( https://github.com/uhh-lt/cam/tree/demo ). The basis for demo was taken from master and I merged your feature branch into it to get your changes. Furthermore, I replaced the frontend adaptions by properties in the config.json file.

For further development on your aspect extraction I suggest to work on the created feature branch (described above). If you pushed changes and want to redeploy it on /cam3, all you have to do is to go to srv/docker/pan-cam3, use git pull and execute docker-compose down && docker-compose build && docker-compose up -d (and delete the old containers -> docker rmi)

ChulioZ · 2018-10-18T09:38:34Z

I've now committed the latest version of context aspect extraction into the branch you've created for it. I think it should be a stable version right now. I've also deployed it to cam3.

It features context aspect extraction for 10 sentences per object, using a context size of 2 (meaning 2 sentences before and 2 sentences after the actual sentence are used). If you need/want context aspect extraction for the demo/YouTube video/study/whatever, you can merge it into master. I can also do it myself if you want me to do so, @alexanderpanchenko . The front end part giving the user the option to choose which context sentence amount and context size to use is now gone because the testing phase is basically over. If you want to test different configurations for those numbers, you can do so by changing the assigned values in pos_link_extracter.py.

The current configuration of 10 sentences per object and a context size of 2 seems to be the best after all my testing. If we find that it still takes too long or that we could even go a bit longer, I could change it to 5 or even 20 sentences. Just tell me when you think numbers should be changed.

ChulioZ closed this as completed Oct 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand aspect extraction on sentence context and further improve aspect extraction #92

Expand aspect extraction on sentence context and further improve aspect extraction #92

ChulioZ commented Sep 28, 2018

mschildw commented Sep 28, 2018

alexanderpanchenko commented Sep 28, 2018 via email

ChulioZ commented Oct 7, 2018

ChulioZ commented Oct 7, 2018

mschildw commented Oct 8, 2018

mschildw commented Oct 11, 2018 •

edited

Loading

mschildw commented Oct 18, 2018

ChulioZ commented Oct 18, 2018

Expand aspect extraction on sentence context and further improve aspect extraction #92

Expand aspect extraction on sentence context and further improve aspect extraction #92

Comments

ChulioZ commented Sep 28, 2018

mschildw commented Sep 28, 2018

alexanderpanchenko commented Sep 28, 2018 via email

ChulioZ commented Oct 7, 2018

ChulioZ commented Oct 7, 2018

mschildw commented Oct 8, 2018

mschildw commented Oct 11, 2018 • edited Loading

mschildw commented Oct 18, 2018

ChulioZ commented Oct 18, 2018

mschildw commented Oct 11, 2018 •

edited

Loading