Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand aspect extraction on sentence context and further improve aspect extraction #92

Closed
ChulioZ opened this issue Sep 28, 2018 · 8 comments

Comments

@ChulioZ
Copy link
Contributor

ChulioZ commented Sep 28, 2018

@alexanderpanchenko

Here's my report for this week:

  1. how big the context shall be; that is, how many sentences before and after the comparison sentence shall be used for aspect extraction. -1 means all sentences of the context are used.
  2. how many sentences' contexts shall be used. As you can imagine, the aspect extraction process takes way longer than before because every single website has to be requested and analyzed. That means we can't use the context of all sentences (unless we want to wait for a few hours if the search was a bigger one). You can choose from 0.1 % to 100 % of the sentences and you'll see yourself that the time the process needs will go up dramatically if you choose a high number here. If, for example, you chose 1 %, that means only the top 1 % of the sentences (sorted by their ES score) will have their contexts analyzed. My guess is that there should be some kind of threshold for the amount of contexts we can analyze in an acceptable timeframe. If you choose a context size of 0 you'll get the same results as before without any context analysis happening.
  • In addition to that I've also further improved the aspect extraction process as a whole. There were a few minor fixes and one bigger part which was a new keyword: "reason" (and also "reasons") is now implemented as another trigger word for aspect extraction. I'm currently looking for various different types of sentences regarding this keyword:
  1. ... reason object (verb) (comparative adjective or adverb) than object ... [Example: The reason Python is better than Matlab is its performance]
  2. ... reason for ...
  3. ... reason why ...
    Whenever I find a structure like that in a sentence all nouns that appear before or after this structure (depending on the rest of the sentence) are treated as aspects.

As always I'm open for suggestions and ideas.

Have a good weekend!

@mschildw
Copy link
Collaborator

Interesting way to improve the aspect extraction, I like it!

Two points:

  • I would prefer it, if you could use the /cam2 for such experimental features, as it always can break something. Please don't change the backend running on /cam-api2 because that one is used by our study system (I switched from cam-api, because changes came in).
  • Actually as I see it, the ML classifiers are not working anymore
  • That selections are just there for test reasons, right? Because from user perspective I would not want so select the context size I think. The feature is nice no question about that, but maybe the context size should be selected automatically if there is too few context available because too few sentences got found.

Have a nice weekend, too!

@alexanderpanchenko
Copy link
Contributor

alexanderpanchenko commented Sep 28, 2018 via email

@ChulioZ
Copy link
Contributor Author

ChulioZ commented Oct 7, 2018

@MSchild
Alright, I didn't really know what all the different versions of CAM were all about. So if I understand you correctly, previously you would have preferred me using /cam2 for this, but now you switched the study to /cam2 and that means now you actually want me to use standard /cam for such things, right? Please let me know if I didn't get that correctly.

As for your other question, that's true, the additional options are only visible in the front end page for testing purposes (@alexanderpanchenko wanted to be able to try different configurations with it). As soon as I (and whoever else wants to test this) have found out which combination of context size and amount of sentences to use this'll be implemented in the back end only and will not be visible for the user. I could imagine giving the user the option of using the context for aspect extraction versus not doing so but maybe that's not even necessary.

As for your third point, I actually have no idea how the ML classifiers work as those weren't written by me. If I broke something there with my additions let me know where and how I could fix it because I really don't know anything about the ML code.

I'll be reporting back with my testing results from this week a bit later today.

@ChulioZ
Copy link
Contributor Author

ChulioZ commented Oct 7, 2018

@alexanderpanchenko

Here's my report for this week.

I've done a few smaller things to further improve on what I'm already doing:

  • I've verified that NLTK POS tagging is a good method to use. While there were definitely some false tags in the examples I've looked at, NLTK's POS tagger is not just a dictionary lookup and seems to be accurate in most cases. Of course it's still possible that while this does what we want there may be other packages that do the same thing but even better.
  • I've added a list for context aspects for Sentence objects. This list holds all aspects that are found within the context of each sentence and can be used for searching sentences containing specific generated aspects (see Make clicking on a generated aspect find all relevant sentences #93).
  • I've changed the context aspect extraction so that whenever a sentence contains multiple id_pairs, only one of them is used because often times the contexts were almost the same (forum entries quoting another forum entry; same article on multiple pages) leading to the same aspects being counted multiple times.

In addition to that, my main goal was to further test the context aspect extraction. I've not tested the quality of aspects being found within the contexts but instead I've tested the amount of time it takes. The results are pretty underwhelming: As stated in my last report, even a pretty small amount of sentences being used for context aspect extraction means that the whole process takes waaay longer than without any usage of contexts. Because of that default was at 0.1 % of all sentences. I've tested multiple different configurations regarding context size, amount of sentences to use, and objects to compare. After all this testing my suggestions would be:

  • Set the amount of sentences used for context aspect extraction to 10 per object. When doing this, average time needed for a CAM comparison seems to be around 25 seconds (obviously depending on the machine it's running on) which should be an okay value. We could also go up to about 20 sentences per object which would increase the duration to around 55 seconds which seems like a little bit too much. As a comparison: CAM requests without using any context aspect extraction take around 5-20 seconds depending on the amount of sentences that are found. You can see that even with a very small number of sentences used for context aspect extraction the average time needed for a CAM request increases significantly. Important note if you want to test yourself: Caching seems to play a pretty big role here. When I tried a specific combination for the first time, it usually took longer than when I tried it again. Sometimes this difference was actually huge. This seems to be true both for the Elastic Search requests and also for the URL requests happening during context aspect extraction. I had to alter my search parameters all the time to keep this effect at a minimum.
  • Set the context size to 2 (so for each sentence used for context aspect extraction, look at the two sentences appearing before and the two sentences appearing after the specific sentence). This seems to be good regarding time needed (although the impact of this is not too big; amount of sentences used has a way larger impact) and, more importantly, regarding quality -- it wouldn't make that much sense to use a sentence appearing 50 sentences later as that sentence most likely doesn't have anything to do with the sentence CAM found. A sentence that's closer to CAM's sentence is much more likely to have relevant information. It's even possible that this should be reduced to 1 because of this. Note however that as stated earlier I haven't made any tests regarding quality so this suggestion should be seen as an idea rather than the result of serious testing.

@mschildw
Copy link
Collaborator

mschildw commented Oct 8, 2018

Alright, I didn't really know what all the different versions of CAM were all about. So if I understand you correctly, previously you would have preferred me using /cam2 for this, but now you switched the study to /cam2 and that means now you actually want me to use standard /cam for such things, right? Please let me know if I didn't get that correctly.

The studie used the backend on cam-api and I switched that to cam-api2.
So you could use cam-api3 + cam3 for example.

@mschildw
Copy link
Collaborator

mschildw commented Oct 11, 2018

@ChulioZ I moved your changes to another branch to keep the master branch clean and running. (I created another branch with your changes and went back on master to the commit before)
Please use git checkout feature/aspectExtractionUsingContext to work on the created branch.
If you have a well running finished feature please just shortly ask @alexanderpanchenko if it is a problem to merge it into master, I also do this for my features.

As next step, I will deploy your new branch on /cam3 to provide access to your features and set /cam to the old one, to have a running demo.

If you have questions, feel free to ask :)

Edit: deployed now http://ltdemos.informatik.uni-hamburg.de/cam3/#/

@mschildw
Copy link
Collaborator

I fixed the issue with the ML approaches and redeployed it on /cam3/ so that at least BoW also can be used now.

To create the demo version I created a new branch called "demo" ( https://github.com/uhh-lt/cam/tree/demo ). The basis for demo was taken from master and I merged your feature branch into it to get your changes. Furthermore, I replaced the frontend adaptions by properties in the config.json file.

For further development on your aspect extraction I suggest to work on the created feature branch (described above). If you pushed changes and want to redeploy it on /cam3, all you have to do is to go to srv/docker/pan-cam3, use git pull and execute docker-compose down && docker-compose build && docker-compose up -d (and delete the old containers -> docker rmi)

@ChulioZ
Copy link
Contributor Author

ChulioZ commented Oct 18, 2018

I've now committed the latest version of context aspect extraction into the branch you've created for it. I think it should be a stable version right now. I've also deployed it to cam3.

It features context aspect extraction for 10 sentences per object, using a context size of 2 (meaning 2 sentences before and 2 sentences after the actual sentence are used). If you need/want context aspect extraction for the demo/YouTube video/study/whatever, you can merge it into master. I can also do it myself if you want me to do so, @alexanderpanchenko . The front end part giving the user the option to choose which context sentence amount and context size to use is now gone because the testing phase is basically over. If you want to test different configurations for those numbers, you can do so by changing the assigned values in pos_link_extracter.py.

The current configuration of 10 sentences per object and a context size of 2 seems to be the best after all my testing. If we find that it still takes too long or that we could even go a bit longer, I could change it to 5 or even 20 sentences. Just tell me when you think numbers should be changed.

@ChulioZ ChulioZ closed this as completed Oct 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants