-
Notifications
You must be signed in to change notification settings - Fork 3
Capstone Report ( Yan Fang )
Yan Fang
This capstone is a part of bio-QA project. The goal is to find out different ways of improving the usability of bio-QA system. The capstone used the Configuration Space Exploration (CSE) system to simulate the online-learning process. The result implies we can have a very good online-learning system with the correct features, using existing CSE. The capstone also explored the possibility of using CSE to quickly adopt a new dataset. The result gives us confidence that we can have CSE work very well in new dataset in a very short time.
This capstone is a subproject of the feasibility study of Domain Adaptive of bio-QA project. In this capstone project, I am tackling two problems:
- Online-learning problem: how much effort users should contribute to make the Question-answering system significantly better. It is well known that the users’ liberate interaction with the system helps improve the performance of QA system. But we do not know how much effort the users should put and also, we do not know if our system can be better when users interact with it. By solving this problem, the project manager will have a good understanding about the potential of the current QA system. In this research, I used the CSE to simulate the online-learning process. And the results shows with effective user interaction, the performance of the system will improve significantly.
- New dataset problem: it is also very common that we need to add new dataset into the system in reality. So having a system that can be quickly tuned become very important. In this research, I use CSE to quickly tune configuration for TREC 2007 Genomic (Hersh, Cohen, Ruslen, & Roberts) task to meet the state of the art (Demner-Fushman, Humphrey, Ide, Loane, & etc). The goal is to understand how quickly and conveniently the bioQA system adopts new dataset and understands new type of questions.
Display top 10 passages
|
-------------------------------------------
| |
Compare with golden standard Extract features
|_________________________________________|
|
Store features and labels
|
Train Models
|
Store Models
|
Rerank Passages
In the process, the experiment first extractes top 10 passages for each question and then compares them with the golden standard. Based on the golden standard, the passages are labeled. This is a key procedure because it simulated the user interaction in reality. In reality, because of the limitation of cost, users may also only label the first 10 passages. The experiment extracts features from the passages based on literatures (Surdeanu, Ciaramita, & Zaragoza, 2008) and human heuristic. Then a SVM model (Chang & Lin, 2011) is trained based features and labels. With this model, top 100 passages are reranked.
The experiment was run on TREC 2006 dataset. For labeling, we think a passage is relevant as long as it has overlapping with golden standard. For the sake of convenience, top 100 passages for each question were cached before the whole process.
Mean Average Precision
important keyterm count This is the count of important keyterms in the passage. The important keyterms are concept name, protein name or disease name in the question. For example, in Q160: “What is the role of PrnP in mad cow disease?” The “PrnP” and “mad cow disease” are the important keyterms. Their synonyms also are regarded as the same as the original terms. Therefore, if the passage contains “Prnp”, “mad cow disease” or its synonyms, the count plus one. Every “Prnp” (or “mad cow disease”) and its synonyms only count once. That is to say, we do not count the duplicate occurrences in the passage. For Q160, at most, the count is 2.
regular keyterm count Regular keyterms are all the terms in the questions except stop words. We count their occurrences in the passages. The strategy is the same as the important keyterm count. Every keyterm and its synonyms only are counted once.
normalized length The normalized length is the original length of the passage / 100. The goal of doing this is to make all the features are in the same scale.
percent match Percent match is important keyterm count / important keyterm count with duplicates. The intuition behind this is that, if a keyterm has many duplicates in the passage (it has high frequency), it is not very important in the passage.
keyterm matches score Keyterm matches score is important keyterm count / keyterm count in the question.
dencity of keyterms Dencity of keyterms is important keyterms count / term count in the passage. This feature emphasizes the importance of the keyterm in the passage.
standardized features They are standardized length, standardized important keyterms count, standardized important keyterms with duplicates and standardized regular keyterms count. We use the same strategy as (Gondek, et al., 2012) used.
new score = original score + probability ( the distance from the boundary) Model
svm_type: epsilon-SVR
kernel: linear
Baseline is the performance of the system without using online-learning process. The MAP is 0.1352.
Baseline One model for one question One model for all questions
MAP 0.1352 0.1482 (+9.6%) 0.1348
Since the performance between “one model for one question” and “one model for all questions” is very different. I did experiment based on the categories.
Baseline Online-learning
Q160 – 164 (what is the role of …) 0.1750 0.1919
Q165 – 180 (how does … ) 0.0848 0.0821
Q181 – 187 (how do mutations … ) 0.2219 0.2213
From the experiment result, we can see that the features we select work very well for each question. However, in reality, it is unfeasible to train each model for each question. This result only gives us a taste of the power of online-learning. When we used one model for all questions, we got a little worse performance than the baseline. This is acceptable because we do not select features very well. When we did experiment for three categories based on the question types, we got very interesting results. For questions like “what is the role of …”, with our features, we improve the MAP a lot. But these features do not work very well in “how …” questions, which ask the relation between two ontologies. This implies that we should use different features for different question types. It is difficult to find out a one-fit-all feature set to train the online-learning system.
Experiment In this experiment, we used the TREC 2007 questions and corpus to run the experiment. There were 36 questions totally. The idea was that, we used the same system as the TREC 2006 to see how well the TREC 2006 system worked in TREC 2007 questions. We also used the CSE to explore the possible configurations, in the sake of reaching the state of the art automatically.
We used the median of TREC 2007 as the baseline (Hersh, Cohen, Ruslen, & Roberts). After quickly exploring the configurations in keyterm-refiner component (for detail information, refer to Zi’s paper), we got the following performance.
Baseline Quick-tuned
Document MAP 0.1862 0.26
Passage MAP 0.0560 0.06
Then we manually checked the synonyms for all the keyterms because from our experience, having correct synonyms can improve the performance a lot while our system is not very good at that. After quickly checking the synonyms and adding a few synonyms, we got new documentMAP, 0.31, which was the same as the state of the art.
The difficulty in TREC 2007 is that, some questions such as Q217, “What [PROTEINS] in rats perform functions different from those of their human homologs?”, have very general keyterms. As a result, the system has the difficulty in extracting the relevant documents. Also, the question is asking the “protein”, which requires a well-annotated corpus. Because our corpus is not annotated well, we only structure the queries that do not need any annotation information. This limits the performance of our system.
This capstone is only the first step of building a mature bio-QA system. But it does explore the feasibility of using CSE to make an online-learning bio-QA system and quickly adopt new dataset. For future work, we will need to find more effective features for online-learning. Also, we need a well-annotated corpus, with which we can solve questions that ask for a specific biology name. What is more, we need to improve the performance of the system by exploring queries that can deal with the questions asking relation between two ontologies.
Chang, C.-C., & Lin, C.-J. (2011). LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology , 2:27:1--27:27.
Demner-Fushman, D., Humphrey, M. S., Ide, C. N., Loane, F. R., & etc. Combining resources to find answers to biomedical questions . The Sixteenth Text REtrieval Conference (TREC 2007) Proceedings.
Gondek, C. D., Lally, A., Kalyanpur, A., Murdock, W. J., Duboue, A. P., Zhang, L., et al. (2012). A framework for merging and ranking of answers in DeepQA. IBM J. RES. & DEV. , 56, 14:1 - 14:12.
Hersh, W., Cohen, A., Ruslen, L., & Roberts, P. TREC 2007 Genomics Track Overview. The Sixteenth Text REtrieval Conference (TREC 2007) Proceedings.
Surdeanu, M., Ciaramita, M., & Zaragoza, H. (2008). Learning to Rank Answers on Large Online QA Collections. Proceedings of ACL-08, (pp. 719-727). Columbus.
Online-learning: https://github.com/oaqa/bioqa/tree/online-learning
TREC 2007: https://github.com/oaqa/bioqa/tree/2007trec-experiment