PSCMModel is a small set of Python scripts for the user click models based on Yandex version (https://github.com/varepsilon/clickmodels).
A Click Model is a probabilistic graphical model used to predict search engine click data from past observations.
This project is aimed to implement recently proposed click models and intended to be easy-to-read and easy-to-modify. If it's not, please let me know how to improve it :)
- Partially Sequential Click Model ( PSCM ) model: Chao Wang, Yiqun Liu, Meng Wang, Ke Zhou, Jian-Yun Nie, Shaoping Ma. Incorporating Non-sequential Behavior into Click Models. SIGIR (2015).
- Temporal Hidden Click Model ( THCM ) model: Danqing Xu, Yiqun Liu, Min Zhang, Shaoping Ma. Incorporating revisiting behaviors into click models. WSDM (2012).
- Temporal Click Model ( TCM ) model: Wanhong Xu, Eren Manavoglu, Erick Cantú-Paz. Temporal Click Model for Sponsored Search. SIFIR (2010).
- Partially Observable Markov Model ( POM ) model: Kuansan Wang, Nikolas Gloy, Xiaolong Li. Inferring search behaviors using partially observable markov (pom) model. WSDM (2010).
- Dynamic Bayesian Network ( DBN ) model: Chapelle, O. and Zhang, Y. 2009. A dynamic bayesian network click model for web search ranking. WWW (2009). (This model is exactly the same implementation as Yandex version)
- User Browsing Model ( UBM ): Dupret, G. and Piwowarski, B. 2008. A user browsing model to predict search engine click data from past observations. SIGIR (2008).
This file.
Directory with the scripts.
Directory with the sample dataset.
query-result relevance data for PSCM model's paper (Section 5.3)
user preference test data for PSCM model's paper (Section 5.4)
A small example can be found under sample/
(tab-separated). 5 files are included in this directory:
- query_id: encode each query into a unique id.
- e.g.: "test 1 10 5" means query ("test") with a unique id (1), 10 sessions are found in search logs and 5 sessions contain click action.
- query_class: The probability of been each searh intent for each query.
- e.g.: "test 0.25 0.25 0.25 0.25" means query ("test") has four search intents. Set "query_id 1" for each query if this information is needless.
- url_id: encode each URL into a unique id.
- train_data, test_data: search logs, in which each line represents one query-session. 10 tab-separated part are included. The inner separator for each [] is space (" "):
- query_id [url_id * 10] [click * 10] [click_time * 10] [mouse_feature_1 * 10] [mouse_feature_2 * 10] [mouse_feature_3 * 10] [mouse_feature_4 * 10] [mouse_feature_5 * 10] [mouse_feature_6 * 10]
- click: 1 represents click, 0 represents no click
- click_time: >0 represents click time in seconds, -1 represents no click
- mouse_feature_1: The most left position mouse ever reach to in the result’s display area
- mouse_feature_2: User’s total right towards movement length in the result’s display area
- mouse_feature_3: The total dwell time that cursor spends in the result's display area neglecting its horizontal coordinate
- mouse_feature_4: The total dwell time that cursor spends in the result's display area
- mouse_feature_5: The total time that cursor hovers over the result's display area
- mouse_feature_6: The amount of cursor actions (scroll, test select, move times) that appear in the result's display area
- Ps: Just set mouse feature as 0 if your search logs do not contain mouse movement information.
in bin/config_sample.py: select click models (e.g.: TEST_MODELS = ['PSCM', 'UBM', 'DBN', 'POM', 'TCM', 'THCM'])
in bin/ : python wc_click_model_inference_by_id.py ../sample
in target data directory, "/output" directory will be automatically generated, in which model results will be logged:
- model_name.model: Parameters of this model generated from train_data
- model_name.model.perplexity: Perplexity metrics of this model tested on test_data
- model_name.model.relevance: Query-URL-Relevance generated from this model