alheliou · Jul 23, 2020
diff --git a/‎README.md
+11-7 b/‎README.md
+11-7
diff --git a/‎dataset/bike/Readme.txt
+113 b/‎dataset/bike/Readme.txt
+113
@@ -20,12 +20,12 @@ The method used is:
 Assume that we have trained a machine learning model to predict the probability of recividism of a given individual. The algorithm is quite effective but it only returns a probability score without any details on how it has made its choice.
 We would like to know how each attribute (characteristic) influences the model output. Furthermore, contributions explain the difference between the individual prediction and the mean prediction for all references. These references are defined by the user (e.g. for classification, interesting references are selected into other predicted classes).
 
-<img alt="Exporting from nbdev" width="700" caption="On this example, the fact that this person has commited 6 priors crime, is African-American and 27 years old, his legal status is Post Sentence, are mainly explained why the model has predicted such probability score. The contributions could also be negative, e.g. his probation custody status influences the model towards a low probability of recividism." src="nbs/images/shap_readme_illustration.png">
+<img alt="Exporting from nbdev" width="1000" caption="On this example, we can analyze that the age (21 years old) and the ethnicity of the individual x increase respectively by 46% and 15% the estimated probability of recidivism. In the meantime, the fact that he has never commited any crimes decreases the probability by 9%." src="nbs/images/compas_plot.png">
 
-This picture displays the kind of interpretation associated to a given prediction for individual x. The estimated probability of recidivism is about 0,75 (deep blue arrow). The individual attributes (or characteristics) are showed in the y axis. Based on a set of chosen references (here the references are predicted as non recidivist by the model), we compute contributions (Shapley Values) of each attribute related to their influence on the model output. 
-Those contributions have some interesting properties. Indeed, the sum of all contributions equals the difference between the output of the individual x (0,75) and the mean output of references (0,13).
-
-On this example, the fact that this person has commited 6 priors crimes, is African-American and 27 years old, his legal status is Post Sentence, mainly explain why the model has predicted such probability score. The contributions could also be negatives, e.g. his probation custody status influences the model towards a low probability of recividism.
+This picture displays the kind of interpretation associated to a given prediction for individual x. We want to understand the model decision associated to an individual x. As an example, here the individual  has a probability of 70% to reoffend. (the blue tick at top right).
+Attribute importance are computed with respect to one or several references. On this example, we chose only non predicted recividists as good elements of comparison. The mean probability for that group of references is about 14% (green tick at the bottom left).
+Finally, the gap between our individual prediction and the mean reference prediction is splitted by the attribute importance. The sum of all contributions equals that difference. 
+Now, we can analyze that the age (21 years old) and the ethnicity of the individual x increase respectively by 46% and 15% the estimated probability of recidivism. In the meantime, the fact that he has never commited any crimes decreases the probability by 9%.
 
 ## Install
 
@@ -73,12 +73,12 @@ mc_shap = MonteCarloShapley(x=x, fc=fc, ref=reference, n_iter=1000)
 
 ```python
 sgd_est = SGDshapley(d, C=y.max())
-sgd_shap = sgd_est.sgd(x=x, fc=fc, r=reference, n_iter=5000, step=.1, step_type="sqrt")
+sgd_shap = sgd_est.sgd(x=x, fc=fc, ref=reference, n_iter=5000, step=.1, step_type="sqrt")
 ```
 
 ## Code and description
 
-This library is based on [nbdev](http://nbdev.fast.ai/).
+This library is based on [nbdev](http://nbdev.fast.ai/). If you want to modify the lib or run tests, you will have to install it.
 > nbdev is a library that allows you to fully develop a library in Jupyter Notebooks, putting all your code, tests and documentation in one place. That is:you now have a true literate programming environment, as envisioned by Donald Knuth back in 1983!
 
 Codes, descriptions, small examples and tests are all put together in jupyter notebooks in the folder `nbs`.
@@ -109,3 +109,7 @@ Notebook demos are availables in `tutorials` folder.
 ## License
 
 Shapkit is licensed under the terms of the MIT License (see the file LICENSE).
+
+## Main reference
+
+*A Projected SGD algorithm for estimating Shapley Value applied in attribute importance*, S. Grah, V. Thouvenot, CD-MAKE 2020
@@ -0,0 +1,113 @@
+The dataset comes from the public repository hosted at https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset
+
+==========================================
+Bike Sharing Dataset
+==========================================
+
+Hadi Fanaee-T
+
+Laboratory of Artificial Intelligence and Decision Support (LIAAD), University of Porto
+INESC Porto, Campus da FEUP
+Rua Dr. Roberto Frias, 378
+4200 - 465 Porto, Portugal
+
+
+=========================================
+Background 
+=========================================
+
+Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return 
+back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return 
+back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of 
+over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, 
+environmental and health issues. 
+
+Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by
+these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration
+of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into
+a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important
+events in the city could be detected via monitoring these data.
+
+=========================================
+Data Set
+=========================================
+Bike-sharing rental process is highly correlated to the environmental and seasonal settings. For instance, weather conditions,
+precipitation, day of week, season, hour of the day, etc. can affect the rental behaviors. The core data set is related to  
+the two-year historical log corresponding to years 2011 and 2012 from Capital Bikeshare system, Washington D.C., USA which is 
+publicly available in http://capitalbikeshare.com/system-data. We aggregated the data on two hourly and daily basis and then 
+extracted and added the corresponding weather and seasonal information. Weather information are extracted from http://www.freemeteo.com. 
+
+=========================================
+Associated tasks
+=========================================
+
+	- Regression: 
+		Predication of bike rental count hourly or daily based on the environmental and seasonal settings.
+	
+	- Event and Anomaly Detection:  
+		Count of rented bikes are also correlated to some events in the town which easily are traceable via search engines.
+		For instance, query like "2012-10-30 washington d.c." in Google returns related results to Hurricane Sandy. Some of the important events are 
+		identified in [1]. Therefore the data can be used for validation of anomaly or event detection algorithms as well.
+
+
+=========================================
+Files
+=========================================
+
+	- Readme.txt
+	- hour.csv : bike sharing counts aggregated on hourly basis. Records: 17379 hours
+	- day.csv - bike sharing counts aggregated on daily basis. Records: 731 days
+
+	
+=========================================
+Dataset characteristics
+=========================================	
+Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv
+	
+	- instant: record index
+	- dteday : date
+	- season : season (1:springer, 2:summer, 3:fall, 4:winter)
+	- yr : year (0: 2011, 1:2012)
+	- mnth : month ( 1 to 12)
+	- hr : hour (0 to 23)
+	- holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
+	- weekday : day of the week
+	- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
+	+ weathersit : 
+		- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
+		- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
+		- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
+		- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
+	- temp : Normalized temperature in Celsius. The values are divided to 41 (max)
+	- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
+	- hum: Normalized humidity. The values are divided to 100 (max)
+	- windspeed: Normalized wind speed. The values are divided to 67 (max)
+	- casual: count of casual users
+	- registered: count of registered users
+	- cnt: count of total rental bikes including both casual and registered
+	
+=========================================
+License
+=========================================
+Use of this dataset in publications must be cited to the following publication:
+
+[1] Fanaee-T, Hadi, and Gama, Joao, "Event labeling combining ensemble detectors and background knowledge", Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg, doi:10.1007/s13748-013-0040-3.
+
+@article{
+	year={2013},
+	issn={2192-6352},
+	journal={Progress in Artificial Intelligence},
+	doi={10.1007/s13748-013-0040-3},
+	title={Event labeling combining ensemble detectors and background knowledge},
+	url={http://dx.doi.org/10.1007/s13748-013-0040-3},
+	publisher={Springer Berlin Heidelberg},
+	keywords={Event labeling; Event detection; Ensemble learning; Background knowledge},
+	author={Fanaee-T, Hadi and Gama, Joao},
+	pages={1-15}
+}
+
+=========================================
+Contact
+=========================================
+	
+For further information about this dataset please contact Hadi Fanaee-T (hadi.fanaee@fe.up.pt)