Skip to content

Commit a57a38f

Browse files
author
simon.grah
committedJul 23, 2020
sync with gitlab repo
1 parent 07a6758 commit a57a38f

37 files changed

+26247
-6258
lines changed
 

‎README.md

+11-7
Original file line numberDiff line numberDiff line change
@@ -20,12 +20,12 @@ The method used is:
2020
Assume that we have trained a machine learning model to predict the probability of recividism of a given individual. The algorithm is quite effective but it only returns a probability score without any details on how it has made its choice.
2121
We would like to know how each attribute (characteristic) influences the model output. Furthermore, contributions explain the difference between the individual prediction and the mean prediction for all references. These references are defined by the user (e.g. for classification, interesting references are selected into other predicted classes).
2222

23-
<img alt="Exporting from nbdev" width="700" caption="On this example, the fact that this person has commited 6 priors crime, is African-American and 27 years old, his legal status is Post Sentence, are mainly explained why the model has predicted such probability score. The contributions could also be negative, e.g. his probation custody status influences the model towards a low probability of recividism." src="nbs/images/shap_readme_illustration.png">
23+
<img alt="Exporting from nbdev" width="1000" caption="On this example, we can analyze that the age (21 years old) and the ethnicity of the individual x increase respectively by 46% and 15% the estimated probability of recidivism. In the meantime, the fact that he has never commited any crimes decreases the probability by 9%." src="nbs/images/compas_plot.png">
2424

25-
This picture displays the kind of interpretation associated to a given prediction for individual x. The estimated probability of recidivism is about 0,75 (deep blue arrow). The individual attributes (or characteristics) are showed in the y axis. Based on a set of chosen references (here the references are predicted as non recidivist by the model), we compute contributions (Shapley Values) of each attribute related to their influence on the model output.
26-
Those contributions have some interesting properties. Indeed, the sum of all contributions equals the difference between the output of the individual x (0,75) and the mean output of references (0,13).
27-
28-
On this example, the fact that this person has commited 6 priors crimes, is African-American and 27 years old, his legal status is Post Sentence, mainly explain why the model has predicted such probability score. The contributions could also be negatives, e.g. his probation custody status influences the model towards a low probability of recividism.
25+
This picture displays the kind of interpretation associated to a given prediction for individual x. We want to understand the model decision associated to an individual x. As an example, here the individual has a probability of 70% to reoffend. (the blue tick at top right).
26+
Attribute importance are computed with respect to one or several references. On this example, we chose only non predicted recividists as good elements of comparison. The mean probability for that group of references is about 14% (green tick at the bottom left).
27+
Finally, the gap between our individual prediction and the mean reference prediction is splitted by the attribute importance. The sum of all contributions equals that difference.
28+
Now, we can analyze that the age (21 years old) and the ethnicity of the individual x increase respectively by 46% and 15% the estimated probability of recidivism. In the meantime, the fact that he has never commited any crimes decreases the probability by 9%.
2929

3030
## Install
3131

@@ -73,12 +73,12 @@ mc_shap = MonteCarloShapley(x=x, fc=fc, ref=reference, n_iter=1000)
7373

7474
```python
7575
sgd_est = SGDshapley(d, C=y.max())
76-
sgd_shap = sgd_est.sgd(x=x, fc=fc, r=reference, n_iter=5000, step=.1, step_type="sqrt")
76+
sgd_shap = sgd_est.sgd(x=x, fc=fc, ref=reference, n_iter=5000, step=.1, step_type="sqrt")
7777
```
7878

7979
## Code and description
8080

81-
This library is based on [nbdev](http://nbdev.fast.ai/).
81+
This library is based on [nbdev](http://nbdev.fast.ai/). If you want to modify the lib or run tests, you will have to install it.
8282
> nbdev is a library that allows you to fully develop a library in Jupyter Notebooks, putting all your code, tests and documentation in one place. That is:you now have a true literate programming environment, as envisioned by Donald Knuth back in 1983!
8383
8484
Codes, descriptions, small examples and tests are all put together in jupyter notebooks in the folder `nbs`.
@@ -109,3 +109,7 @@ Notebook demos are availables in `tutorials` folder.
109109
## License
110110

111111
Shapkit is licensed under the terms of the MIT License (see the file LICENSE).
112+
113+
## Main reference
114+
115+
*A Projected SGD algorithm for estimating Shapley Value applied in attribute importance*, S. Grah, V. Thouvenot, CD-MAKE 2020

‎dataset/bike/Readme.txt

+113
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
The dataset comes from the public repository hosted at https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset
2+
3+
==========================================
4+
Bike Sharing Dataset
5+
==========================================
6+
7+
Hadi Fanaee-T
8+
9+
Laboratory of Artificial Intelligence and Decision Support (LIAAD), University of Porto
10+
INESC Porto, Campus da FEUP
11+
Rua Dr. Roberto Frias, 378
12+
4200 - 465 Porto, Portugal
13+
14+
15+
=========================================
16+
Background
17+
=========================================
18+
19+
Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return
20+
back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return
21+
back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of
22+
over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic,
23+
environmental and health issues.
24+
25+
Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by
26+
these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration
27+
of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into
28+
a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important
29+
events in the city could be detected via monitoring these data.
30+
31+
=========================================
32+
Data Set
33+
=========================================
34+
Bike-sharing rental process is highly correlated to the environmental and seasonal settings. For instance, weather conditions,
35+
precipitation, day of week, season, hour of the day, etc. can affect the rental behaviors. The core data set is related to
36+
the two-year historical log corresponding to years 2011 and 2012 from Capital Bikeshare system, Washington D.C., USA which is
37+
publicly available in http://capitalbikeshare.com/system-data. We aggregated the data on two hourly and daily basis and then
38+
extracted and added the corresponding weather and seasonal information. Weather information are extracted from http://www.freemeteo.com.
39+
40+
=========================================
41+
Associated tasks
42+
=========================================
43+
44+
- Regression:
45+
Predication of bike rental count hourly or daily based on the environmental and seasonal settings.
46+
47+
- Event and Anomaly Detection:
48+
Count of rented bikes are also correlated to some events in the town which easily are traceable via search engines.
49+
For instance, query like "2012-10-30 washington d.c." in Google returns related results to Hurricane Sandy. Some of the important events are
50+
identified in [1]. Therefore the data can be used for validation of anomaly or event detection algorithms as well.
51+
52+
53+
=========================================
54+
Files
55+
=========================================
56+
57+
- Readme.txt
58+
- hour.csv : bike sharing counts aggregated on hourly basis. Records: 17379 hours
59+
- day.csv - bike sharing counts aggregated on daily basis. Records: 731 days
60+
61+
62+
=========================================
63+
Dataset characteristics
64+
=========================================
65+
Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv
66+
67+
- instant: record index
68+
- dteday : date
69+
- season : season (1:springer, 2:summer, 3:fall, 4:winter)
70+
- yr : year (0: 2011, 1:2012)
71+
- mnth : month ( 1 to 12)
72+
- hr : hour (0 to 23)
73+
- holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
74+
- weekday : day of the week
75+
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
76+
+ weathersit :
77+
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
78+
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
79+
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
80+
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
81+
- temp : Normalized temperature in Celsius. The values are divided to 41 (max)
82+
- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
83+
- hum: Normalized humidity. The values are divided to 100 (max)
84+
- windspeed: Normalized wind speed. The values are divided to 67 (max)
85+
- casual: count of casual users
86+
- registered: count of registered users
87+
- cnt: count of total rental bikes including both casual and registered
88+
89+
=========================================
90+
License
91+
=========================================
92+
Use of this dataset in publications must be cited to the following publication:
93+
94+
[1] Fanaee-T, Hadi, and Gama, Joao, "Event labeling combining ensemble detectors and background knowledge", Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg, doi:10.1007/s13748-013-0040-3.
95+
96+
@article{
97+
year={2013},
98+
issn={2192-6352},
99+
journal={Progress in Artificial Intelligence},
100+
doi={10.1007/s13748-013-0040-3},
101+
title={Event labeling combining ensemble detectors and background knowledge},
102+
url={http://dx.doi.org/10.1007/s13748-013-0040-3},
103+
publisher={Springer Berlin Heidelberg},
104+
keywords={Event labeling; Event detection; Ensemble learning; Background knowledge},
105+
author={Fanaee-T, Hadi and Gama, Joao},
106+
pages={1-15}
107+
}
108+
109+
=========================================
110+
Contact
111+
=========================================
112+
113+
For further information about this dataset please contact Hadi Fanaee-T (hadi.fanaee@fe.up.pt)

0 commit comments

Comments
 (0)
Please sign in to comment.