-
Notifications
You must be signed in to change notification settings - Fork 10
/
Clean_Data.rtf
141 lines (140 loc) · 6.2 KB
/
Clean_Data.rtf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
{\rtf1\ansi\ansicpg1252\cocoartf1404\cocoasubrtf130
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;\red38\green38\blue38;}
\margl1440\margr1440\vieww13420\viewh15680\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0
\f0\fs24 \cf0 DATA CLEANING AND MUNGING FOR ROSSMANN KAGGLE PROJECT\
___________________________________________\
/Users/2015/Dropbox/Data Science/NYC Data Science Academy/class-projects/Rossmann/KaggleProject/KaggleProject/input\
\
DATA DESCRIPTION\
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0
{\field{\*\fldinst{HYPERLINK "https://www.kaggle.com/c/rossmann-store-sales/data"}}{\fldrslt \cf0 https://www.kaggle.com/c/rossmann-store-sales/data}}\
\
TRAIN\
\'95 Store Use as a Factor\
* Date Extract Day, Month, Year\
\'95\'a0Sales Y\
\'95\'a0Customers - Why doesn\'92t H2O not use customers?\
\'95 Open Delete rows with open = 0\
\'95\'a0Promo OK\
\'95 State Holiday OK to use in R, need to convert to categorical in Python\
\'95\'a0SchoolHoliday OK\
\
\
CHANGE OPEN TO 0 FOR OBVIOUS CASES\
- Change Open to 0 when sales and/or customers = 0\
- Delete rows from Train (only when we are ready to start model building)\
\
ADD DATE FEATURES\
- Extract Day, Month, Year from Date\
- extract day of the year \
- day_of_year\
- weekofyear\
- quarter\
- is_month_start\
- is_month_end\
- is_quarter_start\
- is_quarter_end\
\
- STATE HOLIDAYS\
- 0 is being used as a string and as an integer - CORRECT\
\
PROMO PERIOD\
- Add first day of Promo period\
\
MEAN STORES AND CUSTOMERS (by PROMO too)\
- mean_sales_week\
- mean_sales_month\
- mean_sales_quarter\
- mean_sales_year\
\
- mean_customers_week\
- mean_customers_month\
- mean_customers_quarter\
- mean_customers_year\
\
\
REFURBISHMENT\
- sort by store, date (necessary to get days after/before refurb AND start of Promo)\
- get PromoFirstDate\
- add closed column (necessary for refurb)\
- first day of refurbishment (open = o, so row will be deleted anyway)\
- days from refurbishment\
- days after refurbishment\
\
\
EXTERNAL DATA\
- state in which store is at (states.csv) \
{\field{\*\fldinst{HYPERLINK "https://www.kaggle.com/c/rossmann-store-sales/forums/t/17048/putting-stores-on-the-map/"}}{\fldrslt https://www.kaggle.com/c/rossmann-store-sales/forums/t/17048/putting-stores-on-the-map/}}\
{\field{\*\fldinst{HYPERLINK "https://gist.github.com/gereleth/781a92c97b58b07a11e2#file-putting_stores_on_map-ipynb"}}{\fldrslt https://gist.github.com/gereleth/781a92c97b58b07a11e2#file-putting_stores_on_map-ipynb}}\
- Weather\
{\field{\*\fldinst{HYPERLINK "https://www.kaggle.com/c/rossmann-store-sales/forums/t/17058/weather-at-berlin-us-airport/97075"}}{\fldrslt https://www.kaggle.com/c/rossmann-store-sales/forums/t/17058/weather-at-berlin-us-airport/97075}}\
- Google Trends\
{\field{\*\fldinst{HYPERLINK "https://www.kaggle.com/c/rossmann-store-sales/forums/t/17130/google-trends/97196"}}{\fldrslt https://www.kaggle.com/c/rossmann-store-sales/forums/t/17130/google-trends/97196}}\
- Macroeconomic data\
{\field{\*\fldinst{HYPERLINK "https://www.kaggle.com/c/rossmann-store-sales/forums/t/17111/macro-indicators"}}{\fldrslt https://www.kaggle.com/c/rossmann-store-sales/forums/t/17111/macro-indicators}}\
++++++++++++++++++++++++++++++\
\
TEST\
\'95\'a0store 622 has Nan for Open feature \
- Change these NaNs to 1\
- Do we need to extract date info?\
\
++++++++++++++++++++++++++++++\
\
STORE\
\'95\'a0\expnd0\expndtw0\kerning0
CompetitionDistance has NaNs\
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0
\cf0 \kerning1\expnd0\expndtw0 \'95\'a0\expnd0\expndtw0\kerning0
CompetitionOpenSinceMonth has NaNs\
\kerning1\expnd0\expndtw0 \'95\'a0\expnd0\expndtw0\kerning0
CompetitionOpenSinceYear has NaNs\
\kerning1\expnd0\expndtw0 \'95\'a0\expnd0\expndtw0\kerning0
Promo2SinceWeek has NaNs\
\kerning1\expnd0\expndtw0 \'95\'a0\expnd0\expndtw0\kerning0
Promo2SinceYear has NaNs\
\kerning1\expnd0\expndtw0 \'95\'a0\expnd0\expndtw0\kerning0
PromoInterval has NaNs\
\
\
- \cf2 CompetitionOpenSinceYear, CompetitionOpenSinceMonth and \'931\'94 were combined to create a \'93CompetitionOpenDate\'94 variable???\
- {\field{\*\fldinst{HYPERLINK "https://www.kaggle.com/amhchiu/rossmann-store-sales/more-exploratory-data-analysis/notebook"}}{\fldrslt https://www.kaggle.com/amhchiu/rossmann-store-sales/more-exploratory-data-analysis/notebook}}\
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0
\cf0 \
NaNs IN STORE\
- What to do with NaNs?\
- {\field{\*\fldinst{HYPERLINK "http://stackoverflow.com/questions/27824954/how-to-handle-missing-nans-for-machine-learning-in-python"}}{\fldrslt http://stackoverflow.com/questions/27824954/how-to-handle-missing-nans-for-machine-learning-in-python}}\
- {\field{\*\fldinst{HYPERLINK "https://www.quora.com/What-are-the-best-ways-to-account-for-missing-data-in-machine-learning"}}{\fldrslt https://www.quora.com/What-are-the-best-ways-to-account-for-missing-data-in-machine-learning}}\
\
\
STORE TYPE\
\
- Store type "b" seem to (almost) always be open on holidays, unlike the other types\
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0
\cf0 \
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0
\cf0 \kerning1\expnd0\expndtw0 ++++++++++++++++++++++++++++++\
\
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0
\cf0 \expnd0\expndtw0\kerning0
\
JOIN\
- join train + store\
- join test + store\
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0
\cf0 \kerning1\expnd0\expndtw0 \
- join train_store_states\
- join test_store_states\
++++++++++++++++++++++++++++++\
\
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0
\cf0 \expnd0\expndtw0\kerning0
\
EXPORT CSV\
\
- Use log(Sales) instead of Sales\
- Use Store as factor\
\
}