-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathHaberman_Cancer_Survival_Dataset.py
242 lines (161 loc) · 7 KB
/
Haberman_Cancer_Survival_Dataset.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
#!/usr/bin/env python
# coding: utf-8
# This is Dataset of Cancer Survival Patient
# Using Haberman_Datasets
# USing Python under ML Project
# In[6]:
#import all necessary modules
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# In[2]:
#Reading the given csv dataset.
haberman=pd.read_csv('/home/harshraj/datasets_haberman.csv')
# In[3]:
# Total Data-points and Features
print(haberman.shape)
# In[5]:
#Total Columns in haberman datasets
print(haberman.columns)
# In[20]:
#2-D Scatter Plot:
#Always understand the axis:label and scale
sns.set_style("whitegrid");
haberman.plot(kind='scatter',x='Age',y='axil_nodes')
# In[77]:
#Using Seaborn as sns
#Drawing graph according to Surv_status point of view
#According to Age and axil_nodes
#LEgend helps to describe Plot by using scale
sns.set_style("whitegrid");
sns.FacetGrid(haberman,hue="Surv_status",size=4) .map(plt.scatter,'axil_nodes','Age') .add_legend();
# # 3D Scatter Plot
#
# It requires more of mouse operation to interpret data.
# Used to Draw 3D Plot
#
# In[60]:
plt.close();
sns.set_style("whitegrid");
sns.pairplot(haberman,hue='Surv_status',size=3);
plt.show()
# # observations
# 1.If we take any paricular region in any Hue, we can differentiate with Survival status,
# i.e, What's the probaibility that they will live or die.
# 2.We can build lines and if-else condition, to build a simple model to classify according to the Survival Status.
# 3.The Diagonals are the PDF(Probability Density Function) of each feature.
# # HISTOGRAM, CDF,PDF
# Plotting 1D Scatter Plots
# In[67]:
import numpy as np
haberman_Surv_status_Yes=haberman.loc[haberman["Surv_status"]==1]
haberman_Surv_status_No=haberman.loc[haberman["Surv_status"]==2]
plt.plot(haberman_Surv_status_Yes["axil_nodes"],np.zeros_like(haberman_Surv_status_Yes["axil_nodes"]))
plt.plot(haberman_Surv_status_No["axil_nodes"],np.zeros_like(haberman_Surv_status_No['axil_nodes']))
plt.show()
# In[130]:
sns.FacetGrid(haberman,hue='Surv_status',height = 5) .map(sns.distplot,'Op_Year') . add_legend();
plt.show()
# # Observations:
#
# Patients with no nodes or 1 node are more likely to survive. There are very few chances of surviving if there are 25 or more nodes.
#
# # Cumulative Distribution Function(CDF)
#
#
# The Cumulative Distribution Function (CDF) is the probability that the variable takes a value less than or equal to x.
# In[116]:
counts1, bin_edges1 = np.histogram(haberman_Surv_status_Yes['axil_nodes'], bins=10, density = True)
pdf1 = counts1/(sum(counts1))
print(pdf1);
print(bin_edges1)
cdf1 = np.cumsum(pdf1)
plt.plot(bin_edges1[1:], pdf1)
plt.plot(bin_edges1[1:], cdf1, label = 'Yes')
plt.xlabel('axil_nodes')
print("***********************************************************")
counts2, bin_edges2 = np.histogram(haberman_Surv_status_No['axil_nodes'], bins=10, density = True)
pdf2 = counts2/(sum(counts2))
print(pdf2);
print(bin_edges2)
cdf2 = np.cumsum(pdf2)
plt.plot(bin_edges2[1:], pdf2)
plt.plot(bin_edges2[1:], cdf2, label = 'No')
plt.xlabel('axil_nodes')
plt.legend()
plt.show()
# # Box Plots and Violin Plots
#
# The box extends from the lower to upper quartile values of the data, with a line at the median. The whiskers extend from the box to show the range of the data. Outlier points are those past the end of the whiskers.
#
# Violin plot is the combination of a box plot and probability density function(CDF).
# In[123]:
sns.boxplot(x='Surv_status',y='Age',data=haberman)
plt.show()
sns.boxplot(x='Surv_status',y='Op_Year',data=haberman)
plt.show()
sns.boxplot(x='Surv_status',y='axil_nodes',data=haberman)
plt.show()
# In[126]:
sns.violinplot(x="Surv_status",y="Age",data = haberman,height = 10)
plt.show()
sns.violinplot(x='Surv_status',y='Op_Year',data = haberman,height = 10)
plt.show()
sns.violinplot(x='Surv_status',y='axil_nodes',data = haberman,height = 10)
plt.show()
# # Observations:
#
# Patients with more than 1 nodes are not likely to survive. More the number of nodes, lesser the survival chances.
#
# A large percentage of patients who survived had 0 nodes. Yet there is a small percentage of patients who had no positive axillary nodes died within 5 years of operation, thus an absence of positive axillary nodes cannot always guarantee survival.
#
# There were comparatively more people who got operated in the year 1965 did not survive for more than 5 years.
#
# There were comparatively more people in the age group 45 to 65 who did not survive. Patient age alone is not an important parameter in determining the survival of a patient.
#
# The box plots and violin plots for age and year parameters give similar results with a substantial overlap of data points.
#
# The overlap in the box plot and the violin plot of nodes is less compared to other features but the overlap still exists and thus it is difficult to set a threshold to classify both classes of patients.
# # Bi-Variate analysis
#
# # Scatter Plots
#
# A scatter plot is a two-dimensional data visualization that uses dots to represent the values obtained for two different variables — one plotted along the x-axis and the other plotted along the y-axis.
# In[129]:
sns.set_style('whitegrid')
sns.FacetGrid(haberman, hue = 'Surv_status' , height = 6) .map(plt.scatter,'Age','Op_Year') .add_legend()
plt.show()
# # Observation:
#
# Patients with 0 nodes are more likely to survive irrespective of their age.
# There are hardly any patients who have nodes more than 25.
# Patients aged more than 50 with nodes more than 10 are less likely to survive.
# # Pair Plots
#
# By default, this function will create a grid of Axes such that each variable in data will be shared in the y-axis across a single row and in the x-axis across a single column. The diagonal Axes are treated differently, drawing a plot to show the univariate distribution of the data for the variable in that column.
# In[131]:
sns.set_style('whitegrid')
sns.pairplot(haberman, hue='Surv_status', height = 5)
plt.show()
# # Observations:
#
# The plot between year and nodes is comparatively better.
# # Multivariate analysis
# # Contour Plot
#
# A contour line or isoline of a function of two variables is a curve along which the function has a constant value. It is a cross-section of the three-dimensional graph.
#
#
# In[133]:
sns.jointplot(x = 'Op_Year', y = 'Age', data = haberman, kind = 'kde')
plt.show()
# # Observation:
#
# From 1960 to 1964, more operations done on the patients in the age group 45 to 55.
# # Conclusions:
#
# Patient’s age and operation year alone are not deciding factors for his/her survival. Yet, people less than 35 years have more chance of survival.
# Survival chance is inversely proportional to the number of positive axillary nodes. We also saw that the absence of positive axillary nodes cannot always guarantee survival.
# The objective of classifying the survival status of a new patient based on the given features is a difficult task as the data is imbalanced.
# In[ ]: