-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathproject4_4.Rmd
1422 lines (1133 loc) · 57.6 KB
/
project4_4.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output:
html_document: default
pdf_document: default
---
# Prosper Loan Dataset Analysis by Gisela Chen
========================================================
```{r echo=FALSE, message=FALSE, warning=FALSE}
# Load all of the packages that you end up using in your analysis in this code
# chunk.
# Notice that the parameter "echo" was set to FALSE for this code chunk. This
# prevents the code from displaying in the knitted HTML output. You should set
# echo=FALSE for all code chunks in your file, unless it makes sense for your
# report to show the code that generated a particular plot.
# The other parameters for "message" and "warning" should also be set to FALSE
# for other code chunks once you have verified that each plot comes out as you
# want it to. This will clean up the flow of your report.
library(ggplot2)
library(dplyr)
library(gridExtra)
library(GGally)
```
```{r echo=FALSE}
# set up global chunk functions
knitr::opts_chunk$set(echo=FALSE, warning=FALSE, message=FALSE)
```
### Goal: explore factors that affect BorrowerAPR.
```{r}
# Load the Data
prosper_loan <- read.csv('prosperLoanData.csv')
```
This data set consists of 113937 loan listings with 81 variables describing
each loan. To limit the initial EDA to only 10-15 variables, I will focus on
analyzing the characteristics of the borrowers and find out what are the factors
that might affect the APR of the loan.
```{r}
# Assemble a working dataset of 15 variables plus loan ids and date created
working_prosper_loan <- prosper_loan[ , c('ListingNumber',
'ListingCreationDate', 'Term',
'LoanStatus', 'BorrowerAPR',
'ListingCategory..numeric.',
'BorrowerState', 'Occupation',
'EmploymentStatus',
'EmploymentStatusDuration',
'CreditScoreRangeLower',
'CreditScoreRangeUpper',
'DebtToIncomeRatio',
'StatedMonthlyIncome',
'LoanOriginalAmount')]
```
# Univariate Plots Section
First, I will look into the dataframe structure.
```{r}
str(working_prosper_loan)
```
To better reflect the nature of the data, I will change the data type of some
columns from int to factor (ListingNumber and Term), convert ListingCreationDate
from factor to date object, then change the ListingCategory from numeric to
factor to make the interpretation more intuitive.
```{r change_data_type}
# change ListingNumber and Term to factor
working_prosper_loan$ListingNumber <- factor(working_prosper_loan$ListingNumber)
working_prosper_loan$Term <- factor(working_prosper_loan$Term)
# convert ListingCreationDate from factor to date object
working_prosper_loan$ListingCreationDate <- as.Date(as.character(working_prosper_loan$ListingCreationDate),
format = "%Y-%m-%d")
# replace ListingCaterogy..numeric. with actual listing category names and
# change the name of the column to ListingCategory
working_prosper_loan$ListingCategory..numeric. <- factor(working_prosper_loan$ListingCategory..numeric.)
levels(working_prosper_loan$ListingCategory..numeric.) <- list(
NotAvailable = '0', DebtConsolidation = '1', HomeImprovement = '2',
Business = '3', PersonalLoan = '4', StudentUse = '5', Auto = '6',
Other = '7', BabyAndAdoption = '8', Boat = '9', CosmeticProcedure = '10',
EngagementRing = '11', GreenLoans = '12', HouseholdExpenses = '13',
LargePurchases = '14', MedicalOrDental = '15', Motorcycle = '16', RV = '17',
Taxes = '18', Vacation = '19', WeddingLoans = '20')
colnames(working_prosper_loan)[6] <- 'ListingCategory'
```
Now let's look at the summary of the working dataset!
```{r}
summary(working_prosper_loan)
```
It appeared odd to me that there were duplications in the listing number. The
listing number should be unique since it represents each individual loan
listings so I will take a closer look at some of them.
```{r}
subset(working_prosper_loan, ListingNumber == '951186')
```
After checking the top three listings (shown above is the first example), it
seems that they are indeed duplications of the same listing so I will keep only
the unique loan listings which add up to 113066 observations.
```{r}
working_prosper_loan <- unique(working_prosper_loan)
```
Now let's look at individual variables.
```{r}
ggplot(working_prosper_loan, aes(x = ListingCreationDate, y = ..count..)) +
geom_histogram() +
scale_x_date(date_breaks = "6 month") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
```
```{r}
summary(working_prosper_loan$ListingCreationDate)
```
The listings range from November 2005 to March 2014. There is a gap at year 2009
in which there were no listings at all, splitting the data into two parts. There
is a sharp increase in the number of listings in 2013.
```{r}
working_prosper_loan %>%
group_by(Term) %>% # group by Term
summarize(counts = n()) %>% # calculate the counts
arrange(-counts) %>% # sort by counts descendently
mutate(Term = factor(Term, Term)) %>% # reset factor level
ggplot(aes(x = Term, y = counts)) + # plot
geom_bar(stat = "identity")
```
```{r}
summary(working_prosper_loan$Term)
```
The majority of the loans is of 36-month term (87224) followed by 60-month term
(24228) and 12-month term (1614), indicating that the loan dataset is comprised
of short term loans.
```{r}
working_prosper_loan %>%
group_by(LoanStatus) %>%
summarise(counts = n()) %>%
arrange(-counts) %>%
mutate(LoanStatus = factor(LoanStatus, LoanStatus)) %>%
ggplot(aes(x = LoanStatus, y = counts)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# simplify the levels by grouping all problematic loans into one level for
# future analysis
levels(working_prosper_loan$LoanStatus) <- list("Completed" =
c("Completed",
"FinalPaymentInProgress"),
"Current" = "Current",
"Bad_loan" = c("Cancelled",
"Chargedoff",
"Defaulted",
"Past Due (>120 days)",
"Past Due (1-15 days)",
"Past Due (16-30 days)",
"Past Due (31-60 days)",
"Past Due (61-90 days)",
"Past Due (91-120 days)"))
```
```{r}
working_prosper_loan %>%
group_by(LoanStatus) %>%
summarise(counts = n()) %>%
arrange(-counts) %>%
mutate(LoanStatus = factor(LoanStatus, LoanStatus)) %>%
ggplot(aes(x = LoanStatus, y = counts)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
The majority of the loans are either current or completed with remaining loans
that are mainly charged off, defaulted or past due (Top plot). I decided to
create a simplified column with only three levels: "Current", "Completed" and
"Bad_loan" (Bottom plot).
```{r}
working_prosper_loan %>%
group_by(ListingCategory) %>%
summarise(counts = n()) %>%
arrange(-counts) %>%
mutate(ListingCategory = factor(ListingCategory, ListingCategory)) %>%
ggplot(aes(x = ListingCategory, y = counts)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
The most common reason for taking out a loan in the dataset is debt
consolidation. There is also a fraction of loans that does not provide a
concrete reason (Not available or other). Home improvement and business are also
main listing categories in this dataset.
```{r}
working_prosper_loan %>%
group_by(BorrowerState) %>%
summarise(counts = n()) %>%
arrange(-counts) %>%
mutate(BorrowerState = factor(BorrowerState, BorrowerState)) %>%
ggplot(aes(x = BorrowerState, y = counts)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
```
It looks like California has the most of the loan listings followed by Texas,
Florida and New York.
```{r}
working_prosper_loan %>%
group_by(Occupation) %>%
summarize(counts = n()) %>%
arrange(-counts) %>%
mutate(Occupation = factor(Occupation, Occupation)) %>%
ggplot(aes(x = Occupation, y = counts)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5),
axis.text.y = element_text(angle = 60))
```
This is a bit of a busy plot at the x-axis but it nicely demonstrates all
flavors of occupation the borrowers had during loan application. Although the
highest level is "other" and does not provide too much information, we can see
that the top three occupations are professionals, computers programmers and
executives.
```{r}
working_prosper_loan %>%
group_by(EmploymentStatus) %>%
summarize(counts = n()) %>%
arrange(-counts) %>%
mutate(EmploymentStatus = factor(EmploymentStatus, EmploymentStatus)) %>%
ggplot(aes(x = EmploymentStatus, y = counts)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
I am quite confused with the classification of this category. More than half of
the borrowers are employed, although from the choices (levels) of the employment
status it is not clear what is the difference between "employed" vs. "full-time"
or "Part-time' so the proportion might be different depending on the
interpretation. Overall most borrowers are employed.
I suspect that the choices in this category was changed at some point so I
decide to look at the time distribution of the EmploymentStatus categories
(see the chart below).
```{r}
#ggplot(data = working_prosper_loan, aes(x = ListingCreationDate, y = ..count..)) +
# geom_bar() +
# scale_x_date(date_breaks = "6 month")
ggplot(working_prosper_loan, aes(x = ListingCreationDate, y = ..count..)) +
geom_histogram() +
scale_x_date(date_breaks = "6 month") +
facet_wrap(~EmploymentStatus) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
```
Unfortunately, the result is still confusing to me. For example, "Employed" and
"Full-time" are two overlapping classifications to me and I was hoping to find
that the distribution of the counts for these two classifications would be
distinct (ie. one is collect at a period that is not overlapped to the other).
However, it seemed that in the dataset both of them are used to classify loans
that are collected at the same time and I have no further distinction on how
the choice was made.
Next I will look at distribution of the numeric columns starting with APR.
```{r}
ggplot(data = working_prosper_loan, aes(BorrowerAPR)) +
geom_histogram(binwidth = 0.01) +
scale_x_continuous(limits = c(0, 0.45), breaks = seq(0, 0.45, 0.05)) +
geom_vline(xintercept = median(working_prosper_loan$BorrowerAPR,
na.rm = TRUE), color = 'red') +
geom_vline(xintercept = mean(working_prosper_loan$BorrowerAPR,
na.rm = TRUE), color = 'green')
```
```{r}
summary(working_prosper_loan$BorrowerAPR)
```
```{r}
# check listings without borrowerAPRs
check <- subset(working_prosper_loan, is.na(BorrowerAPR))
# remove listings without BorrowerAPR from the working dataset
working_prosper_loan <- subset(working_prosper_loan, !is.na(BorrowerAPR))
```
The APR histogram appears bimodal and has a broad, normally distributed first
peak at 0.2%, near the median(red) and mean(green), and a sharp second peak at
0.36%.
There are 25 listings that are from the very beginning of the dataset that don't
have APR information. I will remove those listings since my focus is on
characterizing the APR.
```{r}
# plot the original data
esd1 <- ggplot(data = working_prosper_loan,
aes(x = EmploymentStatusDuration)) +
geom_histogram()
# plot log transformed data
esd2 <- ggplot(data = working_prosper_loan,
aes(x = EmploymentStatusDuration + 1)) +
geom_histogram() + scale_x_log10()
# plot square rooted data
esd3 <- ggplot(data = working_prosper_loan,
aes(x = EmploymentStatusDuration)) +
geom_histogram() + scale_x_sqrt()
grid.arrange(esd1, esd2, esd3, ncol = 1)
```
```{r}
summary(working_prosper_loan$EmploymentStatusDuration)
```
Employment status duration for the borrowers range from 0 month to 755 months.
The original distribution was positively skewed with a long tail to the right
(Top plot). Transformation by taking log10 seems to result in a slightly
negatively skewed distribution (Middle plot). Transformation of the data by
square rooting seems to reduce the longtailness (Bottom plot).
```{r}
smi1 <- ggplot(data = working_prosper_loan, aes(x = StatedMonthlyIncome)) +
geom_histogram() +
scale_x_continuous(limits = c(0, 30000), breaks = seq(0, 30000, 5000))
smi2 <- ggplot(data = working_prosper_loan, aes(x = StatedMonthlyIncome + 1)) +
geom_histogram() + scale_x_log10()
smi3 <- ggplot(data = working_prosper_loan, aes(x = StatedMonthlyIncome)) +
geom_histogram() + scale_x_sqrt()
grid.arrange(smi1, smi2, smi3, ncol = 1)
```
```{r summary_of_StatedMonthlyIncome}
summary(working_prosper_loan$StatedMonthlyIncome)
```
```{r summary_of_log10_StatedMonthlyIncome}
summary(log10(working_prosper_loan$StatedMonthlyIncome + 1))
```
```{r summary_of_sqrt_StatedMonthlyIncome}
summary(sqrt(working_prosper_loan$StatedMonthlyIncome))
```
StatedMonthlyIncome also shows positive skewness (first summary) but can be
tranformed to normal distribution by taking log10 (second summary) or square
root (third summary).
I am puzzled by the large number of maximum StatedMonthlyIncome so decided to
take a closer look.
```{r StatedMonthlyIncome_greater_than_500000}
StatedMonthlyIncome_large <- subset(working_prosper_loan,
StatedMonthlyIncome > 500000)
StatedMonthlyIncome_large
```
It appears that for a StatedMonthlyIncome greater than 500000 there are two
loan lisitngs for businesses. The StatedMonthlyIncome might be large because it
is representing income for the whole business and not individuals. These are
likely true outliers.
```{r compare_CreditScoreRangeLower_vs_CreditScoreRangeUpper}
ggplot(data = working_prosper_loan) +
geom_line(aes(CreditScoreRangeLower), stat = 'count', color = 'blue') +
geom_line(aes(CreditScoreRangeUpper), stat = 'count', color = 'red') +
xlab('Credit Score') +
scale_x_continuous( breaks = seq(0, 900, 100))
```
This plot shows both the upper credit score in red and the lower credit score in
blue. The trends are very similar so I will only keep the upper range scores.
```{r delete_CreditScoreRangeLower}
working_prosper_loan <- subset(working_prosper_loan,
select = -c(CreditScoreRangeLower))
```
```{r histogram_CreditScoreRangeUpper}
ggplot(data = working_prosper_loan, aes(CreditScoreRangeUpper)) + geom_histogram()
```
```{r histogram_CreditScoreRangeUpper_binwidth20}
ggplot(working_prosper_loan, aes(CreditScoreRangeUpper)) +
geom_histogram(binwidth = 20) +
scale_x_continuous(limits = c(400, 900), breaks = seq(400, 900, 50))
```
```{r summary_CreditScoreRangeUpper}
summary(working_prosper_loan$CreditScoreRangeUpper)
```
```{r summary_Credit_scoreRangeUpper_greater_than_400}
with(subset(working_prosper_loan, CreditScoreRangeUpper > 400),
summary(CreditScoreRangeUpper))
```
There are a few outliers with very low credit scores that negatively skewed the
data slightly. After removing those with lower values (less than 400), the
variable seemed normally distributed. I would think that the lower credit score
is consistent with the fact that many of these loans were targeted for debt
consolidation. I would be curious to find out if there is a correlation
between low credit score and bad loans though.
```{r histogram_DebtToIncomeRatio}
ggplot(data = working_prosper_loan, aes(x = DebtToIncomeRatio)) +
geom_histogram() +
scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.1))
```
```{r histogram_DebtToIncomeRatio_normal_log10_sqrt}
dtir1 <- ggplot(data = working_prosper_loan, aes(DebtToIncomeRatio)) +
geom_histogram()
dtir2 <- ggplot(data = working_prosper_loan, aes(DebtToIncomeRatio + 1)) +
geom_histogram() + scale_x_log10()
dtir3 <- ggplot(data = working_prosper_loan, aes(DebtToIncomeRatio)) +
geom_histogram() + scale_x_sqrt()
grid.arrange(dtir1, dtir2, dtir3, ncol = 1)
```
```{r summary_DebtToIncomeRatio}
summary(working_prosper_loan$DebtToIncomeRatio)
```
```{r summary_DebtToIncomeRatio_log10}
summary(log10(working_prosper_loan$DebtToIncomeRatio + 1))
```
```{r summary_DebtToIncomeRatio_sqrt}
summary(sqrt(working_prosper_loan$DebtToIncomeRatio))
```
While most of the DTI are at 0.2, we are seeing outliers with big numbers of
DTI. They are likely to be real because of the nature of our dataset so I will
keep them. Data transformation can be used to normalize the data.
```{r histogram_LoanOriginalAmount_normal_log10_sqrt}
loa1 <- ggplot(data = working_prosper_loan, aes(LoanOriginalAmount)) +
geom_histogram()
loa2 <- ggplot(data = working_prosper_loan, aes(LoanOriginalAmount)) +
geom_histogram() + scale_x_log10()
loa3 <- ggplot(data = working_prosper_loan, aes(LoanOriginalAmount)) +
geom_histogram() + scale_x_sqrt()
grid.arrange(loa1, loa2, loa3, ncol = 1)
```
```{r summary_LoanOriginalAmount_normal}
summary(working_prosper_loan$LoanOriginalAmount)
```
```{r summary_LoanOriginalAmount_log10}
summary(log10(working_prosper_loan$LoanOriginalAmount + 1))
```
```{r summary_loanOriginalAmount_sqrt}
summary(sqrt(working_prosper_loan$LoanOriginalAmount))
```
The distribution of loan original amount appear "spiky" with specific peaks
presumably representing the amount for a specific type of loan. However, It
looks to me that the original data appeared positively skewed and log
transformation appear to help normalizing the dataset.
# Univariate Analysis
### What is the structure of your dataset?
#### Size
The original dataset contains 113937 rows representing loan listings with 81
variables describing each listing. I selected 15 variables for the initial EDA.
I found that there were duplicated loan listings and deleted the duplications.
The final working dataset contains 113066 observations of 14 variables (see last
question for reason to delete one variable).
#### Variables
The variables can be grouped by their data types as follows:
Date: ListingCreationDate \
Factor: ListingNumber, Term, LoanStatus, ListingCategory, BorrowerState,
Occupation, EmploymentStatus \
Integer: EmploymentStatusDuration, CreditScoreRangeLower, CreditScoreRangeUpper,
LoanOriginalAmount \
Number: BorrowerAPR, DebtToIncomeRatio, StatedMonthlyIncome
The distribution of some numeric variables appeared to be skewed but can be
transformed to approach normal distribution.
#### Content
The loans in this dataset are short term loans of 12 months, 36 months or 60
months durations, ranging from 2005 to 2014. The amount of original loan range
from 1,000 to 35,000. The top reason for taking out the loan is debt
consolidation. Most of the borrowers are from California. Professionals and
programmers came on top of the types of borrowers' occupations.
The APR distribution appear bimodal with a broad first peak that appear normally
distributed and a second sharp peak at 0.36% (located at the right of the first
peak near the end of the distribution).
### What is/are the main feature(s) of interest in your dataset?
The main feature of interst is the APR. I am interested in finding factors that
affects the APR.
### What other features in the dataset do you think will help support your \
investigation into your feature(s) of interest?
I think that high credit score and low debt to income ratio would help bring
down the APR. Employment stability and income can also be an important factor.
### Did you create any new variables from existing variables in the dataset?
I converted Listing Category from numeric to factor type by mapping the number
representing a category to the actual name of the listing category so that it
would be more intuitive to interpret the outcome.
I also simplified the loan status variable to group all loans in question into
"bad loans".
### Of the features you investigated, were there any unusual distributions? \
Did you perform any operations on the data to tidy, adjust, or change the form \
of the data? If so, why did you do this?
I initially included two variables, UpperCreditScore and LowerCreditScore,
regarding credit scores. After finding that the pattern are almost identical I
have decided to include just one, resulting in 14 variables total in the working
dataset.
I changed the data type for ListingNumber, ListingCreationDate and Term because
it fits the nature of the data better. I also changed the levels in
ListingCategory and LoanStatus to simplify the analysis.
I found duplicated listings that I have deleted to make sure that all listings
were unique.
Overall the dataset appear to contain some NA or " " levels in categorical
variables and some outliers in numerical levels. At this point I have decided to
keep them because they appear to be true data points.
# Bivariate Plots Section
Here I will focus on the following questions:
I. Which numeric variable has a correlation with APR? \
II. How does APR change over time and other variables? \
III. What are the characteristics of the current, completed and bad loans?
## I. Which numeric variable has a correlation with APR?
To reduce computing time for initial analysis I created a smaller data set with
only numeric variables and randomly selected ten thousand loan listings to
generate correlations plots of those numeric variables against each other using
'''ggpairs'''. I also generated new numeric columns with modified values (
'''log10''' or '''sqrt''') to normalized some of the variables.
```{r create_ggpair_dataset}
# Generate a data set with only numeric columns for ggpair analysis
working_prosper_loan_numeric <- subset(working_prosper_loan,
select = c(BorrowerAPR,
EmploymentStatusDuration,
CreditScoreRangeUpper,
DebtToIncomeRatio,
StatedMonthlyIncome,
LoanOriginalAmount))
# Create additional columns with transformed variables
working_prosper_loan_numeric$sqrt_ESD <- sqrt(working_prosper_loan_numeric$EmploymentStatusDuration)
working_prosper_loan_numeric$log10_SMI <-
log10(working_prosper_loan_numeric$StatedMonthlyIncome + 1)
working_prosper_loan_numeric$sqrt_DTIR <-
sqrt(working_prosper_loan_numeric$DebtToIncomeRatio)
summary(working_prosper_loan_numeric)
```
```{r ggpair_analysis}
set.seed(38)
# randomly select 10000 samples for initial analysis
ggpairs(working_prosper_loan_numeric
[sample.int(nrow(working_prosper_loan_numeric), 10000), ])
```
BorrowerAPR have some small but significant negative correlation with
CreditScoreRangeUpper (-0.442) and LoanOriginalAmount (-0.325) but not with EmploymentStatusDuration, DebtToIncomeRatio or StatedMonthlyIncome.
Additionally, LoanOriginalAmount has small but significant positive correlation
with CreditScoreRangeUpper(0.344) and StatedMonthlyIncome(0.38).
None of the modified variables showed a significant improvement in correlation.
```{r, scatter_BorrowerAPR_CreditScoreRangeUpper}
ggplot(working_prosper_loan, aes(BorrowerAPR, CreditScoreRangeUpper)) +
geom_point(alpha = 1/10, size = 1/10, position = "jitter") +
geom_smooth(method = "lm", color = "orange")
```
```{r, pearson_coef_BorrowerAPR_CreditScoreRangeUpper}
# compute the pearson's coefficient for the entire data set
cor.test(working_prosper_loan$BorrowerAPR,
working_prosper_loan$CreditScoreRangeUpper)
```
Here we can see the negatively trend between high credit score and borrower APR
with a pearson's coefficient of -0.429 for the entire dataset.
```{r, scatter_BorrowerAPR_LoanOriginalAmount}
ggplot(data = working_prosper_loan, aes(BorrowerAPR, LoanOriginalAmount)) +
geom_point(alpha = 1/20, size = 1/10, position = "jitter") +
geom_smooth(method = "lm", color = "orange")
```
```{r pearson_coef_BorrowerAPR_LoanOriginalAmount}
cor.test(working_prosper_loan$BorrowerAPR,
working_prosper_loan$LoanOriginalAmount)
```
We see that higher loan amounts tend to correlate with lower APR. This is
specifically true when the loan amount is large. The discrete horizontal lines
are due to the fact that loan amounts are given out at a specific number. The
pearson's coefficient for the entire dataset is -0.322.
The two major factors that correlate with APR in this dataset are
CreditScoreRangeUpper and LoanOriginalAmount, which makes these two variables
top candidates for building a model. I would also like to point out that
StatedMonthlyIncome might also have a very mild effect on BorrwerAPR but the
coefficient was very low at -0.165. I was surprised to find that
DebtToIncomeRatio was not correlated with BorrowerAPR. I suspect that this
factor is ignored because many of the loans are used for debt consolidation.
## II. How does APR change over time and other variables?
```{r, scatter_ListingCreationDate_BorrowerAPR}
ggplot(working_prosper_loan, aes(ListingCreationDate, BorrowerAPR)) +
geom_point(alpha = 1/10, size = 1/10) +
scale_x_date(date_breaks = "6 month") +
geom_hline(yintercept = 0.36, color = "magenta", linetype = 2) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
Generally a line plot would be the best to visualize trends over time. However,
because we are looking at individual loan listings, a scatter plot turned out
to show the APR trend more clearly. We see nicely the upper bound and lower
bound of APR over time. For example, during the first few months of data
collection, the APR range from 0.03% to slightly over 0.5%. On the other hand,
from 2011 to 2014, the lower bound is around 0.06% and the upper bound is around
0.36%. The extreme high and low rates also appear to be rates prior to 2009.
The number of loan listings at a specific time is also highlighted by the
different shades of grey, with darker dots representing higher number and vice
versa.
The flactuation of rates at different times poses an inconsistency in which a
high APR at one time could be only moderate in another time. The difference in
the upper and lower bound of the APR could explain why the correlation between
APR and other variables were not stronger.
```{r, pearson_coef_BorrowerAPR_subsetListingCreationDate}
# create listings between May 2006 to Nov. 2008
ListingCreationDateSubset <- subset(working_prosper_loan, ListingCreationDate >
"2006-05-01"
& ListingCreationDate < "2008-11-1")
cor.test(ListingCreationDateSubset$BorrowerAPR,
ListingCreationDateSubset$CreditScoreRangeUpper)
```
When I limit the dates to a period of time with similar upper and lower bound
in APR (May 2006 to Nov. 2008), I see an increase in correlation coefficient
to -0.625.
Another trend that I observed was that between Mar. 2011 and Jul. 2012,
BorrowerAPR were offered at very fixed rates, indicated by the blank spaces
flanked by dark dotted lines. This trend is very different than other times
during the data collection.
The magenta dotted line marks the 0.36% APR which represented the second sharp
peak observed in the BorrowerAPR histogram in the previous section. There is a
period from Dec. 2010 to Dec. 2012 where 0.36% was offered extensively that
could account for that peak.
Next I will look into how the APR trend is affected by other categorical
variables.
```{r box_Term_BorrowerAPR}
ggplot(working_prosper_loan, aes(x = Term, y = BorrowerAPR)) +
geom_boxplot() +
stat_summary(fun.y = mean, color = "green", geom = "point")
```
```{r, summary_Term_BorrowerAPR}
working_prosper_loan %>%
filter(!is.na(BorrowerAPR)) %>%
group_by(Term) %>%
summarise(mean = mean(BorrowerAPR), median = median(BorrowerAPR), n = n())
```
Here we see that 12-month term has slightly higher median than the other two
terms but 36-month term has the higest mean (green dots). Overall the stats are
very similar, suggessting that terms don't affect the APR much.
```{r, scatter_ListingCreationDate_BorrowerAPR_Term}
ggplot(working_prosper_loan, aes(ListingCreationDate, BorrowerAPR)) +
geom_point(alpha = 1/15, size = 1/2) +
facet_wrap(~Term) +
geom_hline(yintercept = 0.36, color = "magenta", linetype = 2)
```
12-month and 60 month loans were launched only after end of 2011. 12-month loans
discontinued after 2013.
```{r box_LoanStatus_BorrowerAPR}
ggplot(working_prosper_loan, aes(LoanStatus, BorrowerAPR)) +
geom_boxplot() +
stat_summary(fun.y = mean, color = "green", geom = "point")
```
```{r summary_LoanStatus_BorrowerAPR}
working_prosper_loan %>%
filter(!is.na(BorrowerAPR)) %>%
group_by(LoanStatus) %>%
summarise(mean = mean(BorrowerAPR), median = median(BorrowerAPR), n = n())
```
Here we see that listings in the Bad_loan category has higher APR than the
other two categories. One possible explanation for this difference is that the
bad loans were issued only during the time when the rate was higher.
Alternatively, it could also be that loans with higher APR tend to end up being
bad loans. So I will look into how these loan types were distributed over time.
```{r, scatter_ListinCreationDate_BorrowerAPR_LoanStatus}
ggplot(working_prosper_loan, aes(ListingCreationDate, BorrowerAPR)) +
geom_point(alpha = 1/15, size = 1/2) +
facet_wrap(~LoanStatus) +
geom_hline(yintercept = 0.36, color = "magenta", linetype = 2)
```
Bad loans ranged from the beginning to the end of the data collection time.
There were also more loan listings with lower APR in the completed loans than
bad loans, suggesting a correlation between high APR and bad loans.
```{r box_ListingCategory_BorrowerAPR}
ggplot(subset(working_prosper_loan, !is.na(BorrowerAPR)),
aes(x = reorder(ListingCategory, BorrowerAPR, FUN = median),
y = BorrowerAPR)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
geom_hline(yintercept = with(subset(working_prosper_loan,!is.na(BorrowerAPR)),
median(BorrowerAPR)),
color = "red", linetype = 2)
```
```{r summary_ListingCategory_BorrowerAPR}
working_prosper_loan %>%
filter(!is.na(BorrowerAPR)) %>%
group_by(ListingCategory) %>%
summarise(median = median(BorrowerAPR), n = n()) %>%
arrange(median)
```
BorrowerAPR differed depending on the ListingCategory. Personal loan and Not
available are the two categories that has the lowest APR, much less than the
overall APR median (red dashed line); whereas Household expenses and Cosmetic
procedures are the two categories with the highest APRs, way above the overall
median.
```{r scatter_ListingCreationDate_BorrowerAPR_ListingCategory, fig.height=12}
ggplot(working_prosper_loan, aes(ListingCreationDate, BorrowerAPR)) +
geom_point(alpha = 1/20) +
facet_wrap(~ListingCategory) +
geom_hline(yintercept = 0.36, color = "magenta", linetype = 2)
```
When we split the time analysis data into listing categories, we see the
following:
1. Loans in personal loan and student use category were once offered but
discontinued in the middle of the data collection period.
2. Loans in debt consolidation, home improvement, business, auto and other
category were offered consistently throughout the data collection period
3. There were a series of new loans offered after 2012, among those the more
popular ones include Medical or dental, wedding loans, household expenses, etc.
The medians of these different types of loans vary drastically.
4. Most of the Not available listings were collected before 2008 and all others
were collected only after 2008, suggesting that this variable was not collected
until after 2008.
Combining the two plots together we can also find out that the rates before 2008
have lower top and bottom limits, which might explain why the Not available
listing has lower median. Similarly, there were more personal loans with low APR
given out, resulting in a lower median.
```{r box_BorrowerState_BorrowerAPR}
ggplot(subset(working_prosper_loan, !is.na(BorrowerAPR)),
aes(x = reorder(BorrowerState, BorrowerAPR, FUN = median),
BorrowerAPR)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
geom_hline(yintercept = with(subset(working_prosper_loan,
!is.na(BorrowerAPR)),
median(BorrowerAPR)),
color = "red",
linetype = 2)
```
Maine and Iowa have the lowest APRs (with outliers) and Arkansas and Alabama
have the highest APRs. At first I thought it was due to the state usury limit,
but after checking [this article](http://www.lectlaw.com/files/ban02.htm) I
didn't find a correlation. For example, the legal rate of interest for ME, AR
and AL are all 6%. Overall, the BorrowerAPR varies from state to state.
```{r scatter_LCDate_BorrwerAPR_BorrowerState, fig.height= 15, fig.width=12}
ggplot(working_prosper_loan, aes(ListingCreationDate, BorrowerAPR)) +
geom_point(alpha = 1/5, size = 1) +
facet_wrap(~BorrowerState) +
geom_hline(yintercept = 0.36, color = "magenta", linetype = 2)
```
When we split the time series data into borrower states we can see the
following,
1. Loans offered to IA, ME and ND were discontinued after 2009. In contrast,
loans offered to SD only started after 2009.
2. Early discontinuation of loans in ND and IA could count for the missing peak
at 0.36% in the APR histogram in previous section.
3. There seems to be an increase in issueing many more loans after 2012 in most
of the states.
4. The NA listings seem to exist only prior to 2008, similar to
Listing Category.
5. Before 2008, BorrowerAPR seemed drastically different from state to state
(eg.CA vs CO). However, after 2008 the high and low limits of BorrowerAPR are
very similar.
When we look at this plot we can also see individual reasons for why the median
BorrowerAPR is different for each state. For example, the low rates of Maine and
Iowa are due to the fact that the loans were issued at a time when the rates
were low and discontinued when the rates went up.
```{r box_Occupation_BorrowerAPR}
ggplot(subset(working_prosper_loan, !is.na(BorrowerAPR)),
aes(x = reorder(Occupation, BorrowerAPR, FUN = median),
y = BorrowerAPR)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
geom_hline(yintercept = with(subset(working_prosper_loan,
!is.na(BorrowerAPR)),
median(BorrowerAPR)),
color = "red", linetype = 2)
```
Judge and Doctor receive the lowest APR. Nurse's Aide and Teacher's Aide
receive the highest APR. Again we see that BorrowerAPR varies among different
occupations adn seems to be related to the socialeconomic status of the
borrower.
```{r scatter_LCDate_BorrowerAPR_Occupation, fig.height= 15, fig.width=12}
ggplot(working_prosper_loan, aes(ListingCreationDate, BorrowerAPR)) +
geom_point(alpha = 1/5, size = 1) +
facet_wrap(~Occupation) +
geom_hline(yintercept = 0.36, color = "magenta", linetype = 2)
```
The data collection for Occupation seems quite completed with most of the levels
covering from 2006 to 2014. There are two groups of NAs at the beginning
(2006 - 2008) and end (2014) of the data collection time.
```{r box_EmploymentSatus_BorrowerAPR}
ggplot(data = subset(working_prosper_loan, !is.na(BorrowerAPR)),
aes(x = reorder(EmploymentStatus, BorrowerAPR, FUN = median),
y = BorrowerAPR)) +
geom_boxplot() +
geom_hline(yintercept = with(subset(working_prosper_loan,
!is.na(BorrowerAPR)),
median(BorrowerAPR)),
color = "red", linetype = 2)
```
Part-time and Full-time status have lowest APRs whereas Other and Not employed
have highest APRs.
```{r}
ggplot(working_prosper_loan, aes(ListingCreationDate, BorrowerAPR)) +
geom_point(alpha = 1/5, size = 1) +
facet_wrap(~EmploymentStatus) +
geom_hline(yintercept = 0.36, color = "magenta", linetype = 2) +
geom_hline(yintercept = 0.1, color = "pink", linetype = 2)
```
Here we see again that the definition of levels in this category is confusing.
There is almost no data for "Employed" before 2010, which was probably
classified as "Full-time" based on the density (shades of gray) of the loan
listings. There is also decreased number of loans in "part-time" and "retired"
after 2011 and it's not clear if the company decided not to offer loans to
these categories or were they included in "others", another category that began
to accumulate data only after mid-2010.
One thing that I noticed is that listings after 2009, with borrowers that are
not employed, consistently had higher APRs on the lower range, set arbitrarily
at 0.1% (pink dotted line) when compared to the other levels in this variable.
Additionally, there seems to be an increase of issueing loans to self-
employed borrowers after 2011.
In this section we look into how loan listing were offered regionally, over
time, and to what type of borrowers. We learn that BorrowerAPR varies from level
to level in ListingCategory, BorrowerState, Occupation, LoanStatus and
EmploymentStatus but remains similar in Term. Furthermore, we learn that data
points in Occupation, Term and LoanStatus are collected consistently throughout
the data collection period but not in BorrowerState, ListingCategory and
EmploymentStatus, resulting in more missing values in the later variables.
## III. Characteristics of bad loans
In previous section we see that BorrowerAPR is higher in the Bad_loan level so
I want to look further into other variables that show a difference to futher
characterize the bad loan category.
```{r box_LoanStatus_CreditScoreRangeUpper}
ggplot(working_prosper_loan, aes(x = LoanStatus, y = CreditScoreRangeUpper)) +
geom_boxplot()
```
```{r summary_LoanStatus_CreditScoreRangeUpper}
working_prosper_loan %>%
filter(!is.na(CreditScoreRangeUpper)) %>%
group_by(LoanStatus) %>%
summarise(median = median(CreditScoreRangeUpper), mean = mean(CreditScoreRangeUpper, n = n()))
```
Both mean and median is lower for Bad_loans than for Completed and Current
loans.
```{r scatter_ListingCreationDate_CreditScoreRangeUpper_LoanStatus}
ggplot(working_prosper_loan, aes(ListingCreationDate, CreditScoreRangeUpper)) +
geom_point(size = 0.05, position = "jitter", alpha = 1/10) +
facet_wrap(~LoanStatus)
```
Here we can see that the low credit scores are from between 2006 to 2007.
```{r box_LoanStatus_SubsetCreditScoreRangeUpper}
ggplot(working_prosper_loan, aes(LoanStatus, CreditScoreRangeUpper)) +
geom_boxplot() +
coord_cartesian(ylim = c(500, 800))
```
A zoom-in of CreditScoreRange shows that Current loans has the highest median of
credit scores followed by Completed then Bad_loan.
```{r box_LoanStatus_DebtToIncomeRatio}
ggplot(working_prosper_loan, aes(LoanStatus, DebtToIncomeRatio)) +
geom_boxplot()
```
We know from previous session that the distribution is very positively skewed.
It's hard to see the bulk of the data because of the outliers so I will zoom-in
first.