-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathLoan_EDA_Code.Rmd
794 lines (530 loc) · 32.7 KB
/
Loan_EDA_Code.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
---
title: "Prosper Loan Data Analysis"
author: "Shelly Sousa"
date: "11/23/2021"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(tidy.opts=list(width.cutoff = 80), tidy=TRUE, echo=TRUE)
```
```{r echo=FALSE, message=FALSE, warning=FALSE, packages}
# Import/Install required libraries for the project
library(knitr)
library(ggplot2)
library(dplyr)
library(ggrepel)
library(GGally)
library(scales)
library(car)
library(devtools)
```
```{r echo=FALSE, data_load}
# Load the csv
loanData <- read.csv('prosperLoanData.csv')
```
## Table of Contents
Introduction
Univariate Plots
Univariate Analysis
Bivariate Plots
Bivariate Analysis
Multivariate Plots
Multivariate Analysis
Final Plots and Summary
Reflection
# Introduction
This analysis explores information from the Prosper Loan dataset. Udacity's description states that the provided csv file contains data for 113,937 loans. 81 distinct attributes are included for each loan.
According to their [website](https://www.prosper.com/about "website"): "Prosper was founded in 2005 as the first peer-to-peer lending marketplace in the United States. Since then, Prosper has facilitated more than \$19 billion in loans to more than 1,160,000 people." Prosper provides its partners with API access for the development of loan and investment software clients.
I selected this dataset because it is relevant to my career. Recently, I accepted a position as a software engineer at a financial services organization. Their specialty is lending. Exploring the data in this project may improve my general understanding of lending practices and the borrowers who use my applications.
### Project Description
The goal of the project is to analyze loan attributes and gain insights about the various loan characteristics. For this analysis, I will focus on a subset of the attributes and attempt to answer the following questions:
<ul>
<li>
What borrower characteristics are of interest?
</li>
<li>
How does a borrower's occupation or income affect their Prosper Rating?
</li>
<li>
Is there any correlation between other borrower and loan attributes?
</li>
</ul>
# Univariate Plots
First, let's take a look at the data and evaluate single attributes of interest.
<br>
##### Number of Rows
```{r echo=FALSE, loan_summary}
nrow(loanData)
```
<br>
##### Number of Columns
```{r echo=FALSE}
ncol(loanData)
```
<br>
##### List of Column Names
```{r echo=FALSE}
names(loanData)
```
<br>
```{r echo=FALSE}
CreditScoreAverage <- ((loanData$CreditScoreRangeLower +
loanData$CreditScoreRangeUpper) / 2)
loanData$CreditScoreAverage <- CreditScoreAverage
# Creating a new column and adding it to the dataframe.
# There are two credit score ranges for each loan, low and high. I would like to reference a single average credit score rating for this analysis.
loanData <- subset(loanData, (CreditScoreAverage >= 300 & CreditScoreAverage <= 850))
# Subsetting the data to eliminate the small number of records that do not meet the criteria for standard credit scores (see analysis and sources).
# This attribute will be utilized in most of my plots. It is better to subset and remove the rows to improve performance.
loanData$StatedMonthlyIncome <- (round(loanData$StatedMonthlyIncome, digits = 0))
# The StatedMonthlyIncome integer is 6 decimals by default. Precise income values are not needed for this analysis.
```
```{r echo=FALSE, data_drop}
loanData <- select(loanData,-c(1:4, 7:15, 17, 23:37, 40:48, 51:81))
# Dropping the columns that are not needed
```
<br>
Borrower information like credit score and stated monthly income are of interest to me. I prepared the data for the analysis with the following actions:
<ul>
<li>
A new variable, CreditScoreAverage, was calculated from the upper and lower credit scores for each loan.
</li>
<li>
CreditScoreAverage was limited to the standard FICO range, 300-850
</li>
<li>
StatedMonthlyIncome was modified by rounding the values to the nearest whole number.
</li>
</ul>
Let's drop the columns that we do not need for this exploration and review the summary again.
<br>
##### Number of Rows
```{r echo=FALSE}
nrow(loanData)
```
<br>
##### New Number of Columns
```{r echo=FALSE}
ncol(loanData)
```
<br>
##### New List of Column Names
```{r echo=FALSE}
names(loanData)
```
<br> Much better. Now we can observe the structure of the modified dataset and review the data types and values.
<br>
### Data Summary & Structure
<br>
#### Loan Data Summary
```{r echo=FALSE}
summary(loanData)
```
<br>
#### Loan Data Structure
```{r echo=FALSE}
str(loanData)
```
<br> We have enough information to proceed.
Due to the simplicity of individual variable plots, each selected loan attribute will be presented in a stream of consciousness exploration. Detailed commentary is provided in the Univariate Analysis section.
Let's plot and explore the data!
<br>
### Individual Variables
<br>
```{r echo=FALSE, message=FALSE, warning=FALSE, Univariate_Attributes}
ggplot(data=loanData, aes(ProsperScore)) +
geom_bar() +
labs(title="Prosper Scores", x="Prosper Score", y="Loans")
```
<br>
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(data=loanData, aes(CreditScoreAverage)) +
geom_histogram(bins=30) +
labs(title="Average Credit Scores", x="Credit Score", y="Loans")
# After reviewing the initial output and summary, I decided to limit the x axis.
# Credit scores range from 300-850.
```
<br>
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.height = 10, fig.width = 6}
loanData %>%
group_by(BorrowerState) %>%
summarise(count = n()) %>%
ggplot(aes(y = reorder(BorrowerState,(count)), x = count)) +
geom_bar(stat = 'identity') +
labs(title = "Number of Loans by Borrower State", x = "Loans", y = "State")
```
<br> <br>
#### Number of Unique Occupations
```{r echo=FALSE, message=FALSE, warning=FALSE}
# Unique (Distinct) Occupations sorted alphabetically
n_distinct(loanData$Occupation)
occUnique <- unique(sort(loanData$Occupation))
print(occUnique)
```
<br>
```{r echo=FALSE, fig.height = 10, fig.width = 8}
loanData %>%
group_by(Occupation) %>%
summarise(count = n()) %>%
ggplot(aes(y = reorder(Occupation,(count)), x = count)) +
geom_bar(stat = 'identity') +
labs(title = "Frequency of Occupations", x = "Loans", y = "Occupation")
```
<br>
```{r echo=FALSE, message=FALSE, warning=FALSE}
loanData %>%
group_by(Occupation) %>%
filter(Occupation != 'Other') %>%
summarise(count = n()) %>%
top_n(5) %>%
ggplot(aes(y = reorder(Occupation,(count)), x = count)) +
geom_bar(stat = 'identity') +
labs(title = "Top 5 Occupations (excluding Other)", x = "Loans", y = "Occupation")
```
<br>
```{r echo=FALSE, message=FALSE, warning=FALSE}
loanData %>%
group_by(EmploymentStatus) %>%
summarise(count = n()) %>%
ggplot(aes(y = reorder(EmploymentStatus,(count)), x = count)) +
geom_bar(stat = 'identity') +
labs(title = "Employment Statuses", y = "Employment Status", x = "Loans")
```
<br>
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(data=loanData, aes(EmploymentStatusDuration)) +
geom_bar() +
labs(title = "Employment Durations", x = "Duration (Months)", y = "Loans")
```
<br>
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(data=loanData, aes(DelinquenciesLast7Years)) +
geom_bar() +
labs(title = "Delinquencies Last 7 Years",x = "Delinquencies", y = "Loans") +
xlim(-1, 40)
# After reviewing the initial output and summary, I decided to limit the x axis. The mean for delinquencies is 4.
```
<br>
```{r echo=FALSE, message=FALSE, warning=FALSE}
loanDataPR10 <- subset(loanData, !(PublicRecordsLast10Years > 8))
# Removing a small number of outliers. The mean for delinquencies is 0.3126. Most borrowers do not have public records.
ggplot(data = loanDataPR10, aes(PublicRecordsLast10Years)) +
geom_bar() +
labs(title = "Public Records Last 10 Years", x = "Public Records", y = "Loans") +
xlim(-1, 8)
```
<br>
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(data = loanData, aes(Term)) +
geom_bar() +
labs(title = "Term Lengths", x = "Term", y = "Loans")
```
<br>
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(data=loanData, aes(IsBorrowerHomeowner)) +
geom_bar() +
labs(title="Borrowers who are Current Homeowners", x="Homeowner", y="Loans")
```
<br>
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(data=loanData, aes(IncomeVerifiable)) +
geom_bar() +
labs(title="Borrowers with Verifiable Income", x="Verifiable Income", y="Loans")
```
<br> <br>
# Univariate Analysis
<br>
### What is the structure of your dataset?
There are 113,937 loans in the dataset with 81 unique characteristics. Although 13 loan attributes will be utilized for this analysis, the following data ranges (low to high) are relevant:
Prosper Scores: 1-11
Average Credit Scores: 9.5-889.5 (limited to 300-850)
<br>
#### Details:
The most common Prosper Scores are 4, 6, and 7.
Most borrower credit scores average somewhere between 690-730.
A significant number of borrowers in this dataset live in California.
There are 68 unique borrower occupations. 28617 occupations are categorized
as "Other". 3588 occupations are null.
Most borrowers are currently employed with a median duration of 67 months.
Most borrowers do not have public records that impact their credit.
A small number of borrower deliquinces (4) in the last 7 years is common.
The median loan term is 36 months.
Approximately half of the borrowers in the dataset are current homeowners.
The majority of loans in the dataset have verifiable sources of income.
### What is/are the main feature(s) of interest in your dataset?
For my analysis, the main features are the attributes of interest are those that tell us more about the borrowers.
What is their occupation?
What is their credit score?
Is there a correlation between these characteristics and the Prosper Score?
Here are the primary characteristics:
ProsperScore:
A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. Applicable for loans originated after July 2009.
CreditScoreAverage:
Aggregate credit score taken from the highest and lowest scores
EmploymentStatus:
The employment status of the borrower at the time they posted the listing.
EmploymentStatusDuration:
The length in months of the employment status at the time the listing was created.
Occupation:
The Occupation selected by the Borrower at the time they created the listing.
### What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
The following secondary characteristics will be used to further explore the borrower and loan data:
BorrowerState:
The two letter abbreviation of the state of the address of the borrower at the time the Listing was created.
DelinquenciesLast7Years:
Number of delinquencies in the past 7 years at the time the credit profile was pulled.
IncomeVerifiable:
The borrower indicated they have the required documentation to support their income.
IsBorrowerHomeowner:
A Borrower will be classified as a homowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner.
LoanStatus:
The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket.
PublicRecordsLast10Years:
Number of public records in the past 10 years at the time the credit profile was pulled.
StatedMonthlyIncome:
The monthly income the borrower stated at the time the listing was created.
Term:
The length of the loan expressed in months.
### Did you create any new variables from existing variables in the dataset?
Yes. I created an average credit score variable, CreditScoreAverage, by adding the Lower and Upper Credit Score values for each loan and dividing by two.
### Of the features you investigated, were there any unusual distributions?
It highly unusual for any US citizen to attain a double-digit number of public records. The max number of public records for a borrower in this dataset is 38. That is a questionable outlier.
There is a minimum value of 9.5 for a credit score in the dataset. According to American Express, the two most commonly used credit scoring models, FICO and VantageScore, both rank credit scores on a scale from 300 to 850. I limited the dataset and excluded rows with scores outside of the 300-850 range. This is a small number of rows (less than 1% of the total number of rows), so their removal will not significantly impact the analysis.
I noticed a large number of rows with loan data from California residents. Exploring data from specific borrower states could skew the analysis so I will not continue to explore it.
### Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
Yes. In addition to the CreditScoreAverage subset, I dropped the columns that were not needed for my exploration. This reduced the amount of data that the compiler needs to process for each function. I also rounded all values for StatedMonthlyIncome to the nearest whole number.
<br>
# Bivariate Plots
In this section, we will evaluate two characteristics in each plot.
<br>
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width = 10, Bivariate_Plots}
boxplot(data = loanData, ProsperScore ~ CreditScoreAverage, main = "Credit Score by Prosper Score", xlab="Average Credit Score", ylab="Prosper Score")
# I am using a simple boxplot with R for this comparison
```
<br> We can see that Prosper Scores are better for borrowers with higher average credit scores. I will use the Average Credit Score to explore more characteristics in the dataset.
<br>
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width = 10}
ggplot(data = loanData, aes(x = CreditScoreAverage, fill = EmploymentStatus)) +
geom_histogram(binwidth = 500, aes(y = ..density..), color = "black") +
facet_wrap(~ EmploymentStatus, scale = "free") +
scale_y_continuous(breaks = c(3,6,9)) +
geom_density(color = "black", lwd = 0.5, alpha = 0.5) +
labs(title = "Credit Score by Employment Status", x = "Average Credit Score") +
scale_fill_hue(l = 40, c = 150) +
xlim(300, 850)
# ggplot gave me the most flexibility for this plot. Facet wrap creates a series of plots that are easy to analyze.
```
<br> Employment does not guarantee that the borrower will have a high credit score. For example, part-time employees are more likely to have low credit scores. Surprisingly, Retirees (who are not likely to receive employment income) appear to have a credit score advantage.
<br>
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(data = loanData, aes(EmploymentStatus, fill = StatedMonthlyIncome)) +
geom_bar() +
labs(title = "Employment Statuses by Stated Monthly Income", x = "Employment Status", y = "Stated Monthly Income") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Added ggplot theme to rotate text 45 degrees and adjust the text so it does not overlap with the graph
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(data = loanData, aes(ProsperScore, fill = StatedMonthlyIncome)) +
geom_bar()+
labs(title="Prosper Scores by Stated Monthly Income", x = "ProsperScore", y = "StatedMonthlyIncome")
# Basic ggplot bar plot with a Monthly Income fill.
```
<br> After reviewing the employment status plots, I became curious about the relationship between stated income and employment statuses.
Another surprise. Retirees do not report a monthly income. They may have other financial resources like retirement funds or pensions but that does not seem to be reflected in the monthly income values.
<br>
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(data = loanData, aes(x = CreditScoreAverage, y = StatedMonthlyIncome),
color = StatedMonthlyIncome) +
geom_point(alpha = 0.4, position=position_jitter(height = .5, width = .5)) +
labs(title = "Credit Score by Monthly Income", x = "Average Credit Score", y = "Stated Monthly Income (USD)") +
xlim(300, 850) +
ylim(0, 100000)
# There are very few salaries above $100,000 in this dataset. The outliers skew the visualization so I decided to limit the y axis and exclude them.
```
<br> There appears to be a relationship between low incomes and low credit scores. The lowest average credit scores in the dataset are associated with borrowers who report incomes below \$25000 USD annually.
High incomes are visible across the median. Incomes are slightly higher in the 800-850 credit score range, but they are not as high as I anticipated. I will explore this further in the Multivariate Analysis.
<br>
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(data = loanData, aes(CreditScoreAverage, EmploymentStatusDuration, color=CreditScoreAverage)) +
geom_point(alpha = 0.01, position=position_jitter(height=.5, width=.5)) +
labs(title="Credit Score by Employment Duration", x = "Average Credit Score", y = "Employment Duration (Months)") +
xlim(300, 850)
# Using a point plot with jitter to blur the values slightly and increase visibility of potential relationships
```
<br> Most borrowers are within the mean for average credit scores regardless of employment history.
There is a weak relationship between borrowers with a shorter employment duration and lower credit scores. A slightly stronger relationship exists for borrowers with longer employment durations and higher credit scores.
These findings align with the results in the prior plots.
<br>
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width = 10}
ggplot(data = loanData, aes(x = CreditScoreAverage, fill = LoanStatus)) +
geom_histogram(binwidth = 500, aes( y=..density..), color = "black") +
facet_wrap(~ LoanStatus, scale = "free") +
scale_y_continuous(breaks = c(3,6,9)) +
geom_density(color = "black", lwd = 0.5, alpha = 0.7) +
labs(title="Credit Score by Loan Status", x="Average Credit Score") +
scale_fill_hue(l = 40, c = 150) +
xlim(300, 850)
# Another ggplot histogram with facets to show the credit scores for each loan status category. I am using scaled breaks on the y axis to expand the fill area. This hue was also selected for improved accessibility.
```
<br> Without more information it is difficult to make an inference. Is a low credit score caused by past due payments and defaulted loans? Or did the past due payments cause the low credit score?
We can clearly identify a greater occurrence of lower credit scores for borrowers of loans which were canceled or defaulted. Credit scores for Current loans vary but they are close to the median.
Completed loans and loans in their final payment stage are more likely to have some scores that are above the median. This may indicate that successful completion of a full loan term shows that a borrower is less likely to pose a risk to lenders in the future.
<br>
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(data = loanData, aes(Term, fill = LoanStatus)) +
geom_bar() +
labs(title = "Employment Statuses by Stated Monthly Income", x = "EmploymentStatus", y = "StatedMonthlyIncome") +
scale_fill_manual(values = c('Defaulted' = 'orange'))
# Defaulted loans are highlighted with orange for better visibility.
```
<br> Are borrowers more likely to default on loans with longer terms? Nope!
<br>
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(data = loanData, aes(EmploymentStatusDuration, DelinquenciesLast7Years, color = EmploymentStatusDuration)) +
geom_point(alpha = 0.02, position = position_jitter(height = .5, width = .5)) +
labs(title = "Delinquencies by Employment Duration",
x = "Duration (Months)", y = "Delinquencies")
# Showing the full range in this plot
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
boxplot(data = loanData, ProsperScore ~ DelinquenciesLast7Years, main = "Delinquencies by Prosper Score", xlab = "Delinquencies", ylab = "Prosper Score", xlim = c(0, 15))
# Displaying a second graph for comparison purposes and limiting the plot
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(data = loanDataPR10, aes(EmploymentStatusDuration, PublicRecordsLast10Years, color = EmploymentStatusDuration)) +
geom_point(alpha = 0.04, position = position_jitter(height = .5, width = .5)) +
labs(title = "Public Records by Employment Duration",
x = "Duration (Months)", y = "Public Records")
# I increased the alpha setting to better highlight the points.
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
boxplot(data = loanData, ProsperScore ~ PublicRecordsLast10Years, main = "Public Records by Prosper Score", xlab = "Public Records", ylab = "Prosper Score")
# Displaying a second graph for comparison purposes.
```
<br> I decided to display two plots for each to observe the differences between two types of visualizations for the same data.
These plots show that borrowers with high credit scores have few, if any, delinquencies or public records. We can infer that these are two factors which contribute to higher Prosper scores.
It is odd that although 1 or 2 public records decrease the borrower's Prosper score, this is not a consistent trend.
<br> <br>
# Bivariate Analysis
### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
The relationships between unique borrower characteristics, Prosper scores, and credit scores can be difficult to identify. It is often easier to determine the factors which contribute to a low credit score than a high credit score.
The creditworthiness of a borrower is calculated using multiple factors. Although they are important, income and employment are not the most significant factors. Borrowers with varying occupations and salaries can achieve a high Prosper score.
### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
Yes. I was surprised to see that Retirees are more likely to have better credit scores although they do not receive a regular income from employment.
### What was the strongest relationship you found?
The strongest relationship exists between the Prosper Score and Credit Score. There is a clear relationship. Higher credit scores are preferred by Prosper lenders. Borrowers with high credit scores are less likely to default on their loans which is a less risky investment for a lender.
<br>
# Multivariate Plots Section
In this section, we will evaluate multiple characteristics and analyze complex relationships.
<br>
```{r echo=FALSE, message=FALSE, warning=FALSE, Multivariate_Plots}
ggcorr(loanData, label = TRUE, label_size = 3, hjust = 0.8, size = 2.5, color = "black", layout.exp = 2) +
labs(title = "Prosper Loan Correlaton Matrix")
# This uses the GGally library
```
<br> This plot shows the positive, negative, and neutral correlations between the attributes that I chose to select. The output confirms the discoveries about Prosper Scores and credit scores that were made in the Bivariate Plots section. There is a very strong correlation between Prosper scores and credit scores.
<br>
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggpairs(loanData, diag = list(continuous = "density"), columns = c("CreditScoreAverage","DelinquenciesLast7Years","PublicRecordsLast10Years"), columnLabels = c("Credit Score", "Delinquencies", "Public Records"), color = "d", axisLabels = "show") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Deliquency and Public Record Impact on Credit Score")
# Creating additional Prosper Score plots to highlight the results from the correlation matrix
```
<br> <br>
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggpairs(loanData, diag = list(continuous = "density"), columns = c("ProsperScore","DelinquenciesLast7Years","PublicRecordsLast10Years"), columnLabels = c("Prosper Score", "Delinquencies", "Public Records"), color = "d", axisLabels = "show") +
labs(title = "Deliquency and Public Record Impact on Prosper Score")
# Creating additional Prosper Score plots to highlight the results from the correlation matrix
```
<br> These plots use the ggpairs function. I wanted to closely examine the Delinquency and Public Record correlations. The plots validate the output of the correlation matrix and provide visualizations of the data.
<br>
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(loanData, aes(CreditScoreAverage, StatedMonthlyIncome, color = factor(ProsperScore))) +
scale_x_continuous(limits = c(400, 850)) +
scale_y_continuous(limits = c(0, 100000)) +
geom_point(alpha = 0.3, position = "jitter") +
theme_minimal() +
theme(legend.title=element_blank()) +
labs(title = "Credit Scores, Monthly Income, and Prosper Scores", x = "Credit Score", y = "Stated Monthly Income")
# Limiting the y axis to improve the scaling. There are few outliers above 100,000
```
<br> <br>
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(loanData, aes(CreditScoreAverage, EmploymentStatusDuration, color = factor(ProsperScore))) +
scale_x_continuous(limits = c(400, 850)) +
scale_y_continuous(limits = c(0, 600)) +
geom_point(alpha = 0.3, position = "jitter") +
theme(legend.title=element_blank()) +
labs(title = "Credit Scores, Employment Duration, and Prosper Scores", x = "Credit Score", y = "Duration (Months)")
```
<br> The next two plots expand upon the Bivariate analysis of monthly income and employment status. We can see that borrowers with a low credit score will not receive a high Prosper Score. Length of employment and monthly income cannot change this.
<br>
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.height=10}
ggpairs(loanData, columns = c("CreditScoreAverage", "ProsperScore", "LoanStatus"), aes(color = LoanStatus), legend = 1, diag = list(continuous = wrap("densityDiag", alpha = 0.5 ))) +
theme(legend.position = "bottom") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(fill = "LoanStatus") +
labs(title = "Credit Score, Employment Duration, and Prosper Scores", x = "Credit Score", y = "Duration (Months)")
```
<br> The last plot combines the Prosper Score, Credit Score, and Loan Status to show a colorful array of visualizations. The correlation summary confirms and further expands upon the prior facet wrap plots. Borrowers who are either current or previously completed their loan payments have better Prosper Scores and credit scores.
<br>
# Multivariate Analysis
### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
The correlation matrix was especially helpful. It highlighted both strong and weak relationships in a simple visualization.
As I noted in the Bivariate Plots analysis, credit scores are better for borrowers who successfully completed their loan terms. The multivariate plots confirmed this.
### Were there any interesting or surprising interactions between features?
The point plots highlighted loans that do not have Prosper Scores (NA). The associated borrower's credit scores and salaries are typically low. This was surprising. I did not explore the NA value closely before generating this plot and had not considered it as a point (pun intended) of interest. The chosen style of the graphic and the colors made the relationship easier to identify.
------------------------------------------------------------------------
# Final Plots and Summary
### Plot One
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width = 10}
boxplot(data = loanData, ProsperScore ~ CreditScoreAverage, main = "Credit Score by Prosper Score", xlab = "Average Credit Score", ylab = "Prosper Score")
```
### Description One
This simple boxplot is the core of my exploration. When I initially reviewed the Prosper Loans dataset, I noticed the Prosper Score and wondered how it was calculated. Boxplots graphically depict the symmetry of categorical and continuous data comparisons. It was a good choice for these attributes.
In this visualization, we can easily observe a strong relationship between the credit score and Prosper Score. If the borrower's average credit score is high, the Prosper Score will be high as well. Both scores are crucial characteristics of Prosper's risk scoring process.
Prosper Scores are derived from historical loan data collected after 2009. With additional analysis, we may be able to determine risks for borrowers with existing Prosper history vs. first time Prosper borrowers.
### Plot Two
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width = 10}
ggplot(data = loanData, aes(x = CreditScoreAverage, fill = EmploymentStatus)) +
geom_histogram(binwidth = 500, aes(y = ..density..), color = "black") +
facet_wrap(~ EmploymentStatus, scale = "free") +
scale_y_continuous(breaks = c(3,6,9)) +
geom_density(color = "black", lwd = 0.5, alpha = 0.7) +
labs(title = "Credit Score by Employment Status", x = "Average Credit Score") +
scale_fill_hue(l = 40, c = 150) +
xlim(300, 850)
```
### Description Two
I was excited to discover the facet wrap feature of ggplot. Facet wraps create a sequential series of graphs. Each category can be explored as a unique property or in relation to other categories.
Plots of this kind are easier for reader to quickly absorb. We can see more information about the borrowers for each unique employment type and decide if the employment status positively or negatively impacts lending risk factors like credit scoring.
Retirees and self-employed borrowers have better credit scores than I expected. Income cannot eliminate the inherent risks of lending to borrowers with who frequently miss payments or acquire public records.
### Plot Three
```{r echo=FALSE, message=FALSE, warning=FALSE, Plot_Three}
ggcorr(loanData, label = TRUE, label_size = 3, hjust = 0.8, size = 2.5, color = "black", layout.exp = 2) +
labs(title = "Correlaton Matrix")
```
### Description Three
This matrix is simple but it explains so much about the data. This is a quick glance at all of the loan characteristics that I chose to explore and their correlations with one another.
------------------------------------------------------------------------
# Reflection
Loan risk scoring and creditworthiness calculation are complex subjects. The Prosper Loans dataset presents a great opportunity to explore the impact of loan and credit scores from the perspective of a borrower.
It was somewhat disappointing to discover that borrower occupations and employment statuses are poorly categorized. A large number of occupation values are either blank or categorized as "Other" and "Professional". That may indicate that the borrower's occupation is not a key characteristic of a Prosper risk analysis.
Requiring occupation and employment data for each loan would be a useful improvement. That data may highlight additional insights from supporting loan characteristics like long-term credit score stability. It would be fun to explore these attributes if they are available in the future.
The stated income and employment status data was the most surprising to me. Further exploration of the existing data might reveal that the Prosper Scoring model differs for Retirees and unemployed borrowers vs. borrowers with common sources of income.
Overall, this project was interesting and I enjoyed the exploration. This is my first time programming with R. I learned a lot about the language. I also gained new insights about the unique world of lending. That will certainly benefit me in my day to day work.
# Sources
RDocumentation Reference: <https://www.rdocumentation.org/>
ggplot2 Reference: <https://ggplot2.tidyverse.org/reference/>
diplyr Reference: <https://dplyr.tidyverse.org/>
ggcor Reference: <https://briatte.github.io/ggcorr/>
R Markdown Cookbook: <https://bookdown.org/yihui/rmarkdown-cookbook/>
Advanced R Style Guide: <http://adv-r.had.co.nz/Style.html>
RStudio Cheatsheets <https://www.rstudio.com/resources/cheatsheets/>
Cookbook for R - Colorblind-friendly Palette <http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/#a-colorblind-friendly-palette>
How to Read and Use Histograms in R: <https://flowingdata.com/2014/02/27/how-to-read-histograms-and-use-them-in-r/>
Loading Data and Formatting in R: <https://flowingdata.com/2015/02/18/loading-data-and-basic-formatting-in-r/>
Quick R - Subsetting Data: <https://www.statmethods.net/management/subset.html>
Credit Score Information: <https://www.americanexpress.com/en-us/credit-cards/credit-intel/credit-score-ranges/>
Sample Diamonds Exploration Provided by Udacity: <https://s3.amazonaws.com/content.udacity-data.com/courses/ud651/diamondsExample_2016-05.html>
R Markdown Project Template provided by Udacity: <https://video.udacity-data.com/topher/2017/February/58af99ac_projecttemplate/projecttemplate.rmd>