In this section we clean and prepare the dataset for the model which involves the following steps:
- Download the data from the given link.
- Reformat categorical columns (
status
,home
,marital
,records
, andjob
) by mapping with appropriate values. - Replace the maximum value of
income
,assests
, anddebt
columns with NaNs. - Replace the NaNs in the dataframe with
0
(will be shown in the next lesson). - Extract only those rows in the column
status
who are either ok or default as value. - Split the data in a two-step process which finally leads to the distribution of 60% train, 20% validation, and 20% test sets with random seed to
11
. - Prepare target variable
status
by converting it from categorical to binary, where 0 representsok
and 1 representsdefault
. - Finally delete the target variable from the train/val/test dataframe.
Add notes from the video (PRs are welcome)
The notes are written by the community. If you see an error here, please create a PR with a fix. |