In general, the dataset is split into three parts: training, validation, and test. For each partition, we need to obtain feature matrices (X) and y vectors of targets. First, the size of partitions is calculated, records are shuffled to guarantee that values of the three partitions contain non-sequential records of the dataset, and the partitions are created with the shuffled indices.
Pandas attributes and methods:
df.iloc[]
- returns subsets of records of a dataframe, being selected by numerical indicesdf.reset_index()
- restate the orginal indicesdel df[col]
- eliminates target variable
Numpy methods:
np.arange()
- returns an array of numbersnp.random.shuffle()
- returns a shuffled arraynp.random.seed()
- set a seed
The entire code of this project is available in this jupyter notebook.
The notes are written by the community. If you see an error here, please create a PR with a fix. |