Skip to content

Latest commit

 

History

History
153 lines (101 loc) · 4.57 KB

File metadata and controls

153 lines (101 loc) · 4.57 KB

Homework

Note: sometimes your answer doesn't match one of the options exactly. That's fine. Select the option that's closest to your solution.

Dataset

In this homework, we will use the Car price dataset. Download it from here.

Or you can do it with wget:

wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv

We'll keep working with the MSRP variable, and we'll transform it to a classification task.

Features

For the rest of the homework, you'll need to use only these columns:

  • Make,
  • Model,
  • Year,
  • Engine HP,
  • Engine Cylinders,
  • Transmission Type,
  • Vehicle Style,
  • highway MPG,
  • city mpg

Data preparation

  • Select only the features from above and transform their names using next line:
    data.columns = data.columns.str.replace(' ', '_').str.lower()
    
  • Fill in the missing values of the selected features with 0.
  • Rename MSRP variable to price.

Question 1

What is the most frequent observation (mode) for the column transmission_type?

  • AUTOMATIC
  • MANUAL
  • AUTOMATED_MANUAL
  • DIRECT_DRIVE

Question 2

Create the correlation matrix for the numerical features of your dataset. In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.

What are the two features that have the biggest correlation in this dataset?

  • engine_hp and year
  • engine_hp and engine_cylinders
  • highway_mpg and engine_cylinders
  • highway_mpg and city_mpg

Make price binary

  • Now we need to turn the price variable from numeric into a binary format.
  • Let's create a variable above_average which is 1 if the price is above its mean value and 0 otherwise.

Split the data

  • Split your data in train/val/test sets with 60%/20%/20% distribution.
  • Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.
  • Make sure that the target value (price) is not in your dataframe.

Question 3

  • Calculate the mutual information score between above_average and other categorical variables in our dataset. Use the training set only.
  • Round the scores to 2 decimals using round(score, 2).

Which of these variables has the lowest mutual information score?

  • make
  • model
  • transmission_type
  • vehicle_style

Question 4

  • Now let's train a logistic regression.
  • Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
  • Fit the model on the training dataset.
    • To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    • model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
  • Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

  • 0.60
  • 0.72
  • 0.84
  • 0.95

Question 5

  • Let's find the least useful feature using the feature elimination technique.
  • Train a model with all these features (using the same parameters as in Q4).
  • Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
  • For each feature, calculate the difference between the original accuracy and the accuracy without the feature.

Which of following feature has the smallest difference?

  • year
  • engine_hp
  • transmission_type
  • city_mpg

Note: the difference doesn't have to be positive

Question 6

  • For this question, we'll see how to use a linear regression model from Scikit-Learn.
  • We'll need to use the original column price. Apply the logarithmic transformation to this column.
  • Fit the Ridge regression model on the training data with a solver 'sag'. Set the seed to 42.
  • This model also has a parameter alpha. Let's try the following values: [0, 0.01, 0.1, 1, 10].
  • Round your RMSE scores to 3 decimal digits.

Which of these alphas leads to the best RMSE on the validation set?

  • 0
  • 0.01
  • 0.1
  • 1
  • 10

Note: If there are multiple options, select the smallest alpha.

Submit the results

  • Submit your results here: https://forms.gle/FFfNjEP4jU4rxnL26
  • You can submit your solution multiple times. In this case, only the last submission will be used
  • If your answer doesn't match options exactly, select the closest one

Deadline

The deadline for submitting is 2 October (Monday), 23:00 CEST.

After that, the form will be closed.