- The list of questions is based on this post: https://hackernoon.com/160-data-science-interview-questions-415s3y2a
- Legend: 👶 easy ⭐️ medium 🚀 expert
- Do you know how to answer questions without answers? Please create a PR
- See an error? Please create a PR with fix
What is supervised machine learning? 👶
A case when we have both features (the matrix X) and the labels (the vector y)
What is regression? Which models can you use to solve a regression problem? 👶
Regression is a part of supervised ML. Regression models predict a real number
What is linear regression? When do we use it? 👶
Linear regression is a model that assumes a linear relationship between the input variables (X) and the single output variable (y).
With a simple equation:
y = B0 + B1*x1 + ... + Bn * xN
B is regression coefficients, x values are the independent (explanatory) variables and y is dependent variable.
The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression.
Simple linear regression:
y = B0 + B1*x1
Multiple linear regression:
y = B0 + B1*x1 + ... + Bn * xN
What’s the normal distribution? Why do we care about it? 👶
Answer here
How do we check if a variable follows the normal distribution? ⭐️
Answer here
What if we want to build a model for predicting prices? Are prices distributed normally? Do we need to do any pre-processing for prices? ⭐️
Answer here
What are the methods for solving linear regression do you know? ⭐️
Answer here
What is gradient descent? How does it work? ⭐️
Answer here
What is the normal equation? ⭐️
Normal equations are equations obtained by setting equal to zero the partial derivatives of the sum of squared errors (least squares); normal equations allow one to estimate the parameters of a multiple linear regression.
What is SGD — stochastic gradient descent? What’s the difference with the usual gradient descent? ⭐️
Answer here
Which metrics for evaluating regression models do you know? 👶
Answer here
What are MSE and RMSE? 👶
Answer here
What is overfitting? 👶
When your model perform very well on your training set but can't generalize the test set, because it adjusted a lot to the training set.
How to validate your models? 👶
Answer here
Why do we need to split our data into three parts: train, validation, and test? 👶
The training set is used to fit the model, i.e. to train the model with the data. The validation set is then used to provide an unbiased evaluation of a model while fine-tuning hyperparameters. This improves the generalization of the model. Finally, a test data set which the model has never "seen" before should be used for the final evaluation of the model. This allows for an unbiased evaluation of the model. The evaluation should never be performed on the same data that is used for training. Otherwise the model performance would not be representative.
Can you explain how cross-validation works? 👶
Cross-validation is the process to separate your total training set into two subsets: training and validation set, and evaluate your model to choose the hyperparameters. But you do this process iteratively, selecting differents training and validation set, in order to reduce the bias that you would have by selecting only one validation set.
What is K-fold cross-validation? 👶
Answer here
How do we choose K in K-fold cross-validation? What’s your favorite K? 👶
Answer here
What is classification? Which models would you use to solve a classification problem? 👶
Answer here
What is logistic regression? When do we need to use it? 👶
Answer here
Is logistic regression a linear model? Why? 👶
Answer here
What is sigmoid? What does it do? 👶
Answer here
How do we evaluate classification models? 👶
Answer here
What is accuracy? 👶
Accuracy is a metric for evaluating classification models. It is calculated by dividing the number of correct predictions by the number of total predictions.
Is accuracy always a good metric? 👶
Accuracy is not a good performance metric when there is imbalance in the dataset. For example, in binary classification with 95% of A class and 5% of B class, prediction accuracy can be 95%. In case of imbalance dataset, we need to choose Precision, recall, or F1 Score depending on the problem we are trying to solve.
What is the confusion table? What are the cells in this table? 👶
Confusion table (or confusion matrix) shows how many True positives (TP), True Negative (TN), False Positive (FP) and False Negative (FN) model has made.
Actual | Actual | ||
---|---|---|---|
Positive (1) | Negative (0) | ||
Predicted | Positive (1) | TP | FP |
Predicted | Negative (0) | FN | TN |
- True Positives (TP): When the actual class of the observation is 1 (True) and the prediction is 1 (True)
- True Negative (TN): When the actual class of the observation is 0 (False) and the prediction is 0 (False)
- False Positive (FP): When the actual class of the observation is 0 (False) and the prediction is 1 (True)
- False Negative (FN): When the actual class of the observation is 1 (True) and the prediction is 0 (False)
Most of the performance metrics for classification models are based on the values of the confusion matrix.
What are precision, recall, and F1-score? 👶
- Precision and recall are classification evaluation metrics:
- P = TP / (TP + FP) and R = TP / (TP + FN).
- Where TP is true positives, FP is false positives and FN is false negatives
- In both cases the score of 1 is the best: we get no false positives or false negatives and only true positives.
- F1 is a combination of both precision and recall in one score:
- F1 = 2 * PR / (P + R).
- Max F score is 1 and min is 0, with 1 being the best.
Precision-recall trade-off ⭐️
Answer here
What is the ROC curve? When to use it? ⭐️
Answer here
What is AUC (AU ROC)? When to use it? ⭐️
Answer here
How to interpret the AU ROC score? ⭐️
Answer here
What is the PR (precision-recall) curve? ⭐️
Answer here
What is the area under the PR curve? Is it a useful metric? ⭐️I
Answer here
In which cases AU PR is better than AU ROC? ⭐️
Answer here
What do we do with categorical variables? ⭐️
Answer here
Why do we need one-hot encoding? ⭐️
Answer here
What happens to our linear regression model if we have three columns in our data: x, y, z — and z is a sum of x and y? ⭐️
Answer here
What happens to our linear regression model if the column z in the data is a sum of columns x and y and some random noise? ⭐️
Answer here
What is regularization? Why do we need it? 👶
Answer here
Which regularization techniques do you know? ⭐️
Answer here
What kind of regularization techniques are applicable to linear models? ⭐️
Answer here
How does L2 regularization look like in a linear model? ⭐️
Answer here
How do we select the right regularization parameters? 👶
Answer here
What’s the effect of L2 regularization on the weights of a linear model? ⭐️
Answer here
How L1 regularization looks like in a linear model? ⭐️
Answer here
What’s the difference between L2 and L1 regularization? ⭐️
Answer here
Can we have both L1 and L2 regularization components in a linear model? ⭐️
Answer here
What’s the interpretation of the bias term in linear models? ⭐️
Answer here
How do we interpret weights in linear models? ⭐️
If the variables are normalized, we can interpret weights in linear models like the importance of this variable in the predicted result.
If a weight for one variable is higher than for another — can we say that this variable is more important? ⭐️
Answer here
When do we need to perform feature normalization for linear models? When it’s okay not to do it? ⭐️
Answer here
What is feature selection? Why do we need it? 👶
Answer here
Is feature selection important for linear models? ⭐️
Answer here
Which feature selection techniques do you know? ⭐️
Answer here
Can we use L1 regularization for feature selection? ⭐️
Answer here
Can we use L2 regularization for feature selection? ⭐️
Answer here
What are the decision trees? 👶
Answer here
How do we train decision trees? ⭐️
Answer here
What are the main parameters of the decision tree model? 👶
Answer here
How do we handle categorical variables in decision trees? ⭐️
Answer here
What are the benefits of a single decision tree compared to more complex models? ⭐️
Answer here
How can we know which features are more important for the decision tree model? ⭐️
Answer here
What is random forest? 👶
Answer here
Why do we need randomization in random forest? ⭐️
Answer here
What are the main parameters of the random forest model? ⭐️
Answer here
How do we select the depth of the trees in random forest? ⭐️
Answer here
How do we know how many trees we need in random forest? ⭐️
Answer here
Is it easy to parallelize training of a random forest model? How can we do it? ⭐️
Answer here
What are the potential problems with many large trees? ⭐️
Answer here
What if instead of finding the best split, we randomly select a few splits and just select the best from them. Will it work? 🚀
Answer here
What happens when we have correlated features in our data? ⭐️
Answer here
What is gradient boosting trees? ⭐️
Answer here
What’s the difference between random forest and gradient boosting? ⭐️
Answer here
Is it possible to parallelize training of a gradient boosting model? How to do it? ⭐️
Answer here
Feature importance in gradient boosting trees — what are possible options? ⭐️
Answer here
Are there any differences between continuous and discrete variables when it comes to feature importance of gradient boosting models? 🚀
Answer here
What are the main parameters in the gradient boosting model? ⭐️
Answer here
How do you approach tuning parameters in XGBoost or LightGBM? 🚀
Answer here
How do you select the number of trees in the gradient boosting model? ⭐️
Answer here
Which parameter tuning strategies (in general) do you know? ⭐️
Answer here
What’s the difference between grid search parameter tuning strategy and random search? When to use one or another? ⭐️
Answer here
What kind of problems neural nets can solve? 👶
Answer here
How does a usual fully-connected feed-forward neural network work? ⭐️
Answer here
Why do we need activation functions? 👶
Answer here
What are the problems with sigmoid as an activation function? ⭐️
Answer here
What is ReLU? How is it better than sigmoid or tanh? ⭐️
Answer here
How we can initialize the weights of a neural network? ⭐️
Answer here
What if we set all the weights of a neural network to 0? ⭐️
Answer here
What regularization techniques for neural nets do you know? ⭐️
Answer here
What is dropout? Why is it useful? How does it work? ⭐️
Answer here
What is backpropagation? How does it work? Why do we need it? ⭐️
Answer here
Which optimization techniques for training neural nets do you know? ⭐️
Answer here
How do we use SGD (stochastic gradient descent) for training a neural net? ⭐️
Answer here
What’s the learning rate? 👶
The learning rate is an important hyperparameter that controls how quickly the model is adapted to the problem during the training. It can be seen as the "step width" during the parameter updates, i.e. how far the weights are moved into the direction of the minimum of our optimization problem.
What happens when the learning rate is too large? Too small? 👶
A large learning rate can accelerate the training. However, it is possible that we "shoot" too far and miss the minimum of the function that we want to optimize, which will not result in the best solution. On the other hand, training with a small learning rate takes more time but it is possible to find a more precise minimum. The downside can be that the solution is stuck in a local minimum, and the weights won't update even if it is not the best possible global solution.
How to set the learning rate? ⭐️
Answer here
What is Adam? What’s the main difference between Adam and SGD? ⭐️
Answer here
When would you use Adam and when SGD? ⭐️
Answer here
Do we want to have a constant learning rate or we better change it throughout training? ⭐️
Answer here
How do we decide when to stop training a neural net? 👶
Answer here
What is model checkpointing? ⭐️
Answer here
Can you tell us how you approach the model training process? ⭐️
Answer here
How we can use neural nets for computer vision? ⭐️
Answer here
What’s a convolutional layer? ⭐️
Answer here
Why do we actually need convolutions? Can’t we use fully-connected layers for that? ⭐️
Answer here
What’s pooling in CNN? Why do we need it? ⭐️
Answer here
How does max pooling work? Are there other pooling techniques? ⭐️
Answer here
Are CNNs resistant to rotations? What happens to the predictions of a CNN if an image is rotated? 🚀
Answer here
What are augmentations? Why do we need them? 👶What kind of augmentations do you know? 👶How to choose which augmentations to use? ⭐️
Answer here
What kind of CNN architectures for classification do you know? 🚀
Answer here
What is transfer learning? How does it work? ⭐️
Answer here
What is object detection? Do you know any architectures for that? 🚀
Answer here
What is object segmentation? Do you know any architectures for that? 🚀
Answer here
How can we use machine learning for text classification? ⭐️
Answer here
What is bag of words? How we can use it for text classification? ⭐️
Answer here
What are the advantages and disadvantages of bag of words? ⭐️
Answer here
What are N-grams? How can we use them? ⭐️
Answer here
How large should be N for our bag of words when using N-grams? ⭐️
Answer here
What is TF-IDF? How is it useful for text classification? ⭐️
Answer here
Which model would you use for text classification with bag of words features? ⭐️
Answer here
Would you prefer gradient boosting trees model or logistic regression when doing text classification with bag of words? ⭐️
Answer here
What are word embeddings? Why are they useful? Do you know Word2Vec? ⭐️
Answer here
Do you know any other ways to get word embeddings? 🚀
Answer here
If you have a sentence with multiple words, you may need to combine multiple word embeddings into one. How would you do it? ⭐️
Answer here
Would you prefer gradient boosting trees model or logistic regression when doing text classification with embeddings? ⭐️
Answer here
How can you use neural nets for text classification? 🚀
Answer here
How can we use CNN for text classification? 🚀
Answer here
What is unsupervised learning? 👶
Unsupervised learning aims to detect paterns in data where no labels are given.
What is clustering? When do we need it? 👶
Clustering algorithms group objects such that similar feature points are put into the same groups (clusters) and dissimilar feature points are put into different clusters.
Do you know how K-means works? ⭐️
- Partition points into k subsets.
- Compute the seed points as the new centroids of the clusters of the current partitioning.
- Assign each point to the cluster with the nearest seed point.
- Go back to step 2 or stop when the assignment does not change.
How to select K for K-means? ⭐️
- Domain knowledge, i.e. an expert knows the value of k
- Elbow method: compute the clusters for different values of k, for each k, calculate the total within-cluster sum of square, plot the sum according to the number of clusters and use the band as the number of clusters.
- Average silhouette method: compute the clusters for different values of k, for each k, calculate the average silhouette of observations, plot the silhouette according to the number of clusters and select the maximum as the number of clusters.
What are the other clustering algorithms do you know? ⭐️
- k-medoids: Takes the most central point instead of the mean value as the center of the cluster. This makes it more robust to noise.
- Agglomerative Hierarchical Clustering (AHC): hierarchical clusters combining the nearest clusters starting with each point as its own cluster.
- DIvisive ANAlysis Clustering (DIANA): hierarchical clustering starting with one cluster containing all points and splitting the clusters until each point describes its own cluster.
- Density-Based Spatial Clustering of Applications with Noise (DBSCAN): Cluster defined as maximum set of density-connected points.
Do you know how DBScan works? ⭐️
- Two input parameters epsilon (neighborhood radius) and minPts (minimum number of points in an epsilon-neighborhood)
- Cluster defined as maximum set of density-connected points.
- Points p_j and p_i are density-connected w.r.t. epsilon and minPts if there is a point o such that both, i and j are density-reachable from o w.r.t. epsilon and minPts.
- p_j is density-reachable from p_i w.r.t. epsilon, minPts if there is a chain of points p_i -> p_i+1 -> p_i+x = p_j such that p_i+x is directly density-reachable from p_i+x-1.
- p_j is a directly density-reachable point of the neighborhood of p_i if dist(p_i,p_j) <= epsilon.
When would you choose K-means and when DBScan? ⭐️
- DBScan is more robust to noise.
- DBScan is better when the amount of clusters is difficult to guess.
- K-means has a lower complexity, i.e. it will be much faster, especially with a larger amount of points.
What is the curse of dimensionality? Why do we care about it? ⭐️
Answer here
Do you know any dimensionality reduction techniques? ⭐️
Answer here
What’s singular value decomposition? How is it typically used for machine learning? ⭐️
Answer here
What is the ranking problem? Which models can you use to solve them? ⭐️
Answer here
What are good unsupervised baselines for text information retrieval? ⭐️
Answer here
How would you evaluate your ranking algorithms? Which offline metrics would you use? ⭐️
Answer here
What is precision and recall at k? ⭐️
Answer here
What is mean average precision at k? ⭐️
Answer here
How can we use machine learning for search? ⭐️
Answer here
How can we get training data for our ranking algorithms? ⭐️
Answer here
Can we formulate the search problem as a classification problem? How? ⭐️
Answer here
How can we use clicks data as the training data for ranking algorithms? 🚀
Answer here
Do you know how to use gradient boosting trees for ranking? 🚀
Answer here
How do you do an online evaluation of a new ranking algorithm? ⭐️
Answer here
What is a recommender system? 👶
Answer here
What are good baselines when building a recommender system? ⭐️
Answer here
What is collaborative filtering? ⭐️
Answer here
How we can incorporate implicit feedback (clicks, etc) into our recommender systems? ⭐️
Answer here
What is the cold start problem? ⭐️
Answer here
Possible approaches to solving the cold start problem? ⭐️🚀
Answer here
What is a time series? 👶
Answer here
How is time series different from the usual regression problem? 👶
Answer here
Which models do you know for solving time series problems? ⭐️
Answer here
If there’s a trend in our series, how we can remove it? And why would we want to do it? ⭐️
Answer here
You have a series with only one variable “y” measured at time t. How do predict “y” at time t+1? Which approaches would you use? ⭐️
Answer here
You have a series with a variable “y” and a set of features. How do you predict “y” at t+1? Which approaches would you use? ⭐️
Answer here
What are the problems with using trees for solving time series problems? ⭐️
Answer here