Norsk versjon:
Program that produces retail/wholesale trade statistics using machine learning + visualisations
The main problem being solved, is inaccuarte/low quality responses being delivered by respondents to a survey - often with very material consequences to the final production. This normally takes a team of statisticians an entire year to correct (sometimes results in having to recontact the responders) - I am to solve this task using machine learning and other statistical measures.
Results: A full production run (normally completed by a team of 5-7 people over an entire year) completed in 600.38 seconds. Results pass several logical tests and when backtested against former productions compares very favorably. The r^2 when comparing what this program produces against what was actually published was approx 98% with a mean absolute error of approx 5.000 nok - which is low given the characteristics of our data.
This repo will also act as a one stop shop for method testing and ml solution devlopment for statistics relating to NØKU. Relevant ReadMe files will be added in the folders other codes are saved in.
Feel free to clone the repo if you have appropriate access. I will also demonstrate what the code is doing here in this ReadMe file:
Visualisations seen here in this README file are for data already published, and has had noise added in order to protect confidentiality. The visuals here simply demonstrate how the code functions.
Several visualisations are used to analyse the data on an industry level. The plots are interactive, the user can select years, chart-types, focus variables etc. All of the regular plotly interactive tools are available as well. Some visualisations are animated and if the user presses play, they will see changes overtime. Here are what some of the outputs look like (which would naturally adjust if something different was selected in the dropdown menus.
Simple plots:
Bar Charts and Heat Maps:
Maps (one that is animated):
Histogram with cumulative percentage:
Linked Plots:
Bubble Plots:
Parallel Coordinates Plot:
Geographical Plot:
Animated Bar Chart:
3D Plot:
This program aims to solve the problem of low quality financial data surveys responses. We evaluate the quality by comparing responses to skattetateen data and how many of the fields are filled out. Poor quality responses are imputed using a chosen machine learning algorithm which is trained on the full data set (net of poor quality surveys).
Important tools used:
Feature engineering: I gathered extra data by quering various apis and cooperating with several other departments within SSB. I also used tools such as KNN imputation to fill NaN values and created new trend variables using Linear Regression.
GridSearch: This was used for hyperparameter tuning. This can be switched on and off depending on the needs of the user.
Other key tools and parameters:
Scaler (object): Scalers are used to normalize or standardize numerical features. Common scalers include StandardScaler and RobustScaler. Normalization helps in speeding up the convergence of the training algorithm by ensuring that all features contribute equally to the learning process.
epochs_number (int): The number of epochs determines how many times the learning algorithm will work through the entire training dataset. More epochs can lead to better learning but may also result in overfitting if too high.
batch_size (int): This defines the number of samples that will be propagated through the network at one time. Smaller batch sizes can lead to more reliable updates but are computationally more expensive. I chose a medium size number based on the shape of the data, and how often certain features appear within the df. Speed was also a consideration.
Early Stopping: I use early stopping techniques in order to prevent overfitting and improve training time.
Learning Curves: I have used learning curves to determine whether models are overfitting. The results indicate that this has not occurred.
All parameters are subject to change based on results and at times the results of a GridSearch(hyperparameter tuning)
learning_rate (float): In the function, the default learning rate is set to 0.001. The learning rate controls how much the model’s weights are adjusted with respect to the loss gradient. A learning rate of 0.001 is a common starting point as it allows the model to converge smoothly without overshooting the optimal solution.
dropout_rate (float): The default dropout rate is set to 0.5. Dropout is a regularization technique used to prevent overfitting by randomly setting a fraction of the input units to zero at each update during training. A dropout rate of 0.5 means that half of the neurons are dropped, which is a standard value for promoting robustness in the network.
neurons_layer1 (int): The first layer of the neural network has 64 neurons by default. Having 64 neurons allows the model to capture complex patterns in the data while maintaining a balance between computational efficiency and model capacity.
neurons_layer2 (int): The second layer has 32 neurons by default. This smaller number of neurons in the subsequent layer helps in reducing the model complexity gradually, which can help in capturing hierarchical patterns in the data.
activation (str): The activation function used in the hidden layers is relu (Rectified Linear Unit). The ReLU function is popular because it introduces non-linearity while being computationally efficient and mitigating the vanishing gradient problem common in deeper networks.
optimizer (str): The optimizer used is adam by default. Adam (Adaptive Moment Estimation) is an adaptive learning rate optimization algorithm that has been widely adopted due to its efficiency and effectiveness in training deep neural networks. It combines the advantages of two other extensions of stochastic gradient descent, namely AdaGrad and RMSProp, to provide faster convergence.
Additional Details on the Model Building Process
Layer Construction:
The first dense layer with 64 neurons uses relu activation, which is ideal for capturing complex non-linear relationships. A dropout layer follows to prevent overfitting by randomly dropping 50% of the neurons during training. The second dense layer with 32 neurons also uses relu activation, helping to refine the features extracted by the first layer. Another dropout layer is added after the second dense layer for additional regularization. The final output layer has a single neuron with a linear activation function, appropriate for regression tasks as it outputs a continuous value. Regularization:
The kernel_regularizer=tf.keras.regularizers.l2(0.01) is applied to the dense layers. L2 regularization helps in preventing overfitting by penalizing large weights, thereby promoting smaller, more generalizable weights.
Results:
XGBoost:
**I used visualisations techniques in order to see the importance of several features. **
K-Nearest Neighbors:
I also created a dashboard using Dash to visualise the final product. Here is a quick snapshot (Theres more), but essentially it is the visualisations seen in the notebook, but in dashboard form where variables can be selected and used to update all plots at once:
I perform several logical tests and backtest the output of the program against actual publications:
Based on these results its likely I will use K-NN nearest neighbors for the 2023 production.
Models can always be improved. With more resources, particularly time, it may be worth investigating several other opportunities, such as :
- training models for specific industries. Especially if those industries are particularly unique. For example for petrol & diesel sales we can try to use various road network features (distance to nearest gas stations, how often a road is used etc):
- Card transaction data may soon be available, which leads to the possibility of better feature engineering - particularly for retail industries.
- There is an opportunity to identify which company an industry might belong to, and as a result, identify companies that are currently assigned to the wrong industry (the key for which everything is aggregated). Current classification models perform poorly as seen below. But these only use financial data, I expect if we use features such as job titles (number of employees under a given job title) , then the models will perform better.
Road Network Data:
Classification Performance (So far)