Skip to content

Latest commit

 

History

History
104 lines (59 loc) · 4.6 KB

README.md

File metadata and controls

104 lines (59 loc) · 4.6 KB

CC BY-SA 4.0

Predicting Spotify Song Popularity: A Refactoring Journey

In this case study, we will show how a machine learning use case that is implemented as a Jupyter notebook (which was taken from Kaggle, was originally implemented by Saurav Palekar and is licensed under the Apache 2 license) can be successively refactored in order to

  • improve the software design in general, achieving a high degree clarity and maintainability,
  • gain flexibility for experimentation,
  • appropriately track results,
  • arrive at a solution that can straightforwardly be deployed for production.

The use case considers a dataset from kaggle containing meta-data on approximately one million songs (see download instructions below). The goal is to use the data in order to learn a model for the prediction of song popularity given other song attributes such as the tempo, the release year, the key, the musical mode, etc.

Preliminaries

Make sure you have created the Python virtual environment, set up a project in your IDE and downloaded the data as described in the root README file.

How to use this package?

This package is organised as follows:

  • There is one folder per step in the refactoring process with a dedicated README file explaining the key aspects of the respective step.
  • There is an independent Python implementation of the use case in each folder, which you should inspect alongside the README file.

The intended way of exploring this package is to clone the repository and open it in your IDE of choice, such that you can browse it with familiar tools and navigate the code efficiently.

Diffing

To more clearly see the concrete changes from one step to another, you can make use of a diff tool. To support this, you may run the Python script generate_repository.py in order to create a git repository in folder refactoring-repo that references the state of each step in a separate tag, i.e. in said folder, you could run, for example,

    git difftool step04-model-specific-pipelines step05-sensai

Steps in the Journey

These are the steps of the journey:

  1. Monolithic Notebook

    This is the starting point, a Jupyter notebook which is largely unstructured.

  2. Python Script

    This step extracts the code that is strictly concerned with the training and evaluation of models.

  3. Dataset Representation

    This step introduces an explicit representation for the dataset, making transformations explicit as well as optional.

  4. Refactoring

    This step improves the code structure by adding function-specific Python modules.

  5. Model-Specific Pipelines

    This step refactors the pipeline to move all transforming operations into the models, enabling different models to use entirely different pipelines.

  6. sensAI

    This step introduces the high-level library sensAI, which will enable more flexible, declarative model specifications down the line.

  7. Feature Representation

    This step separates representations of features and their properties from the models that use them, allowing model input pipelines to be flexibly composed.

  8. Feature Engineering

    This step adds an engineered feature to the mix.

  9. High-Level Evaluation

    This step applies sensAI's high-level abstraction for model evaluation, enabling logging.

  10. Tracking Experiments

    This step adds tracking functionality via sensAI's mlflow integration (and additionally by saving results directly to the file system).

  11. Regression

    This step considers the perhaps more natural formulation of the prediction problem as a regression problem.

  12. Hyperparameter Optimisation

    This step adds hyperparameter optimisation for the XGBoost regression model.

  13. Cross-Validation

    This step adds the option to use cross-validation.

  14. Deployment

    This step adds a web service for inference, which is packaged in a docker container.