Skip to content

27priyanshu/DVC-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 

Repository files navigation

End-to-End Machine Learning Pipeline Creation Using DVC

Machine Learning (ML) is revolutionizing industries across the globe, but the journey from raw data to a deployed ML model can be complex and challenging. To streamline this process, we introduce an End-to-End Machine Learning Pipeline Creation using Data Version Control (DVC).

image

Step 1: Data Split

In any ML project, the foundation is data. In the first step, we focus on data splitting. This involves dividing our dataset into multiple subsets, typically a training set and a testing set. By using DVC to version control our data, we ensure that the data used for training remains consistent and reproducible. DVC helps us manage large datasets efficiently, keeping track of different data versions over time.

Step 2: Data Processing

Once we have our data, the next critical step is data preprocessing. This step involves cleaning, transforming, and preparing the data to make it suitable for training ML models. DVC comes in handy here by allowing us to version control our data preprocessing scripts and configurations. This ensures that data transformations are consistent and can be easily reproduced.

Step 3: Training of Data

With our data prepared, we move on to training ML models. This step involves selecting an appropriate algorithm, feeding it the training data, and tuning hyperparameters. DVC can version control the code, model configurations, and model weights, allowing us to track changes made during model training. This ensures reproducibility and makes it easy to collaborate with team members or roll back to previous model versions if needed.

Step 4: Evaluation of Data

Once the model is trained, it's crucial to evaluate its performance. We use DVC to manage evaluation metrics and validation datasets. This enables us to track model performance over time and make informed decisions about model deployment. With a clear version history, we can easily compare different model iterations and choose the best-performing one.

When n_estimator = 10

image

Plots for n_estimator = 10

image image image

After changing n_estimaters to 35

image

Plots of n_estimator = 35

image image image

In summary, our End-to-End Machine Learning Pipeline Creation using DVC simplifies and enhances the ML development process. By leveraging DVC for data versioning and code management, we ensure reproducibility, collaboration, and efficient tracking of changes throughout the pipeline. This approach makes it easier to develop, deploy, and maintain robust ML models in real-world applications.

Use Full Commands

  1. Adds files or directories to DVC tracking.
dvc add ./model ./data
  1. Removes a DVC-tracked file or directory.
dvc remove model.dvc
  1. Shows the status of DVC-tracked files, indicating changes.
dvc status
  1. Commits changes made to DVC-tracked files.
dvc commit
  1. Pushes data and metadata to the default DVC remote storage.
dvc push
  1. Pushes data and metadata to a specific remote storage.
dvc push -r <remote-name>
  1. Pulls data and metadata from the default DVC remote storage.
dvc pull
  1. Pulls data and metadata from a specific remote storage.
dvc pull -r <remote-name>
  1. Reproduces the data pipeline by running DVC-managed commands
dvc repro
  1. Shows the metrics associated with a DVC-tracked data file.
dvc metrics show
  1. Compares metrics between different versions of a DVC-tracked data file.
dvc metrics diff
  1. Shows plots generated for DVC-tracked data files.
dvc plots show

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages