Thomas-George-T · AshyScripts · Nov 8, 2023 · Oct 25, 2023 · Oct 25, 2023 · Oct 25, 2023
diff --git a/.pylintrc b/.pylintrc
@@ -3,6 +3,7 @@
 # When enabled, pylint would attempt to guess common misconfiguration and emit
 # user-friendly hints instead of false-positive error messages.
 suggestion-mode=yes
+init-hook='import sys; sys.path.append("src")'
 
 [MESSAGES CONTROL]
 

diff --git a/README.md b/README.md
@@ -1,14 +1,62 @@
 [![Pytest](https://github.com/Thomas-George-T/Ecommerce-Data-MLOps/actions/workflows/pytest.yml/badge.svg)](https://github.com/Thomas-George-T/Ecommerce-Data-MLOps/actions/workflows/pytest.yml)
+# Ecommerce Customer Segmentation & MLOps
 
+<p align="center">  
+    <br>
+	<a href="#">
+	      <img src="https://raw.githubusercontent.com/Thomas-George-T/Thomas-George-T/master/assets/python.svg" alt="Python" title="Python" width ="120" />
+        <img height=100 src="https://cdn.svgporn.com/logos/airflow-icon.svg" alt="Airflow" title="Airflow" hspace=20 /> 
+        <img height=100 src="https://cdn.svgporn.com/logos/tensorflow.svg" alt="Tensorflow" title="Tensorflow" hspace=20 /> 
+        <img height=100 src="https://cdn.svgporn.com/logos/docker-icon.svg" alt="Docker" title="Docker" hspace=20 />
+  </a>	
+</p>
+<br>
 
-# Ecommerce Customer Segmentation MLOps
-Work in Progress
+# Introduction 
+In today's data-driven world, businesses are constantly seeking ways to better understand their customers, anticipate their needs, and tailor their products and services accordingly. One powerful technique that has emerged as a cornerstone of customer-centric strategies is “Customer segmentation”: the process of dividing a diverse customer base into distinct groups based on shared characteristics, that allows organizations to effectively target their marketing efforts, personalize customer experiences, and optimize resource allocation. Clustering, being a fundamental method within the field of unsupervised machine learning, plays a pivotal role in the process of customer segmentation by leveraging the richness of customer data, including behaviors, preferences, purchase history, beyond the geographic demographics to recognize hidden patterns and subsequently group customers who exhibit similar traits or tendencies. As population demographics are proven to strongly follow the Gaussian distribution, a characteristic tendency in an individual could be possessed by other individuals in the relevant cluster, which then may serve as the foundation for tailored marketing campaigns, product recommendations, and service enhancements. By understanding the unique needs and behaviors of each segment, companies can deliver highly personalized experiences, ultimately fostering customer loyalty and driving revenue growth.
+In this project of clustering for customer segmentation, we will delve into the essential exploratory data analysis techniques, unsupervised learning methods such as K-means clustering, followed by Cluster Analysis to create targeted profils for customers. The goals of this project comprise data pipeline preparation, ML model training, ML model update, exploring the extent of data and concept drifts (if any), and CI/CD Process demonstration. Thus, this project shall serve as a simulation for real-world application in the latest competitive business landscape. We aim to further apply these clustering algorithms to gain insights into customer behavior, and create a recommendation system as a future scope for lasting impact on customer satisfaction and business success. 
 
-# Changelog
+# Dataset Information 
+This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.
+## Data Card
+- Size: 541909 rows × 8 columns
+- Data Types
+
+| Variable Name |Role|Type|Description|
+|:--------------|:---|:---|:----------|
+|InvoiceNo |ID	|Categorical	|a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation |
+|StockCode |ID	|Categorical	|a 5-digit integral number uniquely assigned to each distinct product |
+|Description|Feature	|Categorical	|product name |
+|Quantity	|Feature	|Integer	|the quantities of each product (item) per transaction |
+|InvoiceDate	|Feature	|Date	|the day and time when each transaction was generated |
+|UnitPrice	|Feature	|Continuous	|product price per unit |
+|CustomerID	|Feature	|Categorical	|a 5-digit integral number uniquely assigned to each customer |
+|Country	|Feature	|Categorical	|the name of the country where each customer resides |
 
-- Added GitHub Actions with pytest and pylint on push for all branches
-- Run the following commands locally before pushing to ensure build success
-	```
-	pytest --pylint
-	pytest 
-	```
+## Data Sources 
+The data is taken from [UCI repository](https://archive.ics.uci.edu/dataset/352/online+retail)
+
+# Installation
+This project uses `Python >= 3.10`. Please ensure that the correct version is installed on your device. This project also works on Windows, Linux and Mac. 
+
+The steps for User installation are as follows:
+
+1. Clone repository onto the local machine
+2. Install the required dependencies
+```python
+pip install -r requirements.txt
+```
+
+# GitHub Actions
+
+Added GitHub Actions on push for all branches including the feature** and main branches. On pushing a new commit, triggers a build involving pytest and pylint generating test reports as artefacts. 
+This workflow will check for test cases available under `test` for the corresponding codes in `src`. By using `pylint`, it also runs a formatting and code leaks tests ensuring that the codes are readable and well documented for future use.
+Only on a successful build, the feature branches can be merged with the main.
+
+## Testing
+Before pushing code to GitHub, Run the following commands locally to ensure build success. Working on the suggestions by `Pylint` improves code quality. Making sure that the test cases are passed by `Pytest` are essential for code reviews and maintaining code quality.
+
+```python
+pytest --pylint
+pytest 
+```
diff --git a/data/bad.zip b/data/bad.zip
@@ -0,0 +1 @@
+This is not a valid zip file
diff --git a/requirements.txt b/requirements.txt
@@ -7,4 +7,5 @@ mlflow
 requests
 pytest-mock
 pytest-pylint
-openpyxl
+openpyxl
+requests-mock
diff --git a/src/datapipeline.py b/src/datapipeline.py
@@ -1,42 +1,11 @@
 """
-Functions to ingest and process data
+Modularized Data pipeline to form DAGs in the future
 """
-import zipfile
-import requests
-
-def ingest_data():
-    """
-    Function to download file from URL
-    """
-    file_url = "https://archive.ics.uci.edu/static/public/352/online+retail.zip"
-
-    # Send an HTTP GET request to the URL
-    response = requests.get(file_url, timeout=30)
-
-    # Check if the request was successful (status code 200)
-    if response.status_code == 200:
-        # Save file to data
-        with open("data/data.zip", "wb") as file:
-            file.write(response.content)
-        print("File downloaded successfully.")
-    else:
-        print(f"Failed to download the file. Status code: {response.status_code}")
-
-
-def unzip_file():
-    """
-    Function to unzip the downloaded data
-    """
-    zip_filename ='data/data.zip'
-    extract_to = 'data/'
-    try:
-        with zipfile.ZipFile(zip_filename, 'r') as zip_ref:
-            zip_ref.extractall(extract_to)
-        print(f"File {zip_filename} successfully unzipped to {extract_to}")
-    except zipfile.BadZipFile:
-        print(f"Failed to unzip {zip_filename}")
+from download_data import ingest_data
+from unzip_data import unzip_file
 
 
 if __name__ == "__main__":
-    ingest_data()
-    unzip_file()
+    ZIPFILE_PATH = ingest_data(
+        """https://archive.ics.uci.edu/static/public/352/online+retail.zip""")
+    UNZIPPED_FILE = unzip_file(ZIPFILE_PATH, 'data')
diff --git a/src/download_data.py b/src/download_data.py
@@ -0,0 +1,38 @@
+"""
+Function to download and ingest the data file
+"""
+import os
+import requests
+
+DEFAULT_FILE_URL = "https://archive.ics.uci.edu/static/public/352/online+retail.zip"
+
+# Set the root directory variable using a relative path
+ROOT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), '..'))
+
+def ingest_data(file_url=DEFAULT_FILE_URL):
+    """
+    Function to download file from URL
+    Args:
+        file_url: URL of the file, A default is used if not specified
+    Returns:
+        zipfile_path: The zipped file path to the data
+    """
+    # Send an HTTP GET request to the URL
+    response = requests.get(file_url, timeout=30)
+
+    # Path to store the zipfile
+    zipfile_path=os.path.join(ROOT_DIR, 'data','data.zip')
+    # Check if the request was successful (status code 200)
+    if response.status_code == 200:
+        # Save file to data
+        with open(zipfile_path, "wb") as file:
+            file.write(response.content)
+        print(f"File downloaded successfully. Zip file available under {zipfile_path}")
+    else:
+        print(f"Failed to download the file. Status code: {response.status_code}")
+
+    return zipfile_path
+
+if __name__ == "__main__":
+    ZIPFILE_PATH = ingest_data("https://archive.ics.uci.edu/static/public/352/online+retail.zip")
+
diff --git a/src/unzip_data.py b/src/unzip_data.py
@@ -0,0 +1,33 @@
+"""
+Function to unzip data and make it available
+"""
+import zipfile
+import os
+
+# Set the root directory variable using a relative path
+ROOT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), '..'))
+
+ZIP_FILENAME = os.path.join(ROOT_DIR, 'data','data.zip')
+EXTRACT_TO = os.path.join(ROOT_DIR,'data')
+
+def unzip_file(zip_filename=ZIP_FILENAME, extract_to=EXTRACT_TO):
+    """
+    Function to unzip the downloaded data
+    Args:
+      zip_filename: zipfile path, a default is used if not specified
+      extract_to: Path where the unzipped and extracted data is available
+    Returns:
+      extract_to: filepath where the data is available
+    """
+    try:
+        with zipfile.ZipFile(zip_filename, 'r') as zip_ref:
+            zip_ref.extractall(extract_to)
+        print(f"File {zip_filename} successfully unzipped to {extract_to}")
+    except zipfile.BadZipFile:
+        print(f"Failed to unzip {zip_filename}")
+    # Return unzipped file
+    unzipped_file =  os.path.join(extract_to, 'Online Retail.xlsx')
+    return unzipped_file
+
+if __name__ == "__main__":
+    UNZIPPED_FILE = unzip_file(ZIP_FILENAME, EXTRACT_TO)
diff --git a/test/__pycache__/__init__.cpython-310.pyc b/test/__pycache__/__init__.cpython-310.pyc
diff --git a/test/__pycache__/test_datapipeline.cpython-310-pytest-7.4.2.pyc b/test/__pycache__/test_datapipeline.cpython-310-pytest-7.4.2.pyc
diff --git a/test/test_datapipeline.py b/test/test_datapipeline.py
@@ -1,45 +1,3 @@
 """
 Tests for datapipeline functions
 """
-from src import datapipeline
-
-def test_ingest_data(mocker):
-    """
-    Test for ingest_data()
-    """
-
-    # arrange:
-    # mocked dependencies
-
-    mock_print = mocker.MagicMock(name='print')
-    mocker.patch('src.datapipeline.print', new=mock_print)
-
-    # act: invoking the tested code
-    datapipeline.ingest_data()
-
-    # assert:
-    assert 1 == mock_print.call_count
-
-
-def test_unzip_file(mocker):
-    """
-    Tests for unzip()
-    """
-
-    # arrange:
-    # mocked dependencies
-
-    mock_zipfile = mocker.MagicMock(name='ZipFile')
-    mocker.patch('src.datapipeline.zipfile.ZipFile', new=mock_zipfile)
-
-    mock_print = mocker.MagicMock(name='print')
-    mocker.patch('src.datapipeline.print', new=mock_print)
-
-    mock_exception = mocker.MagicMock(name='Exception')
-    mocker.patch('src.datapipeline.Exception', new=mock_exception)
-
-    # act: invoking the tested code
-    datapipeline.unzip_file()
-
-    # assert:
-    mock_exception.assert_not_called()
diff --git a/test/test_download_data.py b/test/test_download_data.py
@@ -0,0 +1,47 @@
+"""
+  Tests for downloda_data.py
+"""
+import os
+import requests
+import requests_mock
+from src import download_data
+
+DEFAULT_FILE_URL = "https://archive.ics.uci.edu/static/public/352/online+retail.zip"
+
+def test_ingest_data(mocker):
+    """
+      Tests for checking print call
+    """
+    # arrange:
+    # mocked dependencies
+    mock_print = mocker.MagicMock(name='print')
+    mocker.patch('src.download_data.print', new=mock_print)
+    # act: invoking the tested code
+    download_data.ingest_data(DEFAULT_FILE_URL)
+    # assert: todo
+    assert 1 == mock_print.call_count
+
+def test_ingest_data_successful_download():
+    """
+      Test for checking successful download of the file
+    """
+    # Create a session and attach the requests_mock to it
+    with requests.Session() as session:
+        adapter = requests_mock.Adapter()
+        # session.mount('http://', adapter)
+        session.mount('https://', adapter)
+
+        # Set the root directory variable using a relative path
+        root_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), '..'))
+
+        # Path to store the zipfile
+        zipfile_path=os.path.join(root_dir, 'data','data.zip')
+
+        # Define the mock response
+        adapter.register_uri('GET', DEFAULT_FILE_URL, text=zipfile_path)
+
+        # Call your function that makes the HTTP requests
+        result = download_data.ingest_data(DEFAULT_FILE_URL)  # Replace with your actual function
+
+        # Perform assertions
+        assert result == zipfile_path
diff --git a/test/test_unzip_data.py b/test/test_unzip_data.py
@@ -0,0 +1,46 @@
+"""
+Function to test the unzip_data functions
+"""
+import os
+from src import unzip_data
+
+# Define constants or variables for testing
+# Set the root directory variable using a relative path
+ROOT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), '..'))
+
+ZIP_FILENAME = os.path.join(ROOT_DIR, 'data','data.zip')
+EXTRACT_TO = os.path.join(ROOT_DIR,'data')
+BAD_ZIP_FILENAME = os.path.join(ROOT_DIR, 'data', 'bad.zip')
+
+# Test for successful unzipping
+def test_unzip_file_successful():
+    """
+      Test for successful unzipping
+    """
+    # Call the function to unzip a valid file
+    unzipped_file = unzip_data.unzip_file(ZIP_FILENAME, EXTRACT_TO)
+
+    # Check if the function returned the expected unzipped file path
+    assert unzipped_file == os.path.join(EXTRACT_TO, 'Online Retail.xlsx')
+
+    # Check if the unzipped file exists
+    assert os.path.isfile(unzipped_file)
+
+# Test for handling a bad zip file
+def test_unzip_file_bad_zip(tmp_path, capsys):
+    """
+      Test for handling a bad zip file
+    """
+    # Create a bad zip file in the temporary directory
+    with open(BAD_ZIP_FILENAME, "wb") as file:
+        file.write(b"This is not a valid zip file")
+
+    # Create a temporary directory for testing
+    test_dir = tmp_path / "test_dir"
+    test_dir.mkdir()
+    # Call the function to unzip a bad zip file
+    unzip_data.unzip_file(BAD_ZIP_FILENAME, test_dir)
+
+    # Check if the function printed the appropriate error message
+    captured = capsys.readouterr()
+    assert "Failed to unzip" in captured.out