diff --git a/samples/img/step_1.png b/samples/img/step_1.png
new file mode 100644
index 0000000..9f469b6
Binary files /dev/null and b/samples/img/step_1.png differ
diff --git a/samples/img/step_2.png b/samples/img/step_2.png
new file mode 100644
index 0000000..22ab5de
Binary files /dev/null and b/samples/img/step_2.png differ
diff --git a/samples/img/step_3.1.png b/samples/img/step_3.1.png
new file mode 100644
index 0000000..9bb806b
Binary files /dev/null and b/samples/img/step_3.1.png differ
diff --git a/samples/img/step_3.png b/samples/img/step_3.png
new file mode 100644
index 0000000..4e06900
Binary files /dev/null and b/samples/img/step_3.png differ
diff --git a/samples/img/step_6.png b/samples/img/step_6.png
new file mode 100644
index 0000000..da6cebb
Binary files /dev/null and b/samples/img/step_6.png differ
diff --git a/samples/img/step_7.png b/samples/img/step_7.png
new file mode 100644
index 0000000..42b5dbe
Binary files /dev/null and b/samples/img/step_7.png differ
diff --git a/samples/img/step_8.png b/samples/img/step_8.png
new file mode 100644
index 0000000..ce8b5f7
Binary files /dev/null and b/samples/img/step_8.png differ
diff --git a/samples/onboarding_manual.ipynb b/samples/onboarding_manual.ipynb
new file mode 100644
index 0000000..c47b58e
--- /dev/null
+++ b/samples/onboarding_manual.ipynb
@@ -0,0 +1,1519 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Welcome to Synnax Lab! In this tutorial, I'll guide you through the entire processโfrom setting up your account to making your first submission. Let's go! ๐"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Creating Account"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Step: 1 Navigate to Synnax Lab\n",
+ "\n",
+ "Let's go to [Synnax Lab](https://synnax.app/) and click on the **LOG IN** button.\n",
+ "\n",
+ "\n",
+ "![Synnax](./img/step_1.png)\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Step 2: Choosing Your Login Method\n",
+ "\n",
+ "Synnax allows you to log in with a wallet or an email. Click LOG IN in the top right corner, choose either option and finish the sign up process. This will register you in the Synnax App, which is currently open for everyone. It allows registered users to view the credit intelligence for the companies Synnax features. To become a contributor proceed to Step 3.\n",
+ "\n",
+ "![Login Options](img/step_2.png)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Step 3: Become a contributor\n",
+ "\n",
+ "Apply for Synnax Lab membership as a Data Scientist. In the Synnax Lab section click EARN WITH SYNNAX LAB button.\n",
+ "\n",
+ "![Signup Options](img/step_3.png)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 3.1 Fill out the form and click Join\n",
+ "\n",
+ "We give preference to applications who agree to participate in a short introductory video call.\n",
+ "\n",
+ "![Signup Form](img/step_3.1.png)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Step 5: Get application approval\n",
+ "\n",
+ "Your application will be processed in the order when it was received. We are approving 3 applications per week after a short 15 minute introductory video call with each applicant. Typical wait time for interview scheduling is 1-2 weeks."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Step 6: Create a New API Key\n",
+ "\n",
+ "In the **API Keys** section, click on **New API Key** to generate a new key.\n",
+ "\n",
+ "![API Keys](img/step_6.png)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Step 7: Name Your Key\n",
+ "\n",
+ "Give your API key a name, click each checkbox and **Create** to finalize the process.\n",
+ "\n",
+ "![Name and Create API Key](img/step_7.png)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Step 8: Copy Your API Key\n",
+ "\n",
+ "Once your API key is created, **copy** it immediately. This key will not be available after closing the modal.\n",
+ "\n",
+ "![Copy API Key](img/step_8.png)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "๐ You have successfully created your account and obtained an API key for Synnax Lab. \n",
+ "\n",
+ "๐ช Now, roll up your sleeves, get your hands dirty, and make some magic happen!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Data Science Meets Finance: Letโs Get Coding!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Installing SDK"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "To make your journey from fetching data to submitting predictions as smooth as possible, Synnax Lab has created an SDK that's as effortless as a Sunday morning. Letโs install it and get started!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "!pip install synnax_lab_sdk -q"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Imports"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Importing necessary libraries\n",
+ "import pandas as pd\n",
+ "from sklearn.preprocessing import LabelEncoder\n",
+ "from sklearn.multioutput import MultiOutputRegressor\n",
+ "from lightgbm import LGBMRegressor\n",
+ "\n",
+ "# Importing Synnax Lab SDK Client\n",
+ "from synnax_lab_sdk.client import SynnaxLabClient"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Fetching Datasets"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Above script will create a synnax-lab folder in the current working directory where all the downloaded datasets will be stored.\n",
+ "`files` object is a dictionary with files names and their respective paths."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "synnax_lab_client = SynnaxLabClient(api_key = \"your_api_key\")\n",
+ "\n",
+ "files = synnax_lab_client.get_datasets()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "{'x_train_path': 'synnax-data/datasets/X_train.csv',\n",
+ " 'targets_train_path': 'synnax-data/datasets/targets_train.csv',\n",
+ " 'x_forward_looking_path': 'synnax-data/datasets/X_forward_looking.csv',\n",
+ " 'macro_train_path': 'synnax-data/datasets/macro_train.csv',\n",
+ " 'macro_forward_looking_path': 'synnax-data/datasets/macro_forward_looking.csv',\n",
+ " 'sample_submission_path': 'synnax-data/datasets/sample_submission.csv',\n",
+ " 'data_dictionary_path': 'synnax-data/datasets/data_dictionary.txt',\n",
+ " 'dataset_date': '2024-09-10'}"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "files"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Dataset Structure ๐\n",
+ "\n",
+ "
\n",
+ "๐ synnax-data\n",
+ "โ โโโ ๐ datasets\n",
+ "โ โโโ ๐ data_dictionary.txt\n",
+ "โ โโโ ๐ macro_forward_looking.csv\n",
+ "โ โโโ ๐ macro_train.csv\n",
+ "โ โโโ ๐ sample_submission.csv\n",
+ "โ โโโ ๐ฏ targets_train.csv\n",
+ "โ โโโ ๐ฎ X_forward_looking.csv\n",
+ "โ โโโ ๐ X_train.csv\n",
+ " \n",
+ "\n",
+ "### ๐ Description:\n",
+ "The datasets subdirectory includes everything you need:\n",
+ "\n",
+ "- **`X_train.csv`**: Your training data with financial features.\n",
+ "- **`targets_train.csv`**: The targets you're predicting in training.\n",
+ "- **`X_forward_looking.csv`**: The test data where youโll make your predictions.\n",
+ "- **`macro_train.csv`** & **`macro_forward_looking.csv`**: Macroeconomic data to enrich your model.\n",
+ "- **`sample_submission.csv`**: Shows you how to format your predictions for submission.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Loading Dataset"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 150,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "X_train = pd.read_csv(files['x_train_path']) # Training features\n",
+ "X_forward_looking = pd.read_csv(files['x_forward_looking_path']) # Test features\n",
+ "targets_train = pd.read_csv(files['targets_train_path']) # Training targets\n",
+ "# macro_train = pd.read_csv(files['macro_train_path']) # Historical macroeconomic data\n",
+ "# macro_forward_looking = pd.read_csv(files['macro_forward_looking_path']) # Future macroeconomic data\n",
+ "\n",
+ "# Clean column names in macroeconomic datasets by removing unsupported characters (mainly spaces)\n",
+ "# macro_train = macro_train.rename(columns=lambda x: re.sub('[^A-Za-z0-9_]+', '', x))\n",
+ "# macro_forward_looking = macro_forward_looking.rename(columns=lambda x: re.sub('[^A-Za-z0-9_]+', '', x))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Let's take a quick look at the first few rows of our training data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 106,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " companyId \n",
+ " metadata_0 \n",
+ " industry \n",
+ " sector \n",
+ " metadata_1 \n",
+ " metadata_2 \n",
+ " metadata_3 \n",
+ " metadata_4 \n",
+ " lastUpdatedAnnumEndDate \n",
+ " lastUpdatedQuarterEndDate \n",
+ " ... \n",
+ " Y_0_feature_122 \n",
+ " Y_0_feature_95 \n",
+ " Y_0_feature_40 \n",
+ " Y_0_feature_56 \n",
+ " Y_0_feature_54 \n",
+ " Y_0_feature_101 \n",
+ " Y_0_feature_99 \n",
+ " Y_0_feature_124 \n",
+ " Y_0_feature_128 \n",
+ " Y_0_feature_43 \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " company_21230 \n",
+ " TH \n",
+ " Pollution & Treatment Controls \n",
+ " Industrials \n",
+ " 0.50005 \n",
+ " NaN \n",
+ " 0.50235 \n",
+ " 1448.31480 \n",
+ " 2023-12-31 \n",
+ " 2024-06-30 \n",
+ " ... \n",
+ " 101.58625 \n",
+ " 1038.36685 \n",
+ " 655.42900 \n",
+ " 2.80030 \n",
+ " 89.59040 \n",
+ " 0.50000 \n",
+ " 0.5000 \n",
+ " 265.91925 \n",
+ " 1398.63610 \n",
+ " 52.52565 \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " company_352 \n",
+ " US \n",
+ " Biotechnology \n",
+ " Healthcare \n",
+ " NaN \n",
+ " 0.49990 \n",
+ " 4.77695 \n",
+ " 7216.34560 \n",
+ " 2023-12-31 \n",
+ " 2024-06-30 \n",
+ " ... \n",
+ " 10.95000 \n",
+ " 5813.10000 \n",
+ " 1794.73560 \n",
+ " 0.50000 \n",
+ " 522.10000 \n",
+ " 0.50000 \n",
+ " 0.5000 \n",
+ " 16951.75000 \n",
+ " 5813.10000 \n",
+ " -11138.10000 \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " company_3853 \n",
+ " CN \n",
+ " Airports & Air Services \n",
+ " Industrials \n",
+ " 0.50015 \n",
+ " 0.50015 \n",
+ " 47.32115 \n",
+ " 88190.89465 \n",
+ " 2023-12-31 \n",
+ " 2024-06-30 \n",
+ " ... \n",
+ " 20768.33075 \n",
+ " 78092.83165 \n",
+ " 14408.75425 \n",
+ " 50.11065 \n",
+ " 17663.72665 \n",
+ " -375.16215 \n",
+ " 0.5000 \n",
+ " 19524.86935 \n",
+ " 106229.49400 \n",
+ " 36702.64605 \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " company_20796 \n",
+ " TH \n",
+ " Resorts & Casinos \n",
+ " Consumer Cyclical \n",
+ " 0.50005 \n",
+ " NaN \n",
+ " 3.15225 \n",
+ " 10077.52090 \n",
+ " 2023-12-31 \n",
+ " 2024-06-30 \n",
+ " ... \n",
+ " 9796.72115 \n",
+ " 23419.37500 \n",
+ " 5230.67575 \n",
+ " 0.50000 \n",
+ " 1747.96490 \n",
+ " -41.68705 \n",
+ " 2018.1635 \n",
+ " 4246.88795 \n",
+ " 43695.68365 \n",
+ " -4591.63235 \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " company_9742 \n",
+ " US \n",
+ " Electronic Components \n",
+ " Technology \n",
+ " 0.50175 \n",
+ " 0.50120 \n",
+ " 6.88430 \n",
+ " 335875.75120 \n",
+ " 2023-12-31 \n",
+ " 2024-06-30 \n",
+ " ... \n",
+ " 42953.50000 \n",
+ " 124008.95000 \n",
+ " 1246.13905 \n",
+ " 10099.70000 \n",
+ " 13353.05000 \n",
+ " -4234.30000 \n",
+ " 701.5000 \n",
+ " 50616.75000 \n",
+ " 167605.70000 \n",
+ " 89133.60000 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
5 rows ร 1147 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " companyId metadata_0 industry \\\n",
+ "0 company_21230 TH Pollution & Treatment Controls \n",
+ "1 company_352 US Biotechnology \n",
+ "2 company_3853 CN Airports & Air Services \n",
+ "3 company_20796 TH Resorts & Casinos \n",
+ "4 company_9742 US Electronic Components \n",
+ "\n",
+ " sector metadata_1 metadata_2 metadata_3 metadata_4 \\\n",
+ "0 Industrials 0.50005 NaN 0.50235 1448.31480 \n",
+ "1 Healthcare NaN 0.49990 4.77695 7216.34560 \n",
+ "2 Industrials 0.50015 0.50015 47.32115 88190.89465 \n",
+ "3 Consumer Cyclical 0.50005 NaN 3.15225 10077.52090 \n",
+ "4 Technology 0.50175 0.50120 6.88430 335875.75120 \n",
+ "\n",
+ " lastUpdatedAnnumEndDate lastUpdatedQuarterEndDate ... Y_0_feature_122 \\\n",
+ "0 2023-12-31 2024-06-30 ... 101.58625 \n",
+ "1 2023-12-31 2024-06-30 ... 10.95000 \n",
+ "2 2023-12-31 2024-06-30 ... 20768.33075 \n",
+ "3 2023-12-31 2024-06-30 ... 9796.72115 \n",
+ "4 2023-12-31 2024-06-30 ... 42953.50000 \n",
+ "\n",
+ " Y_0_feature_95 Y_0_feature_40 Y_0_feature_56 Y_0_feature_54 \\\n",
+ "0 1038.36685 655.42900 2.80030 89.59040 \n",
+ "1 5813.10000 1794.73560 0.50000 522.10000 \n",
+ "2 78092.83165 14408.75425 50.11065 17663.72665 \n",
+ "3 23419.37500 5230.67575 0.50000 1747.96490 \n",
+ "4 124008.95000 1246.13905 10099.70000 13353.05000 \n",
+ "\n",
+ " Y_0_feature_101 Y_0_feature_99 Y_0_feature_124 Y_0_feature_128 \\\n",
+ "0 0.50000 0.5000 265.91925 1398.63610 \n",
+ "1 0.50000 0.5000 16951.75000 5813.10000 \n",
+ "2 -375.16215 0.5000 19524.86935 106229.49400 \n",
+ "3 -41.68705 2018.1635 4246.88795 43695.68365 \n",
+ "4 -4234.30000 701.5000 50616.75000 167605.70000 \n",
+ "\n",
+ " Y_0_feature_43 \n",
+ "0 52.52565 \n",
+ "1 -11138.10000 \n",
+ "2 36702.64605 \n",
+ "3 -4591.63235 \n",
+ "4 89133.60000 \n",
+ "\n",
+ "[5 rows x 1147 columns]"
+ ]
+ },
+ "execution_count": 106,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "X_train.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Checking Categorical Columns"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 151,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Index(['companyId', 'metadata_0', 'industry', 'sector',\n",
+ " 'lastUpdatedAnnumEndDate', 'lastUpdatedQuarterEndDate'],\n",
+ " dtype='object')"
+ ]
+ },
+ "execution_count": 151,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "cat_cols = X_train.select_dtypes(include='object').columns\n",
+ "cat_cols"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Don't miss the two columns containing dates! ๐
"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 153,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "metadata_0 79\n",
+ "industry 143\n",
+ "sector 11\n",
+ "lastUpdatedAnnumEndDate 19\n",
+ "lastUpdatedQuarterEndDate 5\n",
+ "dtype: int64"
+ ]
+ },
+ "execution_count": 153,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Check the number of unique values in the categorical columns\n",
+ "X_train[cat_cols[1:]].nunique()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Process data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 154,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Combine training and test data for consistent encoding\n",
+ "data = pd.concat([X_train, X_forward_looking], axis=0)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Extract features from dates"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 155,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Extract features out of datetime cols\n",
+ "datetime_cols = ['lastUpdatedAnnumEndDate', 'lastUpdatedQuarterEndDate']\n",
+ "\n",
+ "for col in datetime_cols:\n",
+ " data[col] = pd.to_datetime(data[col])\n",
+ " data[f'{col}_day'] = data[col].dt.day\n",
+ " data[f'{col}_day_of_week'] = data[col].dt.day_of_week\n",
+ " data[f'{col}_day_of_year'] = data[col].dt.day_of_year\n",
+ " data[f'{col}_month'] = data[col].dt.month\n",
+ " data[f'{col}_is_month_start'] = data[col].dt.is_month_start\n",
+ " data[f'{col}_is_month_end'] = data[col].dt.is_month_end\n",
+ " data[f'{col}_quarter'] = data[col].dt.quarter\n",
+ " data[f'{col}_is_quarter_start'] = data[col].dt.is_quarter_start\n",
+ " data[f'{col}_is_quarter_end'] = data[col].dt.is_quarter_end\n",
+ " data[f'{col}_year'] = data[col].dt.year\n",
+ "\n",
+ "data.drop(datetime_cols, axis=1, inplace=True)\n",
+ "\n",
+ "cat_cols = [col for col in cat_cols if col not in datetime_cols + ['companyId']]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Encode categorical variables"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 156,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Encode categorical columns using LabelEncoder\n",
+ "for col in cat_cols:\n",
+ " data[col] = LabelEncoder().fit_transform(data[col])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "There are multiple ways to encode categoric variables and they should be selected given the model you will be using and the nature of the data.\n",
+ "\n",
+ "Tree-based models can take advantage of any kind of encoding, while for linear models `LabelEncoder` might not be a viable option because it makes the categories ordinal.\n",
+ "\n",
+ "Try implementing:\n",
+ "- one-hot-encoding\n",
+ "- label-encoding\n",
+ "- frequency-encoding\n",
+ "- mean-target-encoding\n",
+ "\n",
+ "to find the best option for your model."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Handling Missing Values"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Define Function"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 157,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def fill_missing_with_mean(df):\n",
+ " \"\"\"\n",
+ " Fill missing values with the mean of each column.\n",
+ " \"\"\"\n",
+ " for col in df.columns:\n",
+ " if df[col].isnull().any():\n",
+ " df[col] = df[col].fillna(df[col].mean())\n",
+ " return df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Apply Function to Datasets"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 158,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "data = fill_missing_with_mean(data)\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 159,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "metadata_8\n",
+ "city\n"
+ ]
+ }
+ ],
+ "source": [
+ "# drop columns with all missing (sometimes that happens)\n",
+ "for col in data:\n",
+ " if data[col].isnull().all():\n",
+ " print(col)\n",
+ " data.drop(col, axis=1, inplace=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Missing values imputation is a tricky process. While replacing NaNs with mean of the column technically makes your dataset for modeling, it might populate your *ground truth* training data with a lot of faulty values. In many cases mean of the column will not be a right choice.\n",
+ "\n",
+ "Consider:\n",
+ "- median\n",
+ "- mode\n",
+ "- KNN imputer\n",
+ "- verstack.NaNImputer\n",
+ "- etc."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 160,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Split data back into training and test sets\n",
+ "X_train = data[:X_train.shape[0]]\n",
+ "X_forward_looking = data[X_train.shape[0]:]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Dropping `companyId`"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 161,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "X_train = X_train.drop('companyId', axis=1)\n",
+ "X_forward_looking = X_forward_looking.drop('companyId', axis=1)\n",
+ "targets_train = targets_train.drop('companyId', axis=1)\n",
+ "# macro_train = macro_train.drop('companyId', axis=1)\n",
+ "# macro_forward_looking = macro_forward_looking.drop('companyId', axis=1)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Training"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Example Hyperparameters"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 162,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# define starter parameters for LGBMRegressor\n",
+ "params = {\n",
+ " 'learning_rate': 0.01,\n",
+ " 'num_leaves': 250,\n",
+ " 'feature_fraction': 0.5,\n",
+ " 'bagging_fraction': 0.9,\n",
+ " 'verbosity': -1,\n",
+ " 'random_state': 42,\n",
+ " 'device_type': 'cpu',\n",
+ " 'objective': 'regression',\n",
+ " 'metric': 'l2',\n",
+ " 'num_threads': 10,\n",
+ " 'lambda_l1': 0.5,\n",
+ " 'n_estimators': 100\n",
+ " }"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Train the Model"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We will use the same type of model with fixed parameters to predict each of the 17 targets. `sklearn.multioutput.MultiOutputRegressor` will simplify our code and rather than producing code to train 17 independent models, we can use `MultiOutputRegressor` to do it for us in a convenient sklearn-style one-liner."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "But remember, each target, even though it represents financial indicators from one company and one time-period, may have different dependencies and even may require different types of models. So try different strategies to improve your score."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": []
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 163,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "MultiOutputRegressor(estimator=LGBMRegressor(bagging_fraction=0.9,\n",
+ " device_type='cpu',\n",
+ " feature_fraction=0.5,\n",
+ " lambda_l1=0.5, learning_rate=0.01,\n",
+ " metric='l2', num_leaves=250,\n",
+ " num_threads=10,\n",
+ " objective='regression',\n",
+ " random_state=42, verbosity=-1)) In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. MultiOutputRegressor?Documentation for MultiOutputRegressor iFitted MultiOutputRegressor(estimator=LGBMRegressor(bagging_fraction=0.9,\n",
+ " device_type='cpu',\n",
+ " feature_fraction=0.5,\n",
+ " lambda_l1=0.5, learning_rate=0.01,\n",
+ " metric='l2', num_leaves=250,\n",
+ " num_threads=10,\n",
+ " objective='regression',\n",
+ " random_state=42, verbosity=-1)) estimator: LGBMRegressor LGBMRegressor(bagging_fraction=0.9, device_type='cpu', feature_fraction=0.5,\n",
+ " lambda_l1=0.5, learning_rate=0.01, metric='l2', num_leaves=250,\n",
+ " num_threads=10, objective='regression', random_state=42,\n",
+ " verbosity=-1) LGBMRegressor LGBMRegressor(bagging_fraction=0.9, device_type='cpu', feature_fraction=0.5,\n",
+ " lambda_l1=0.5, learning_rate=0.01, metric='l2', num_leaves=250,\n",
+ " num_threads=10, objective='regression', random_state=42,\n",
+ " verbosity=-1) "
+ ],
+ "text/plain": [
+ "MultiOutputRegressor(estimator=LGBMRegressor(bagging_fraction=0.9,\n",
+ " device_type='cpu',\n",
+ " feature_fraction=0.5,\n",
+ " lambda_l1=0.5, learning_rate=0.01,\n",
+ " metric='l2', num_leaves=250,\n",
+ " num_threads=10,\n",
+ " objective='regression',\n",
+ " random_state=42, verbosity=-1))"
+ ]
+ },
+ "execution_count": 163,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Initialize the model with parameters\n",
+ "regressor = MultiOutputRegressor(LGBMRegressor(**params))\n",
+ "# from sklearn.linear_model import Ridge\n",
+ "# regressor = MultiOutputRegressor(Ridge())\n",
+ "\n",
+ "# Fit the model on training data\n",
+ "regressor.fit(X_train, targets_train)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Make Predictions"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 164,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "predictions = regressor.predict(X_forward_looking)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Submitting Predictions"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Load the Sample Submission File"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 165,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sample_submission = pd.read_csv(files['sample_submission_path'])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Update the Submission File with Predictions"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 168,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "for col in sample_submission.columns[1:]:\n",
+ " sample_submission[col] = sample_submission[col].astype(float)\n",
+ "\n",
+ "sample_submission.iloc[:, 1:] = predictions"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Save Submission File"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 169,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Save the updated submission file\n",
+ "submission_path = 'synnax-submissions/submission.csv'\n",
+ "sample_submission.to_csv(submission_path, index=False)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Submit the Predictions"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "synnax_lab_client.submit_predictions(files[\"dataset_date\"], submission_path)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Check confidence score (validation)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Scores get calculated on the synnax-lab-sdk backend with a small lag, so in case running the below function does not return the `confidenceScore` right away, give it a few seconds and rerun. A `confidenceScore` is there when the submission `status: 'Processed'`"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 147,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[{'id': '4a809ed2-a34b-409a-895b-85d913817a9d',\n",
+ " 'datasetDate': '2024-09-10',\n",
+ " 'originalFilename': 'submission.csv',\n",
+ " 'status': 'Processed',\n",
+ " 'confidenceScore': -1030.0046503478281,\n",
+ " 'ownerId': 'User-aa71d80a-6e0a-417b-a869-e9b2fa671fdc',\n",
+ " 'uploadedAt': '2024-09-10T11:17:30.540Z'},\n",
+ " {'id': 'ba10218d-4293-49de-9c56-36d7eb0c0657',\n",
+ " 'datasetDate': '2024-09-10',\n",
+ " 'originalFilename': 'submission.csv',\n",
+ " 'status': 'Processed',\n",
+ " 'confidenceScore': -1030.0046503478281,\n",
+ " 'ownerId': 'User-aa71d80a-6e0a-417b-a869-e9b2fa671fdc',\n",
+ " 'uploadedAt': '2024-09-10T11:16:48.016Z'},\n",
+ " {'id': '4140f050-eb0e-4e25-bf99-d787a26fa367',\n",
+ " 'datasetDate': '2024-09-10',\n",
+ " 'originalFilename': 'submission.csv',\n",
+ " 'status': 'Processed',\n",
+ " 'confidenceScore': 0.12233956654794893,\n",
+ " 'ownerId': 'User-aa71d80a-6e0a-417b-a869-e9b2fa671fdc',\n",
+ " 'uploadedAt': '2024-09-10T10:47:02.448Z'},\n",
+ " {'id': 'dbfa0841-31bc-4580-bb4c-c647d1e9f628',\n",
+ " 'datasetDate': '2024-09-10',\n",
+ " 'originalFilename': 'submission.csv',\n",
+ " 'status': 'Processed',\n",
+ " 'confidenceScore': 0.12233956654794893,\n",
+ " 'ownerId': 'User-aa71d80a-6e0a-417b-a869-e9b2fa671fdc',\n",
+ " 'uploadedAt': '2024-09-10T10:36:01.295Z'},\n",
+ " {'id': 'e88376dd-1f78-4fc9-9a88-d471fe1d0d3d',\n",
+ " 'datasetDate': '2024-09-05',\n",
+ " 'originalFilename': 'submission.csv',\n",
+ " 'status': 'Processed',\n",
+ " 'confidenceScore': 0.45768741331569546,\n",
+ " 'ownerId': 'User-aa71d80a-6e0a-417b-a869-e9b2fa671fdc',\n",
+ " 'uploadedAt': '2024-09-05T10:35:00.152Z'}]"
+ ]
+ },
+ "execution_count": 147,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "synnax_lab_client.get_past_submissions()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Congratulations! ๐ฅณ Youโve Made Your First Submission!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This was a basic pipeline. Now, it's time to level up! ๐ Use the macroeconomic data to make your model more robust. Experiment with different models, tweak them, and maybe even try some neural networks. ๐ง ๐ฅ\n",
+ "\n",
+ "Data science is like cooking. There are endless recipes to try. So, spice things up, preprocess like a pro, and get those scores soaring! ๐๐จโ๐ณ๐ฉโ๐ณ\n",
+ "\n",
+ "Good luck, and may the data be ever in your favor! ๐๐"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# P.S."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Above pipeline is a simple example of how to get started with synnax-lab-sdk and arrive at your first submission.\n",
+ "\n",
+ "To improve your scores look into:\n",
+ "1. Macroeconomic data\n",
+ "2. Experiment with other categoric variables encoding options (try individual mean-target-encoding for each target)\n",
+ "3. Deal with outliers\n",
+ "4. More advanced missing values imputation options\n",
+ "5. Different models, individual models for each tartet, hyperparameters tuning\n",
+ "6. Models ensembling (if using different models, make sure you have appropriate processing for each model)\n",
+ "7. Feature selection: not all features in X_train, macro_train can be useful. Try to remove some of them.\n",
+ "8. And of course your creativity. We're confident that your individual approach can beat anything we have layed out in our short tutorial"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.12.5"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}