Edge Anomaly Prediction - Workshop Instructions

Table of Contents

Preparing the Data

When the application first loads in your browser, a large amount of sensor data is being loaded to it. To track its progress, navigate to the Data Prep tab of the application where you will see a message stating "Loading data..." and the estimated time it will take before it is complete.
This data loading step consists of (1) dropping sensors that have several null values and (2) Principle Component Analysis (PCA) to reduce the dimensionality of the data, allowing our model to train faster.
The resulting display labeled Feature Choices provides information on which sensors were dropped due to a large amount of null values and which linear combinations of sensors had the highest variance according to PCA.

As pictured below, we are dropping sensor_00, sensor_15, sensor_50, and sensor_51 and the linear combinations of sensors, denoted with 'pc', are ordered from greatest to least variance.

The model will be trained on the first linear combinations of sensors whose variance adds up to 95% or the first 12 pc's.
Once the data is fully loaded, the Start Data Prep button will be enabled. Click on it, which will reshape the data into a form that the model can ingest in the background.
While this processing is taking place, you will see the following progress bar:
Once data preparation is complete, you will recieve a message stating, "Data is prepared. Ready to train" and the Train Model tab will be enabled.

Click on the Train Model tab. There are different choices for training parameters in the drop-down menu, however, for the purposes of this workshop we will proceed with the default values.
Click on the Train Model button to start training the model.
When training is complete, you will recieve a message stating, "Training is finsihed. Click on Display Loss Graph Btn". Click on the Display Loss Graph to view the observe the loss graph.
In addition to the graph, you will also notice that the Train Model and Predict tab.

Click on the Test Model tab.
Next, click on the Test Model button. The resulting display will look something like:

Do not be surprised if you get a slightly different result. This is due to stochastic nature of training the model. Training will always involve uncertainties and randomness and as a result, the model is always an approximation.

Click on the Predict tab.
Before you can make a prediction, you must first select a data source for the prediction data. There are two options:

(a) Use a CSV file which will be streamed one point at a time, simulating real time generation. The data in the CSV file is taken from the original Kaggle data source as test data.

(b) Use a data stream of synthetic data with the help of Apache Kafka that also simulates the production of real time data.

Make sure to select one of the options before proceeding. If you attempt to click on the Start Prediction Graph before selecting a data source you will get an error message:

(a) If you choose the CSV radio button, a select box will list CSV file names available. Simply select a CSV filename.

(b) If you choose the Kafka radio button, enter the Group ID that you have been given.
After you select a data source, click on the Start Prediction Graph button. If you chose the Kafka radio button follow the additional instructions for streaming sensor data.

If you have not already, follow the instructions for generating sensor data.
In the same Jupyter notebook referenced in the above instructions, 12_generate_sensor_data.ipynb, go down to the section title Streaming our sensor data.

Let's attach a fake timestamp to each instance of synthetic data, making it time series data, by running the first four cells in this section.

![](/docs/images/streaming_sensor_data.png)

Now that you've transformed your data into time series data, define the Kafka cluster credentials by running the following cell:
Finally, stream your data by running the remaining two cells, which (1) connects to the Kafka cluster based on the credentials you defined in the previous step, (2) initializes a KafkaProducer object, (3) streams your data to the sensor failure prediction model.