Latest Item

Latest Blog articles from various categories

Let us try to apply the principles of churn for a use-case involving sensors and devices. In this post we will apply some machine learning principles to predict which devices are likely to fail in future. The scenario is as follows:

Problem Description

A company is in the business of allocating car parking space to visitors. For that, it needs to have sensors installed at each parking space. These sensors are installed on the ground right under where the car is supposed to park. When a car comes over it, it senses the presence of the car through a set of five proximity sensors that then give a reading in analog voltages. These set of five readings would indicate if there is a car present above it. This data is then transmitted periodically over the network to the server where it is recorded. On the other hand, one can also send control messages over the cellular network to the sensors from the servers to take some action, like calibrate itself or reboot the device. Sensor under the car Sensor under the car senses an object above it by sending ultra-sonic waves[/caption] These sensors need to be really hardy. They can be driven upon by cars. They have to weather the sun's direct sunlight when there is no vehicle over it. They also have to weather the rain and the battery must last for a long time. Over time, though, the battery becomes weak, or the sensors themselves lose their ability to sense correctly, and then finally die. The engineers who are operating these sensors have noticed that every time they send a signal to reboot the device, some of the devices do not show up any more. This happens because during a power-cycle reboot, the sensor goes through some stress and, depending on network connection in that area, it takes up to 24 hours for the device to reconnect to the network and send its data. Thus it is known that sensors generally die after a power-cycle reboot. If data has not been received from the sensor for 24 hours, that device is most likely dead. We have a record of all sensors that were operational before and after the last reboot operation. A few of the devices stopped transmitting after that reboot event. The problem is to predict which devices are likely to die during the next reboot event.

Analyzing the data set

Before we attempt to solve this problem, let us take a look at some sample data. Note that in real-life situations the data is going to be noisy, messy and sometimes inaccurate. As a data scientist, one has to make a judicious choice as to whether it should be used or not. This elimination process must be applied to columns as well as rows.

We will analyze the columns one-by-one and make a decision to keep it or discard it.

  1. The first column appears to just a row Id. Thus it has no contribution to the data-model; so we discard it.
  2. The second column (mac) is the device id. This is the unique Id of the device, so we need to use it for predicting which devices are likely to die during the next power-cycle reboot.
  3. The third column (mj) does not change, so it has no effect on the data-model; thus it is discarded.
  4. The fourth column (fw) is same for all rows - a candidate for discarding it.
  5. The fifth column (time) is important. It tells us when the device data was received - thus we can actually see which devices were alive and dead after the last power cycle reboot.
  6. The sixth column (uamps) has no value since it is same for all rows; discard it.
  7. The seventh column (batt_v) appears to be the most important field of all. It is the battery voltage which has the highest impact on the life of a device stranded in the open space.
  8. The eighth field (cc) is a numeric measured value from the environment. We will keep it and observe if it has any impact on the life of the device.
  9. The ninth field (temp) is the temperature of the device at the time the reading was taken. We'll keep it.
  10. The tenth field (diag) may be discarded since it is zero all across.
  11. The eleventh field (mahrs_consumed) may also be discarded.
  12. The twelfth field (avg_lifetime_uamps) is not coming through correctly, so we discard it.
  13. The thirteenth field (missed_payloads) seems like an interesting field since it would have indicated the signal strength around that area, but unfortunately it is not coming through; so we discard it.
  14. The fourteenth field (total_time_sec) is zero all across; so discard it.
  15. The fifteenth field (l0) does have some significance since it is a sensor parameter. We'll keep it.
  16. The sixteenth field (l1) is also an sensor parameter with changing values. We'll keep that as well.
  17. The seventeenth field (prod_date) is a the date the device was manufactured. We probably don't have much use of this at the moment, so it will be ignored.
  18. The eighteenth field (vreg) appears to be important since it has values that change; so we'll keep it.
  19. The nineteenth field (sleep) has a constant value of zero; discard it.
  20. The twentieth field (eco) may be kept even though it seems to have little variation.
  21. The twenty-first field (esr_samples) is interesting since it has four different numeric values embedded inside it. We believe these values can be separated (using ':' as a delimiter) and these can be used individually. Most likely these are the raw voltage readings from the sensors which will have a huge impact on the life of the product. We'll use all four values found in this field.
  22. The last field (esr_timing) is the setting on the device when the readings in column 21 were taken. This is constant all through, so we discard it for our analysis.

The engineering team has also notified us that the last reboot operation happened on September 29th, 2016.

Problem Approach

Before we start our analysis you need to get the data-set. Get the GZIP file here. Download the file given in the link above and save it to some local folder on your computer. Open up a terminal and change directory to the folder where you downloaded the file. Then unzip the file using:

gunzip sensor_data_001.csv.gz

You will see the data in the file sensor_data_001.csv. You can readily observe that this is not a standard supervised learning problem. For any supervised learning problem we need an X-vector and a y-vector for the training set. We do not know yet how to get the output y-variable since it is not given in the problem directly. Thus for training purposes, the output variable must be derived in some way.

Determining the output y-vector

Look at the data-set carefully. There are 69,765 rows in this data-set. To determine which devices have died during the last reboot operation, one has to first sort this list by date. Fortunately, I have already sorted this data for you by date, so you should be able to see the records from the beginning in order. You will notice that this data-set contains records only for a few days. We need to partition this data-set into two parts.

  1. All records prior to Sep 29th, since that was the day when the last reboot operation happened. This will be our training set which helps us build out prediction model.
  2. All records after Sep 29th, to determine which devices survived the reboot operation. This will be our dev set on which we will apply our model to determine the "weak" devices that are likely to die after the next reboot operation.

Having done that, we may be able to determine which devices have died due to the reboot, and thus be able to re-create an additional column for the y-vector.

Problem Solution

Let us get started with the programming exercise. To solve this problem, we will use Python with Pandas, Scikit-Learn. Our goal is to write the entire solution in one file without multiple passes. Thus it will involve use of Pandas filtering - the results of which will be in memory.

Before you attempt it on your own machine, you need to ensure that you have the necessary libraries to run this program. Open up a terminal shell and run the following:

pip install scikit-learn
pip install pandas

After installing these necessary packages, we are ready to begin. I will describe the different stages of the task as different steps.

Step 1: Set up import statements

Let us first import a few necessary libraries:

  1. from __future__ import division
  2. import pandas as pd
  3. import numpy as np
  4. import datetime
  5. from sklearn.ensemble import RandomForestClassifier
  6. from sklearn.metrics import confusion_matrix
  7. from sklearn.metrics import f1_score
  8. from sklearn.metrics import precision_recall_fscore_support as prf
  9. import warnings
  10. warnings.filterwarnings("ignore")

Step 2: Read the CSV file and identify the column names.

We now need to open the data file and read the column names. Pandas has handy functions to do that.

  1. print("Reading CSV data from current directory...")
  2. all_sensor_data_df = pd.read_csv('sensor_data_001.csv')
  3. col_names = all_sensor_data_df.columns.tolist()

Step 3: Build the data-frames

The next step is to build the data frames in Pandas. While doing that we need to do certain transformations since some fields that contain numbers or dates are appearing in the data file as strings.

  1. First we need to convert the string variables found in column time_stamp into timestamps.
  2. Secondly, we need to pluck out the individual values from the field esr_samples into four different values. These are the individual sensor voltage readings that we believe have significant impact on the life of the sensor.
  1. print("Building dataframes...")
  2. all_sensor_data_df['time_stamp'] = pd.to_datetime(all_sensor_data_df['time'])
  3. X1 = all_sensor_data_df['esr_samples'].str.split(pat=':', expand=True)
  4. X1 = X1.rename(columns={0: 'esr0', 1: 'esr1', 2: 'esr2', 3: 'esr3'})
  5. all_sensor_data_df = pd.concat([all_sensor_data_df, X1], axis=1)

Notice that we are creating four new columns in the data frame with names esr0, esr1, esr2 and esr3. Then we are concatenating the newly generated data-frame with the original data-frame making it wider.

Step 4: Choosing the training set

Recall that I said, we need to look at the rows prior to the last reboot operation on Sep 29 to determine which devices were alive then and compare them with those found later. This helps us figure out which devices have died during the reboot.

  1. cutoff = datetime.datetime(2016, 9, 29)
  2. still_running_sensors_df = all_sensor_data_df.loc[all_sensor_data_df['time_stamp'] >= cutoff]
  3.  
  4. all_sensor_macs = all_sensor_data_df.mac.unique()
  5. alive_sensor_macs = still_running_sensors_df.mac.unique()

I created a Pandas filter to separate out the the rows prior to the cut-off date. Then I used the Pandas unique() function to figure out all the device Ids that existed during that time. When I use the same unique() function on the rows after the cut-off date, I get all sensors that are alive later.

  1. running_before_reboot_df = all_sensor_data_df.loc[all_sensor_data_df['time_stamp'] < cutoff]
  2. running_before_reboot_df.loc[:, 'isalive'] = np.where(running_before_reboot_df['mac'].isin(alive_sensor_macs), 1, 0)  # alive = 1, dead = 0

My goal is to create another Y-column (isalive) which has values 0 or 1 - where 0 represents dead and 1 represents alive. I do so by using the Pandas np.where() function and augmenting that information in the data-frame containing rows prior to Sep 29. This will be my training set.

For your reference here is the full set of columns in the original dataset:

  1. # all_columns = ['Unnamed: 0', 'mac', 'mj', 'fw', 'time', 'uamps', 'batt_v', 'cc', 'temp', 'diag', 'mahrs_consumed', 'avg_lifetime_uamps',
  2. 'missed_payloads', 'total_time_sec', 'l0', 'l1', 'prod_date', 'vreg', 'sleep', 'eco', 'esr_samples', 'esr_timing']

Step 5: Setting up the training task

Now it is time to set up the training task. I first create a list of columns to be used for the output vector, and also another set for all the input variables - or features.

  1. class_variables = ['isalive']
  2. features = ['eco', 'batt_v', 'missed_payloads', 'esr0', 'esr1', 'esr2', 'esr3']

A lot of what follows next is boiler-plate code that you can use for any generic data-science problem. I have been using this boiler-plate as my starting point for any coding problem that I want to use with Pandas and Sckit-Learn.

Note that we have all our training data in the data-frame named running_before_reboot_df. Typically, we would take a small sample out of this data-frame to verify the accuracy of the prediction. In this situation we can use the entire data-frame as our training set since we have already separated out the validation set (those records after Sep 29). But for demonstration purposes, I am going to still create a test set from the data-frame to create the confusion matrix and an F-score.

  1. train_df, test_df = np.split(running_before_reboot_df, [int(.8 * len(running_before_reboot_df))])

I am creating above two data-frames, one for training and another for testing. 80% of the data is being used for training.

  1. y_train = train_df[class_variables]
  2. X_train = train_df[features]
  3. X_train.fillna(value=0, inplace=True)
  4.  
  5. y_test = test_df[class_variables]
  6. X_test = test_df[features]
  7. X_test.fillna(value=0, inplace=True)

To avoid any problems with missing values, I am also filling with zeros all fields that do not have a value.

Step 6: Setting up the validation set for prediction

Similarly, one can set up the prediction set by taking records that were after the cut-off date.

  1. X_prediction_set = still_running_sensors_df[features]
  2. X_prediction_set.fillna(value=0, inplace=True)

Step 7: Training the data-model

Now it is time to train the model with the data-frames created so far. We choose to use the Random Forest classifier for this purpose. Through experience I have discovered that ensemble learning provides the best results since it uses multiple approaches to solve the problem and then chooses the few best among those.

  1. print('Building RandomForest Classifier ...')
  2. model = RandomForestClassifier(n_estimators=20, min_samples_leaf=1, max_depth=20, min_samples_split=2, random_state=0)
  3. model.fit(X_train, y_train.values.ravel())
  4. print('... built')

Step 8: Validating the model and finding accuracy

For data scientists most of the time is spent in validating and tuning the model. I am giving below the necessary code to print out the Confusion Matrix along with the Precision, Recall and F-score.

  1. y_pred = model.predict(X_test)
  2. y_test_as_matrix = y_test.as_matrix()
  3.  
  4. print('Confusion Matrix')
  5. print(confusion_matrix(y_test, y_pred))
  6.  
  7. model_score = model.score(X_test, y_test_as_matrix)
  8.  
  9. print('Features: ' + str(features))
  10. print('Feature importances: ', model.feature_importances_)
  11. print('Model Score: %f' % model_score)
  12.  
  13. print("F1 Score with macro averaging:" + str(f1_score(y_test, y_pred, average='macro')))
  14. print("F1 Score with micro averaging:" + str(f1_score(y_test, y_pred, average='micro')))
  15. print("F1 Score with weighted averaging:" + str(f1_score(y_test, y_pred, average='weighted')))
  16.  
  17. print('Precision, Recall and FScore')
  18. precision, recall, fscore, _ = prf(y_test, y_pred, pos_label=1, average='micro')
  19. print('Precision: ' + str(precision))
  20. print('Recall:' + str(recall))
  21. print('FScore:' + str(fscore))

If you run this, you will see that the model score come out as 82% accurate. This is not perfect, but it is perhaps the best what our data-set allows. Remember that we are dealing with noisy data, and trying to come up with a prediction based on that data-set without doing any further filtering.

One of the interesting results coming from this analysis is finding out which parameters have the most impact on a sensor's death. For that, one has to look at the feature importance numbers. Looking at the numbers, one can see that esr3 has the most impact, followed by battery voltage and then esr2 and esr1.

Step 9: Making a prediction based on this data-model

Now that we have created a data-model, let us try to predict how many devices are likely to die during the next reboot.

  1. # We need to find those that are about to die, i.e. 0 value. Filter only those values that are 0
  2. X_prediction_set.loc[:, 'predicted_life'] = model.predict(X_prediction_set)
  3. final_prediction_df = pd.concat([still_running_sensors_df[['mac']], X_prediction_set], axis=1)
  4.  
  5. may_possibly_die_df = final_prediction_df.loc[final_prediction_df['predicted_life'] == 0]
  6. sensors_that_may_die = may_possibly_die_df.mac.unique()

All I am doing here is using the predict() method provided by Scikit-learn to apply the model on the prediction data-frame. Out of the values that I get as my predicted value of isalive, I am only choosing those that have a value of zero i.e. likely to die.

Step 10: All done, print out the device Ids

Finally it is time to print out the device Ids that were identified to be weak and likely to switch off permanently during the next reboot.

  1. print('All sensors in dataset:')
  2. print(all_sensor_macs)
  3.  
  4. print('Sensors that were still alive after 9/28 reboot event:')
  5. print(alive_sensor_macs)
  6.  
  7. print('Sensors that may possibly die in future:')
  8. print(sensors_that_may_die)

Full output of the program

Given below is the full output of the program. You can see that it provides the list of device ids that are likely to die during the next reboot.

Reading CSV data from current directory...
Building dataframes...
Building RandomForest Classifier ...
... built
Confusion Matrix
[[2218  164]
 [1956 7501]]
Features: ['eco', 'batt_v', 'missed_payloads', 'esr0', 'esr1', 'esr2', 'esr3']
Feature importances:  [0.0784046  0.1871789  0.09478354 0.12096443 0.08418511 0.13707296
 0.29741046]
Model Score: 0.820931
F1 Score with macro averaging:0.7764073908388418
F1 Score with micro averaging:0.8209308218599543
F1 Score with weighted averaging:0.8360332235994649
Precision, Recall and FScore
Precision: 0.8209308218599544
Recall:0.8209308218599544
FScore:0.8209308218599543
All sensors in dataset:
['48-98-D3' '48-95-80' '48-9B-19' '48-97-92' '48-9A-65' '48-96-F9'
 '48-9D-84' '48-9B-4C' '48-95-B8' '48-9C-B0' '48-96-8F' '48-92-CB'
 '48-95-97' '48-A9-8D' '48-96-1C' '21-BE-36' '48-94-3C' '48-94-3D'
 '49-20-4E' '48-BF-FC' '48-92-42' '48-98-4F' '48-93-41' '22-02-9F'
 '48-97-9A' '48-BE-C1' '48-9C-BF' '48-92-D7' '48-C0-70' '48-9D-47'
 '22-06-96' '21-C5-A9' '48-C0-53' '48-95-A3' '48-96-5A' '48-BE-C2'
 '48-93-50' '22-02-75' '48-9A-92' '48-9C-0C' '22-02-9D' '48-9A-58'
 '21-C9-EE' '48-93-38' '48-96-0B' '48-92-A1' '48-BE-C9' '48-99-2F'
 '22-02-9E' '48-BE-B4' '48-92-AC' '48-96-4C' '48-94-3F' '48-97-D1'
 '48-98-59' '48-97-BA' '48-BE-63' '21-C5-AB' '48-9D-08' '21-BD-F9'
 '48-95-C2' '48-97-12' '21-C5-B0' '21-C4-35' '48-94-23' '48-9D-72'
 '21-C9-D9' '49-3F-F2' '48-9A-7E' '48-94-12' '21-C3-26' '22-02-C7'
 '21-C9-FF' '48-94-47' '48-9C-AF' '48-A9-A2' '48-92-EE' '21-C9-F0'
 '48-9B-9A' '48-94-08' '48-9B-50' '22-06-A7' '48-93-FA' '48-96-6F'
 '48-95-1E' '48-9A-E1' '48-9C-2A' '48-BE-E3' '22-02-A4' '48-98-69'
 '48-94-B1' '22-02-CA' '22-04-BD' '48-92-95' '48-9B-06' '22-02-86'
 '48-9B-1B' '22-02-9C' '21-C4-4C' '21-C2-46' '48-BE-7F']
Sensors that were still alive after 9/28 reboot event:
['22-02-86' '48-BE-63' '48-92-A1' '48-95-1E' '48-BE-C9' '48-9A-7E'
 '48-BF-FC' '21-C5-B0' '48-9D-47' '48-C0-53' '48-9D-84' '22-02-A4'
 '21-C4-4C' '48-95-A3' '21-C3-26' '22-02-C7' '48-BE-C1' '48-92-D7'
 '48-BE-7F' '48-9B-4C' '22-06-96' '22-02-9F' '21-C2-46' '22-02-CA'
 '48-9A-92' '22-02-9D' '22-06-A7' '49-20-4E' '21-BD-F9' '48-BE-B4'
 '48-9B-19' '48-A9-8D' '48-94-47' '48-BE-E3' '48-9C-B0' '22-04-BD'
 '21-C5-AB' '48-92-42' '21-C4-35' '48-97-D1' '21-C9-D9' '21-C9-FF'
 '48-9B-9A' '48-BE-C2' '48-94-12' '22-02-9C' '48-93-38' '48-94-3C'
 '22-02-9E' '48-9C-2A' '48-9A-65' '22-02-75' '48-C0-70' '21-C5-A9'
 '48-9D-08' '48-9D-72' '21-BE-36' '21-C9-EE' '21-C9-F0' '49-3F-F2'
 '48-95-C2' '48-A9-A2' '48-96-F9']
Sensors that may possibly die in future:
['48-92-A1' '48-95-1E' '48-9A-7E' '48-92-D7' '48-9B-4C' '48-9B-19'
 '48-9C-B0' '48-97-D1' '48-9B-9A' '48-94-3C' '48-9C-2A' '48-9A-92'
 '48-96-F9' '48-92-42' '48-9A-65' '48-94-12' '48-95-A3' '48-95-C2'
 '48-93-38' '48-94-47']

Conclusion

I hope you have enjoyed reading this blog post. You can get the entire code as a single file. A lot of the work depends on knowing the quirks of Pandas and handling data using the Pandas functions. Most people, including myself, think in SQL first and then attempt to find out the Pandas way of doing the same job. I need to keep Google handy for doing searches as I progress through the code. Since I have shown you a lot of the common use-cases, you can take hints from this code and proceed with your own work, using some of the boiler-plate code. If you like this article or find an error in the code, or have an alternate opinion about the approach, you are most welcome to leave a comment below.

 

Published in Data Science
Page 3 of 3