Blog articles pertaining to Data Science, Machine Learning, Hardware, Software and Personal musings

In the first and second part of this series I described how to set up the hardware to read data from the OBD port using a Raspberry Pi and then upload it to the cloud. You should read the first part and second part of this series before you read this article.

Having done all the work to capture and transport all the data to the cloud, let us figure out what can be done on the cloud to introduce Machine Learning. To understand the concepts given in this article you will need to be familiar with Javascript and Python. Also, I am using MongoDB as my database - so you will need to know the basics of a document-oriented database to follow the example code here. MongoDB is not only my storage engine here, but also my compute engine. By that, I mean that I am using the database's scripting language (Javascript) to pre-process data that will ultimately be used for machine learning. The Javascript code given herein executes (in a distributed manner) inside the MongoDB database. (Some people get confused when they see Javascript, assuming that it requires a server like NodeJS to run - not here.)

Introduction to the Solution

To set the introduction, I will describe the following three tasks in this article:

  1. Read the raw records and augment it with additional derived information to generate some extra features used for machine learning that is not directly sent by the Raspberry Pi. Derivatives based on time interval between readings can be used to derive instanteous velocity, angular velocity and incline. These are inserted into the derived records when we save it to another MongoDB collection (also known as 'table' in relational database parlance) to be used later for generating the feature sets. I will be calling this collection 'mldataset' in my database.
  2. Read the 'mldataset' and extract features from the records. The feature set is saved into another collection called 'vehicle_signature_records'. This is an involved process since there are so many fields found in the raw records. In my case, the feature sets are basically three statistical averages (minimum, average and maximum) of all values aggregated over a 15 second period. The other research papers on this subject take the same approach, but the time interval over which the aggregates are taken vary based on the frequency of the readings. The recommended frequency is 5 Hz i.e. 1 record-set per 0.2 second. But as I mentioned in article 2 of this series, we are unable to read data that fast on a serial connection over ELM 327. The maximum speed that I have been able to observe (mostly in modern cars) is 1 record-set in 0.7 seconds. Thus a 15 second aggregation makes more sense in our scenario. Due to this, the accuracy of the prediction may be affected - but we will accept that as a constraint. The solution methodology remains the same though.
  3. Apply a learning algorithm on the feature-set to learn the driver behavior. Then the model needs to be deployed on a machine in the cloud. In real-time we need to calculate the same aggregates over the same time interval (15 seconds) and feed it into the model to come up with a prediction. To confirm the driver we will need to take readings over several intervals (5 minutes will give 20 predictions) and then use the value with the maximum count (i.e. modal value).

Augmenting raw data with derivatives

This is a very common scenario in IoT applications. When generated data comes from a human being, it always has useful information at the surface. All you need to do is scan the logs and extract it. An example of this is finding the interests of the user based on user-clicks in a shopping cart scenario - all the items that the user has seen on the web-site are directly found in the logs. However an IoT device is dumb, has no emotion, has no special interests. All data coming from an IoT device is the same consistent boring stream. So where do you dig to find useful information? The answer to this question is in the time-derivatives. The variation in values from one reading and the other provides useful insight into the behavior. Examples of these are velocity (derivative of displacement found from GPS readings), acceleration (derivative of velocity) and jerk (derivative of acceleration). So you see, augmenting raw data to put this extra information is extremely useful for IoT applications.

In this step I am going to write some Javascript code (that runs inside the MongoDB database) to augment raw data with derivatives for each record. You will find all this code in the file 'extract_driver_features_from_car_readings.js' which is located inside the 'machinelearning' folder. If you are wondering where to find the code, it is in Github at this location

Processing raw records

Before diving into the code, let me clarify a few things. The code is written to run as a cronjob on a machine on the same network as the MongoDB database - so that it is accessible. Since it runs as a cron task, we need to know how many records to process from the raw data table. Thus we need to do some book-keeping on the database. We have a special collection called 'book_keeping' for this purpose where we store some book-keeping information. One of them is the last date till when we have processed the records. The data (in JSON format) may look like this:

  1. {
  2. "_id" : "processed_until",
  3. "lastEndTime" : ISODate("2017-12-08T23:56:56.724+0000")
  4. }

To determine where we need to pick up the record processing from, here is one way to do this in a MongoDB script written in Javascript.

  1. // Look up the book-keeping table to figure out the time from where we need to pick up
  2. var processedUntil = db.getCollection('book_keeping').findOne( { _id: "processed_until" } );
  3. endTime = new Date(); // Set the end time to the current time
  4. if (processedUntil != null) {
  5. startTime = processedUntil.lastEndTime;
  6. } else {
  7. db.book_keeping.insert( { _id: "processed_until", lastEndTime: endTime } );
  8. startTime = new Date(endTime.getTime() - (365*86400000)); // Go back 365 days
  9. }

The 'else' part of the logic above is for the initialization phase when we run it for the first time - we just want to pick up all records for the past year.

Keeping track of driver vehicle combination

Another book-keeping task is to keep track of the driver-vehicle combinations. To make the job easier for the machine learning algorithm, this should be converted to indices. Those indices are maintained in the database in another book-keeping collection called 'driver_vehicles'. This collection looks somewhat like this:

  1. {
  2. "_id" : "driver_vehicles",
  3. "drivers" : {
  4. "gmc-denali-2015_jonathan" : 0.0,
  5. "gmc-denali-2015_mindy" : 1.0,
  6. "gmc-denali-2015_charles" : 2.0,
  7. "gmc-denali-2015_chris" : 3.0,
  8. "gmc-denali-2015_elise" : 4.0,
  9. "gmc-denali-2015_thomas" : 5.0,
  10. "toyota-camry-2009_alice" : 6.0,
  11. "gmc-denali-2015_andrew" : 7.0,
  12. "toyota-highlander-2005_arka" : 8.0,
  13. "subaru-outback-2015_john" : 9.0,
  14. "gmc-denali-2015_grant" : 10.0,
  15. "gmc-denali-2015_avni" : 11.0,
  16. "toyota-highlander-2005_anupam" : 12.0,
  17. }
  18. }

These are the names of vehicles with their drivers. Each combination has been assigned a number against it. When a driver-vehicle combination is encountered, the program looks to see if that combination already exists or not. If not, then it adds a new combination. Here is the code to do it.

  1. // Another book-keeping task is to read the driver-vehicle hash-table from the database
  2. // Look up the book-keeping table to figure out the previous driver-vehicle codes (we have
  3. // numbers representing the combination of drivers and vehicles).
  4. var driverVehicles = db.getCollection('book_keeping').findOne( { _id: "driver_vehicles" } );
  5. var drivers;
  6. if (driverVehicles != null)
  7. drivers = driverVehicles.drivers;
  8. else
  9. drivers = {};
  11. var maxDriverVehicleId = 0;
  12. for (var key in drivers) {
  13. if (drivers.hasOwnProperty(key)) {
  14. maxDriverVehicleId = Math.max(maxDriverVehicleId, drivers[key]);
  15. }
  16. }

You can see that a 'find' call to the MongoDb database is being made to read the hash-table in memory.

The next task is to query the raw collection to find out which records are new since the last time it ran.

NOTE:  In the code segments below the dollar symbol will show up as '@'. Please make the appropriate substitutions when you read it. The correct code may be found in the github repository.

  1. // Now do a query of the database to find out what records are new since we ran it last
  2. var allNewCarDrivers = db.getCollection('car_readings').aggregate([
  3. {
  4. "@match": {
  5. "timestamp" : { @gt: startTimeStr, @lte: endTimeStr }
  6. }
  7. },
  8. {
  9. "@unwind": "@data"
  10. },
  11. {
  12. "@group": { _id: "@data.username" }
  13. }
  14. ]);

SQL vs. Document-oriented database

The next part is the crux of the actual process happening in this script. To understand how this works you need to be familiar with the MongoDB aggregation framework. A task like this will take way too much code if you start writing SQL. Most relational databases and also Spark offer SQL as a way to process and aggregate their data. The most common reason I have heard from managers to take that approach is - "it is easy". That works, but it is too verbose. That is why I personally prefer to use the aggregation framework of MongoDB to do my pre-processing since I can operate much faster than the other tools out there. It may not be "easy" as per the common belief, but a bit more effort in studying the aggregation framework pays off - saving a lot of development effort.

What about execution time? These scripts execute on the database nodes - inside the database. Thus you cannot make it any faster - since most of the time spent in dealing with large data is in transporting the data from the storage nodes to the execution nodes. In the case of the aggregation frameworks, you are getting all benefits of BigData for free. You are actually using in-database analytics here for the fastest execution time.

  1. // Look at all the records returned and process each driver one-by-one
  2. // The following query is a pipeline with the following steps:
  3. allNewCarDrivers.forEach(function(driverId) {
  4. var driverName = driverId._id;
  5. print("Processing driver: " + driverName);
  6. var allNewCarReadings = db.getCollection('car_readings').aggregate([
  7. {
  8. "@match": { // 1. Match all records that fall within the time range we have decided to use. Note that this is being
  9. // done on a live database - which means that new data is coming in while we are trying to analyze it.
  10. // Thus we have to pin both the starting time and the ending time. Pinning the endtime to the starting time
  11. // of the application ensures that we will be accurately picking up only the NEW records when the program
  12. // runs again the next time.
  13. "timestamp" : { @gt: startTimeStr, @lte: endTimeStr }
  14. }
  15. },
  16. {
  17. @project: { // We only need to consider a few fields for our analysis. This eliminates the summaries from our analysis.
  18. "timestamp": 1,
  19. "data" : 1,
  20. "account": 1,
  21. "_id": 0
  22. }
  23. },
  24. {
  25. @unwind: "@data" // Flatten out all records into one gigantic array of records
  26. },
  27. {
  28. @match: {
  29. "data.username": driverName // Only consider the records for this specific driver, ignoring all others.
  30. }
  31. },
  32. {
  33. @sort: {
  34. "data.eventtime": 1 // Finally sort the data based on eventtime in ascending order
  35. }
  36. }
  37. ]);

This nifty script above does a lot of things. The first thing to note in this script is that we are operating on a live database that has a constant stream of data coming in. Thus in order to select some records for processing we need to decide the time range first and only select those that fall within that time range. The next time we run this script, the records that could not be picked up this time, will be gathered and processed. This is all being done within the 'match' clause.

The second clause is the 'project' clause - which only selects the four required fields for the next stage of the pipeline. The 'unwind' clause flattens all arrays. The next 'match' clause select the driver name and the final 'sort' clause sorts the data by eventtime in ascending order.

Distance on earth between two points

Before proceeding further, I would like to get one thing out of the way. Since we are dealing with a lot of latitude-longitude pairs and subsequently trying to find displacement, velocity and acceleration, we need a way to calculate the distance between two points on earth. There are several algorithms with varying degree of accuracy, but this is the one I have found to be computationally accurate (if you do not have an algorithm already provided by the database vendor).

  1. function earth_distance_havesine(lat1, lon1, lat2, lon2, unit) {
  2. var radius = 3959; // miles
  3. var phi1 = lat1.toRadians();
  4. var phi2 = lat2.toRadians();
  5. var delphi = (lat2-lat1).toRadians();
  6. var dellambda = (lon2-lon1).toRadians();
  8. var a = Math.sin(delphi/2) * Math.sin(delphi/2) +
  9. Math.cos(phi1) * Math.cos(phi2) *
  10. Math.sin(dellambda/2) * Math.sin(dellambda/2);
  11. var c = 2 * Math.atan2(Math.sqrt(a), Math.sqrt(1-a));
  13. var dist = radius * c;
  14. if (unit=="K") { dist = dist * 1.609344 }
  15. if (unit=="N") { dist = dist * 0.8684 }
  16. return dist;
  17. }

We will be using this function in the next analysis. As I said before, our goal is to augment our device records with extra information pertaining to time-derivatives. The following code adds extra fields "interval", "acceleration", "angular_velocity" and "incline" to each device record by comparing it with the preceeding record.

  1. var lastRecord = null; // We create a variable to remember what was the last record processed
  3. var numProcessedRecords = 0;
  4. allNewCarReadings.forEach(function(record) {
  5. // Here we are reading a raw record from the car_readings collection, and then enhancing it with a few more
  6. // variables. These are the (1) id of the driver-vehicle combination and (2) the delta values between current and previous record
  7. numProcessedRecords += 1; // This is just for printing number of processed records when the program is running
  8. var lastTime; // This is the timestamp of the last record
  9. if (lastRecord !== null) {
  10. lastTime =;
  11. } else {
  12. lastTime = "";
  13. }
  14. var eventTime =;
  15. = new Date('Z'); // Creating a real timestamp from an ISO string (without the trailing 'Z')
  16. // print('Eventtime = ' + eventTime);
  17. if (eventTime !== lastTime) { // this must be a new record
  18. var driverVehicle = + "_" +;
  19. if (drivers.hasOwnProperty(driverVehicle))
  20. record.driverVehicleId = drivers[driverVehicle];
  21. else {
  22. drivers[driverVehicle] = maxDriverVehicleId;
  23. record.driverVehicleId = maxDriverVehicleId;
  24. maxDriverVehicleId += 1;
  25. }
  27. = {}; // delta stores the difference between the current record and the previous record
  28. if (lastRecord !== null) {
  29. var timeDifference = -; // in milliseconds
  30.["distance"] = earth_distance_havesine(
  35. "K");
  36. if (timeDifference < 60000) {
  37. // if time difference is less than 60 seconds, only then can we consider it as part of the same session
  38. // print(JSON.stringify(;
  39.["interval"] = timeDifference;
  40.["acceleration"] = 1000 * ( - / timeDifference;
  41.["angular_velocity"] = ( - / timeDifference;
  42.["incline"] = ( - / timeDifference;
  43. } else {
  44. // otherwise this is a new session. So we still store the records, but the delta calculation is all set to zero.
  45.["interval"] = timeDifference;
  46.["acceleration"] = 0.0;
  47.["angular_velocity"] = 0.0;
  48.["incline"] = 0.0;
  49. }
  50. db.getCollection('mldataset').insert(record);
  51. }
  52. }
  53. if (numProcessedRecords % 100 === 0)
  54. print("Processed " + numProcessedRecords + " records");
  55. lastRecord = record;
  56. });
  57. });

Note that in line 50, I am saving the record in another collection called 'mldataset' which is going to be the collection on which I will apply feature-extraction for driver signatures. The final task is to save the book-keeping values in their respective tables.

  1. db.book_keeping.update(
  2. { _id: "driver_vehicles"},
  3. { \(set: { drivers: drivers } },
  4. { upsert: true }
  5. );
  7. // Save the end time to the database
  8. db.book_keeping.update(
  9. { _id: "processed_until" },
  10. { \)set: { lastEndTime: endTime } },
  11. { upsert: true }
  12. );

Creating the feature set for driver signatures

The next step is to create the feature sets for driver signature analysis. I do this by first reading records from the augmented collection 'mldataset' and aggregating values over every 15 minutes. For each field that contains a number (and it happens to change often), I will calculate three statistical values for each field - the minimum over the time window, the maximum and the average. Interestingly, one can also include other statistical values like variance, kertosis - but I have not tried those in my experiment yet - and is an enhancement that you can do easily.

You will find all the code in the file 'extract_features_from_mldataset.js' under the 'machinelearning' directory.

Let us do some book-keeping first.

  1. var processedUntil = db.getCollection('book_keeping').findOne( { _id: "driver_vehicles_processed_until" } );
  2. var currentTime = new Date(); // This time includes the seconds value
  3. // Set the end time (for querying) to the current time till the last whole minute, excluding seconds
  4. var endTimeGlobal = new Date(Date.UTC(currentTime.getFullYear(), currentTime.getMonth(), currentTime.getDate(), currentTime.getHours(), currentTime.getMinutes(), 0, 0))
  6. if (processedUntil === null) {
  7. db.book_keeping.insert( { _id: "driver_vehicles_processed_until", lastEndTimes: [] } ); // initialize to an empty array
  8. }
  10. // Now do a query of the database to find out what records are new since we ran it last
  11. var startTimeForSearchingActiveDevices = new Date(endTimeGlobal.getTime() - (200*86400000)); // Go back 200 days
  13. // Another book-keeping task is to read the driver-vehicle hash-table from the database.
  14. // Look up the book-keeping table to figure out the previous driver-vehicle codes (we have
  15. // numbers representing the combination of drivers and vehicles).
  16. var driverVehicles = db.getCollection('book_keeping').findOne( { _id: "driver_vehicles" } );
  17. var drivers;
  18. if (driverVehicles !== null)
  19. drivers = driverVehicles.drivers;
  20. else
  21. drivers = {};
  23. var maxDriverVehicleId = 0;
  24. for (var key in drivers) {
  25. if (drivers.hasOwnProperty(key)) {
  26. maxDriverVehicleId = Math.max(maxDriverVehicleId, drivers[key]);
  27. }
  28. }

Using the last time stamp stored in the system, we can figure out which records are new.

  1. // Now do a query of the database to find out what records are new since we ran it last
  2. var allNewCarDrivers = db.getCollection('mldataset').aggregate([
  3. {
  4. "@match": {
  5. "data.eventTimestamp" : { @gt: startTimeForSearchingActiveDevices, @lte: endTimeGlobal }
  6. }
  7. },
  8. {
  9. "@group": { _id: "@data.username" }
  10. }
  11. ]);

Extracting features for each driver

Now is the time to do the actual feature extraction from the data-set. Here is the entire loop:

  1. allNewCarDrivers.forEach(function(driverId) {
  2. var driverName = driverId._id;
  3. print("Processing driver: " + driverName);
  4. var startTimeForDriver = startTimeForSearchingActiveDevices; // To begin with we start with the earliest start time we care about
  6. var driverIsNew = true;
  7. // First find out if this device already has some records processed, and has a last end time defined
  8. var lastEndTimeDevice = db.getCollection('book_keeping').find(
  9. {
  10. _id: "driver_vehicles_processed_until",
  11. "lastEndTimes.driver": driverName
  12. },
  13. {
  14. _id: 0,
  15. 'lastEndTimes.@': 1
  16. }
  17. );
  19. lastEndTimeDevice.forEach(function(record) {
  20. startTimeForDriver = record.lastEndTimes[0].endTime;
  21. driverIsNew = false;
  22. });
  24. //print('Starting time for driver is ' + startTimeForDriver.toISOString());
  25. //print('endTimeGlobal = ' + endTimeGlobal.toISOString());
  27. var allNewCarReadings = db.getCollection('mldataset').aggregate([
  28. {
  29. "@match": { // 1. Match all records that fall within the time range we have decided to use. Note that this is being
  30. // done on a live database - which means that new data is coming in while we are trying to analyze it.
  31. // Thus we have to pin both the starting time and the ending time. Pinning the endtime to the starting time
  32. // of the application ensures that we will be accurately picking up only the NEW records when the program
  33. // runs again the next time.
  34. "data.eventTimestamp": {@gt: startTimeForDriver, @lte: endTimeGlobal},
  35. "data.username": driverName // Only consider the records for this specific driver, ignoring all others.
  36. }
  37. },
  38. {
  39. @project: { // We only need to consider a few fields for our analysis. This eliminates the summaries from our analysis.
  40. "data": 1,
  41. "account": 1,
  42. "delta": 1,
  43. "driverVehicleId": 1,
  44. "_id": 0
  45. }
  46. },
  47. {
  48. "@group": {
  49. "_id": {
  50. year: {@year: "@data.eventTimestamp"},
  51. month: {@month: "@data.eventTimestamp"},
  52. day: {@dayOfMonth: "@data.eventTimestamp"},
  53. hour: {@hour: "@data.eventTimestamp"},
  54. minute: {@minute: "@data.eventTimestamp"},
  55. quarter: {@mod: [{@second: "@data.eventTimestamp"}, 4]}
  56. },
  58. "averageGPSLatitude": {@avg: {"@arrayElemAt": ["@data.location.coordinates", 1]}},
  59. "averageGPSLongitude": {@avg: {"@arrayElemAt": ["@data.location.coordinates", 0]}},
  61. "averageLoad": {@avg: "@data.load"},
  62. "minLoad": {@min: "@data.load"},
  63. "maxLoad": {@max: "@data.load"},
  65. "averageThrottlePosB": {@avg: "@data.abs_throttle_pos_b"},
  66. "minThrottlePosB": {@min: "@data.abs_throttle_pos_b"},
  67. "maxThrottlePosB": {@max: "@data.abs_throttle_pos_b"},
  69. "averageRpm": {@avg: "@data.rpm"},
  70. "minRpm": {@min: "@data.rpm"},
  71. "maxRpm": {@max: "@data.rpm"},
  73. "averageThrottlePos": {@avg: "@data.throttle_pos"},
  74. "minThrottlePos": {@min: "@data.throttle_pos"},
  75. "maxThrottlePos": {@max: "@data.throttle_pos"},
  77. "averageIntakeAirTemp": {@avg: "@data.intake_air_temp"},
  78. "minIntakeAirTemp": {@min: "@data.intake_air_temp"},
  79. "maxIntakeAirTemp": {@max: "@data.intake_air_temp"},
  81. "averageSpeed": {@avg: "@data.speed"},
  82. "minSpeed": {@min: "@data.speed"},
  83. "maxSpeed": {@max: "@data.speed"},
  85. "averageAltitude": {@avg: "@data.altitude"},
  86. "minAltitude": {@min: "@data.altitude"},
  87. "maxAltitude": {@max: "@data.altitude"},
  89. "averageCommThrottleAc": {@avg: "@data.comm_throttle_ac"},
  90. "minCommThrottleAc": {@min: "@data.comm_throttle_ac"},
  91. "maxCommThrottleAc": {@max: "@data.comm_throttle_ac"},
  93. "averageEngineTime": {@avg: "@data.engine_time"},
  94. "minEngineTime": {@min: "@data.engine_time"},
  95. "maxEngineTime": {@max: "@data.engine_time"},
  97. "averageAbsLoad": {@avg: "@data.abs_load"},
  98. "minAbsLoad": {@min: "@data.abs_load"},
  99. "maxAbsLoad": {@max: "@data.abs_load"},
  101. "averageGear": {@avg: "@data.gear"},
  102. "minGear": {@min: "@data.gear"},
  103. "maxGear": {@max: "@data.gear"},
  105. "averageRelThrottlePos": {@avg: "@data.rel_throttle_pos"},
  106. "minRelThrottlePos": {@min: "@data.rel_throttle_pos"},
  107. "maxRelThrottlePos": {@max: "@data.rel_throttle_pos"},
  109. "averageAccPedalPosE": {@avg: "@data.acc_pedal_pos_e"},
  110. "minAccPedalPosE": {@min: "@data.acc_pedal_pos_e"},
  111. "maxAccPedalPosE": {@max: "@data.acc_pedal_pos_e"},
  113. "averageAccPedalPosD": {@avg: "@data.acc_pedal_pos_d"},
  114. "minAccPedalPosD": {@min: "@data.acc_pedal_pos_d"},
  115. "maxAccPedalPosD": {@max: "@data.acc_pedal_pos_d"},
  117. "averageGpsSpeed": {@avg: "@data.gps_speed"},
  118. "minGpsSpeed": {@min: "@data.gps_speed"},
  119. "maxGpsSpeed": {@max: "@data.gps_speed"},
  121. "averageShortTermFuelTrim2": {@avg: "@data.short_term_fuel_trim_2"},
  122. "minShortTermFuelTrim2": {@min: "@data.short_term_fuel_trim_2"},
  123. "maxShortTermFuelTrim2": {@max: "@data.short_term_fuel_trim_2"},
  125. "averageO211": {@avg: "@data.o211"},
  126. "minO211": {@min: "@data.o211"},
  127. "maxO211": {@max: "@data.o211"},
  129. "averageO212": {@avg: "@data.o212"},
  130. "minO212": {@min: "@data.o212"},
  131. "maxO212": {@max: "@data.o212"},
  133. "averageShortTermFuelTrim1": {@avg: "@data.short_term_fuel_trim_1"},
  134. "minShortTermFuelTrim1": {@min: "@data.short_term_fuel_trim_1"},
  135. "maxShortTermFuelTrim1": {@max: "@data.short_term_fuel_trim_1"},
  137. "averageMaf": {@avg: "@data.maf"},
  138. "minMaf": {@min: "@data.maf"},
  139. "maxMaf": {@max: "@data.maf"},
  141. "averageTimingAdvance": {@avg: "@data.timing_advance"},
  142. "minTimingAdvance": {@min: "@data.timing_advance"},
  143. "maxTimingAdvance": {@max: "@data.timing_advance"},
  145. "averageClimb": {@avg: "@data.climb"},
  146. "minClimb": {@min: "@data.climb"},
  147. "maxClimb": {@max: "@data.climb"},
  149. "averageFuelPressure": {@avg: "@data.fuel_pressure"},
  150. "minFuelPressure": {@min: "@data.fuel_pressure"},
  151. "maxFuelPressure": {@max: "@data.fuel_pressure"},
  153. "averageTemp": {@avg: "@data.temp"},
  154. "minTemp": {@min: "@data.temp"},
  155. "maxTemp": {@max: "@data.temp"},
  157. "averageAmbientAirTemp": {@avg: "@data.ambient_air_temp"},
  158. "minAmbientAirTemp": {@min: "@data.ambient_air_temp"},
  159. "maxAmbientAirTemp": {@max: "@data.ambient_air_temp"},
  161. "averageManifoldPressure": {@avg: "@data.manifold_pressure"},
  162. "minManifoldPressure": {@min: "@data.manifold_pressure"},
  163. "maxManifoldPressure": {@max: "@data.manifold_pressure"},
  165. "averageLongTermFuelTrim1": {@avg: "@data.long_term_fuel_trim_1"},
  166. "minLongTermFuelTrim1": {@min: "@data.long_term_fuel_trim_1"},
  167. "maxLongTermFuelTrim1": {@max: "@data.long_term_fuel_trim_1"},
  169. "averageLongTermFuelTrim2": {@avg: "@data.long_term_fuel_trim_2"},
  170. "minLongTermFuelTrim2": {@min: "@data.long_term_fuel_trim_2"},
  171. "maxLongTermFuelTrim2": {@max: "@data.long_term_fuel_trim_2"},
  173. "averageGPSAcceleration": {@avg: "@delta.acceleration"},
  174. "minGPSAcceleration": {@min: "@delta.acceleration"},
  175. "maxGPSAcceleration": {@max: "@delta.acceleration"},
  177. "averageHeadingChange": {@avg: {@abs: "@delta.angular_velocity"}},
  178. "minHeadingChange": {@min: {@abs: "@delta.angular_velocity"}},
  179. "maxHeadingChange": {@max: {@abs: "@delta.angular_velocity"}},
  181. "averageIncline": {@avg: "@data.incline"},
  182. "minIncline": {@min: "@data.incline"},
  183. "maxIncline": {@max: "@data.incline"},
  185. "averageAcceleration": {@avg: "@delta.acceleration"},
  186. "minAcceleration": {@min: "@delta.acceleration"},
  187. "maxAcceleration": {@max: "@delta.acceleration"},
  189. // "dtcCodes": {"@push": "@data.dtc_status"},
  190. "accountIdArray": {@addToSet: "@account"},
  192. "vehicleArray": {@addToSet: "@data.vehicle"},
  193. "driverArray": {@addToSet: "@data.username"},
  194. "driverVehicleArray": {@addToSet: "@driverVehicleId"},
  196. "count": {@sum: 1}
  197. }
  198. },
  199. {
  200. @sort: {
  201. "_id": 1 // Finally sort the data based on eventtime in ascending order
  202. }
  203. }
  204. ],
  205. {
  206. allowDiskUse: true
  207. }
  208. );

For each driver (or rather driver-vehicle combination) that is identified, the first task is to figure out the last processing time for that driver and find all new records (lines 6 to 22). The next task of aggregating over 15 second windows is a MongoDB aggregation step starting from line 27. Aggregation tasks in MongoDB are described as pipeline where element element of the flow does a certain task and passes on the result to the next element in the pipe. The first task is to match all records within the time-span that we want to process (lines 29 to 36). Then we only need to consider (i.e. project) few fields that are of interest to us (lines 38 to 44). The element of the pipeline  '\(group') does the actual job of aggregation. The key to this aggregation step is the group-by Id that is created using a 'quarter' (line 55) which is nothing but a number between 0 and 3 created out of the second value of the time-stamp. This effectively creates the time windows needed for aggregation.

The actual aggregation steps are quite repetitive. See for example lines 61 to 63 where the average load, minimum load and maximum load is being calculated based on the aggregate over each time period. This is repeated for all the variables that we want to consider in the feature-set. Before storing it, the values are sorted based on event-time (lines 200 to 202).

Saving the feature-set in a collection

The features thus calculated are saved to a new collection on which I would apply a machine-learning algorithm to create a model. The collection is called 'vehicle_signature_records' - where the feature-set records can be saved as follows:

  1. var lastRecordedTimeForDriver = startTimeForDriver;
  2. var insertCounter = 0;
  3. allNewCarReadings.forEach(function (record) {
  4. var currentRecordEventTime = new Date(Date.UTC(record._id.year, record._id.month - 1,, record._id.hour, record._id.minute, record._id.quarter * 15, 0));
  5. if (currentRecordEventTime >= lastRecordedTimeForDriver)
  6. lastRecordedTimeForDriver = new Date(Date.UTC(record._id.year, record._id.month - 1,, record._id.hour, record._id.minute, 59, 999));
  8. record['eventTime'] = currentRecordEventTime;
  9. record['eventId'] = record._id;
  10. delete record._id;
  11. record['accountId'] = record.accountIdArray[0];
  12. delete record.accountIdArray;
  14. record['vehicle'] = record.vehicleArray[0];
  15. delete record.vehicleArray;
  17. record['driver'] = record.driverArray[0];
  18. delete record.driverArray;
  20. record['driverVehicle'] = record.driverVehicleArray[0];
  21. delete record.driverVehicleArray;
  23. record.averageGPSLatitude = parseInt((record.averageGPSLatitude * 1000).toFixed(3)) / 1000;
  24. record.averageGPSLongitude = parseInt((record.averageGPSLongitude * 1000).toFixed(3)) / 1000;
  26. db.getCollection('vehicle_signature_records').insert(record);
  27. insertCounter += 1;
  28. });

The code above inserts a few more variables to identify the driver, the vehicle and the driver-vehicle combination to the result sent by the aggregation function (lines 8 to 21) and saves it to the database (line 26). However lines 23 and 24 need an explanation since it signifies something very important and significant!

Coding the approximate location of the driver

One of the interesting observations I discovered while working on this problem is that one can dramatically improve accuracy of prediction if you can code the approximate location of the driver. Imagine working on this problem for millions of drivers who are scattered all across the country. One of the important facts to consider is that most drivers generally drive around a certain location most of the time. Thus if their location is somehow encoded into the model, the model can quickly converge based on their location. Lines 23 and 24 attempt to do just that. It encodes two numbers that represent the approximate latitude and longitude of the location. All these lines do is store the latitude and longitude with reduced accuracy.

Some more book-keeping

As a final step the final task is to store the book-keeping values.

  1. if (driverIsNew) { // which means this is a new device with no record
  2. db.book_keeping.update(
  3. {_id: 'driver_vehicles_processed_until'},
  4. {@push: {'lastEndTimes': {driver: driverName, endTime: lastRecordedTimeForDriver}}}
  5. );
  6. } else {
  7. var nowDate = new Date();
  8. db.book_keeping.update(
  9. {_id: 'driver_vehicles_processed_until', 'lastEndTimes.driver': driverName},
  10. {@set: {'lastEndTimes.@.endTime': lastRecordedTimeForDriver, 'lastEndTimes.@.driver': driverName}} // lastRecordedTimeForDriver
  11. );
  12. }

After doing all this work (which by now you may be already exhausted after reading through), we are finally ready to apply some real machine-learning algorithms. Remember, I said before that 95% of the task of a data scientist is in preparing, collecting, consolidating and cleaning the data. You are seeing a live example of that!

In big companies there are people called data-engineers who would do part of this job, but not all people are fortunate enough to have data-engineers working for them. Besides, if you can do all this work, you are more indispensible to the company you work for - and so it makes sense to develop these skills along with your analysis skills as a data-scientist.

Building a Machine Learning Model

Fortunately, the data has been created in a clean way, so there is no further clean-up required on it. Our data is in a MongoDB collection called 'vehicle_signature_records'. If you are a pure Data Scientist the following should be very familar to you. The only difference between what I am going to do now and what you generally find in books and blogs, is the data-source. I am going to read my data-sets directly from the MongoDB database instead of from CSV files. After reading the above, by now you must have become partial experts at understanding MongoDB document structures. If not, don't worry since all the data that we stored in the collection are all flat - i.e. all values are present at the top level of each record. To illlustrate how the data looks, let me show you one record from the collection.

  1. {
  2. "_id" : ObjectId("5a3028db7984b918e715c2a7"),
  3. "averageGPSLatitude" : 37.386,
  4. "averageGPSLongitude" : -121.96,
  5. "averageLoad" : 24.80392156862745,
  6. "minLoad" : 0.0,
  7. "maxLoad" : 68.62745098039215,
  8. "averageThrottlePosB" : 29.11764705882353,
  9. "minThrottlePosB" : 14.901960784313726,
  10. "maxThrottlePosB" : 38.03921568627451,
  11. "averageRpm" : 1216.25,
  12. "minRpm" : 516.0,
  13. "maxRpm" : 1486.0,
  14. "averageThrottlePos" : 20.49019607843137,
  15. "minThrottlePos" : 11.764705882352942,
  16. "maxThrottlePos" : 36.86274509803921,
  17. "averageIntakeAirTemp" : 85.5,
  18. "minIntakeAirTemp" : 84.0,
  19. "maxIntakeAirTemp" : 86.0,
  20. "averageSpeed" : 13.517712865133625,
  21. "minSpeed" : 0.0,
  22. "maxSpeed" : 24.238657551274084,
  23. "averageAltitude" : -1.575,
  24. "minAltitude" : -1.9,
  25. "maxAltitude" : -1.2,
  26. "averageCommThrottleAc" : 25.392156862745097,
  27. "minCommThrottleAc" : 6.2745098039215685,
  28. "maxCommThrottleAc" : 38.431372549019606,
  29. "averageEngineTime" : 32.25,
  30. "minEngineTime" : 32.0,
  31. "maxEngineTime" : 33.0,
  32. "averageAbsLoad" : 40.3921568627451,
  33. "minAbsLoad" : 18.431372549019606,
  34. "maxAbsLoad" : 64.31372549019608,
  35. "averageGear" : 0.0,
  36. "minGear" : 0.0,
  37. "maxGear" : 0.0,
  38. "averageRelThrottlePos" : 19.019607843137255,
  39. "minRelThrottlePos" : 4.705882352941177,
  40. "maxRelThrottlePos" : 27.84313725490196,
  41. "averageAccPedalPosE" : 14.607843137254902,
  42. "minAccPedalPosE" : 9.411764705882353,
  43. "maxAccPedalPosE" : 19.215686274509803,
  44. "averageAccPedalPosD" : 30.19607843137255,
  45. "minAccPedalPosD" : 18.823529411764707,
  46. "maxAccPedalPosD" : 39.21568627450981,
  47. "averageGpsSpeed" : 6.720000000000001,
  48. "minGpsSpeed" : 0.0,
  49. "maxGpsSpeed" : 12.82,
  50. "averageShortTermFuelTrim2" : -0.5,
  51. "minShortTermFuelTrim2" : -1.0,
  52. "maxShortTermFuelTrim2" : 1.0,
  53. "averageO211" : 9698.5,
  54. "minO211" : 1191.0,
  55. "maxO211" : 27000.0,
  56. "averageO212" : 30349.0,
  57. "minO212" : 28299.0,
  58. "maxO212" : 32499.0,
  59. "averageShortTermFuelTrim1" : -0.25,
  60. "minShortTermFuelTrim1" : -2.0,
  61. "maxShortTermFuelTrim1" : 4.0,
  62. "averageMaf" : 2.4332170200000003,
  63. "minMaf" : 0.77513736,
  64. "maxMaf" : 7.0106280000000005,
  65. "averageTimingAdvance" : 28.0,
  66. "minTimingAdvance" : 16.5,
  67. "maxTimingAdvance" : 41.0,
  68. "averageClimb" : -0.025,
  69. "minClimb" : -0.2,
  70. "maxClimb" : 0.1,
  71. "averageFuelPressure" : null,
  72. "minFuelPressure" : null,
  73. "maxFuelPressure" : null,
  74. "averageTemp" : 199.0,
  75. "minTemp" : 199.0,
  76. "maxTemp" : 199.0,
  77. "averageAmbientAirTemp" : 77.75,
  78. "minAmbientAirTemp" : 77.0,
  79. "maxAmbientAirTemp" : 78.0,
  80. "averageManifoldPressure" : 415.4026475455047,
  81. "minManifoldPressure" : 248.2073910645339,
  82. "maxManifoldPressure" : 592.9398786541643,
  83. "averageLongTermFuelTrim1" : 3.25,
  84. "minLongTermFuelTrim1" : -1.0,
  85. "maxLongTermFuelTrim1" : 7.0,
  86. "averageLongTermFuelTrim2" : -23.5,
  87. "minLongTermFuelTrim2" : -100.0,
  88. "maxLongTermFuelTrim2" : 7.0,
  89. "averageGPSAcceleration" : 1.0196509034930195,
  90. "minGPSAcceleration" : 0.0,
  91. "maxGPSAcceleration" : 1.9128551867763974,
  92. "averageHeadingChange" : 0.006215710862578118,
  93. "minHeadingChange" : 0.0,
  94. "maxHeadingChange" : 0.013477895914941244,
  95. "averageIncline" : null,
  96. "minIncline" : null,
  97. "maxIncline" : null,
  98. "averageAcceleration" : 1.0196509034930195,
  99. "minAcceleration" : 0.0,
  100. "maxAcceleration" : 1.9128551867763974,
  101. "count" : 4.0,
  102. "eventTime" : ISODate("2017-07-18T18:11:30.000+0000"),
  103. "eventId" : {
  104. "year" : NumberInt(2017),
  105. "month" : NumberInt(7),
  106. "day" : NumberInt(18),
  107. "hour" : NumberInt(18),
  108. "minute" : NumberInt(11),
  109. "quarter" : NumberInt(2)
  110. },
  111. "accountId" : "17350",
  112. "vehicle" : "toyota-highlander-2005",
  113. "driver" : "anupam",
  114. "driverVehicle" : 12.0
  115. }

That's quite a number of values for analysis! Which is a good sign for us - more values gives us more options to play with it.

As you may have realized by now, I have come to the final stage of building the model which is a traditional machine-learning task that is usually done in Python or R. So the final piece will be written in Python. You will find the entire code at '' in the 'machinelearning' directory.

Feature selection and elimination

As is common in any data-science project, one must first take a look at the data and determine if any features need to be eliminated. If some features do not make sense for the model we are building then those features need to be dropped. One quick observation is that fuel pressure and incline has nothing to do with driver signatures. So I will eliminate those values from any further consideration.

Specifically for this problem, you need do something special, which is a bit unusual, but required in this scenario.

If you look at the features carefully you will notice that some features are driver characteristics while others are vehicle characteristics. Thus it is important to not mix up the two sets. I have used my judgement to separate out the features into two sets as follows.

  1. vehicle_features = [
  2. "averageLoad",
  3. "minLoad",
  4. "maxLoad",
  5. "averageRpm",
  6. "minRpm",
  7. "maxRpm",
  8. "averageEngineTime",
  9. "minEngineTime",
  10. "maxEngineTime",
  11. "averageAbsLoad",
  12. "minAbsLoad",
  13. "maxAbsLoad",
  14. "averageAccPedalPosE",
  15. "minAccPedalPosE",
  16. "maxAccPedalPosE",
  17. "averageAccPedalPosD",
  18. "minAccPedalPosD",
  19. "maxAccPedalPosD",
  20. "averageShortTermFuelTrim2",
  21. "minShortTermFuelTrim2",
  22. "maxShortTermFuelTrim2",
  23. "averageO211",
  24. "minO211",
  25. "maxO211",
  26. "averageO212",
  27. "minO212",
  28. "maxO212",
  29. "averageShortTermFuelTrim1",
  30. "minShortTermFuelTrim1",
  31. "maxShortTermFuelTrim1",
  32. "averageMaf",
  33. "minMaf",
  34. "maxMaf",
  35. "averageTimingAdvance",
  36. "minTimingAdvance",
  37. "maxTimingAdvance",
  38. "averageTemp",
  39. "minTemp",
  40. "maxTemp",
  41. "averageManifoldPressure",
  42. "minManifoldPressure",
  43. "maxManifoldPressure",
  44. "averageLongTermFuelTrim1",
  45. "minLongTermFuelTrim1",
  46. "maxLongTermFuelTrim1",
  47. "averageLongTermFuelTrim2",
  48. "minLongTermFuelTrim2",
  49. "maxLongTermFuelTrim2"
  50. ]
  52. driver_features = [
  53. "averageGPSLatitude",
  54. "averageGPSLongitude",
  55. "averageThrottlePosB",
  56. "minThrottlePosB",
  57. "maxThrottlePosB",
  58. "averageThrottlePos",
  59. "minThrottlePos",
  60. "maxThrottlePos",
  61. "averageIntakeAirTemp",
  62. "minIntakeAirTemp",
  63. "maxIntakeAirTemp",
  64. "averageSpeed",
  65. "minSpeed",
  66. "maxSpeed",
  67. "averageAltitude",
  68. "minAltitude",
  69. "maxAltitude",
  70. "averageCommThrottleAc",
  71. "minCommThrottleAc",
  72. "maxCommThrottleAc",
  73. "averageGear",
  74. "minGear",
  75. "maxGear",
  76. "averageRelThrottlePos",
  77. "minRelThrottlePos",
  78. "maxRelThrottlePos",
  79. "averageGpsSpeed",
  80. "minGpsSpeed",
  81. "maxGpsSpeed",
  82. "averageClimb",
  83. "minClimb",
  84. "maxClimb",
  85. "averageAmbientAirTemp",
  86. "minAmbientAirTemp",
  87. "maxAmbientAirTemp",
  88. "averageGPSAcceleration",
  89. "minGPSAcceleration",
  90. "maxGPSAcceleration",
  91. "averageHeadingChange",
  92. "minHeadingChange",
  93. "maxHeadingChange",
  94. "averageAcceleration",
  95. "minAcceleration",
  96. "maxAcceleration"
  97. ]

Having done this, now we need to build two different models - one to predict the driver and another one to predict the vehicle. It will be an interesting exercise to see which of these two models have better accuracy.

Reading directly from database instead of CSV

For completeness sake let me first give you two utility functions that are used to pull data out of the MongoDB database.

  1. def _connect_mongo(host, port, username, password, db):
  2. """ A utility for making a connection to MongoDB """
  3. if username and password:
  4. mongo_uri = 'mongodb://%s:%s@%s:%s/%s' % (username, password, host, port, db)
  5. conn = MongoClient(mongo_uri)
  6. else:
  7. conn = MongoClient(host, port)
  8. return conn[db]
  10. def read_mongo(db, collection, query={}, projection='', limit=1000, host='localhost', port=27017, username=None, password=None, no_id=False):
  11. """ Read from Mongo and Store into DataFrame """
  12. db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)
  13. cursor = db[collection].find(query, projection).limit(limit)
  14. datalist = list(cursor)
  15. sanitized = json.loads(json_util.dumps(datalist))
  16. normalized = json_normalize(sanitized)
  17. df = pd.DataFrame(normalized)
  19. return df

The function above is Pandas-friendly - it reads data from the MongoDB database and returns a Pandas data-frame so that you can get to work immediately with your machine-learning part.

In case you are not comfortable with MongoDB, I am giving you the entire dataset of the aggregated values in CSV format so that you can import it in any database you wish. The file is in GZIP format - so you need to unzip it before reading it. For those of you who are comfortable with MongoDB, here is the entire database dump.

Building a Machine Learning model

Now it is time to build the learning model. At program invocation two parameters are needed - the database host and which feature set to build the model for. This is handled in the code as follows:

  1. DATABASE_HOST = argv[0]
  2. CHOSEN_FEATURE_SET = argv[1]
  4. readFromDatabase = True
  5. read_and_proceed = False

Then I have some logic for setting the appropriate feature set within the application.

  1. if (CHOSEN_FEATURE_SET == 'vehicle'):
  2. features = vehicle_features
  3. feature_name = 'vehicle'
  4. class_variables = ['vehicle'] # Declare the vehicle as a class variable
  5. elif (CHOSEN_FEATURE_SET == 'driver'):
  6. features = driver_features
  7. feature_name = 'driver'
  8. class_variables = ['driver'] # Declare the driver as a class variable
  9. else:
  10. features = all_features
  11. feature_name = 'driverVehicleId'
  12. class_variables = ['driverVehicleId'] # Declare the driver-vehicle combo as a class variable
  14. if readFromDatabase:
  15. if CHOSEN_FEATURE_SET == 'driver': # Choose the records only for one vehicle which has multiple drivers
  16. df = read_mongo('obd2', 'vehicle_signature_records', {"vehicle": {"\)regex" : ".*gmc-denali.*"}, "eventTime": {"\(gte": startTime, "\)lte": endTime} }, {"_id": 0}, 1000000, DATABASE_HOST, 27017, None, None, True )
  17. else:
  18. df = read_mongo('obd2', 'vehicle_signature_records', {"eventTime": {"\(gte": startTime, "\)lte": endTime} }, {"_id": 0}, 1000000, DATABASE_HOST, 27017, None, None, True )

The following part is mostly boiler-plate code to break up the dataset into a training set, test set and validation set. While doing so all null values are set to zero as well.

  1. # First randomize the entire dataset
  2. df = df.sample(frac=1).reset_index(drop=True)
  4. # Then choose only a small subset of the data, frac=1 means choose everything
  5. df = df.sample(frac=1, replace=True)
  7. df.fillna(value=0, inplace=True)
  9. train_df, test_df, validate_df = np.split(df, [int(.8*len(df)), int(.9*len(df))])
  11. df[feature_name] = df[feature_name].astype('category')
  13. y_train = train_df[class_variables]
  14. X_train = train_df.reindex(columns=features)
  15. X_train.replace('NODATA', 0, regex=False, inplace=True)
  16. X_train.fillna(value=0, inplace=True)
  18. y_test = test_df[class_variables]
  19. X_test = test_df.reindex(columns=features)
  20. X_test.replace('NODATA', 0, regex=False, inplace=True)
  21. X_test.fillna(value=0, inplace=True)
  23. y_validate = validate_df[class_variables]
  24. X_validate = validate_df.reindex(columns=features)
  25. X_test.replace('NODATA', 0, regex=False, inplace=True)
  26. X_validate.fillna(value=0, inplace=True)

Building a Random Forest Classifier and saving it

After trying out various different classifiers, with this dataset, it turns out that a Random Forest classifier gives the best accuracy. Here is the graph showing accuracy of the different classifiers used with this data set. The two best algorithms turn out to be Classification & Regression and Random Forest Classifier. I chose the Random Forest Classifier since this is an ensamble techique and will have better resilience.

Raspberry AlgorithmComparison

This is what you need to do to build a Random Forest classifier with this dataset.

  1. dt = RandomForestClassifier(n_estimators=20, min_samples_leaf=1, max_depth=20, min_samples_split=2, random_state=0)
  2., y_train.values.ravel())
  4. joblib.dump(dt, model_file)
  5. print('...done. Your Random Forest classifier has been saved in file: ' + model_file)

After building the model, I am saving it in a file (line 4) so that it can be read easily when doing the prediction. To find out how well the model is doing, we have to use the test set to make a prediction and evaluate the model score.

  1. y_pred = dt.predict(X_test)
  2. y_test_as_matrix = y_test.as_matrix()
  3. print('Completed generating predicted set')
  5. print ('Confusion Matrix')
  6. print(confusion_matrix(y_test, y_pred))
  8. crossValScore = cross_val_score(dt, X_validate, y_validate)
  9. model_score = dt.score(X_test, y_test_as_matrix)
  10. print('Cross validation score = ' + crossValScore)
  11. print('Model score = ' + model_score)
  12. print ('Precision, Recall and FScore')
  13. precision, recall, fscore, _ = prf(y_test, y_pred, pos_label=1, average='micro')
  14. print('Precision: ' + str(precision))
  15. print('Recall:' + str(recall))
  16. print('FScore:' + str(fscore))

 Many kinds of evalution metrics are calculated and printed in the above code segment. The most important one that I tend to look at is the overall model score, but the others will give you a good idea of the bias and variance which indicates how resilient your model is with respect to changing values.

Measure of importance

One interesting analysis is to figure out which of the features is the most impactful on the result. This can be done using the simple code fragment below:

  1. importance_indices = {}
  2. for z in range(0, len(dt.feature_importances_)):
  3. importance_indices[z] = dt.feature_importances_[z]
  5. sorted_importance_indices = sorted(importance_indices.items(), key=operator.itemgetter(1), reverse=True)
  7. for k1 in sorted_importance_indices:
  8. print(features[int(k1[0])] + ' -> ' + str(k1[1]))

 Prediction results and Conclusion

After running the two cases, namely driver prediction and vehicle prediciton, I am typically getting the following scores.

Driver Prediction Using Raspberry Pi results

This is encouraging given that there was always an apprehension about the score not being accurate enough due to the low frequency of data collection. This is an important factor, since we are creating this model out of the instantaneous time derivatives of values, and a low sampling rate will introduce a significant error. The dataset has 13 different driver vehicle combinations. There isn't a whole lot of driving data other than the experiments that were done, but with an accuracy that is 95% or above, there may be some value in this approach.

Another interesting fact is that the vehicle prediction is coming out to be more accurate than the driver. In other words, the parameters being emitted by the car tend to characterize the car more heavily than the driver. Most drivers drive the same way, but the machine characteristics of the car tend to distinguish them more clearly.

Commercial Use Cases

I have showed you an example of many such applications that can be done with an approach like this. It just involves equipping your car with a smart device like a Raspberry Pi and the rest is all backend server-side work. Here are all the use-cases that I can think of. You can take up any of these as your own project and attempt to find a solution.

  1. Parking assistance
  2. Adaptive collision detection
  3. Video evidence recording
  4. Detect abusive driving
  5. Crash detection
  6. Theft detection
  7. Parking meter
  8. Mobile hot-spot
  9. Voice recognition
  10. Connect racing equipment
  11. Head Unit display
  12. Traffic sign warning
  13. Pattern of usage
  14. Reset fault codes
  15. Driver recognition (this is already demonstrated here!)
  16. Emergency braking alert
  17. Animal overheating protection
  18. Remote start
  19. Remote seatbelt notifications
  20. Radio volume regulation
  21. Auto radio off when window down
  22. Eco-driving optimization alerts
  23. Auto lock/unlock

Commercial product

After doing this experiment building a Raspberry Pi kit from scratch, I found out that there is a product called AutoPi that you can buy which will cut short a lot of the hardware setup. I have no affiliation with AutoPi, but I thought it is interesting that this subject is being treated quite seriously by some companies in Europe.



Published in Data Science

The Raspberry Pi is an extremely interesting invention. It is a full-fledged Linux box (literally can be caged inside a plastic box) and it basically allows you to run any program to connect to any other communicating device around it through cables or a Bluetooth adapter. I am going to show you how to build your own system to hook up a Raspberry Pi to your car, then extract diagnostic information from the CAN bus, upload that to the cloud, and then use a streaming API to predict who is driving the vehicle using a learning model created through a periodic backend process. To understand this, you need some basic electrical engineering skills and some exposure to Python programming.

Since all this work is pretty long, I intend to break it up into three parts which I am going to put into three different articles as follows:

  1. Driver Signatures from OBD Data captured using a Raspberry Pi: Part 1 (Building your Raspberry Pi setup)
  2. Driver Signatures from OBD Data captured using a Raspberry Pi: Part 2 (Reading real-time data and uploading to the cloud)
  3. Driver Signatures from OBD Data captured using a Raspberry Pi: Part 3 (Analyzing driver data and making predictions)

You should go through these articles in order. You are currently reading Part 1 of this series.

Building your Raspberry Pi setup

Note that there are several web-sites where people have described how to set up a Raspberry Pi for reading Onboard Diagnostic (OBD) data. So my article here is mostly a repetition. I created my setup by reading those articles and watching YouTube videos. You should do that too. The only difference between my article and their's is that I provide a motive for doing all this work and take it further - i.e. for creating driver signatures. The articles I have seen on other blog sites take you through the process of building a Raspberry Pi setup, but their story ends right there. In my case, that is just chapter one - and the more interesting work of uploading to the cloud and analyzing that data follows after that. For completeness sake, I thus have to describe a few things about the setup that pertains to hardware. Without that you cannot even start the project. So just follow along. If you the handyman type of guy (or gal), roll up your sleeves and build it. It is fun!

To cut the description short, here are a few articles describing the setup with enough pictures to get you started.



I started off by going through the first article given here. My car is slightly old, so I had no way of using the car's monitor for seeing the dashboard. So I decided to hook up an independent monitor with my Raspberry Pi that I can carry along with me as a "kit". Moreover, I needed to try out this setup on different cars and different drivers to build any meaningful model - so it was important for me to make my setup independent of any attachment to a vehicle. What I wanted was to hop on any car, start my system using the car battery, hook up my Raspberry Pi to the OBD's ELM 327 adapter using Bluetooth and run my program. So my setup is a bit different from the other guy's since I also needed to upload all that data on the cloud while the car was in motion. Remember, I said I want to do real-time prediction of the driver, so all data that is being generated has to go to the cloud in real-time (or near real-time) where we will apply a prediction step on a pre-built model to make the prediction.

Bill of Materials (Main Items)

Given below are the accessories you need to purchase for a complete setup. I am giving a picture of the items as well so that you can match it as closely as you can when you purchase them.

Raspberry Pi 3

The first thing you want is a Raspberry Pi 3 computer. Here is how it looks. Search for one on web and purchase it from your country's online retailer.

Raspberry Pi 3

You will also need to buy a plastic case for this board to protect the components.

Notice that the Pi shown above has an HDMI port for the monitor, an ethernet port, 4 USB ports and an SD card slot.


You will need a 7-inch car monitor. Most people hook up the Raspberry Pi to the inbuilt monitor of the car, but my use-case is multi-car. So I wanted to keep my setup independent of any car - thus I went with a separate monitor. Here is what I used for this purpose.

This does not come with an HDMI cable, so I had to purchase a separate cable myself to connect to the Pi. Note that this came with a 110V adapter, so you may choose to either buy a cable to use with your car's cigarette lighter slot, or buy an inverter that converts 12V DC to 110V AC and use that as your universal power source for all situations.

Inverter (Optional)

I decided to go with the DC to AC inverter since that also [1] acts as a spike buster in the car - (note that voltage spikes when you start the car), and [2] you can have just one setup for your power source needs, whether you are inside the car, or sitting in the lab for your post-processing or development needs.

Here is a picture of the inverter I am using:



The next accessory you need is a keyboard to operate your machine. I went with an integrated keyboard/trackpad that goes along with size of the Raspberry Pi. Here is what I liked:

ELM 327 Adapter

To connect to your car, your Raspberry Pi needs a Bluetooth adapter. The ELM 327 standard allows a serial connection your car's CAN bus. This adapter can access your car's CAN bus data via a serial connection. Note that this is a slow connection (being serial), but we will try to manage with this. The connector is pretty cheap (around USD 10 or so). I have seen more advanced connectors from Kvaser that come in the range of $300 or so, and can read data at a much faster rate. Our connector will be able to read data at a reasonable rate, but not fast enough for highly accurate readings. We will live with it.

Here is how it looks.

You need to insert this into your car's OBD2 port which is generally found under the steering wheel of your car. It is generally above your left foot, and the port looks like this.

GSM modem

Since we intend to upload our data to the cloud in real-time, we will need a GSM modem. There are many choices and the software setup is different for different vendors. I went with a TELUS modem that uses a Huawei chip internally.

You will notice that this is where our setup goes beyond the other setups found on the internet.

GPS Antenna

Another important component we need is the GPS antenna. This is not really necessary for the driver signature part, but it is very important for visualization of data and to do other kinds of analysis - like speed and acceleration calculations from GPS data. Here is one that goes with the Raspberry Pi.

You need to follow the manual for connecting this to your Raspberry Pi. I had to do some soldering work in order to make the right connections. The software to drive this also needs to be installed via a 'apt-get' command.

Complete Kit

My complete kit after putting together all the components looks like this:

You can see that I decided to mount all of this on a two-layered particle board separated with plastic pipes. Even though it looks big, it is not clunky. There is enough room for expansion. I kept it in two layers so that all the power cables are hidden underneath the top board and exist out of view. You will need some velcro strips to keep the power supply units from falling away. I also used a USB extender (Inateck) to keep my USB devices from clogging up room around the Raspberry Pi.

This setup is clean and portable. You can use it to work and debug code inside your car or in the lab.


My setup was inspired by many articles on the web and on YouTube. You should also consult other links that show you how to do it. However this is not the main focus of this series - my main objective is to use this setup for some data science purpose. I chose to use this for driver signature analysis. The following two articles will delve deeper into the software setup - both on the client side and the server side, to solve this problem. You will find the links to the next two articles under. Happy reading!

Go to Part 2 of this series


Published in Data Science