Blog

Blog articles pertaining to Data Science, Machine Learning, Hardware, Software and Personal musings

Saturday, 18 August 2018 21:57

Managing effectiveness of Email campaigns

A Behavior Analysis Problem

Despite the growth of alternate communication means like Chat applications (Whatsapp, Slack, Snapchat), Social web-sites (Facebook, Google Plus, Twitter), communicating via icons (Emojis) and images (Instagram), the good old Email is not yet dead. Almost all companies still rely on Email to not only run their business, but also to do marketing to their customers. This, in itself, has evolved into a field of study. The main concern facing any marketing manager is to control the amount of email to send to their clients so that the customer stays interested in the company and does not unsubscribe because of too much junk mail. So the balance between too much mail and too little mail is very sensitive.

We can employ data science to solve this problem. The problem solved here is taken from a real company's data and was presented to me as a test. So I thought this is good problem to solve and demonstrate the use of data science. Emailing customers too frequently can cause them to unsubscribe from the email list. Sending them too many mails will lead them to not open emails that can cause promotional emails to be sent to their junk mailbox. The goal is to find out the optimum amount of emails to be sent to each customer so that they stay interested over time. Note that this number is different for every customer. We want to predict a customer's propensity to open an email in the future based on observed historical trends as well as their propensity to unsubscribe if they do open the email.

The Dataset

Before I start describing the dataset, you may be inclined to see it first. To introduce some variety in my solution approach - and since most of the data is tabular in fashion, I have decided to use a relational database (MySQL) to store the data, and query it, using JOINs if necessary to generate my feature set. You can download the entire dump of the MySQL database and restore it on your local machine to get started.

What data do we have?

We have data in 5 tables as follows:

  1. Campaign Types - A historical list of campaigns with date and campaign type
  2. User Info - a list of user ids along with registration date and the source from where their email was acquired
  3. Sends - a list of dates when an email was sent to the user and a flag indicating if the email was opened
  4. Opens - a list of dates when an email was opened by the subscriber and unsubscribed
  5. Events - a list of events pertaining to each user along with counts. The count represents different things depending on the type of event.

Let me describe in detail what each column in the given tables mean.

Campaign Type

Column Data Type Description
 launch_id Date Unique identifier of a campaign by date of send.
camptype String Type of campaign (email send), where “Heavy Promo”, “Light Promo”, “Evergreen” (No Promo) refer to the magnitude of promo offer, and “Primary Messaging”/”Secondary Messaging” refer to the degree to which the promotion is messaged within the email content.

User Info

Column Data Type Description
 riid Integer Unique identifier of a customer
aq_dt Datetime Date customer's email was acquired as stored in current email system
sub_source String Desciption of where the customer registered their email

Sends

Column Data Type Description
 riid Integer Unique identifier of a customer
campaign_send_dt Datetime Send date of campaign
opened Integer 0 or 1, indicator of whether or not a given riid opened a given campaign

Opens

Column Data Type Description
 riid Integer Unique identifier of a customer
campaign_send_dt Datetime Send date of campaign
optout Integer 0 or 1, indicator of whether or not a given riid opened a given campaign and unsubscribed

Events

Column Data Type Description
riid Integer Unique identifier of a customer
event String Name of the event the record is associated:
  • Open (opened an email)
  • Sent (received an email)
  • Click (clicked an email)
  • Online Purch (made a purchase online).
event_captured_dt Datetime Date an event was recorded.
cnts Integer Different aggregate measures depending on the associated event:

  • Open: Count of all opens recorded for an individual on a given day
.
  • Send: Count of all sends recorded for an individual on a given day
.
  • Click: Count of all clicks recorded for an individual on a given day
.
  • Online Purch: Sum total of number of items purchaced by an individual on a given day

Understanding the dataset

Before we jump into a predictive model building, let us analyze the data to understand it better.

Aggregate open rate

As a first exercise, let us attempt  to detemine the aggregate open rate. In other words, what percentage of people actually opened their email. This is a very simple ratio determined from the count of emails opened vs. count of email. We can do this in one SQL statement as follows:

  1. mysql> select (select count(*) from sends where opened = 1)/count(*) from sends;
  2. +--------------------------------------------------------+
  3. | (select count(*) from sends where opened = 1)/count(*) |
  4. +--------------------------------------------------------+
  5. | 0.1863 |
  6. +--------------------------------------------------------+
  7. 1 row in set (0.13 sec)
  8.  

The answer is about 18%.

Aggregate unsubscribe per open rate

Of the people who opened their email, how many got irritated and unsubscribed? Again, this is a simple SQL query on the "opens" table as follows:

  1. mysql> select (select count(*) from opens where optout = 1)/count(*) from opens;
  2. +--------------------------------------------------------+
  3. | (select count(*) from opens where optout = 1)/count(*) |
  4. +--------------------------------------------------------+
  5. | 0.0137 |
  6. +--------------------------------------------------------+
  7. 1 row in set (0.11 sec)

The answer is about 1.4%.

Aggregate open rate ordered by launch Id

How about calculating the open rate based on launch Id. Since one mail is sent every day, we can group the results by launch date and expect the same result. This is a simple group by query as follows:

  1. mysql> SELECT t1.campaign_send_dt AS campaign_send_date, t1.total_daily_sends, t2.total_opened,
  2. (t2.total_opened/t1.total_daily_sends) AS ratio FROM (SELECT campaign_send_dt, COUNT(*) AS total_daily_sends FROM sends
  3. GROUP BY campaign_send_dt ORDER BY campaign_send_dt) AS t1
  4. LEFT JOIN (SELECT campaign_send_dt, COUNT(*) AS total_opened FROM sends
  5. WHERE opened = 1 GROUP BY campaign_send_dt ORDER BY campaign_send_dt) AS t2
  6. ON t1.campaign_send_dt = t2.campaign_send_dt;
  7. +---------------------+-------------------+--------------+--------+
  8. | campaign_send_date | total_daily_sends | total_opened | ratio |
  9. +---------------------+-------------------+--------------+--------+
  10. | 2018-01-01 00:00:00 | 1084 | 310 | 0.2860 |
  11. | 2018-01-03 00:00:00 | 4575 | 790 | 0.1727 |
  12. | 2018-01-05 00:00:00 | 2074 | 502 | 0.2420 |
  13. | 2018-01-07 00:00:00 | 2582 | 501 | 0.1940 |
  14. | 2018-01-08 00:00:00 | 1606 | 337 | 0.2098 |
  15. | 2018-01-09 00:00:00 | 4210 | 690 | 0.1639 |
  16. | 2018-01-11 00:00:00 | 3316 | 574 | 0.1731 |
  17. | 2018-01-12 00:00:00 | 1905 | 422 | 0.2215 |
  18. | 2018-01-14 00:00:00 | 1862 | 435 | 0.2336 |
  19. | 2018-01-15 00:00:00 | 3141 | 564 | 0.1796 |
  20. | 2018-01-16 00:00:00 | 762 | 169 | 0.2218 |
  21. | 2018-01-18 00:00:00 | 2709 | 571 | 0.2108 |
  22. | 2018-01-19 00:00:00 | 1916 | 413 | 0.2156 |
  23. | 2018-01-21 00:00:00 | 3189 | 582 | 0.1825 |
  24. | 2018-01-22 00:00:00 | 535 | 63 | 0.1178 |
  25. | 2018-01-23 00:00:00 | 1948 | 474 | 0.2433 |
  26. | 2018-01-25 00:00:00 | 2581 | 479 | 0.1856 |
  27. | 2018-01-26 00:00:00 | 1953 | 413 | 0.2115 |
  28. | 2018-01-28 00:00:00 | 1008 | 244 | 0.2421 |
  29. | 2018-01-30 00:00:00 | 219 | 35 | 0.1598 |
  30. | 2018-01-31 00:00:00 | 6347 | 1057 | 0.1665 |
  31. | 2018-02-01 00:00:00 | 1077 | 274 | 0.2544 |
  32. | 2018-02-02 00:00:00 | 2626 | 436 | 0.1660 |
  33. | 2018-02-04 00:00:00 | 1967 | 396 | 0.2013 |
  34. | 2018-02-06 00:00:00 | 4989 | 883 | 0.1770 |
  35. | 2018-02-08 00:00:00 | 1031 | 238 | 0.2308 |
  36. | 2018-02-09 00:00:00 | 2731 | 461 | 0.1688 |
  37. | 2018-02-11 00:00:00 | 1994 | 424 | 0.2126 |
  38. | 2018-02-12 00:00:00 | 1209 | 348 | 0.2878 |
  39. | 2018-02-13 00:00:00 | 1936 | 384 | 0.1983 |
  40. | 2018-02-14 00:00:00 | 56 | 24 | 0.4286 |
  41. | 2018-02-15 00:00:00 | 4241 | 626 | 0.1476 |
  42. | 2018-02-16 00:00:00 | 1882 | 371 | 0.1971 |
  43. | 2018-02-18 00:00:00 | 2554 | 489 | 0.1915 |
  44. | 2018-02-19 00:00:00 | 5686 | 764 | 0.1344 |
  45. | 2018-02-21 00:00:00 | 467 | 141 | 0.3019 |
  46. | 2018-02-23 00:00:00 | 1888 | 363 | 0.1923 |
  47. | 2018-02-25 00:00:00 | 2938 | 525 | 0.1787 |
  48. | 2018-02-26 00:00:00 | 1128 | 314 | 0.2784 |
  49. | 2018-02-27 00:00:00 | 7569 | 1081 | 0.1428 |
  50. | 2018-02-28 00:00:00 | 2509 | 459 | 0.1829 |
  51. +---------------------+-------------------+--------------+--------+
  52. 41 ROWS IN SET (0.33 sec)

The answer lies between 11% and 42% that happened only once when mails sent were too low.

Aggregate open rate based on registration source

We could also group the results based on other criteria, like sub_source. This query will be a bit longer, but here it is as given below. Note that we are using the WITH statement available in MySQL that is a more recent addition to the MySQL syntax.

 

  1. mysql> WITH sends_wider AS (SELECT sends.riid, sends.campaign_send_dt, sends.opened, user_info.aq_dt, user_info.sub_source FROM SENDS
  2. LEFT JOIN user_info ON sends.riid = user_info.riid)
  3. SELECT t1.sub_source AS sub_source, t1.total_daily_sends, t2.total_opened, (t2.total_opened/t1.total_daily_sends) AS ratio
  4. FROM (SELECT sub_source, COUNT(*) AS total_daily_sends FROM sends_wider
  5. GROUP BY sub_source ORDER BY sub_source) AS t1
  6. LEFT JOIN (SELECT sub_source, COUNT(*) AS total_opened FROM sends_wider WHERE opened = 1 GROUP BY sub_source ORDER BY sub_source) AS t2
  7. ON t1.sub_source = t2.sub_source;
  8. +---------------------+-------------------+--------------+--------+
  9. | sub_source | total_daily_sends | total_opened | ratio |
  10. +---------------------+-------------------+--------------+--------+
  11. | Checkout_Online | 26189 | 5091 | 0.1944 |
  12. | eReceipt_Physical | 8750 | 1722 | 0.1968 |
  13. | Facebook CPL_Online | 830 | 125 | 0.1506 |
  14. | Footer_Desktop | 2537 | 555 | 0.2188 |
  15. | Footer_Mobile | 2221 | 407 | 0.1833 |
  16. | Join_Desktop | 1532 | 283 | 0.1847 |
  17. | Join_Mobile | 343 | 78 | 0.2274 |
  18. | Lightbox_Desktop | 15254 | 2712 | 0.1778 |
  19. | Lightbox_Mobile | 4734 | 1029 | 0.2174 |
  20. | My Account_Desktop | 599 | 119 | 0.1987 |
  21. | My Account_Mobile | 166 | 20 | 0.1205 |
  22. | Other_Other | 17098 | 2759 | 0.1614 |
  23. | POS form_Physical | 17542 | 3241 | 0.1848 |
  24. | Sports/Brand_Online | 670 | 150 | 0.2239 |
  25. | Warehouse_Desktop | 999 | 206 | 0.2062 |
  26. | Warehouse_Mobile | 536 | 129 | 0.2407 |
  27. +---------------------+-------------------+--------------+--------+
  28. 16 ROWS IN SET (0.37 sec)

A ratio is expected to average 18% or so, and the results look plausible.

Aggregate open rate based on month of acquisition

Another similar criteria to group the results is month of acquisition. Here is the query for that.

 

  1. mysql> WITH sends_wider AS
  2. (SELECT sends.riid, sends.campaign_send_dt, sends.opened, user_info.aq_dt, user_info.sub_source FROM sends
  3. LEFT JOIN user_info ON sends.riid = user_info.riid) SELECT t1.aq_dt AS aq_dt, t1.total_daily_sends, t2.total_opened,
  4. (t2.total_opened/t1.total_daily_sends) AS ratio
  5. FROM (SELECT aq_dt, COUNT(*) AS total_daily_sends FROM sends_wider GROUP BY aq_dt ORDER BY aq_dt) AS t1
  6. LEFT JOIN (SELECT aq_dt, COUNT(*) AS total_opened FROM sends_wider WHERE opened = 1 GROUP BY aq_dt ORDER BY aq_dt) AS t2
  7. ON t1.aq_dt = t2.aq_dt;
  8. +---------------------+-------------------+--------------+--------+
  9. | aq_dt | total_daily_sends | total_opened | ratio |
  10. +---------------------+-------------------+--------------+--------+
  11. | 2010-01-06 00:00:00 | 1 | NULL | NULL |
  12. | 2010-01-11 00:00:00 | 1 | NULL | NULL |
  13. | 2010-01-13 00:00:00 | 7 | 5 | 0.7143 |
  14. | 2010-01-18 00:00:00 | 5 | NULL | NULL |
  15. | 2010-01-21 00:00:00 | 4 | NULL | NULL |
  16. ...
  17. ...
  18. ...
  19. | 2018-01-17 00:00:00 | 86 | 16 | 0.1860 |
  20. | 2018-01-18 00:00:00 | 38 | 10 | 0.2632 |
  21. | 2018-01-19 00:00:00 | 131 | 26 | 0.1985 |
  22. | 2018-01-20 00:00:00 | 64 | 18 | 0.2813 |
  23. | 2018-01-21 00:00:00 | 44 | 15 | 0.3409 |
  24. | 2018-01-22 00:00:00 | 29 | 3 | 0.1034 |
  25. | 2018-01-23 00:00:00 | 26 | 11 | 0.4231 |
  26. | 2018-01-24 00:00:00 | 25 | 10 | 0.4000 |
  27. | 2018-01-25 00:00:00 | 23 | 12 | 0.5217 |
  28. | 2018-01-26 00:00:00 | 25 | 10 | 0.4000 |
  29. | 2018-01-27 00:00:00 | 24 | 10 | 0.4167 |
  30. | 2018-01-28 00:00:00 | 31 | 6 | 0.1935 |
  31. | 2018-01-29 00:00:00 | 12 | 2 | 0.1667 |
  32. | 2018-01-30 00:00:00 | 25 | 5 | 0.2000 |
  33. | 2018-01-31 00:00:00 | 73 | 18 | 0.2466 |
  34. | 2018-02-01 00:00:00 | 34 | 17 | 0.5000 |
  35. | 2018-02-02 00:00:00 | 29 | 7 | 0.2414 |
  36. | 2018-02-03 00:00:00 | 34 | 11 | 0.3235 |
  37. | 2018-02-04 00:00:00 | 30 | 9 | 0.3000 |
  38. | 2018-02-05 00:00:00 | 25 | 7 | 0.2800 |
  39. | 2018-02-06 00:00:00 | 22 | 11 | 0.5000 |
  40. | 2018-02-07 00:00:00 | 46 | 11 | 0.2391 |
  41. | 2018-02-08 00:00:00 | 64 | 14 | 0.2188 |
  42. | 2018-02-09 00:00:00 | 52 | 16 | 0.3077 |
  43. | 2018-02-10 00:00:00 | 17 | 3 | 0.1765 |
  44. | 2018-02-11 00:00:00 | 10 | 4 | 0.4000 |
  45. | 2018-02-12 00:00:00 | 12 | 5 | 0.4167 |
  46. | 2018-02-13 00:00:00 | 12 | 2 | 0.1667 |
  47. | 2018-02-14 00:00:00 | 11 | 6 | 0.5455 |
  48. | 2018-02-15 00:00:00 | 8 | 4 | 0.5000 |
  49. | 2018-02-16 00:00:00 | 12 | 3 | 0.2500 |
  50. | 2018-02-17 00:00:00 | 10 | 1 | 0.1000 |
  51. | 2018-02-18 00:00:00 | 16 | 2 | 0.1250 |
  52. | 2018-02-19 00:00:00 | 5 | 2 | 0.4000 |
  53. | 2018-02-20 00:00:00 | 11 | 8 | 0.7273 |
  54. | 2018-02-21 00:00:00 | 13 | 3 | 0.2308 |
  55. | 2018-02-22 00:00:00 | 13 | 1 | 0.0769 |
  56. | 2018-02-23 00:00:00 | 13 | 4 | 0.3077 |
  57. | 2018-02-24 00:00:00 | 8 | 1 | 0.1250 |
  58. | 2018-02-25 00:00:00 | 9 | 5 | 0.5556 |
  59. | 2018-02-26 00:00:00 | 1 | NULL | NULL |
  60. +---------------------+-------------------+--------------+--------+
  61. 2815 ROWS IN SET (0.48 sec)

The user data appears to have some problems (NULL values) for earlier dates. This was confirmed as those were reported to have been migrated from a legacy email system, and were not completely correct.

To open or not

So far we had no need to use data-science tools as most of the analysis was simple enough to handle through SQL. Let's try to do a slightly more involved analysis. Before a data scientist attempts to do some prediction, he/she has to work with some conjectures or hypotheses. These are theories based on intuition and one must be able to either prove or disprove it using derivation from the data.

One of the assumptions is that a customer's propensity to open an email as well as their propensity to unsubscribe (after opening the mail) may be related to their tenure. To analyze this theory we need a dataset that has the tenure (difference between the mail's send date and the acquisition date) alongside for each record in the "send" table.

How do we get the tenure in the "sends" table? This is done through a simple JOIN statement as follows:

  1. mysql> SELECT sends.riid, sends.campaign_send_dt, sends.opened, user_info.aq_dt, user_info.sub_source,
  2. DATEDIFF(sends.campaign_send_dt, user_info.aq_dt) AS tenure FROM sends LEFT JOIN user_info
  3. ON sends.riid = user_info.riid LIMIT 10;
  4. +-----------+---------------------+--------+---------------------+-------------------+--------+
  5. | riid | campaign_send_dt | opened | aq_dt | sub_source | tenure |
  6. +-----------+---------------------+--------+---------------------+-------------------+--------+
  7. | 747246582 | 2018-02-19 00:00:00 | 0 | 2016-12-07 00:00:00 | Lightbox_Desktop | 439 |
  8. | 773178502 | 2018-02-04 00:00:00 | 0 | 2017-02-03 00:00:00 | Join_Desktop | 366 |
  9. | 89069862 | 2018-01-11 00:00:00 | 1 | 2010-10-23 00:00:00 | Other_Other | 2637 |
  10. | 899893722 | 2018-01-15 00:00:00 | 0 | 2017-11-08 00:00:00 | Footer_Mobile | 68 |
  11. | 88695502 | 2018-02-02 00:00:00 | 0 | 2011-03-06 00:00:00 | Other_Other | 2525 |
  12. | 939655902 | 2018-02-27 00:00:00 | 0 | 2017-12-31 00:00:00 | eReceipt_Physical | 58 |
  13. | 587652902 | 2018-02-15 00:00:00 | 0 | 2015-11-12 00:00:00 | Other_Other | 826 |
  14. | 865323662 | 2018-02-11 00:00:00 | 1 | 2017-08-20 00:00:00 | Checkout_Online | 175 |
  15. | 166559382 | 2018-01-19 00:00:00 | 0 | 2014-02-22 00:00:00 | POS form_Physical | 1427 |
  16. | 894975022 | 2018-01-01 00:00:00 | 0 | 2017-10-25 00:00:00 | Lightbox_Desktop | 68 |
  17. +-----------+---------------------+--------+---------------------+-------------------+--------+
  18. 10 ROWS IN SET (0.00 sec)

The statement above prints only 10 rows, but the tenure has been elegantly inserted as an extra column in this table. If our theory is correct, there should be a positive relationship between the "opened" state and tenure.  Since the "opened" state is a discrete variable which takes only two values (0 and 1), we may be able to show a corelation if we plot a histogram of opened state against tenure range.

Since I have made my hands dirty with SQL, let me become a bit bold and try to write a SQL query to plot a side-ways histogram using just SQL. Yes, you were expecting me to write a Python program or an R program to draw a beautiful histogram, but I am feeling a bit lazy now, and if SQL can give me a decent graphic, why not?

  1. mysql> SELECT ROUND(tenure, -2) AS tenure_bucket, COUNT(*) AS COUNT, RPAD('', LN(COUNT(*)), '*') AS bar
  2. FROM (SELECT sends.riid, sends.campaign_send_dt, sends.opened, user_info.aq_dt, user_info.sub_source,
  3. DATEDIFF(sends.campaign_send_dt, user_info.aq_dt) AS tenure FROM sends LEFT JOIN user_info
  4. ON sends.riid = user_info.riid WHERE opened = 1) AS histogram
  5. GROUP BY tenure_bucket
  6. ORDER BY tenure_bucket;
  7. +---------------+-------+----------+
  8. | tenure_bucket | COUNT | bar |
  9. +---------------+-------+----------+
  10. | 0 | 932 | ******* |
  11. | 100 | 2237 | ******** |
  12. | 200 | 1507 | ******* |
  13. | 300 | 1529 | ******* |
  14. | 400 | 1510 | ******* |
  15. | 500 | 1295 | ******* |
  16. | 600 | 929 | ******* |
  17. | 700 | 739 | ******* |
  18. | 800 | 830 | ******* |
  19. | 900 | 703 | ******* |
  20. | 1000 | 1486 | ******* |
  21. | 1100 | 623 | ****** |
  22. | 1200 | 409 | ****** |
  23. | 1300 | 378 | ****** |
  24. | 1400 | 349 | ****** |
  25. | 1500 | 512 | ****** |
  26. | 1600 | 374 | ****** |
  27. | 1700 | 309 | ****** |
  28. | 1800 | 254 | ****** |
  29. | 1900 | 369 | ****** |
  30. | 2000 | 275 | ****** |
  31. | 2100 | 160 | ***** |
  32. | 2200 | 191 | ***** |
  33. | 2300 | 220 | ***** |
  34. | 2400 | 116 | ***** |
  35. | 2500 | 102 | ***** |
  36. | 2600 | 110 | ***** |
  37. | 2700 | 80 | **** |
  38. | 2800 | 73 | **** |
  39. | 2900 | 23 | *** |
  40. | 3000 | 2 | * |
  41. +---------------+-------+----------+
  42. 31 ROWS IN SET (0.10 sec)

This SQL query needs some explanation. In the internal SQL query "select sends.riid, sends.campaign_send_dt, sends.opened, user_info.aq_dt, user_info.sub_source, DATEDIFF(sends.campaign_send_dt, user_info.aq_dt) AS tenure from sends left join user_info ON sends.riid = user_info.riid where opened = 1" we are just creating a joined table with only the entries that were opened. Then the rest of query is creating buckets with a range of 100 in each bucket. The variable "COUNT" is the count of entries in each bucket and the "bar" is a poor man's histogram.

We see that if the tenure is low, then the person is more likely to open the email. Those with a higher tenure tend not to open their email. This may be explained by the fact that interest in the email dwindles with time - which I believe is natural to expect.

To unsubscribe or not

A related problem to the above is to figure out how tenure affects someone's likelyhood of unsubscribing. For that we can use the "opens" table and do a similar query on the "optout" column. This will tell us if people who have a long tenure are more likely to unsubscribe or not. To continue the tradition, I am going to write this query in SQL again.

  1. mysql> SELECT ROUND(tenure, -2) AS tenure_bucket, COUNT(*) AS COUNT, RPAD('', LN(COUNT(*)), '*') AS bar
  2. FROM (SELECT opens.riid, opens.campaign_send_dt, opens.optout, user_info.aq_dt, user_info.sub_source,
  3. DATEDIFF(opens.campaign_send_dt, user_info.aq_dt) AS tenure FROM opens LEFT JOIN user_info
  4. ON opens.riid = user_info.riid WHERE optout = 1) AS histogram
  5. GROUP BY tenure_bucket
  6. ORDER BY tenure_bucket;
  7. +---------------+-------+--------+
  8. | tenure_bucket | COUNT | bar |
  9. +---------------+-------+--------+
  10. | 0 | 239 | ***** |
  11. | 100 | 288 | ****** |
  12. | 200 | 135 | ***** |
  13. | 300 | 113 | ***** |
  14. | 400 | 125 | ***** |
  15. | 500 | 75 | **** |
  16. | 600 | 52 | **** |
  17. | 700 | 44 | **** |
  18. | 800 | 40 | **** |
  19. | 900 | 39 | **** |
  20. | 1000 | 91 | ***** |
  21. | 1100 | 20 | *** |
  22. | 1200 | 15 | *** |
  23. | 1300 | 10 | ** |
  24. | 1400 | 8 | ** |
  25. | 1500 | 16 | *** |
  26. | 1600 | 8 | ** |
  27. | 1700 | 8 | ** |
  28. | 1800 | 10 | ** |
  29. | 1900 | 9 | ** |
  30. | 2000 | 5 | ** |
  31. | 2100 | 9 | ** |
  32. | 2200 | 7 | ** |
  33. | 2300 | 2 | * |
  34. | 2400 | 1 | |
  35. | 2500 | 2 | * |
  36. | 2700 | 1 | |
  37. | 2800 | 2 | * |
  38. +---------------+-------+--------+
  39. 28 ROWS IN SET (0.05 sec)

It is very clear from the histogram above that people with less tenure tend to opt out of email campaigns quickly. Those whose have been a subscriber for a long time will hesitate to opt out of email campaigns. The highest rate of opt-outs come between the first 200 days or so.

Predicting a customer's reaction

I am finally going to solve a real data-science problem with this data-set. For those of you who believe that I have been dilly-dallying with SQL all this while, this would be interesting.

We would like to predict the likelihood that a customer will open an email sent on a given day in the future for a given campaign type. Is it also possible to predict if the person who opens the email will unsubscribe? Let us try first create the dataset and then fit a model on that dataset.

To solve this problem, I would like to create a dataset that has the following columns.

  1. User id (riid)
  2. Date (just for reference, not to be used in dataset)
  3. Tenure on that day
  4. Type of campaign (camptype)
  5. Source of user's email address (sub_source)
  6. Number of times the user opened the email that day
  7. Whether the user unsubscribed from list that day
  8. Sum total of items purchased by an individual on a given day

Most of the information is in table "events", but it needs to be augmented with additional information from other tables. To do so we have to do a few more joins. Ideally I would like to do everything in one join statement as follows:

 

  1. CREATE TABLE emailcampaign.dataset
  2. SELECT events.riid, events.event, events.event_captured_dt, events.cnts, user_info.aq_dt, user_info.sub_source,
  3. DATEDIFF(events.event_captured_dt, user_info.aq_dt) AS tenure, campaign_types.camptype, opens.optout
  4. FROM events
  5. LEFT JOIN user_info
  6. ON events.riid = user_info.riid
  7. LEFT JOIN campaign_types ON events.event_captured_dt = campaign_types.launch_id
  8. LEFT JOIN opens ON opens.campaign_send_dt = events.event_captured_dt AND opens.riid = events.riid;

But alas, the datasize is too big to do this in one shot. We have about 8.3 million records in the events table! So we have to break up this query into several parts and do some indexing in between to make it work faster. Here is the trick.

We would like to do the joins in three stages - joining one table at a time and saving the result to a different table. (This is similar to the MapReduce paradigm where each MapReduce step writes its result to the disk before the next stage begins). Here are the steps to do it for this problem. We are creating three tables called dataset1, dataset2 and dataset3 - each of them them bigger than the previous one and pulling in some columns from different tables. Our final table will be dataset3 which will be used for the machine learning exercise.

Here is how you will create the first dataset.

  1. mysql> CREATE TABLE emailcampaign.dataset1
  2. -> SELECT events.riid, events.event, events.event_captured_dt, events.cnts,
  3. -> user_info.aq_dt, user_info.sub_source,
  4. -> DATEDIFF(events.event_captured_dt, user_info.aq_dt) AS tenure
  5. -> FROM events
  6. -> LEFT JOIN user_info
  7. -> ON events.riid = user_info.riid;
  8. Query OK, 8358444 ROWS affected (2 MIN 58.12 sec)
  9. Records: 8358444 Duplicates: 0 Warnings: 0

Notice that we have the same number of rows in table dataset1 as the events table. The next stage includes more data from the campaign_types table

  1. mysql> CREATE TABLE emailcampaign.dataset2
  2. -> SELECT dataset1.riid, dataset1.event, dataset1.event_captured_dt, dataset1.cnts,
  3. -> dataset1.aq_dt, dataset1.sub_source,
  4. -> dataset1.tenure, campaign_types.camptype
  5. -> FROM dataset1
  6. -> LEFT JOIN campaign_types
  7. -> ON dataset1.event_captured_dt = campaign_types.launch_id
  8. -> ;
  9. Query OK, 8358444 ROWS affected (2 MIN 59.94 sec)
  10. Records: 8358444 Duplicates: 0 Warnings: 0

Before we do the final join, it is necessary to make the database efficient as this join is quite exhaustive. So let's create some indices to make the database queries efficient. I am going to create four indices as follows:

  1. mysql> CREATE INDEX idx1 ON opens (campaign_send_dt);
  2. Query OK, 0 ROWS affected (0.37 sec)
  3. Records: 0 Duplicates: 0 Warnings: 0
  4.  
  5. mysql> CREATE INDEX idx2 ON dataset2 (event_captured_dt);
  6. Query OK, 0 ROWS affected (20.83 sec)
  7. Records: 0 Duplicates: 0 Warnings: 0
  8.  
  9. mysql> CREATE INDEX idx3 ON opens (riid);
  10. Query OK, 0 ROWS affected (0.25 sec)
  11. Records: 0 Duplicates: 0 Warnings: 0
  12.  
  13. mysql> CREATE INDEX idx4 ON dataset2 (riid);
  14. Query OK, 0 ROWS affected (20.78 sec)
  15. Records: 0 Duplicates: 0 Warnings: 0

The final join brings in data from the opens table into this composite table.

  1. mysql> CREATE TABLE emailcampaign.dataset3
  2. -> SELECT dataset2.riid, dataset2.event, dataset2.event_captured_dt, dataset2.cnts,
  3. -> dataset2.aq_dt, dataset2.sub_source,
  4. -> dataset2.tenure, dataset2.camptype, opens.optout
  5. -> FROM dataset2
  6. -> LEFT JOIN opens
  7. -> ON opens.campaign_send_dt = dataset2.event_captured_dt AND opens.riid = dataset2.riid;
  8. Query OK, 8358444 ROWS affected (4 MIN 25.61 sec)
  9. Records: 8358444 Duplicates: 0 Warnings: 0

It turns out that the data came out in a slightly different format than what I had expected, but it is workable. Here is a small sample of the resulting table which will be our starting dataset for the rest of the machine learning exercise.

  1. mysql> SELECT * FROM dataset3 WHERE camptype IS NOT NULL AND optout IS NOT NULL LIMIT 10;
  2. +-----------+-------+---------------------+------+---------------------+-------------------+--------+-----------------------------------+--------+
  3. | riid | event | event_captured_dt | cnts | aq_dt | sub_source | tenure | camptype | optout |
  4. +-----------+-------+---------------------+------+---------------------+-------------------+--------+-----------------------------------+--------+
  5. | 91355862 | SENT | 2018-02-27 00:00:00 | 1 | 2011-02-21 00:00:00 | POS form_Physical | 2563 | Heavy Promo - Dedicated Messaging | 0 |
  6. | 86264082 | SENT | 2018-02-15 00:00:00 | 1 | 2010-04-29 00:00:00 | Other_Other | 2849 | Heavy Promo - Dedicated Messaging | 0 |
  7. | 111704342 | SENT | 2018-01-08 00:00:00 | 1 | 2012-07-28 00:00:00 | POS form_Physical | 1990 | Heavy Promo - Secondary Messaging | 0 |
  8. | 94780402 | SENT | 2018-02-01 00:00:00 | 1 | 2015-03-31 00:00:00 | Checkout_Online | 1038 | Heavy Promo | 0 |
  9. | 91122822 | SENT | 2018-01-31 00:00:00 | 1 | 2011-02-04 00:00:00 | POS form_Physical | 2553 | Heavy Promo - Dedicated Messaging | 0 |
  10. | 115866522 | SENT | 2018-01-25 00:00:00 | 1 | 2012-10-18 00:00:00 | Other_Other | 1925 | Light Promo | 0 |
  11. | 104867102 | SENT | 2018-02-12 00:00:00 | 1 | 2012-05-07 00:00:00 | Other_Other | 2107 | Light Promo - Secondary Messaging | 0 |
  12. | 131624382 | SENT | 2018-02-08 00:00:00 | 1 | 2014-05-31 00:00:00 | Checkout_Online | 1349 | No Promo | 0 |
  13. | 145310062 | SENT | 2018-01-23 00:00:00 | 1 | 2015-03-31 00:00:00 | Checkout_Online | 1029 | Light Promo - Secondary Messaging | 0 |
  14. | 86463002 | SENT | 2018-01-18 00:00:00 | 1 | 2010-05-05 00:00:00 | Other_Other | 2815 | Light Promo - Secondary Messaging | 0 |
  15. +-----------+-------+---------------------+------+---------------------+-------------------+--------+-----------------------------------+--------+
  16. 10 ROWS IN SET (0.00 sec)

I am going to take one more step with processing the data to make my life simpler for the following step. I am about to use a standard machine learning package to process my data, and so I am going to export the entire dataset to a file format that is friendly to most packages. Yes, it is possible to read the data from the database directly (and I will show it in a different article some day), but at the moment I want to solve this problem with the tools already availble in the machine learning package without writing anything complicated.

Therefore I am going to export the entire table into CSV format so that I can read it from a file. The export is done simply through the following statement:

  1. mysql> SELECT riid, event, event_captured_dt, cnts, aq_dt, sub_source, tenure, camptype, optout
  2. -> FROM dataset3 WHERE camptype IS NOT NULL AND optout IS NOT NULL INTO OUTFILE '/tmp/emailcampaign_dataset.csv'
  3. -> FIELDS TERMINATED BY ',' LINES TERMINATED BY '\r\n';
  4. Query OK, 8358444 ROWS affected (32.38 sec)

Note that I am only including the records where the camptype and optout variables have either a 0 or a 1 (not NULL). The resulting file looks like this:

  1. riid,event,event_captured_dt,cnts,aq_dt,sub_source,tenure,camptype,optout
  2. 91355862,SENT,2018-02-27 00:00:00,1,2011-02-21 00:00:00,POS form_Physical,2563,Heavy Promo - Dedicated Messaging,0
  3. 86264082,SENT,2018-02-15 00:00:00,1,2010-04-29 00:00:00,Other_Other,2849,Heavy Promo - Dedicated Messaging,0
  4. 111704342,SENT,2018-01-08 00:00:00,1,2012-07-28 00:00:00,POS form_Physical,1990,Heavy Promo - Secondary Messaging,0
  5. 94780402,SENT,2018-02-01 00:00:00,1,2015-03-31 00:00:00,Checkout_Online,1038,Heavy Promo,0
  6. 91122822,SENT,2018-01-31 00:00:00,1,2011-02-04 00:00:00,POS form_Physical,2553,Heavy Promo - Dedicated Messaging,0
  7. 115866522,SENT,2018-01-25 00:00:00,1,2012-10-18 00:00:00,Other_Other,1925,Light Promo,0
  8. 104867102,SENT,2018-02-12 00:00:00,1,2012-05-07 00:00:00,Other_Other,2107,Light Promo - Secondary Messaging,0
  9. 131624382,SENT,2018-02-08 00:00:00,1,2014-05-31 00:00:00,Checkout_Online,1349,No Promo,0
  10. 145310062,SENT,2018-01-23 00:00:00,1,2015-03-31 00:00:00,Checkout_Online,1029,Light Promo - Secondary Messaging,0
  11. 86463002,SENT,2018-01-18 00:00:00,1,2010-05-05 00:00:00,Other_Other,2815,Light Promo - Secondary Messaging,0
  12. ...
  13. ...

You can download the CSV file to look at it in its entirety.

Predicting likelihood that the user will open the email

With the dataset thus created, we are now ready to fit a machine learning model over it. We used to have 8.3 million records in this dataset, but after eliminating all those records that have either a NULL camptype or a NULL optout field, the resuting records amount to only 186,584 records. Our job is to make a prediction based on that. I would like to be able to fit a machine learning model using the entire dataset. For that I am going to use SPARK as my tool-set. Since the output is binary (i.e. 0 for open and 1 for optout) I am going to use LogisticRegression to fit a model.

You will find the entire code written in Java in GitHub. Clone this repository to follow along.

Prerequisites for running this program are Apache SPARK that you need to download and install on your computer first. I will be using Spark in standalone mode, so there should be no difficulty to run it. Make sure that 'spark-submit' is in your $PATH variable.

I have created the repository in such a way that the dataset is in the folder 'data' under root. After cloning the repository look at the Java file CustomerOpenResponsePredictor.java. Let us look at the main() function of this class to see the logic behind it.

  1. SparkSession spark = SparkSession
  2. .builder()
  3. .appName("CustomerOpenResponsePredictor")
  4. .getOrCreate();
  5.  
  6. // Load training data
  7. Dataset allrows = spark.read().format("csv")
  8. .option("header", "true")
  9. .option("inferSchema", "true")
  10. .load("data/emailcampaign_dataset.csv");
  11.  
  12. StringIndexer eventIndexer = new StringIndexer().setHandleInvalid("keep").setInputCol("event").setOutputCol("eventIndex");
  13. StringIndexer subsourceIndexer = new StringIndexer().setHandleInvalid("keep").setInputCol("sub_source").setOutputCol("sub_sourceIndex");
  14. StringIndexer camptypeIndexer = new StringIndexer().setHandleInvalid("keep").setInputCol("camptype").setOutputCol("camptypeIndex");
  15.  
  16. Dataset indexedRows1 = eventIndexer.fit(allrows).transform(allrows);
  17. Dataset indexedRows2 = subsourceIndexer.fit(indexedRows1).transform(indexedRows1);
  18. Dataset indexedRows = camptypeIndexer.fit(indexedRows2).transform(indexedRows2);
  19.  
  20. indexedRows.printSchema();
  21.  
  22. Dataset[] trainingAndTest = indexedRows.randomSplit(new double[] {0.8, 0.2});
  23.  
  24. Dataset trainingSet = trainingAndTest[0];
  25. Dataset testSet = trainingAndTest[1];

First task is the create the Spark session which happens in line 1. In lines 7-10 I am loading the data from the CSV session. (Note that since this is multi-threaded, the data will be read in parallel in chunks through different threads). This dataset has three fields that are given as strings, but they can be converted to categorical variables (i.e. they are basically enumerations of state). Spark offers a way to do this through the StringIndexer class. Lines 12 - 14 define three different indexers that convert the columns event, sub_source and camptype to eventIndex, sub_sourceIndex and camptypeIndex respectively. The remaining variables are all integers, so LogisticRegression will work fine in this case. Lines 16-18 do the actual conversion of these three columns to integers from String. The next step is to split the data randomly into two sets - one for training and another for validation. I am using the randomSplit() function to do that in lines 22 to 25. After doing this, I now have two datasets with a 80%-20% split.

Feature Set and Output Value

To be clear, here are all the parameters in the feature set (the input X vector):

  1. eventIndex - the enumerated version of the type of event
  2. cnts - the count of clicks on that event type
  3. sub_sourceIndex - the enumerated version of the source of email
  4. tenure - the amount of time the user has been a subscriber
  5. camptypeIndex - the enumerated version of the type of campaign

The output (y) is a binary value "optout" which represents whether the user will stay (represented by 0) or unsubscribe (represented by 1).

  1. VectorAssembler assembler = new VectorAssembler()
  2. .setInputCols(new String[]{"eventIndex", "cnts", "sub_sourceIndex", "tenure", "camptypeIndex"})
  3. .setOutputCol("features");
  4.  
  5.  
  6. LogisticRegression lr = new LogisticRegression()
  7. .setMaxIter(10)
  8. .setRegParam(0.3)
  9. .setElasticNetParam(0.8)
  10. .setFeaturesCol("features") // setting features column
  11. .setLabelCol("optout"); // setting label column
  12.  
  13. Pipeline pipeline = new Pipeline().setStages(new PipelineStage[] {assembler,lr});
  14.  
  15. PipelineModel lrModel = pipeline.fit(trainingSet);
  16.  
  17. Dataset predictions = lrModel.transform(testSet);
  18.  
  19. // [sub_source, event_captured_dt, aq_dt, sub_sourceIndex, rawPrediction, cnts, probability, camptypeIndex,
  20. // riid, tenure, event, camptype, prediction, eventIndex, optout, features]
  21. int correctCount = 0, allCount = 0;
  22. for (Row r : predictions.select("prediction","optout").collectAsList()) {
  23. if (Math.abs((r.getDouble(0) - (double)r.getInt(1))) < 0.001)
  24. correctCount++;
  25. allCount++;
  26. }
  27.  
  28. System.out.println("Percentage correct = " + 100*((double)correctCount/(double)allCount));
  29.  
  30. spark.stop();

Note that all the columns present in the dataset are not to be fed into the LogisticRegression model builder. Only certain meaningful columns must be selected for the model building. I am going to use the VectorAssembler class to create my feature set. Lines 1 - 3 creates a composite output column called "features" that comprises of the columns eventIndex, cnts, sub_sourceIndex, tenure and camptypeIndex. These are our input variables. The LogisticRegression model is built in line 6 with epochs as 10, regularization parameter as 0.3 and elastic network parameter as 0.8. The input feature set is the "features" column I just created and the output is the "optout" column. Lines 13 to 15 are the Spark steps to create a workflow pipeline. Line 15 creates the actual learning model from the dataset. Line 17 uses the test dataset to make a prediction based on this model. Lines 21 to 26 finds out the accuracy of the model using a simple comparison of the predicted value and the original value of optout.

One point to note about software versions: I am using Spark 2.3.1 and JDK 1.8 to do my work. This combination worked for me. (JDK 10 has some compatibility problems with Spark 2.3, so avoid this combination.)

To compile type the following:

  1. mvn clean package

To run the program type:

  1. bin/run-emailcampaign-learner.sh

Your output is going to look somewhat like this:

  1. bin/run-emailcampaign-learner.sh
  2. Running Email Campaign Learner using SPARK ...
  3. spark-submit --master local[*] --class com.anupambagchi.emailcampaign.CustomerOpenResponsePredictor /Users/anupambagchi/Developer/machine-learning-using-spark/bin/../target/spark.application-1.0-jar-with-dependencies.jar
  4. 18/09/03 20:43:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  5. root
  6. |-- riid: integer (nullable = true)
  7. |-- event: string (nullable = true)
  8. |-- event_captured_dt: timestamp (nullable = true)
  9. |-- cnts: integer (nullable = true)
  10. |-- aq_dt: timestamp (nullable = true)
  11. |-- sub_source: string (nullable = true)
  12. |-- tenure: integer (nullable = true)
  13. |-- camptype: string (nullable = true)
  14. |-- optout: integer (nullable = true)
  15. |-- eventIndex: double (nullable = false)
  16. |-- sub_sourceIndex: double (nullable = false)
  17. |-- camptypeIndex: double (nullable = false)
  18.  
  19. Percentage correct = 98.41830135805131

Since the model does the splitting between the training and test dataset randomly, the results will be different each time. For most of my runs, the accuracy of the model varies between 98.2 to 98.5 percent. This is pretty good given that we had to crunch so much data to come to this conclusion.

How to use this information?

Now the big question is, what can we do about it? Since we are able to predict quite accurately whether the user will unsubscribe, it is best not to send the email to that user. The next step in deployment of this model is to keep it running as a web-service, and before sending an email, check (through the web-service) if the user is likely to opt out of the emails. If so, it is best to suppress sending that email to the user. Awesome! So now we have found an innovative way to retain customers.

Conclusion

I am not going to disappoint any person who came to this page thinking that they will get nuggets of information and instead found a bunch of SQL and mathematics. Here are some tips for really sending email for your campaigns. I am reproducing below for your reference.

  • Try sending email on weekends: Based on data of over 9 billion emails (yes, billion with a “b”) provided by Mailchimp, Saturdays and Sundays have notably higher Click Thru Rates (CTRs); this is coupled with said data showing unsubscribes peak on Tuesdays. These rates include both B2B and B2C emails (although given the volume, are presumably skewed towards B2C.)

  • Send very early in the morning: Data layered over Hubspot survey results reveal while recipients report reading email throughout the day, CTRs peak between 6 – 7 am (localized). And while unsubscribes also peak in the morning, they also spiked late in the evening (when readership dropped off).

  • Optimize your email for mobile: a bit of a no-brainer; surveys cited 80% of the respondents read email on mobile devices, highlighting the importance of making sure your email doesn’t look like scrambled eggs on mobile.

  • Include reference data in your email: Make email searchable. Focus group participants report using email inboxes as an archive of ‘elite’, personal data, referring back to it on an informational basis.

  • Use lots of links in your email: While this may be counterintuitive, Zarrella says there is strong correlation between a greater number of links and higher CTRs. Data also shows lower unsubscribes as the number of links increase. (This may be, of course, because the unsubscribe link is tougher to find…)

  • Serialize and label your email: Using continuity and content-based words such as “[this] week’s,” “newsletter,” or “digest” in the subject line leads to the higher CTRs. Conversely, the traditional “spam” words continue to hold true. [Bonus hint: monitor your spam box for common “trigger” words to avoid.]

  • Give your subscribers special access: Focus groups find people like getting offers specific to them, offers with exclusivity built into them. Another no-brainer, but worth repeating.

  • Send email from people they’ve heard of: Be it a celebrity name or a guru, people like receiving emails from names they recognize.

  • Do not be afraid to send too much email: Unsubscribes are notably higher for organizations that send one or two emails per month; as the frequency of emails reaches eight the number declines.

  • Your newest subscribers are your best: While most subscribers opt out shortly after first subscribing to an email, CTRs early on are also at a high – proving the adage “get ‘em while they’re hot.”

  • Ask people to follow you, not forward emails: It’s not just using social media, but using it wisely. Survey data showed about 80% people either never or rarely forward or Tweet commercial email, even with the advent of ‘share’ and ‘tweet’ buttons. Instead, get people to follow you through Facebook, Twitter, etc. driving prospects to subscribe to your email.

  • Make them want to get your emails: 70% of people report reading most or all of their email, and 58% have separate “junk” inboxes. Given that, Zarella stressed incorporating all the best practice takeaways detailed to ensure your message gets to people’s “good” email address, read, and acted upon.

So that's it folks. You are now an Email marketing ninja. Go out and start practicing some of these ideas.

Published in Data Science

In the first and second part of this series I described how to set up the hardware to read data from the OBD port using a Raspberry Pi and then upload it to the cloud. You should read the first part and second part of this series before you read this article.

Having done all the work to capture and transport all the data to the cloud, let us figure out what can be done on the cloud to introduce Machine Learning. To understand the concepts given in this article you will need to be familiar with Javascript and Python. Also, I am using MongoDB as my database - so you will need to know the basics of a document-oriented database to follow the example code here. MongoDB is not only my storage engine here, but also my compute engine. By that, I mean that I am using the database's scripting language (Javascript) to pre-process data that will ultimately be used for machine learning. The Javascript code given herein executes (in a distributed manner) inside the MongoDB database. (Some people get confused when they see Javascript, assuming that it requires a server like NodeJS to run - not here.)

Introduction to the Solution

To set the introduction, I will describe the following three tasks in this article:

  1. Read the raw records and augment it with additional derived information to generate some extra features used for machine learning that is not directly sent by the Raspberry Pi. Derivatives based on time interval between readings can be used to derive instanteous velocity, angular velocity and incline. These are inserted into the derived records when we save it to another MongoDB collection (also known as 'table' in relational database parlance) to be used later for generating the feature sets. I will be calling this collection 'mldataset' in my database.
  2. Read the 'mldataset' and extract features from the records. The feature set is saved into another collection called 'vehicle_signature_records'. This is an involved process since there are so many fields found in the raw records. In my case, the feature sets are basically three statistical averages (minimum, average and maximum) of all values aggregated over a 15 second period. The other research papers on this subject take the same approach, but the time interval over which the aggregates are taken vary based on the frequency of the readings. The recommended frequency is 5 Hz i.e. 1 record-set per 0.2 second. But as I mentioned in article 2 of this series, we are unable to read data that fast on a serial connection over ELM 327. The maximum speed that I have been able to observe (mostly in modern cars) is 1 record-set in 0.7 seconds. Thus a 15 second aggregation makes more sense in our scenario. Due to this, the accuracy of the prediction may be affected - but we will accept that as a constraint. The solution methodology remains the same though.
  3. Apply a learning algorithm on the feature-set to learn the driver behavior. Then the model needs to be deployed on a machine in the cloud. In real-time we need to calculate the same aggregates over the same time interval (15 seconds) and feed it into the model to come up with a prediction. To confirm the driver we will need to take readings over several intervals (5 minutes will give 20 predictions) and then use the value with the maximum count (i.e. modal value).

Augmenting raw data with derivatives

This is a very common scenario in IoT applications. When generated data comes from a human being, it always has useful information at the surface. All you need to do is scan the logs and extract it. An example of this is finding the interests of the user based on user-clicks in a shopping cart scenario - all the items that the user has seen on the web-site are directly found in the logs. However an IoT device is dumb, has no emotion, has no special interests. All data coming from an IoT device is the same consistent boring stream. So where do you dig to find useful information? The answer to this question is in the time-derivatives. The variation in values from one reading and the other provides useful insight into the behavior. Examples of these are velocity (derivative of displacement found from GPS readings), acceleration (derivative of velocity) and jerk (derivative of acceleration). So you see, augmenting raw data to put this extra information is extremely useful for IoT applications.

In this step I am going to write some Javascript code (that runs inside the MongoDB database) to augment raw data with derivatives for each record. You will find all this code in the file 'extract_driver_features_from_car_readings.js' which is located inside the 'machinelearning' folder. If you are wondering where to find the code, it is in Github at this location https://github.com/anupambagchi/driver-signature-raspberry-pi.

Processing raw records

Before diving into the code, let me clarify a few things. The code is written to run as a cronjob on a machine on the same network as the MongoDB database - so that it is accessible. Since it runs as a cron task, we need to know how many records to process from the raw data table. Thus we need to do some book-keeping on the database. We have a special collection called 'book_keeping' for this purpose where we store some book-keeping information. One of them is the last date till when we have processed the records. The data (in JSON format) may look like this:

  1. {
  2. "_id" : "processed_until",
  3. "lastEndTime" : ISODate("2017-12-08T23:56:56.724+0000")
  4. }

To determine where we need to pick up the record processing from, here is one way to do this in a MongoDB script written in Javascript.

  1. // Look up the book-keeping table to figure out the time from where we need to pick up
  2. var processedUntil = db.getCollection('book_keeping').findOne( { _id: "processed_until" } );
  3. endTime = new Date(); // Set the end time to the current time
  4. if (processedUntil != null) {
  5. startTime = processedUntil.lastEndTime;
  6. } else {
  7. db.book_keeping.insert( { _id: "processed_until", lastEndTime: endTime } );
  8. startTime = new Date(endTime.getTime() - (365*86400000)); // Go back 365 days
  9. }

The 'else' part of the logic above is for the initialization phase when we run it for the first time - we just want to pick up all records for the past year.

Keeping track of driver vehicle combination

Another book-keeping task is to keep track of the driver-vehicle combinations. To make the job easier for the machine learning algorithm, this should be converted to indices. Those indices are maintained in the database in another book-keeping collection called 'driver_vehicles'. This collection looks somewhat like this:

  1. {
  2. "_id" : "driver_vehicles",
  3. "drivers" : {
  4. "gmc-denali-2015_jonathan" : 0.0,
  5. "gmc-denali-2015_mindy" : 1.0,
  6. "gmc-denali-2015_charles" : 2.0,
  7. "gmc-denali-2015_chris" : 3.0,
  8. "gmc-denali-2015_elise" : 4.0,
  9. "gmc-denali-2015_thomas" : 5.0,
  10. "toyota-camry-2009_alice" : 6.0,
  11. "gmc-denali-2015_andrew" : 7.0,
  12. "toyota-highlander-2005_arka" : 8.0,
  13. "subaru-outback-2015_john" : 9.0,
  14. "gmc-denali-2015_grant" : 10.0,
  15. "gmc-denali-2015_avni" : 11.0,
  16. "toyota-highlander-2005_anupam" : 12.0,
  17. }
  18. }

These are the names of vehicles with their drivers. Each combination has been assigned a number against it. When a driver-vehicle combination is encountered, the program looks to see if that combination already exists or not. If not, then it adds a new combination. Here is the code to do it.

  1. // Another book-keeping task is to read the driver-vehicle hash-table from the database
  2. // Look up the book-keeping table to figure out the previous driver-vehicle codes (we have
  3. // numbers representing the combination of drivers and vehicles).
  4. var driverVehicles = db.getCollection('book_keeping').findOne( { _id: "driver_vehicles" } );
  5. var drivers;
  6. if (driverVehicles != null)
  7. drivers = driverVehicles.drivers;
  8. else
  9. drivers = {};
  10.  
  11. var maxDriverVehicleId = 0;
  12. for (var key in drivers) {
  13. if (drivers.hasOwnProperty(key)) {
  14. maxDriverVehicleId = Math.max(maxDriverVehicleId, drivers[key]);
  15. }
  16. }

You can see that a 'find' call to the MongoDb database is being made to read the hash-table in memory.

The next task is to query the raw collection to find out which records are new since the last time it ran.

NOTE:  In the code segments below the dollar symbol will show up as '@'. Please make the appropriate substitutions when you read it. The correct code may be found in the github repository.

  1. // Now do a query of the database to find out what records are new since we ran it last
  2. var allNewCarDrivers = db.getCollection('car_readings').aggregate([
  3. {
  4. "@match": {
  5. "timestamp" : { @gt: startTimeStr, @lte: endTimeStr }
  6. }
  7. },
  8. {
  9. "@unwind": "@data"
  10. },
  11. {
  12. "@group": { _id: "@data.username" }
  13. }
  14. ]);

SQL vs. Document-oriented database

The next part is the crux of the actual process happening in this script. To understand how this works you need to be familiar with the MongoDB aggregation framework. A task like this will take way too much code if you start writing SQL. Most relational databases and also Spark offer SQL as a way to process and aggregate their data. The most common reason I have heard from managers to take that approach is - "it is easy". That works, but it is too verbose. That is why I personally prefer to use the aggregation framework of MongoDB to do my pre-processing since I can operate much faster than the other tools out there. It may not be "easy" as per the common belief, but a bit more effort in studying the aggregation framework pays off - saving a lot of development effort.

What about execution time? These scripts execute on the database nodes - inside the database. Thus you cannot make it any faster - since most of the time spent in dealing with large data is in transporting the data from the storage nodes to the execution nodes. In the case of the aggregation frameworks, you are getting all benefits of BigData for free. You are actually using in-database analytics here for the fastest execution time.

  1. // Look at all the records returned and process each driver one-by-one
  2. // The following query is a pipeline with the following steps:
  3. allNewCarDrivers.forEach(function(driverId) {
  4. var driverName = driverId._id;
  5. print("Processing driver: " + driverName);
  6. var allNewCarReadings = db.getCollection('car_readings').aggregate([
  7. {
  8. "@match": { // 1. Match all records that fall within the time range we have decided to use. Note that this is being
  9. // done on a live database - which means that new data is coming in while we are trying to analyze it.
  10. // Thus we have to pin both the starting time and the ending time. Pinning the endtime to the starting time
  11. // of the application ensures that we will be accurately picking up only the NEW records when the program
  12. // runs again the next time.
  13. "timestamp" : { @gt: startTimeStr, @lte: endTimeStr }
  14. }
  15. },
  16. {
  17. @project: { // We only need to consider a few fields for our analysis. This eliminates the summaries from our analysis.
  18. "timestamp": 1,
  19. "data" : 1,
  20. "account": 1,
  21. "_id": 0
  22. }
  23. },
  24. {
  25. @unwind: "@data" // Flatten out all records into one gigantic array of records
  26. },
  27. {
  28. @match: {
  29. "data.username": driverName // Only consider the records for this specific driver, ignoring all others.
  30. }
  31. },
  32. {
  33. @sort: {
  34. "data.eventtime": 1 // Finally sort the data based on eventtime in ascending order
  35. }
  36. }
  37. ]);

This nifty script above does a lot of things. The first thing to note in this script is that we are operating on a live database that has a constant stream of data coming in. Thus in order to select some records for processing we need to decide the time range first and only select those that fall within that time range. The next time we run this script, the records that could not be picked up this time, will be gathered and processed. This is all being done within the 'match' clause.

The second clause is the 'project' clause - which only selects the four required fields for the next stage of the pipeline. The 'unwind' clause flattens all arrays. The next 'match' clause select the driver name and the final 'sort' clause sorts the data by eventtime in ascending order.

Distance on earth between two points

Before proceeding further, I would like to get one thing out of the way. Since we are dealing with a lot of latitude-longitude pairs and subsequently trying to find displacement, velocity and acceleration, we need a way to calculate the distance between two points on earth. There are several algorithms with varying degree of accuracy, but this is the one I have found to be computationally accurate (if you do not have an algorithm already provided by the database vendor).

  1. function earth_distance_havesine(lat1, lon1, lat2, lon2, unit) {
  2. var radius = 3959; // miles
  3. var phi1 = lat1.toRadians();
  4. var phi2 = lat2.toRadians();
  5. var delphi = (lat2-lat1).toRadians();
  6. var dellambda = (lon2-lon1).toRadians();
  7.  
  8. var a = Math.sin(delphi/2) * Math.sin(delphi/2) +
  9. Math.cos(phi1) * Math.cos(phi2) *
  10. Math.sin(dellambda/2) * Math.sin(dellambda/2);
  11. var c = 2 * Math.atan2(Math.sqrt(a), Math.sqrt(1-a));
  12.  
  13. var dist = radius * c;
  14. if (unit=="K") { dist = dist * 1.609344 }
  15. if (unit=="N") { dist = dist * 0.8684 }
  16. return dist;
  17. }

We will be using this function in the next analysis. As I said before, our goal is to augment our device records with extra information pertaining to time-derivatives. The following code adds extra fields "interval", "acceleration", "angular_velocity" and "incline" to each device record by comparing it with the preceeding record.

  1. var lastRecord = null; // We create a variable to remember what was the last record processed
  2.  
  3. var numProcessedRecords = 0;
  4. allNewCarReadings.forEach(function(record) {
  5. // Here we are reading a raw record from the car_readings collection, and then enhancing it with a few more
  6. // variables. These are the (1) id of the driver-vehicle combination and (2) the delta values between current and previous record
  7. numProcessedRecords += 1; // This is just for printing number of processed records when the program is running
  8. var lastTime; // This is the timestamp of the last record
  9. if (lastRecord !== null) {
  10. lastTime = lastRecord.data.eventtime;
  11. } else {
  12. lastTime = "";
  13. }
  14. var eventTime = record.data.eventtime;
  15. record.data.eventTimestamp = new Date(record.data.eventtime+'Z'); // Creating a real timestamp from an ISO string (without the trailing 'Z')
  16. // print('Eventtime = ' + eventTime);
  17. if (eventTime !== lastTime) { // this must be a new record
  18. var driverVehicle = record.data.vehicle + "_" + record.data.username;
  19. if (drivers.hasOwnProperty(driverVehicle))
  20. record.driverVehicleId = drivers[driverVehicle];
  21. else {
  22. drivers[driverVehicle] = maxDriverVehicleId;
  23. record.driverVehicleId = maxDriverVehicleId;
  24. maxDriverVehicleId += 1;
  25. }
  26.  
  27. record.delta = {}; // delta stores the difference between the current record and the previous record
  28. if (lastRecord !== null) {
  29. var timeDifference = record.data.eventTimestamp.getTime() - lastRecord.data.eventTimestamp.getTime(); // in milliseconds
  30. record.delta["distance"] = earth_distance_havesine(
  31. record.data.location.coordinates[1],
  32. record.data.location.coordinates[0],
  33. lastRecord.data.location.coordinates[1],
  34. lastRecord.data.location.coordinates[0],
  35. "K");
  36. if (timeDifference < 60000) {
  37. // if time difference is less than 60 seconds, only then can we consider it as part of the same session
  38. // print(JSON.stringify(lastRecord.data));
  39. record.delta["interval"] = timeDifference;
  40. record.delta["acceleration"] = 1000 * (record.data.speed - lastRecord.data.speed) / timeDifference;
  41. record.delta["angular_velocity"] = (record.data.heading - lastRecord.data.heading) / timeDifference;
  42. record.delta["incline"] = (record.data.altitude - lastRecord.data.altitude) / timeDifference;
  43. } else {
  44. // otherwise this is a new session. So we still store the records, but the delta calculation is all set to zero.
  45. record.delta["interval"] = timeDifference;
  46. record.delta["acceleration"] = 0.0;
  47. record.delta["angular_velocity"] = 0.0;
  48. record.delta["incline"] = 0.0;
  49. }
  50. db.getCollection('mldataset').insert(record);
  51. }
  52. }
  53. if (numProcessedRecords % 100 === 0)
  54. print("Processed " + numProcessedRecords + " records");
  55. lastRecord = record;
  56. });
  57. });
  58.  

Note that in line 50, I am saving the record in another collection called 'mldataset' which is going to be the collection on which I will apply feature-extraction for driver signatures. The final task is to save the book-keeping values in their respective tables.

  1. db.book_keeping.update(
  2. { _id: "driver_vehicles"},
  3. { \(set: { drivers: drivers } },
  4. { upsert: true }
  5. );
  6.  
  7. // Save the end time to the database
  8. db.book_keeping.update(
  9. { _id: "processed_until" },
  10. { \)set: { lastEndTime: endTime } },
  11. { upsert: true }
  12. );
  13.  

Creating the feature set for driver signatures

The next step is to create the feature sets for driver signature analysis. I do this by first reading records from the augmented collection 'mldataset' and aggregating values over every 15 minutes. For each field that contains a number (and it happens to change often), I will calculate three statistical values for each field - the minimum over the time window, the maximum and the average. Interestingly, one can also include other statistical values like variance, kertosis - but I have not tried those in my experiment yet - and is an enhancement that you can do easily.

You will find all the code in the file 'extract_features_from_mldataset.js' under the 'machinelearning' directory.

Let us do some book-keeping first.

  1. var processedUntil = db.getCollection('book_keeping').findOne( { _id: "driver_vehicles_processed_until" } );
  2. var currentTime = new Date(); // This time includes the seconds value
  3. // Set the end time (for querying) to the current time till the last whole minute, excluding seconds
  4. var endTimeGlobal = new Date(Date.UTC(currentTime.getFullYear(), currentTime.getMonth(), currentTime.getDate(), currentTime.getHours(), currentTime.getMinutes(), 0, 0))
  5.  
  6. if (processedUntil === null) {
  7. db.book_keeping.insert( { _id: "driver_vehicles_processed_until", lastEndTimes: [] } ); // initialize to an empty array
  8. }
  9.  
  10. // Now do a query of the database to find out what records are new since we ran it last
  11. var startTimeForSearchingActiveDevices = new Date(endTimeGlobal.getTime() - (200*86400000)); // Go back 200 days
  12.  
  13. // Another book-keeping task is to read the driver-vehicle hash-table from the database.
  14. // Look up the book-keeping table to figure out the previous driver-vehicle codes (we have
  15. // numbers representing the combination of drivers and vehicles).
  16. var driverVehicles = db.getCollection('book_keeping').findOne( { _id: "driver_vehicles" } );
  17. var drivers;
  18. if (driverVehicles !== null)
  19. drivers = driverVehicles.drivers;
  20. else
  21. drivers = {};
  22.  
  23. var maxDriverVehicleId = 0;
  24. for (var key in drivers) {
  25. if (drivers.hasOwnProperty(key)) {
  26. maxDriverVehicleId = Math.max(maxDriverVehicleId, drivers[key]);
  27. }
  28. }
  29.  

Using the last time stamp stored in the system, we can figure out which records are new.

  1. // Now do a query of the database to find out what records are new since we ran it last
  2. var allNewCarDrivers = db.getCollection('mldataset').aggregate([
  3. {
  4. "@match": {
  5. "data.eventTimestamp" : { @gt: startTimeForSearchingActiveDevices, @lte: endTimeGlobal }
  6. }
  7. },
  8. {
  9. "@group": { _id: "@data.username" }
  10. }
  11. ]);
  12.  

Extracting features for each driver

Now is the time to do the actual feature extraction from the data-set. Here is the entire loop:

  1. allNewCarDrivers.forEach(function(driverId) {
  2. var driverName = driverId._id;
  3. print("Processing driver: " + driverName);
  4. var startTimeForDriver = startTimeForSearchingActiveDevices; // To begin with we start with the earliest start time we care about
  5.  
  6. var driverIsNew = true;
  7. // First find out if this device already has some records processed, and has a last end time defined
  8. var lastEndTimeDevice = db.getCollection('book_keeping').find(
  9. {
  10. _id: "driver_vehicles_processed_until",
  11. "lastEndTimes.driver": driverName
  12. },
  13. {
  14. _id: 0,
  15. 'lastEndTimes.@': 1
  16. }
  17. );
  18.  
  19. lastEndTimeDevice.forEach(function(record) {
  20. startTimeForDriver = record.lastEndTimes[0].endTime;
  21. driverIsNew = false;
  22. });
  23.  
  24. //print('Starting time for driver is ' + startTimeForDriver.toISOString());
  25. //print('endTimeGlobal = ' + endTimeGlobal.toISOString());
  26.  
  27. var allNewCarReadings = db.getCollection('mldataset').aggregate([
  28. {
  29. "@match": { // 1. Match all records that fall within the time range we have decided to use. Note that this is being
  30. // done on a live database - which means that new data is coming in while we are trying to analyze it.
  31. // Thus we have to pin both the starting time and the ending time. Pinning the endtime to the starting time
  32. // of the application ensures that we will be accurately picking up only the NEW records when the program
  33. // runs again the next time.
  34. "data.eventTimestamp": {@gt: startTimeForDriver, @lte: endTimeGlobal},
  35. "data.username": driverName // Only consider the records for this specific driver, ignoring all others.
  36. }
  37. },
  38. {
  39. @project: { // We only need to consider a few fields for our analysis. This eliminates the summaries from our analysis.
  40. "data": 1,
  41. "account": 1,
  42. "delta": 1,
  43. "driverVehicleId": 1,
  44. "_id": 0
  45. }
  46. },
  47. {
  48. "@group": {
  49. "_id": {
  50. year: {@year: "@data.eventTimestamp"},
  51. month: {@month: "@data.eventTimestamp"},
  52. day: {@dayOfMonth: "@data.eventTimestamp"},
  53. hour: {@hour: "@data.eventTimestamp"},
  54. minute: {@minute: "@data.eventTimestamp"},
  55. quarter: {@mod: [{@second: "@data.eventTimestamp"}, 4]}
  56. },
  57.  
  58. "averageGPSLatitude": {@avg: {"@arrayElemAt": ["@data.location.coordinates", 1]}},
  59. "averageGPSLongitude": {@avg: {"@arrayElemAt": ["@data.location.coordinates", 0]}},
  60.  
  61. "averageLoad": {@avg: "@data.load"},
  62. "minLoad": {@min: "@data.load"},
  63. "maxLoad": {@max: "@data.load"},
  64.  
  65. "averageThrottlePosB": {@avg: "@data.abs_throttle_pos_b"},
  66. "minThrottlePosB": {@min: "@data.abs_throttle_pos_b"},
  67. "maxThrottlePosB": {@max: "@data.abs_throttle_pos_b"},
  68.  
  69. "averageRpm": {@avg: "@data.rpm"},
  70. "minRpm": {@min: "@data.rpm"},
  71. "maxRpm": {@max: "@data.rpm"},
  72.  
  73. "averageThrottlePos": {@avg: "@data.throttle_pos"},
  74. "minThrottlePos": {@min: "@data.throttle_pos"},
  75. "maxThrottlePos": {@max: "@data.throttle_pos"},
  76.  
  77. "averageIntakeAirTemp": {@avg: "@data.intake_air_temp"},
  78. "minIntakeAirTemp": {@min: "@data.intake_air_temp"},
  79. "maxIntakeAirTemp": {@max: "@data.intake_air_temp"},
  80.  
  81. "averageSpeed": {@avg: "@data.speed"},
  82. "minSpeed": {@min: "@data.speed"},
  83. "maxSpeed": {@max: "@data.speed"},
  84.  
  85. "averageAltitude": {@avg: "@data.altitude"},
  86. "minAltitude": {@min: "@data.altitude"},
  87. "maxAltitude": {@max: "@data.altitude"},
  88.  
  89. "averageCommThrottleAc": {@avg: "@data.comm_throttle_ac"},
  90. "minCommThrottleAc": {@min: "@data.comm_throttle_ac"},
  91. "maxCommThrottleAc": {@max: "@data.comm_throttle_ac"},
  92.  
  93. "averageEngineTime": {@avg: "@data.engine_time"},
  94. "minEngineTime": {@min: "@data.engine_time"},
  95. "maxEngineTime": {@max: "@data.engine_time"},
  96.  
  97. "averageAbsLoad": {@avg: "@data.abs_load"},
  98. "minAbsLoad": {@min: "@data.abs_load"},
  99. "maxAbsLoad": {@max: "@data.abs_load"},
  100.  
  101. "averageGear": {@avg: "@data.gear"},
  102. "minGear": {@min: "@data.gear"},
  103. "maxGear": {@max: "@data.gear"},
  104.  
  105. "averageRelThrottlePos": {@avg: "@data.rel_throttle_pos"},
  106. "minRelThrottlePos": {@min: "@data.rel_throttle_pos"},
  107. "maxRelThrottlePos": {@max: "@data.rel_throttle_pos"},
  108.  
  109. "averageAccPedalPosE": {@avg: "@data.acc_pedal_pos_e"},
  110. "minAccPedalPosE": {@min: "@data.acc_pedal_pos_e"},
  111. "maxAccPedalPosE": {@max: "@data.acc_pedal_pos_e"},
  112.  
  113. "averageAccPedalPosD": {@avg: "@data.acc_pedal_pos_d"},
  114. "minAccPedalPosD": {@min: "@data.acc_pedal_pos_d"},
  115. "maxAccPedalPosD": {@max: "@data.acc_pedal_pos_d"},
  116.  
  117. "averageGpsSpeed": {@avg: "@data.gps_speed"},
  118. "minGpsSpeed": {@min: "@data.gps_speed"},
  119. "maxGpsSpeed": {@max: "@data.gps_speed"},
  120.  
  121. "averageShortTermFuelTrim2": {@avg: "@data.short_term_fuel_trim_2"},
  122. "minShortTermFuelTrim2": {@min: "@data.short_term_fuel_trim_2"},
  123. "maxShortTermFuelTrim2": {@max: "@data.short_term_fuel_trim_2"},
  124.  
  125. "averageO211": {@avg: "@data.o211"},
  126. "minO211": {@min: "@data.o211"},
  127. "maxO211": {@max: "@data.o211"},
  128.  
  129. "averageO212": {@avg: "@data.o212"},
  130. "minO212": {@min: "@data.o212"},
  131. "maxO212": {@max: "@data.o212"},
  132.  
  133. "averageShortTermFuelTrim1": {@avg: "@data.short_term_fuel_trim_1"},
  134. "minShortTermFuelTrim1": {@min: "@data.short_term_fuel_trim_1"},
  135. "maxShortTermFuelTrim1": {@max: "@data.short_term_fuel_trim_1"},
  136.  
  137. "averageMaf": {@avg: "@data.maf"},
  138. "minMaf": {@min: "@data.maf"},
  139. "maxMaf": {@max: "@data.maf"},
  140.  
  141. "averageTimingAdvance": {@avg: "@data.timing_advance"},
  142. "minTimingAdvance": {@min: "@data.timing_advance"},
  143. "maxTimingAdvance": {@max: "@data.timing_advance"},
  144.  
  145. "averageClimb": {@avg: "@data.climb"},
  146. "minClimb": {@min: "@data.climb"},
  147. "maxClimb": {@max: "@data.climb"},
  148.  
  149. "averageFuelPressure": {@avg: "@data.fuel_pressure"},
  150. "minFuelPressure": {@min: "@data.fuel_pressure"},
  151. "maxFuelPressure": {@max: "@data.fuel_pressure"},
  152.  
  153. "averageTemp": {@avg: "@data.temp"},
  154. "minTemp": {@min: "@data.temp"},
  155. "maxTemp": {@max: "@data.temp"},
  156.  
  157. "averageAmbientAirTemp": {@avg: "@data.ambient_air_temp"},
  158. "minAmbientAirTemp": {@min: "@data.ambient_air_temp"},
  159. "maxAmbientAirTemp": {@max: "@data.ambient_air_temp"},
  160.  
  161. "averageManifoldPressure": {@avg: "@data.manifold_pressure"},
  162. "minManifoldPressure": {@min: "@data.manifold_pressure"},
  163. "maxManifoldPressure": {@max: "@data.manifold_pressure"},
  164.  
  165. "averageLongTermFuelTrim1": {@avg: "@data.long_term_fuel_trim_1"},
  166. "minLongTermFuelTrim1": {@min: "@data.long_term_fuel_trim_1"},
  167. "maxLongTermFuelTrim1": {@max: "@data.long_term_fuel_trim_1"},
  168.  
  169. "averageLongTermFuelTrim2": {@avg: "@data.long_term_fuel_trim_2"},
  170. "minLongTermFuelTrim2": {@min: "@data.long_term_fuel_trim_2"},
  171. "maxLongTermFuelTrim2": {@max: "@data.long_term_fuel_trim_2"},
  172.  
  173. "averageGPSAcceleration": {@avg: "@delta.acceleration"},
  174. "minGPSAcceleration": {@min: "@delta.acceleration"},
  175. "maxGPSAcceleration": {@max: "@delta.acceleration"},
  176.  
  177. "averageHeadingChange": {@avg: {@abs: "@delta.angular_velocity"}},
  178. "minHeadingChange": {@min: {@abs: "@delta.angular_velocity"}},
  179. "maxHeadingChange": {@max: {@abs: "@delta.angular_velocity"}},
  180.  
  181. "averageIncline": {@avg: "@data.incline"},
  182. "minIncline": {@min: "@data.incline"},
  183. "maxIncline": {@max: "@data.incline"},
  184.  
  185. "averageAcceleration": {@avg: "@delta.acceleration"},
  186. "minAcceleration": {@min: "@delta.acceleration"},
  187. "maxAcceleration": {@max: "@delta.acceleration"},
  188.  
  189. // "dtcCodes": {"@push": "@data.dtc_status"},
  190. "accountIdArray": {@addToSet: "@account"},
  191.  
  192. "vehicleArray": {@addToSet: "@data.vehicle"},
  193. "driverArray": {@addToSet: "@data.username"},
  194. "driverVehicleArray": {@addToSet: "@driverVehicleId"},
  195.  
  196. "count": {@sum: 1}
  197. }
  198. },
  199. {
  200. @sort: {
  201. "_id": 1 // Finally sort the data based on eventtime in ascending order
  202. }
  203. }
  204. ],
  205. {
  206. allowDiskUse: true
  207. }
  208. );
  209.  

For each driver (or rather driver-vehicle combination) that is identified, the first task is to figure out the last processing time for that driver and find all new records (lines 6 to 22). The next task of aggregating over 15 second windows is a MongoDB aggregation step starting from line 27. Aggregation tasks in MongoDB are described as pipeline where element element of the flow does a certain task and passes on the result to the next element in the pipe. The first task is to match all records within the time-span that we want to process (lines 29 to 36). Then we only need to consider (i.e. project) few fields that are of interest to us (lines 38 to 44). The element of the pipeline  '\(group') does the actual job of aggregation. The key to this aggregation step is the group-by Id that is created using a 'quarter' (line 55) which is nothing but a number between 0 and 3 created out of the second value of the time-stamp. This effectively creates the time windows needed for aggregation.

The actual aggregation steps are quite repetitive. See for example lines 61 to 63 where the average load, minimum load and maximum load is being calculated based on the aggregate over each time period. This is repeated for all the variables that we want to consider in the feature-set. Before storing it, the values are sorted based on event-time (lines 200 to 202).

Saving the feature-set in a collection

The features thus calculated are saved to a new collection on which I would apply a machine-learning algorithm to create a model. The collection is called 'vehicle_signature_records' - where the feature-set records can be saved as follows:

  1. var lastRecordedTimeForDriver = startTimeForDriver;
  2. var insertCounter = 0;
  3. allNewCarReadings.forEach(function (record) {
  4. var currentRecordEventTime = new Date(Date.UTC(record._id.year, record._id.month - 1, record._id.day, record._id.hour, record._id.minute, record._id.quarter * 15, 0));
  5. if (currentRecordEventTime >= lastRecordedTimeForDriver)
  6. lastRecordedTimeForDriver = new Date(Date.UTC(record._id.year, record._id.month - 1, record._id.day, record._id.hour, record._id.minute, 59, 999));
  7.  
  8. record['eventTime'] = currentRecordEventTime;
  9. record['eventId'] = record._id;
  10. delete record._id;
  11. record['accountId'] = record.accountIdArray[0];
  12. delete record.accountIdArray;
  13.  
  14. record['vehicle'] = record.vehicleArray[0];
  15. delete record.vehicleArray;
  16.  
  17. record['driver'] = record.driverArray[0];
  18. delete record.driverArray;
  19.  
  20. record['driverVehicle'] = record.driverVehicleArray[0];
  21. delete record.driverVehicleArray;
  22.  
  23. record.averageGPSLatitude = parseInt((record.averageGPSLatitude * 1000).toFixed(3)) / 1000;
  24. record.averageGPSLongitude = parseInt((record.averageGPSLongitude * 1000).toFixed(3)) / 1000;
  25.  
  26. db.getCollection('vehicle_signature_records').insert(record);
  27. insertCounter += 1;
  28. });

The code above inserts a few more variables to identify the driver, the vehicle and the driver-vehicle combination to the result sent by the aggregation function (lines 8 to 21) and saves it to the database (line 26). However lines 23 and 24 need an explanation since it signifies something very important and significant!

Coding the approximate location of the driver

One of the interesting observations I discovered while working on this problem is that one can dramatically improve accuracy of prediction if you can code the approximate location of the driver. Imagine working on this problem for millions of drivers who are scattered all across the country. One of the important facts to consider is that most drivers generally drive around a certain location most of the time. Thus if their location is somehow encoded into the model, the model can quickly converge based on their location. Lines 23 and 24 attempt to do just that. It encodes two numbers that represent the approximate latitude and longitude of the location. All these lines do is store the latitude and longitude with reduced accuracy.

Some more book-keeping

As a final step the final task is to store the book-keeping values.

  1. if (driverIsNew) { // which means this is a new device with no record
  2. db.book_keeping.update(
  3. {_id: 'driver_vehicles_processed_until'},
  4. {@push: {'lastEndTimes': {driver: driverName, endTime: lastRecordedTimeForDriver}}}
  5. );
  6. } else {
  7. var nowDate = new Date();
  8. db.book_keeping.update(
  9. {_id: 'driver_vehicles_processed_until', 'lastEndTimes.driver': driverName},
  10. {@set: {'lastEndTimes.@.endTime': lastRecordedTimeForDriver, 'lastEndTimes.@.driver': driverName}} // lastRecordedTimeForDriver
  11. );
  12. }
  13.  

After doing all this work (which by now you may be already exhausted after reading through), we are finally ready to apply some real machine-learning algorithms. Remember, I said before that 95% of the task of a data scientist is in preparing, collecting, consolidating and cleaning the data. You are seeing a live example of that!

In big companies there are people called data-engineers who would do part of this job, but not all people are fortunate enough to have data-engineers working for them. Besides, if you can do all this work, you are more indispensible to the company you work for - and so it makes sense to develop these skills along with your analysis skills as a data-scientist.

Building a Machine Learning Model

Fortunately, the data has been created in a clean way, so there is no further clean-up required on it. Our data is in a MongoDB collection called 'vehicle_signature_records'. If you are a pure Data Scientist the following should be very familar to you. The only difference between what I am going to do now and what you generally find in books and blogs, is the data-source. I am going to read my data-sets directly from the MongoDB database instead of from CSV files. After reading the above, by now you must have become partial experts at understanding MongoDB document structures. If not, don't worry since all the data that we stored in the collection are all flat - i.e. all values are present at the top level of each record. To illlustrate how the data looks, let me show you one record from the collection.

  1. {
  2. "_id" : ObjectId("5a3028db7984b918e715c2a7"),
  3. "averageGPSLatitude" : 37.386,
  4. "averageGPSLongitude" : -121.96,
  5. "averageLoad" : 24.80392156862745,
  6. "minLoad" : 0.0,
  7. "maxLoad" : 68.62745098039215,
  8. "averageThrottlePosB" : 29.11764705882353,
  9. "minThrottlePosB" : 14.901960784313726,
  10. "maxThrottlePosB" : 38.03921568627451,
  11. "averageRpm" : 1216.25,
  12. "minRpm" : 516.0,
  13. "maxRpm" : 1486.0,
  14. "averageThrottlePos" : 20.49019607843137,
  15. "minThrottlePos" : 11.764705882352942,
  16. "maxThrottlePos" : 36.86274509803921,
  17. "averageIntakeAirTemp" : 85.5,
  18. "minIntakeAirTemp" : 84.0,
  19. "maxIntakeAirTemp" : 86.0,
  20. "averageSpeed" : 13.517712865133625,
  21. "minSpeed" : 0.0,
  22. "maxSpeed" : 24.238657551274084,
  23. "averageAltitude" : -1.575,
  24. "minAltitude" : -1.9,
  25. "maxAltitude" : -1.2,
  26. "averageCommThrottleAc" : 25.392156862745097,
  27. "minCommThrottleAc" : 6.2745098039215685,
  28. "maxCommThrottleAc" : 38.431372549019606,
  29. "averageEngineTime" : 32.25,
  30. "minEngineTime" : 32.0,
  31. "maxEngineTime" : 33.0,
  32. "averageAbsLoad" : 40.3921568627451,
  33. "minAbsLoad" : 18.431372549019606,
  34. "maxAbsLoad" : 64.31372549019608,
  35. "averageGear" : 0.0,
  36. "minGear" : 0.0,
  37. "maxGear" : 0.0,
  38. "averageRelThrottlePos" : 19.019607843137255,
  39. "minRelThrottlePos" : 4.705882352941177,
  40. "maxRelThrottlePos" : 27.84313725490196,
  41. "averageAccPedalPosE" : 14.607843137254902,
  42. "minAccPedalPosE" : 9.411764705882353,
  43. "maxAccPedalPosE" : 19.215686274509803,
  44. "averageAccPedalPosD" : 30.19607843137255,
  45. "minAccPedalPosD" : 18.823529411764707,
  46. "maxAccPedalPosD" : 39.21568627450981,
  47. "averageGpsSpeed" : 6.720000000000001,
  48. "minGpsSpeed" : 0.0,
  49. "maxGpsSpeed" : 12.82,
  50. "averageShortTermFuelTrim2" : -0.5,
  51. "minShortTermFuelTrim2" : -1.0,
  52. "maxShortTermFuelTrim2" : 1.0,
  53. "averageO211" : 9698.5,
  54. "minO211" : 1191.0,
  55. "maxO211" : 27000.0,
  56. "averageO212" : 30349.0,
  57. "minO212" : 28299.0,
  58. "maxO212" : 32499.0,
  59. "averageShortTermFuelTrim1" : -0.25,
  60. "minShortTermFuelTrim1" : -2.0,
  61. "maxShortTermFuelTrim1" : 4.0,
  62. "averageMaf" : 2.4332170200000003,
  63. "minMaf" : 0.77513736,
  64. "maxMaf" : 7.0106280000000005,
  65. "averageTimingAdvance" : 28.0,
  66. "minTimingAdvance" : 16.5,
  67. "maxTimingAdvance" : 41.0,
  68. "averageClimb" : -0.025,
  69. "minClimb" : -0.2,
  70. "maxClimb" : 0.1,
  71. "averageFuelPressure" : null,
  72. "minFuelPressure" : null,
  73. "maxFuelPressure" : null,
  74. "averageTemp" : 199.0,
  75. "minTemp" : 199.0,
  76. "maxTemp" : 199.0,
  77. "averageAmbientAirTemp" : 77.75,
  78. "minAmbientAirTemp" : 77.0,
  79. "maxAmbientAirTemp" : 78.0,
  80. "averageManifoldPressure" : 415.4026475455047,
  81. "minManifoldPressure" : 248.2073910645339,
  82. "maxManifoldPressure" : 592.9398786541643,
  83. "averageLongTermFuelTrim1" : 3.25,
  84. "minLongTermFuelTrim1" : -1.0,
  85. "maxLongTermFuelTrim1" : 7.0,
  86. "averageLongTermFuelTrim2" : -23.5,
  87. "minLongTermFuelTrim2" : -100.0,
  88. "maxLongTermFuelTrim2" : 7.0,
  89. "averageGPSAcceleration" : 1.0196509034930195,
  90. "minGPSAcceleration" : 0.0,
  91. "maxGPSAcceleration" : 1.9128551867763974,
  92. "averageHeadingChange" : 0.006215710862578118,
  93. "minHeadingChange" : 0.0,
  94. "maxHeadingChange" : 0.013477895914941244,
  95. "averageIncline" : null,
  96. "minIncline" : null,
  97. "maxIncline" : null,
  98. "averageAcceleration" : 1.0196509034930195,
  99. "minAcceleration" : 0.0,
  100. "maxAcceleration" : 1.9128551867763974,
  101. "count" : 4.0,
  102. "eventTime" : ISODate("2017-07-18T18:11:30.000+0000"),
  103. "eventId" : {
  104. "year" : NumberInt(2017),
  105. "month" : NumberInt(7),
  106. "day" : NumberInt(18),
  107. "hour" : NumberInt(18),
  108. "minute" : NumberInt(11),
  109. "quarter" : NumberInt(2)
  110. },
  111. "accountId" : "17350",
  112. "vehicle" : "toyota-highlander-2005",
  113. "driver" : "anupam",
  114. "driverVehicle" : 12.0
  115. }

That's quite a number of values for analysis! Which is a good sign for us - more values gives us more options to play with it.

As you may have realized by now, I have come to the final stage of building the model which is a traditional machine-learning task that is usually done in Python or R. So the final piece will be written in Python. You will find the entire code at 'driver_signature_build_model_scikit.py' in the 'machinelearning' directory.

Feature selection and elimination

As is common in any data-science project, one must first take a look at the data and determine if any features need to be eliminated. If some features do not make sense for the model we are building then those features need to be dropped. One quick observation is that fuel pressure and incline has nothing to do with driver signatures. So I will eliminate those values from any further consideration.

Specifically for this problem, you need do something special, which is a bit unusual, but required in this scenario.

If you look at the features carefully you will notice that some features are driver characteristics while others are vehicle characteristics. Thus it is important to not mix up the two sets. I have used my judgement to separate out the features into two sets as follows.

  1. vehicle_features = [
  2. "averageLoad",
  3. "minLoad",
  4. "maxLoad",
  5. "averageRpm",
  6. "minRpm",
  7. "maxRpm",
  8. "averageEngineTime",
  9. "minEngineTime",
  10. "maxEngineTime",
  11. "averageAbsLoad",
  12. "minAbsLoad",
  13. "maxAbsLoad",
  14. "averageAccPedalPosE",
  15. "minAccPedalPosE",
  16. "maxAccPedalPosE",
  17. "averageAccPedalPosD",
  18. "minAccPedalPosD",
  19. "maxAccPedalPosD",
  20. "averageShortTermFuelTrim2",
  21. "minShortTermFuelTrim2",
  22. "maxShortTermFuelTrim2",
  23. "averageO211",
  24. "minO211",
  25. "maxO211",
  26. "averageO212",
  27. "minO212",
  28. "maxO212",
  29. "averageShortTermFuelTrim1",
  30. "minShortTermFuelTrim1",
  31. "maxShortTermFuelTrim1",
  32. "averageMaf",
  33. "minMaf",
  34. "maxMaf",
  35. "averageTimingAdvance",
  36. "minTimingAdvance",
  37. "maxTimingAdvance",
  38. "averageTemp",
  39. "minTemp",
  40. "maxTemp",
  41. "averageManifoldPressure",
  42. "minManifoldPressure",
  43. "maxManifoldPressure",
  44. "averageLongTermFuelTrim1",
  45. "minLongTermFuelTrim1",
  46. "maxLongTermFuelTrim1",
  47. "averageLongTermFuelTrim2",
  48. "minLongTermFuelTrim2",
  49. "maxLongTermFuelTrim2"
  50. ]
  51.  
  52. driver_features = [
  53. "averageGPSLatitude",
  54. "averageGPSLongitude",
  55. "averageThrottlePosB",
  56. "minThrottlePosB",
  57. "maxThrottlePosB",
  58. "averageThrottlePos",
  59. "minThrottlePos",
  60. "maxThrottlePos",
  61. "averageIntakeAirTemp",
  62. "minIntakeAirTemp",
  63. "maxIntakeAirTemp",
  64. "averageSpeed",
  65. "minSpeed",
  66. "maxSpeed",
  67. "averageAltitude",
  68. "minAltitude",
  69. "maxAltitude",
  70. "averageCommThrottleAc",
  71. "minCommThrottleAc",
  72. "maxCommThrottleAc",
  73. "averageGear",
  74. "minGear",
  75. "maxGear",
  76. "averageRelThrottlePos",
  77. "minRelThrottlePos",
  78. "maxRelThrottlePos",
  79. "averageGpsSpeed",
  80. "minGpsSpeed",
  81. "maxGpsSpeed",
  82. "averageClimb",
  83. "minClimb",
  84. "maxClimb",
  85. "averageAmbientAirTemp",
  86. "minAmbientAirTemp",
  87. "maxAmbientAirTemp",
  88. "averageGPSAcceleration",
  89. "minGPSAcceleration",
  90. "maxGPSAcceleration",
  91. "averageHeadingChange",
  92. "minHeadingChange",
  93. "maxHeadingChange",
  94. "averageAcceleration",
  95. "minAcceleration",
  96. "maxAcceleration"
  97. ]

Having done this, now we need to build two different models - one to predict the driver and another one to predict the vehicle. It will be an interesting exercise to see which of these two models have better accuracy.

Reading directly from database instead of CSV

For completeness sake let me first give you two utility functions that are used to pull data out of the MongoDB database.

  1. def _connect_mongo(host, port, username, password, db):
  2. """ A utility for making a connection to MongoDB """
  3. if username and password:
  4. mongo_uri = 'mongodb://%s:%s@%s:%s/%s' % (username, password, host, port, db)
  5. conn = MongoClient(mongo_uri)
  6. else:
  7. conn = MongoClient(host, port)
  8. return conn[db]
  9.  
  10. def read_mongo(db, collection, query={}, projection='', limit=1000, host='localhost', port=27017, username=None, password=None, no_id=False):
  11. """ Read from Mongo and Store into DataFrame """
  12. db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)
  13. cursor = db[collection].find(query, projection).limit(limit)
  14. datalist = list(cursor)
  15. sanitized = json.loads(json_util.dumps(datalist))
  16. normalized = json_normalize(sanitized)
  17. df = pd.DataFrame(normalized)
  18.  
  19. return df

The function above is Pandas-friendly - it reads data from the MongoDB database and returns a Pandas data-frame so that you can get to work immediately with your machine-learning part.

In case you are not comfortable with MongoDB, I am giving you the entire dataset of the aggregated values in CSV format so that you can import it in any database you wish. The file is in GZIP format - so you need to unzip it before reading it. For those of you who are comfortable with MongoDB, here is the entire database dump.

Building a Machine Learning model

Now it is time to build the learning model. At program invocation two parameters are needed - the database host and which feature set to build the model for. This is handled in the code as follows:

  1. DATABASE_HOST = argv[0]
  2. CHOSEN_FEATURE_SET = argv[1]
  3.  
  4. readFromDatabase = True
  5. read_and_proceed = False

Then I have some logic for setting the appropriate feature set within the application.

  1. if (CHOSEN_FEATURE_SET == 'vehicle'):
  2. features = vehicle_features
  3. feature_name = 'vehicle'
  4. class_variables = ['vehicle'] # Declare the vehicle as a class variable
  5. elif (CHOSEN_FEATURE_SET == 'driver'):
  6. features = driver_features
  7. feature_name = 'driver'
  8. class_variables = ['driver'] # Declare the driver as a class variable
  9. else:
  10. features = all_features
  11. feature_name = 'driverVehicleId'
  12. class_variables = ['driverVehicleId'] # Declare the driver-vehicle combo as a class variable
  13.  
  14. if readFromDatabase:
  15. if CHOSEN_FEATURE_SET == 'driver': # Choose the records only for one vehicle which has multiple drivers
  16. df = read_mongo('obd2', 'vehicle_signature_records', {"vehicle": {"\)regex" : ".*gmc-denali.*"}, "eventTime": {"\(gte": startTime, "\)lte": endTime} }, {"_id": 0}, 1000000, DATABASE_HOST, 27017, None, None, True )
  17. else:
  18. df = read_mongo('obd2', 'vehicle_signature_records', {"eventTime": {"\(gte": startTime, "\)lte": endTime} }, {"_id": 0}, 1000000, DATABASE_HOST, 27017, None, None, True )

The following part is mostly boiler-plate code to break up the dataset into a training set, test set and validation set. While doing so all null values are set to zero as well.

  1. # First randomize the entire dataset
  2. df = df.sample(frac=1).reset_index(drop=True)
  3.  
  4. # Then choose only a small subset of the data, frac=1 means choose everything
  5. df = df.sample(frac=1, replace=True)
  6.  
  7. df.fillna(value=0, inplace=True)
  8.  
  9. train_df, test_df, validate_df = np.split(df, [int(.8*len(df)), int(.9*len(df))])
  10.  
  11. df[feature_name] = df[feature_name].astype('category')
  12.  
  13. y_train = train_df[class_variables]
  14. X_train = train_df.reindex(columns=features)
  15. X_train.replace('NODATA', 0, regex=False, inplace=True)
  16. X_train.fillna(value=0, inplace=True)
  17.  
  18. y_test = test_df[class_variables]
  19. X_test = test_df.reindex(columns=features)
  20. X_test.replace('NODATA', 0, regex=False, inplace=True)
  21. X_test.fillna(value=0, inplace=True)
  22.  
  23. y_validate = validate_df[class_variables]
  24. X_validate = validate_df.reindex(columns=features)
  25. X_test.replace('NODATA', 0, regex=False, inplace=True)
  26. X_validate.fillna(value=0, inplace=True)

Building a Random Forest Classifier and saving it

After trying out various different classifiers, with this dataset, it turns out that a Random Forest classifier gives the best accuracy. Here is the graph showing accuracy of the different classifiers used with this data set. The two best algorithms turn out to be Classification & Regression and Random Forest Classifier. I chose the Random Forest Classifier since this is an ensamble techique and will have better resilience.

Raspberry AlgorithmComparison

This is what you need to do to build a Random Forest classifier with this dataset.

  1. dt = RandomForestClassifier(n_estimators=20, min_samples_leaf=1, max_depth=20, min_samples_split=2, random_state=0)
  2. dt.fit(X_train, y_train.values.ravel())
  3.  
  4. joblib.dump(dt, model_file)
  5. print('...done. Your Random Forest classifier has been saved in file: ' + model_file)

After building the model, I am saving it in a file (line 4) so that it can be read easily when doing the prediction. To find out how well the model is doing, we have to use the test set to make a prediction and evaluate the model score.

  1. y_pred = dt.predict(X_test)
  2. y_test_as_matrix = y_test.as_matrix()
  3. print('Completed generating predicted set')
  4.  
  5. print ('Confusion Matrix')
  6. print(confusion_matrix(y_test, y_pred))
  7.  
  8. crossValScore = cross_val_score(dt, X_validate, y_validate)
  9. model_score = dt.score(X_test, y_test_as_matrix)
  10. print('Cross validation score = ' + crossValScore)
  11. print('Model score = ' + model_score)
  12. print ('Precision, Recall and FScore')
  13. precision, recall, fscore, _ = prf(y_test, y_pred, pos_label=1, average='micro')
  14. print('Precision: ' + str(precision))
  15. print('Recall:' + str(recall))
  16. print('FScore:' + str(fscore))

 Many kinds of evalution metrics are calculated and printed in the above code segment. The most important one that I tend to look at is the overall model score, but the others will give you a good idea of the bias and variance which indicates how resilient your model is with respect to changing values.

Measure of importance

One interesting analysis is to figure out which of the features is the most impactful on the result. This can be done using the simple code fragment below:

  1. importance_indices = {}
  2. for z in range(0, len(dt.feature_importances_)):
  3. importance_indices[z] = dt.feature_importances_[z]
  4.  
  5. sorted_importance_indices = sorted(importance_indices.items(), key=operator.itemgetter(1), reverse=True)
  6.  
  7. for k1 in sorted_importance_indices:
  8. print(features[int(k1[0])] + ' -> ' + str(k1[1]))

 Prediction results and Conclusion

After running the two cases, namely driver prediction and vehicle prediciton, I am typically getting the following scores.

Driver Prediction Using Raspberry Pi results

This is encouraging given that there was always an apprehension about the score not being accurate enough due to the low frequency of data collection. This is an important factor, since we are creating this model out of the instantaneous time derivatives of values, and a low sampling rate will introduce a significant error. The dataset has 13 different driver vehicle combinations. There isn't a whole lot of driving data other than the experiments that were done, but with an accuracy that is 95% or above, there may be some value in this approach.

Another interesting fact is that the vehicle prediction is coming out to be more accurate than the driver. In other words, the parameters being emitted by the car tend to characterize the car more heavily than the driver. Most drivers drive the same way, but the machine characteristics of the car tend to distinguish them more clearly.

Commercial Use Cases

I have showed you an example of many such applications that can be done with an approach like this. It just involves equipping your car with a smart device like a Raspberry Pi and the rest is all backend server-side work. Here are all the use-cases that I can think of. You can take up any of these as your own project and attempt to find a solution.

  1. Parking assistance
  2. Adaptive collision detection
  3. Video evidence recording
  4. Detect abusive driving
  5. Crash detection
  6. Theft detection
  7. Parking meter
  8. Mobile hot-spot
  9. Voice recognition
  10. Connect racing equipment
  11. Head Unit display
  12. Traffic sign warning
  13. Pattern of usage
  14. Reset fault codes
  15. Driver recognition (this is already demonstrated here!)
  16. Emergency braking alert
  17. Animal overheating protection
  18. Remote start
  19. Remote seatbelt notifications
  20. Radio volume regulation
  21. Auto radio off when window down
  22. Eco-driving optimization alerts
  23. Auto lock/unlock

Commercial product

After doing this experiment building a Raspberry Pi kit from scratch, I found out that there is a product called AutoPi that you can buy which will cut short a lot of the hardware setup. I have no affiliation with AutoPi, but I thought it is interesting that this subject is being treated quite seriously by some companies in Europe.

 

 

Published in Data Science
Page 1 of 3