Saturday, 18 August 2018 21:57

Managing effectiveness of Email campaigns

A Behavior Analysis Problem

Despite the growth of alternate communication means like Chat applications (Whatsapp, Slack, Snapchat), Social web-sites (Facebook, Google Plus, Twitter), communicating via icons (Emojis) and images (Instagram), the good old Email is not yet dead. Almost all companies still rely on Email to not only run their business, but also to do marketing to their customers. This, in itself, has evolved into a field of study. The main concern facing any marketing manager is to control the amount of email to send to their clients so that the customer stays interested in the company and does not unsubscribe because of too much junk mail. So the balance between too much mail and too little mail is very sensitive.

We can employ data science to solve this problem. The problem solved here is taken from a real company's data and was presented to me as a test. So I thought this is good problem to solve and demonstrate the use of data science. Emailing customers too frequently can cause them to unsubscribe from the email list. Sending them too many mails will lead them to not open emails that can cause promotional emails to be sent to their junk mailbox. The goal is to find out the optimum amount of emails to be sent to each customer so that they stay interested over time. Note that this number is different for every customer. We want to predict a customer's propensity to open an email in the future based on observed historical trends as well as their propensity to unsubscribe if they do open the email.

The Dataset

Before I start describing the dataset, you may be inclined to see it first. To introduce some variety in my solution approach - and since most of the data is tabular in fashion, I have decided to use a relational database (MySQL) to store the data, and query it, using JOINs if necessary to generate my feature set. You can download the entire dump of the MySQL database and restore it on your local machine to get started.

What data do we have?

We have data in 5 tables as follows:

  1. Campaign Types - A historical list of campaigns with date and campaign type
  2. User Info - a list of user ids along with registration date and the source from where their email was acquired
  3. Sends - a list of dates when an email was sent to the user and a flag indicating if the email was opened
  4. Opens - a list of dates when an email was opened by the subscriber and unsubscribed
  5. Events - a list of events pertaining to each user along with counts. The count represents different things depending on the type of event.

Let me describe in detail what each column in the given tables mean.

Campaign Type

Column Data Type Description
 launch_id Date Unique identifier of a campaign by date of send.
camptype String Type of campaign (email send), where “Heavy Promo”, “Light Promo”, “Evergreen” (No Promo) refer to the magnitude of promo offer, and “Primary Messaging”/”Secondary Messaging” refer to the degree to which the promotion is messaged within the email content.

User Info

Column Data Type Description
 riid Integer Unique identifier of a customer
aq_dt Datetime Date customer's email was acquired as stored in current email system
sub_source String Desciption of where the customer registered their email

Sends

Column Data Type Description
 riid Integer Unique identifier of a customer
campaign_send_dt Datetime Send date of campaign
opened Integer 0 or 1, indicator of whether or not a given riid opened a given campaign

Opens

Column Data Type Description
 riid Integer Unique identifier of a customer
campaign_send_dt Datetime Send date of campaign
optout Integer 0 or 1, indicator of whether or not a given riid opened a given campaign and unsubscribed

Events

Column Data Type Description
riid Integer Unique identifier of a customer
event String Name of the event the record is associated:
  • Open (opened an email)
  • Sent (received an email)
  • Click (clicked an email)
  • Online Purch (made a purchase online).
event_captured_dt Datetime Date an event was recorded.
cnts Integer Different aggregate measures depending on the associated event:

  • Open: Count of all opens recorded for an individual on a given day
.
  • Send: Count of all sends recorded for an individual on a given day
.
  • Click: Count of all clicks recorded for an individual on a given day
.
  • Online Purch: Sum total of number of items purchaced by an individual on a given day

Understanding the dataset

Before we jump into a predictive model building, let us analyze the data to understand it better.

Aggregate open rate

As a first exercise, let us attempt  to detemine the aggregate open rate. In other words, what percentage of people actually opened their email. This is a very simple ratio determined from the count of emails opened vs. count of email. We can do this in one SQL statement as follows:

  1. mysql> select (select count(*) from sends where opened = 1)/count(*) from sends;
  2. +--------------------------------------------------------+
  3. | (select count(*) from sends where opened = 1)/count(*) |
  4. +--------------------------------------------------------+
  5. | 0.1863 |
  6. +--------------------------------------------------------+
  7. 1 row in set (0.13 sec)
  8.  

The answer is about 18%.

Aggregate unsubscribe per open rate

Of the people who opened their email, how many got irritated and unsubscribed? Again, this is a simple SQL query on the "opens" table as follows:

  1. mysql> select (select count(*) from opens where optout = 1)/count(*) from opens;
  2. +--------------------------------------------------------+
  3. | (select count(*) from opens where optout = 1)/count(*) |
  4. +--------------------------------------------------------+
  5. | 0.0137 |
  6. +--------------------------------------------------------+
  7. 1 row in set (0.11 sec)

The answer is about 1.4%.

Aggregate open rate ordered by launch Id

How about calculating the open rate based on launch Id. Since one mail is sent every day, we can group the results by launch date and expect the same result. This is a simple group by query as follows:

  1. mysql> SELECT t1.campaign_send_dt AS campaign_send_date, t1.total_daily_sends, t2.total_opened,
  2. (t2.total_opened/t1.total_daily_sends) AS ratio FROM (SELECT campaign_send_dt, COUNT(*) AS total_daily_sends FROM sends
  3. GROUP BY campaign_send_dt ORDER BY campaign_send_dt) AS t1
  4. LEFT JOIN (SELECT campaign_send_dt, COUNT(*) AS total_opened FROM sends
  5. WHERE opened = 1 GROUP BY campaign_send_dt ORDER BY campaign_send_dt) AS t2
  6. ON t1.campaign_send_dt = t2.campaign_send_dt;
  7. +---------------------+-------------------+--------------+--------+
  8. | campaign_send_date | total_daily_sends | total_opened | ratio |
  9. +---------------------+-------------------+--------------+--------+
  10. | 2018-01-01 00:00:00 | 1084 | 310 | 0.2860 |
  11. | 2018-01-03 00:00:00 | 4575 | 790 | 0.1727 |
  12. | 2018-01-05 00:00:00 | 2074 | 502 | 0.2420 |
  13. | 2018-01-07 00:00:00 | 2582 | 501 | 0.1940 |
  14. | 2018-01-08 00:00:00 | 1606 | 337 | 0.2098 |
  15. | 2018-01-09 00:00:00 | 4210 | 690 | 0.1639 |
  16. | 2018-01-11 00:00:00 | 3316 | 574 | 0.1731 |
  17. | 2018-01-12 00:00:00 | 1905 | 422 | 0.2215 |
  18. | 2018-01-14 00:00:00 | 1862 | 435 | 0.2336 |
  19. | 2018-01-15 00:00:00 | 3141 | 564 | 0.1796 |
  20. | 2018-01-16 00:00:00 | 762 | 169 | 0.2218 |
  21. | 2018-01-18 00:00:00 | 2709 | 571 | 0.2108 |
  22. | 2018-01-19 00:00:00 | 1916 | 413 | 0.2156 |
  23. | 2018-01-21 00:00:00 | 3189 | 582 | 0.1825 |
  24. | 2018-01-22 00:00:00 | 535 | 63 | 0.1178 |
  25. | 2018-01-23 00:00:00 | 1948 | 474 | 0.2433 |
  26. | 2018-01-25 00:00:00 | 2581 | 479 | 0.1856 |
  27. | 2018-01-26 00:00:00 | 1953 | 413 | 0.2115 |
  28. | 2018-01-28 00:00:00 | 1008 | 244 | 0.2421 |
  29. | 2018-01-30 00:00:00 | 219 | 35 | 0.1598 |
  30. | 2018-01-31 00:00:00 | 6347 | 1057 | 0.1665 |
  31. | 2018-02-01 00:00:00 | 1077 | 274 | 0.2544 |
  32. | 2018-02-02 00:00:00 | 2626 | 436 | 0.1660 |
  33. | 2018-02-04 00:00:00 | 1967 | 396 | 0.2013 |
  34. | 2018-02-06 00:00:00 | 4989 | 883 | 0.1770 |
  35. | 2018-02-08 00:00:00 | 1031 | 238 | 0.2308 |
  36. | 2018-02-09 00:00:00 | 2731 | 461 | 0.1688 |
  37. | 2018-02-11 00:00:00 | 1994 | 424 | 0.2126 |
  38. | 2018-02-12 00:00:00 | 1209 | 348 | 0.2878 |
  39. | 2018-02-13 00:00:00 | 1936 | 384 | 0.1983 |
  40. | 2018-02-14 00:00:00 | 56 | 24 | 0.4286 |
  41. | 2018-02-15 00:00:00 | 4241 | 626 | 0.1476 |
  42. | 2018-02-16 00:00:00 | 1882 | 371 | 0.1971 |
  43. | 2018-02-18 00:00:00 | 2554 | 489 | 0.1915 |
  44. | 2018-02-19 00:00:00 | 5686 | 764 | 0.1344 |
  45. | 2018-02-21 00:00:00 | 467 | 141 | 0.3019 |
  46. | 2018-02-23 00:00:00 | 1888 | 363 | 0.1923 |
  47. | 2018-02-25 00:00:00 | 2938 | 525 | 0.1787 |
  48. | 2018-02-26 00:00:00 | 1128 | 314 | 0.2784 |
  49. | 2018-02-27 00:00:00 | 7569 | 1081 | 0.1428 |
  50. | 2018-02-28 00:00:00 | 2509 | 459 | 0.1829 |
  51. +---------------------+-------------------+--------------+--------+
  52. 41 ROWS IN SET (0.33 sec)

The answer lies between 11% and 42% that happened only once when mails sent were too low.

Aggregate open rate based on registration source

We could also group the results based on other criteria, like sub_source. This query will be a bit longer, but here it is as given below. Note that we are using the WITH statement available in MySQL that is a more recent addition to the MySQL syntax.

 

  1. mysql> WITH sends_wider AS (SELECT sends.riid, sends.campaign_send_dt, sends.opened, user_info.aq_dt, user_info.sub_source FROM SENDS
  2. LEFT JOIN user_info ON sends.riid = user_info.riid)
  3. SELECT t1.sub_source AS sub_source, t1.total_daily_sends, t2.total_opened, (t2.total_opened/t1.total_daily_sends) AS ratio
  4. FROM (SELECT sub_source, COUNT(*) AS total_daily_sends FROM sends_wider
  5. GROUP BY sub_source ORDER BY sub_source) AS t1
  6. LEFT JOIN (SELECT sub_source, COUNT(*) AS total_opened FROM sends_wider WHERE opened = 1 GROUP BY sub_source ORDER BY sub_source) AS t2
  7. ON t1.sub_source = t2.sub_source;
  8. +---------------------+-------------------+--------------+--------+
  9. | sub_source | total_daily_sends | total_opened | ratio |
  10. +---------------------+-------------------+--------------+--------+
  11. | Checkout_Online | 26189 | 5091 | 0.1944 |
  12. | eReceipt_Physical | 8750 | 1722 | 0.1968 |
  13. | Facebook CPL_Online | 830 | 125 | 0.1506 |
  14. | Footer_Desktop | 2537 | 555 | 0.2188 |
  15. | Footer_Mobile | 2221 | 407 | 0.1833 |
  16. | Join_Desktop | 1532 | 283 | 0.1847 |
  17. | Join_Mobile | 343 | 78 | 0.2274 |
  18. | Lightbox_Desktop | 15254 | 2712 | 0.1778 |
  19. | Lightbox_Mobile | 4734 | 1029 | 0.2174 |
  20. | My Account_Desktop | 599 | 119 | 0.1987 |
  21. | My Account_Mobile | 166 | 20 | 0.1205 |
  22. | Other_Other | 17098 | 2759 | 0.1614 |
  23. | POS form_Physical | 17542 | 3241 | 0.1848 |
  24. | Sports/Brand_Online | 670 | 150 | 0.2239 |
  25. | Warehouse_Desktop | 999 | 206 | 0.2062 |
  26. | Warehouse_Mobile | 536 | 129 | 0.2407 |
  27. +---------------------+-------------------+--------------+--------+
  28. 16 ROWS IN SET (0.37 sec)

A ratio is expected to average 18% or so, and the results look plausible.

Aggregate open rate based on month of acquisition

Another similar criteria to group the results is month of acquisition. Here is the query for that.

 

  1. mysql> WITH sends_wider AS
  2. (SELECT sends.riid, sends.campaign_send_dt, sends.opened, user_info.aq_dt, user_info.sub_source FROM sends
  3. LEFT JOIN user_info ON sends.riid = user_info.riid) SELECT t1.aq_dt AS aq_dt, t1.total_daily_sends, t2.total_opened,
  4. (t2.total_opened/t1.total_daily_sends) AS ratio
  5. FROM (SELECT aq_dt, COUNT(*) AS total_daily_sends FROM sends_wider GROUP BY aq_dt ORDER BY aq_dt) AS t1
  6. LEFT JOIN (SELECT aq_dt, COUNT(*) AS total_opened FROM sends_wider WHERE opened = 1 GROUP BY aq_dt ORDER BY aq_dt) AS t2
  7. ON t1.aq_dt = t2.aq_dt;
  8. +---------------------+-------------------+--------------+--------+
  9. | aq_dt | total_daily_sends | total_opened | ratio |
  10. +---------------------+-------------------+--------------+--------+
  11. | 2010-01-06 00:00:00 | 1 | NULL | NULL |
  12. | 2010-01-11 00:00:00 | 1 | NULL | NULL |
  13. | 2010-01-13 00:00:00 | 7 | 5 | 0.7143 |
  14. | 2010-01-18 00:00:00 | 5 | NULL | NULL |
  15. | 2010-01-21 00:00:00 | 4 | NULL | NULL |
  16. ...
  17. ...
  18. ...
  19. | 2018-01-17 00:00:00 | 86 | 16 | 0.1860 |
  20. | 2018-01-18 00:00:00 | 38 | 10 | 0.2632 |
  21. | 2018-01-19 00:00:00 | 131 | 26 | 0.1985 |
  22. | 2018-01-20 00:00:00 | 64 | 18 | 0.2813 |
  23. | 2018-01-21 00:00:00 | 44 | 15 | 0.3409 |
  24. | 2018-01-22 00:00:00 | 29 | 3 | 0.1034 |
  25. | 2018-01-23 00:00:00 | 26 | 11 | 0.4231 |
  26. | 2018-01-24 00:00:00 | 25 | 10 | 0.4000 |
  27. | 2018-01-25 00:00:00 | 23 | 12 | 0.5217 |
  28. | 2018-01-26 00:00:00 | 25 | 10 | 0.4000 |
  29. | 2018-01-27 00:00:00 | 24 | 10 | 0.4167 |
  30. | 2018-01-28 00:00:00 | 31 | 6 | 0.1935 |
  31. | 2018-01-29 00:00:00 | 12 | 2 | 0.1667 |
  32. | 2018-01-30 00:00:00 | 25 | 5 | 0.2000 |
  33. | 2018-01-31 00:00:00 | 73 | 18 | 0.2466 |
  34. | 2018-02-01 00:00:00 | 34 | 17 | 0.5000 |
  35. | 2018-02-02 00:00:00 | 29 | 7 | 0.2414 |
  36. | 2018-02-03 00:00:00 | 34 | 11 | 0.3235 |
  37. | 2018-02-04 00:00:00 | 30 | 9 | 0.3000 |
  38. | 2018-02-05 00:00:00 | 25 | 7 | 0.2800 |
  39. | 2018-02-06 00:00:00 | 22 | 11 | 0.5000 |
  40. | 2018-02-07 00:00:00 | 46 | 11 | 0.2391 |
  41. | 2018-02-08 00:00:00 | 64 | 14 | 0.2188 |
  42. | 2018-02-09 00:00:00 | 52 | 16 | 0.3077 |
  43. | 2018-02-10 00:00:00 | 17 | 3 | 0.1765 |
  44. | 2018-02-11 00:00:00 | 10 | 4 | 0.4000 |
  45. | 2018-02-12 00:00:00 | 12 | 5 | 0.4167 |
  46. | 2018-02-13 00:00:00 | 12 | 2 | 0.1667 |
  47. | 2018-02-14 00:00:00 | 11 | 6 | 0.5455 |
  48. | 2018-02-15 00:00:00 | 8 | 4 | 0.5000 |
  49. | 2018-02-16 00:00:00 | 12 | 3 | 0.2500 |
  50. | 2018-02-17 00:00:00 | 10 | 1 | 0.1000 |
  51. | 2018-02-18 00:00:00 | 16 | 2 | 0.1250 |
  52. | 2018-02-19 00:00:00 | 5 | 2 | 0.4000 |
  53. | 2018-02-20 00:00:00 | 11 | 8 | 0.7273 |
  54. | 2018-02-21 00:00:00 | 13 | 3 | 0.2308 |
  55. | 2018-02-22 00:00:00 | 13 | 1 | 0.0769 |
  56. | 2018-02-23 00:00:00 | 13 | 4 | 0.3077 |
  57. | 2018-02-24 00:00:00 | 8 | 1 | 0.1250 |
  58. | 2018-02-25 00:00:00 | 9 | 5 | 0.5556 |
  59. | 2018-02-26 00:00:00 | 1 | NULL | NULL |
  60. +---------------------+-------------------+--------------+--------+
  61. 2815 ROWS IN SET (0.48 sec)

The user data appears to have some problems (NULL values) for earlier dates. This was confirmed as those were reported to have been migrated from a legacy email system, and were not completely correct.

To open or not

So far we had no need to use data-science tools as most of the analysis was simple enough to handle through SQL. Let's try to do a slightly more involved analysis. Before a data scientist attempts to do some prediction, he/she has to work with some conjectures or hypotheses. These are theories based on intuition and one must be able to either prove or disprove it using derivation from the data.

One of the assumptions is that a customer's propensity to open an email as well as their propensity to unsubscribe (after opening the mail) may be related to their tenure. To analyze this theory we need a dataset that has the tenure (difference between the mail's send date and the acquisition date) alongside for each record in the "send" table.

How do we get the tenure in the "sends" table? This is done through a simple JOIN statement as follows:

  1. mysql> SELECT sends.riid, sends.campaign_send_dt, sends.opened, user_info.aq_dt, user_info.sub_source,
  2. DATEDIFF(sends.campaign_send_dt, user_info.aq_dt) AS tenure FROM sends LEFT JOIN user_info
  3. ON sends.riid = user_info.riid LIMIT 10;
  4. +-----------+---------------------+--------+---------------------+-------------------+--------+
  5. | riid | campaign_send_dt | opened | aq_dt | sub_source | tenure |
  6. +-----------+---------------------+--------+---------------------+-------------------+--------+
  7. | 747246582 | 2018-02-19 00:00:00 | 0 | 2016-12-07 00:00:00 | Lightbox_Desktop | 439 |
  8. | 773178502 | 2018-02-04 00:00:00 | 0 | 2017-02-03 00:00:00 | Join_Desktop | 366 |
  9. | 89069862 | 2018-01-11 00:00:00 | 1 | 2010-10-23 00:00:00 | Other_Other | 2637 |
  10. | 899893722 | 2018-01-15 00:00:00 | 0 | 2017-11-08 00:00:00 | Footer_Mobile | 68 |
  11. | 88695502 | 2018-02-02 00:00:00 | 0 | 2011-03-06 00:00:00 | Other_Other | 2525 |
  12. | 939655902 | 2018-02-27 00:00:00 | 0 | 2017-12-31 00:00:00 | eReceipt_Physical | 58 |
  13. | 587652902 | 2018-02-15 00:00:00 | 0 | 2015-11-12 00:00:00 | Other_Other | 826 |
  14. | 865323662 | 2018-02-11 00:00:00 | 1 | 2017-08-20 00:00:00 | Checkout_Online | 175 |
  15. | 166559382 | 2018-01-19 00:00:00 | 0 | 2014-02-22 00:00:00 | POS form_Physical | 1427 |
  16. | 894975022 | 2018-01-01 00:00:00 | 0 | 2017-10-25 00:00:00 | Lightbox_Desktop | 68 |
  17. +-----------+---------------------+--------+---------------------+-------------------+--------+
  18. 10 ROWS IN SET (0.00 sec)

The statement above prints only 10 rows, but the tenure has been elegantly inserted as an extra column in this table. If our theory is correct, there should be a positive relationship between the "opened" state and tenure.  Since the "opened" state is a discrete variable which takes only two values (0 and 1), we may be able to show a corelation if we plot a histogram of opened state against tenure range.

Since I have made my hands dirty with SQL, let me become a bit bold and try to write a SQL query to plot a side-ways histogram using just SQL. Yes, you were expecting me to write a Python program or an R program to draw a beautiful histogram, but I am feeling a bit lazy now, and if SQL can give me a decent graphic, why not?

  1. mysql> SELECT ROUND(tenure, -2) AS tenure_bucket, COUNT(*) AS COUNT, RPAD('', LN(COUNT(*)), '*') AS bar
  2. FROM (SELECT sends.riid, sends.campaign_send_dt, sends.opened, user_info.aq_dt, user_info.sub_source,
  3. DATEDIFF(sends.campaign_send_dt, user_info.aq_dt) AS tenure FROM sends LEFT JOIN user_info
  4. ON sends.riid = user_info.riid WHERE opened = 1) AS histogram
  5. GROUP BY tenure_bucket
  6. ORDER BY tenure_bucket;
  7. +---------------+-------+----------+
  8. | tenure_bucket | COUNT | bar |
  9. +---------------+-------+----------+
  10. | 0 | 932 | ******* |
  11. | 100 | 2237 | ******** |
  12. | 200 | 1507 | ******* |
  13. | 300 | 1529 | ******* |
  14. | 400 | 1510 | ******* |
  15. | 500 | 1295 | ******* |
  16. | 600 | 929 | ******* |
  17. | 700 | 739 | ******* |
  18. | 800 | 830 | ******* |
  19. | 900 | 703 | ******* |
  20. | 1000 | 1486 | ******* |
  21. | 1100 | 623 | ****** |
  22. | 1200 | 409 | ****** |
  23. | 1300 | 378 | ****** |
  24. | 1400 | 349 | ****** |
  25. | 1500 | 512 | ****** |
  26. | 1600 | 374 | ****** |
  27. | 1700 | 309 | ****** |
  28. | 1800 | 254 | ****** |
  29. | 1900 | 369 | ****** |
  30. | 2000 | 275 | ****** |
  31. | 2100 | 160 | ***** |
  32. | 2200 | 191 | ***** |
  33. | 2300 | 220 | ***** |
  34. | 2400 | 116 | ***** |
  35. | 2500 | 102 | ***** |
  36. | 2600 | 110 | ***** |
  37. | 2700 | 80 | **** |
  38. | 2800 | 73 | **** |
  39. | 2900 | 23 | *** |
  40. | 3000 | 2 | * |
  41. +---------------+-------+----------+
  42. 31 ROWS IN SET (0.10 sec)

This SQL query needs some explanation. In the internal SQL query "select sends.riid, sends.campaign_send_dt, sends.opened, user_info.aq_dt, user_info.sub_source, DATEDIFF(sends.campaign_send_dt, user_info.aq_dt) AS tenure from sends left join user_info ON sends.riid = user_info.riid where opened = 1" we are just creating a joined table with only the entries that were opened. Then the rest of query is creating buckets with a range of 100 in each bucket. The variable "COUNT" is the count of entries in each bucket and the "bar" is a poor man's histogram.

We see that if the tenure is low, then the person is more likely to open the email. Those with a higher tenure tend not to open their email. This may be explained by the fact that interest in the email dwindles with time - which I believe is natural to expect.

To unsubscribe or not

A related problem to the above is to figure out how tenure affects someone's likelyhood of unsubscribing. For that we can use the "opens" table and do a similar query on the "optout" column. This will tell us if people who have a long tenure are more likely to unsubscribe or not. To continue the tradition, I am going to write this query in SQL again.

  1. mysql> SELECT ROUND(tenure, -2) AS tenure_bucket, COUNT(*) AS COUNT, RPAD('', LN(COUNT(*)), '*') AS bar
  2. FROM (SELECT opens.riid, opens.campaign_send_dt, opens.optout, user_info.aq_dt, user_info.sub_source,
  3. DATEDIFF(opens.campaign_send_dt, user_info.aq_dt) AS tenure FROM opens LEFT JOIN user_info
  4. ON opens.riid = user_info.riid WHERE optout = 1) AS histogram
  5. GROUP BY tenure_bucket
  6. ORDER BY tenure_bucket;
  7. +---------------+-------+--------+
  8. | tenure_bucket | COUNT | bar |
  9. +---------------+-------+--------+
  10. | 0 | 239 | ***** |
  11. | 100 | 288 | ****** |
  12. | 200 | 135 | ***** |
  13. | 300 | 113 | ***** |
  14. | 400 | 125 | ***** |
  15. | 500 | 75 | **** |
  16. | 600 | 52 | **** |
  17. | 700 | 44 | **** |
  18. | 800 | 40 | **** |
  19. | 900 | 39 | **** |
  20. | 1000 | 91 | ***** |
  21. | 1100 | 20 | *** |
  22. | 1200 | 15 | *** |
  23. | 1300 | 10 | ** |
  24. | 1400 | 8 | ** |
  25. | 1500 | 16 | *** |
  26. | 1600 | 8 | ** |
  27. | 1700 | 8 | ** |
  28. | 1800 | 10 | ** |
  29. | 1900 | 9 | ** |
  30. | 2000 | 5 | ** |
  31. | 2100 | 9 | ** |
  32. | 2200 | 7 | ** |
  33. | 2300 | 2 | * |
  34. | 2400 | 1 | |
  35. | 2500 | 2 | * |
  36. | 2700 | 1 | |
  37. | 2800 | 2 | * |
  38. +---------------+-------+--------+
  39. 28 ROWS IN SET (0.05 sec)

It is very clear from the histogram above that people with less tenure tend to opt out of email campaigns quickly. Those whose have been a subscriber for a long time will hesitate to opt out of email campaigns. The highest rate of opt-outs come between the first 200 days or so.

Predicting a customer's reaction

I am finally going to solve a real data-science problem with this data-set. For those of you who believe that I have been dilly-dallying with SQL all this while, this would be interesting.

We would like to predict the likelihood that a customer will open an email sent on a given day in the future for a given campaign type. Is it also possible to predict if the person who opens the email will unsubscribe? Let us try first create the dataset and then fit a model on that dataset.

To solve this problem, I would like to create a dataset that has the following columns.

  1. User id (riid)
  2. Date (just for reference, not to be used in dataset)
  3. Tenure on that day
  4. Type of campaign (camptype)
  5. Source of user's email address (sub_source)
  6. Number of times the user opened the email that day
  7. Whether the user unsubscribed from list that day
  8. Sum total of items purchased by an individual on a given day

Most of the information is in table "events", but it needs to be augmented with additional information from other tables. To do so we have to do a few more joins. Ideally I would like to do everything in one join statement as follows:

 

  1. CREATE TABLE emailcampaign.dataset
  2. SELECT events.riid, events.event, events.event_captured_dt, events.cnts, user_info.aq_dt, user_info.sub_source,
  3. DATEDIFF(events.event_captured_dt, user_info.aq_dt) AS tenure, campaign_types.camptype, opens.optout
  4. FROM events
  5. LEFT JOIN user_info
  6. ON events.riid = user_info.riid
  7. LEFT JOIN campaign_types ON events.event_captured_dt = campaign_types.launch_id
  8. LEFT JOIN opens ON opens.campaign_send_dt = events.event_captured_dt AND opens.riid = events.riid;

But alas, the datasize is too big to do this in one shot. We have about 8.3 million records in the events table! So we have to break up this query into several parts and do some indexing in between to make it work faster. Here is the trick.

We would like to do the joins in three stages - joining one table at a time and saving the result to a different table. (This is similar to the MapReduce paradigm where each MapReduce step writes its result to the disk before the next stage begins). Here are the steps to do it for this problem. We are creating three tables called dataset1, dataset2 and dataset3 - each of them them bigger than the previous one and pulling in some columns from different tables. Our final table will be dataset3 which will be used for the machine learning exercise.

Here is how you will create the first dataset.

  1. mysql> CREATE TABLE emailcampaign.dataset1
  2. -> SELECT events.riid, events.event, events.event_captured_dt, events.cnts,
  3. -> user_info.aq_dt, user_info.sub_source,
  4. -> DATEDIFF(events.event_captured_dt, user_info.aq_dt) AS tenure
  5. -> FROM events
  6. -> LEFT JOIN user_info
  7. -> ON events.riid = user_info.riid;
  8. Query OK, 8358444 ROWS affected (2 MIN 58.12 sec)
  9. Records: 8358444 Duplicates: 0 Warnings: 0

Notice that we have the same number of rows in table dataset1 as the events table. The next stage includes more data from the campaign_types table

  1. mysql> CREATE TABLE emailcampaign.dataset2
  2. -> SELECT dataset1.riid, dataset1.event, dataset1.event_captured_dt, dataset1.cnts,
  3. -> dataset1.aq_dt, dataset1.sub_source,
  4. -> dataset1.tenure, campaign_types.camptype
  5. -> FROM dataset1
  6. -> LEFT JOIN campaign_types
  7. -> ON dataset1.event_captured_dt = campaign_types.launch_id
  8. -> ;
  9. Query OK, 8358444 ROWS affected (2 MIN 59.94 sec)
  10. Records: 8358444 Duplicates: 0 Warnings: 0

Before we do the final join, it is necessary to make the database efficient as this join is quite exhaustive. So let's create some indices to make the database queries efficient. I am going to create four indices as follows:

  1. mysql> CREATE INDEX idx1 ON opens (campaign_send_dt);
  2. Query OK, 0 ROWS affected (0.37 sec)
  3. Records: 0 Duplicates: 0 Warnings: 0
  4.  
  5. mysql> CREATE INDEX idx2 ON dataset2 (event_captured_dt);
  6. Query OK, 0 ROWS affected (20.83 sec)
  7. Records: 0 Duplicates: 0 Warnings: 0
  8.  
  9. mysql> CREATE INDEX idx3 ON opens (riid);
  10. Query OK, 0 ROWS affected (0.25 sec)
  11. Records: 0 Duplicates: 0 Warnings: 0
  12.  
  13. mysql> CREATE INDEX idx4 ON dataset2 (riid);
  14. Query OK, 0 ROWS affected (20.78 sec)
  15. Records: 0 Duplicates: 0 Warnings: 0

The final join brings in data from the opens table into this composite table.

  1. mysql> CREATE TABLE emailcampaign.dataset3
  2. -> SELECT dataset2.riid, dataset2.event, dataset2.event_captured_dt, dataset2.cnts,
  3. -> dataset2.aq_dt, dataset2.sub_source,
  4. -> dataset2.tenure, dataset2.camptype, opens.optout
  5. -> FROM dataset2
  6. -> LEFT JOIN opens
  7. -> ON opens.campaign_send_dt = dataset2.event_captured_dt AND opens.riid = dataset2.riid;
  8. Query OK, 8358444 ROWS affected (4 MIN 25.61 sec)
  9. Records: 8358444 Duplicates: 0 Warnings: 0

It turns out that the data came out in a slightly different format than what I had expected, but it is workable. Here is a small sample of the resulting table which will be our starting dataset for the rest of the machine learning exercise.

  1. mysql> SELECT * FROM dataset3 WHERE camptype IS NOT NULL AND optout IS NOT NULL LIMIT 10;
  2. +-----------+-------+---------------------+------+---------------------+-------------------+--------+-----------------------------------+--------+
  3. | riid | event | event_captured_dt | cnts | aq_dt | sub_source | tenure | camptype | optout |
  4. +-----------+-------+---------------------+------+---------------------+-------------------+--------+-----------------------------------+--------+
  5. | 91355862 | SENT | 2018-02-27 00:00:00 | 1 | 2011-02-21 00:00:00 | POS form_Physical | 2563 | Heavy Promo - Dedicated Messaging | 0 |
  6. | 86264082 | SENT | 2018-02-15 00:00:00 | 1 | 2010-04-29 00:00:00 | Other_Other | 2849 | Heavy Promo - Dedicated Messaging | 0 |
  7. | 111704342 | SENT | 2018-01-08 00:00:00 | 1 | 2012-07-28 00:00:00 | POS form_Physical | 1990 | Heavy Promo - Secondary Messaging | 0 |
  8. | 94780402 | SENT | 2018-02-01 00:00:00 | 1 | 2015-03-31 00:00:00 | Checkout_Online | 1038 | Heavy Promo | 0 |
  9. | 91122822 | SENT | 2018-01-31 00:00:00 | 1 | 2011-02-04 00:00:00 | POS form_Physical | 2553 | Heavy Promo - Dedicated Messaging | 0 |
  10. | 115866522 | SENT | 2018-01-25 00:00:00 | 1 | 2012-10-18 00:00:00 | Other_Other | 1925 | Light Promo | 0 |
  11. | 104867102 | SENT | 2018-02-12 00:00:00 | 1 | 2012-05-07 00:00:00 | Other_Other | 2107 | Light Promo - Secondary Messaging | 0 |
  12. | 131624382 | SENT | 2018-02-08 00:00:00 | 1 | 2014-05-31 00:00:00 | Checkout_Online | 1349 | No Promo | 0 |
  13. | 145310062 | SENT | 2018-01-23 00:00:00 | 1 | 2015-03-31 00:00:00 | Checkout_Online | 1029 | Light Promo - Secondary Messaging | 0 |
  14. | 86463002 | SENT | 2018-01-18 00:00:00 | 1 | 2010-05-05 00:00:00 | Other_Other | 2815 | Light Promo - Secondary Messaging | 0 |
  15. +-----------+-------+---------------------+------+---------------------+-------------------+--------+-----------------------------------+--------+
  16. 10 ROWS IN SET (0.00 sec)

I am going to take one more step with processing the data to make my life simpler for the following step. I am about to use a standard machine learning package to process my data, and so I am going to export the entire dataset to a file format that is friendly to most packages. Yes, it is possible to read the data from the database directly (and I will show it in a different article some day), but at the moment I want to solve this problem with the tools already availble in the machine learning package without writing anything complicated.

Therefore I am going to export the entire table into CSV format so that I can read it from a file. The export is done simply through the following statement:

  1. mysql> SELECT riid, event, event_captured_dt, cnts, aq_dt, sub_source, tenure, camptype, optout
  2. -> FROM dataset3 WHERE camptype IS NOT NULL AND optout IS NOT NULL INTO OUTFILE '/tmp/emailcampaign_dataset.csv'
  3. -> FIELDS TERMINATED BY ',' LINES TERMINATED BY '\r\n';
  4. Query OK, 8358444 ROWS affected (32.38 sec)

Note that I am only including the records where the camptype and optout variables have either a 0 or a 1 (not NULL). The resulting file looks like this:

  1. riid,event,event_captured_dt,cnts,aq_dt,sub_source,tenure,camptype,optout
  2. 91355862,SENT,2018-02-27 00:00:00,1,2011-02-21 00:00:00,POS form_Physical,2563,Heavy Promo - Dedicated Messaging,0
  3. 86264082,SENT,2018-02-15 00:00:00,1,2010-04-29 00:00:00,Other_Other,2849,Heavy Promo - Dedicated Messaging,0
  4. 111704342,SENT,2018-01-08 00:00:00,1,2012-07-28 00:00:00,POS form_Physical,1990,Heavy Promo - Secondary Messaging,0
  5. 94780402,SENT,2018-02-01 00:00:00,1,2015-03-31 00:00:00,Checkout_Online,1038,Heavy Promo,0
  6. 91122822,SENT,2018-01-31 00:00:00,1,2011-02-04 00:00:00,POS form_Physical,2553,Heavy Promo - Dedicated Messaging,0
  7. 115866522,SENT,2018-01-25 00:00:00,1,2012-10-18 00:00:00,Other_Other,1925,Light Promo,0
  8. 104867102,SENT,2018-02-12 00:00:00,1,2012-05-07 00:00:00,Other_Other,2107,Light Promo - Secondary Messaging,0
  9. 131624382,SENT,2018-02-08 00:00:00,1,2014-05-31 00:00:00,Checkout_Online,1349,No Promo,0
  10. 145310062,SENT,2018-01-23 00:00:00,1,2015-03-31 00:00:00,Checkout_Online,1029,Light Promo - Secondary Messaging,0
  11. 86463002,SENT,2018-01-18 00:00:00,1,2010-05-05 00:00:00,Other_Other,2815,Light Promo - Secondary Messaging,0
  12. ...
  13. ...

You can download the CSV file to look at it in its entirety.

Predicting likelihood that the user will open the email

With the dataset thus created, we are now ready to fit a machine learning model over it. We used to have 8.3 million records in this dataset, but after eliminating all those records that have either a NULL camptype or a NULL optout field, the resuting records amount to only 186,584 records. Our job is to make a prediction based on that. I would like to be able to fit a machine learning model using the entire dataset. For that I am going to use SPARK as my tool-set. Since the output is binary (i.e. 0 for open and 1 for optout) I am going to use LogisticRegression to fit a model.

You will find the entire code written in Java in GitHub. Clone this repository to follow along.

Prerequisites for running this program are Apache SPARK that you need to download and install on your computer first. I will be using Spark in standalone mode, so there should be no difficulty to run it. Make sure that 'spark-submit' is in your $PATH variable.

I have created the repository in such a way that the dataset is in the folder 'data' under root. After cloning the repository look at the Java file CustomerOpenResponsePredictor.java. Let us look at the main() function of this class to see the logic behind it.

  1. SparkSession spark = SparkSession
  2. .builder()
  3. .appName("CustomerOpenResponsePredictor")
  4. .getOrCreate();
  5.  
  6. // Load training data
  7. Dataset allrows = spark.read().format("csv")
  8. .option("header", "true")
  9. .option("inferSchema", "true")
  10. .load("data/emailcampaign_dataset.csv");
  11.  
  12. StringIndexer eventIndexer = new StringIndexer().setHandleInvalid("keep").setInputCol("event").setOutputCol("eventIndex");
  13. StringIndexer subsourceIndexer = new StringIndexer().setHandleInvalid("keep").setInputCol("sub_source").setOutputCol("sub_sourceIndex");
  14. StringIndexer camptypeIndexer = new StringIndexer().setHandleInvalid("keep").setInputCol("camptype").setOutputCol("camptypeIndex");
  15.  
  16. Dataset indexedRows1 = eventIndexer.fit(allrows).transform(allrows);
  17. Dataset indexedRows2 = subsourceIndexer.fit(indexedRows1).transform(indexedRows1);
  18. Dataset indexedRows = camptypeIndexer.fit(indexedRows2).transform(indexedRows2);
  19.  
  20. indexedRows.printSchema();
  21.  
  22. Dataset[] trainingAndTest = indexedRows.randomSplit(new double[] {0.8, 0.2});
  23.  
  24. Dataset trainingSet = trainingAndTest[0];
  25. Dataset testSet = trainingAndTest[1];

First task is the create the Spark session which happens in line 1. In lines 7-10 I am loading the data from the CSV session. (Note that since this is multi-threaded, the data will be read in parallel in chunks through different threads). This dataset has three fields that are given as strings, but they can be converted to categorical variables (i.e. they are basically enumerations of state). Spark offers a way to do this through the StringIndexer class. Lines 12 - 14 define three different indexers that convert the columns event, sub_source and camptype to eventIndex, sub_sourceIndex and camptypeIndex respectively. The remaining variables are all integers, so LogisticRegression will work fine in this case. Lines 16-18 do the actual conversion of these three columns to integers from String. The next step is to split the data randomly into two sets - one for training and another for validation. I am using the randomSplit() function to do that in lines 22 to 25. After doing this, I now have two datasets with a 80%-20% split.

Feature Set and Output Value

To be clear, here are all the parameters in the feature set (the input X vector):

  1. eventIndex - the enumerated version of the type of event
  2. cnts - the count of clicks on that event type
  3. sub_sourceIndex - the enumerated version of the source of email
  4. tenure - the amount of time the user has been a subscriber
  5. camptypeIndex - the enumerated version of the type of campaign

The output (y) is a binary value "optout" which represents whether the user will stay (represented by 0) or unsubscribe (represented by 1).

  1. VectorAssembler assembler = new VectorAssembler()
  2. .setInputCols(new String[]{"eventIndex", "cnts", "sub_sourceIndex", "tenure", "camptypeIndex"})
  3. .setOutputCol("features");
  4.  
  5.  
  6. LogisticRegression lr = new LogisticRegression()
  7. .setMaxIter(10)
  8. .setRegParam(0.3)
  9. .setElasticNetParam(0.8)
  10. .setFeaturesCol("features") // setting features column
  11. .setLabelCol("optout"); // setting label column
  12.  
  13. Pipeline pipeline = new Pipeline().setStages(new PipelineStage[] {assembler,lr});
  14.  
  15. PipelineModel lrModel = pipeline.fit(trainingSet);
  16.  
  17. Dataset predictions = lrModel.transform(testSet);
  18.  
  19. // [sub_source, event_captured_dt, aq_dt, sub_sourceIndex, rawPrediction, cnts, probability, camptypeIndex,
  20. // riid, tenure, event, camptype, prediction, eventIndex, optout, features]
  21. int correctCount = 0, allCount = 0;
  22. for (Row r : predictions.select("prediction","optout").collectAsList()) {
  23. if (Math.abs((r.getDouble(0) - (double)r.getInt(1))) < 0.001)
  24. correctCount++;
  25. allCount++;
  26. }
  27.  
  28. System.out.println("Percentage correct = " + 100*((double)correctCount/(double)allCount));
  29.  
  30. spark.stop();

Note that all the columns present in the dataset are not to be fed into the LogisticRegression model builder. Only certain meaningful columns must be selected for the model building. I am going to use the VectorAssembler class to create my feature set. Lines 1 - 3 creates a composite output column called "features" that comprises of the columns eventIndex, cnts, sub_sourceIndex, tenure and camptypeIndex. These are our input variables. The LogisticRegression model is built in line 6 with epochs as 10, regularization parameter as 0.3 and elastic network parameter as 0.8. The input feature set is the "features" column I just created and the output is the "optout" column. Lines 13 to 15 are the Spark steps to create a workflow pipeline. Line 15 creates the actual learning model from the dataset. Line 17 uses the test dataset to make a prediction based on this model. Lines 21 to 26 finds out the accuracy of the model using a simple comparison of the predicted value and the original value of optout.

One point to note about software versions: I am using Spark 2.3.1 and JDK 1.8 to do my work. This combination worked for me. (JDK 10 has some compatibility problems with Spark 2.3, so avoid this combination.)

To compile type the following:

  1. mvn clean package

To run the program type:

  1. bin/run-emailcampaign-learner.sh

Your output is going to look somewhat like this:

  1. bin/run-emailcampaign-learner.sh
  2. Running Email Campaign Learner using SPARK ...
  3. spark-submit --master local[*] --class com.anupambagchi.emailcampaign.CustomerOpenResponsePredictor /Users/anupambagchi/Developer/machine-learning-using-spark/bin/../target/spark.application-1.0-jar-with-dependencies.jar
  4. 18/09/03 20:43:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  5. root
  6. |-- riid: integer (nullable = true)
  7. |-- event: string (nullable = true)
  8. |-- event_captured_dt: timestamp (nullable = true)
  9. |-- cnts: integer (nullable = true)
  10. |-- aq_dt: timestamp (nullable = true)
  11. |-- sub_source: string (nullable = true)
  12. |-- tenure: integer (nullable = true)
  13. |-- camptype: string (nullable = true)
  14. |-- optout: integer (nullable = true)
  15. |-- eventIndex: double (nullable = false)
  16. |-- sub_sourceIndex: double (nullable = false)
  17. |-- camptypeIndex: double (nullable = false)
  18.  
  19. Percentage correct = 98.41830135805131

Since the model does the splitting between the training and test dataset randomly, the results will be different each time. For most of my runs, the accuracy of the model varies between 98.2 to 98.5 percent. This is pretty good given that we had to crunch so much data to come to this conclusion.

How to use this information?

Now the big question is, what can we do about it? Since we are able to predict quite accurately whether the user will unsubscribe, it is best not to send the email to that user. The next step in deployment of this model is to keep it running as a web-service, and before sending an email, check (through the web-service) if the user is likely to opt out of the emails. If so, it is best to suppress sending that email to the user. Awesome! So now we have found an innovative way to retain customers.

Conclusion

I am not going to disappoint any person who came to this page thinking that they will get nuggets of information and instead found a bunch of SQL and mathematics. Here are some tips for really sending email for your campaigns. I am reproducing below for your reference.

  • Try sending email on weekends: Based on data of over 9 billion emails (yes, billion with a “b”) provided by Mailchimp, Saturdays and Sundays have notably higher Click Thru Rates (CTRs); this is coupled with said data showing unsubscribes peak on Tuesdays. These rates include both B2B and B2C emails (although given the volume, are presumably skewed towards B2C.)

  • Send very early in the morning: Data layered over Hubspot survey results reveal while recipients report reading email throughout the day, CTRs peak between 6 – 7 am (localized). And while unsubscribes also peak in the morning, they also spiked late in the evening (when readership dropped off).

  • Optimize your email for mobile: a bit of a no-brainer; surveys cited 80% of the respondents read email on mobile devices, highlighting the importance of making sure your email doesn’t look like scrambled eggs on mobile.

  • Include reference data in your email: Make email searchable. Focus group participants report using email inboxes as an archive of ‘elite’, personal data, referring back to it on an informational basis.

  • Use lots of links in your email: While this may be counterintuitive, Zarrella says there is strong correlation between a greater number of links and higher CTRs. Data also shows lower unsubscribes as the number of links increase. (This may be, of course, because the unsubscribe link is tougher to find…)

  • Serialize and label your email: Using continuity and content-based words such as “[this] week’s,” “newsletter,” or “digest” in the subject line leads to the higher CTRs. Conversely, the traditional “spam” words continue to hold true. [Bonus hint: monitor your spam box for common “trigger” words to avoid.]

  • Give your subscribers special access: Focus groups find people like getting offers specific to them, offers with exclusivity built into them. Another no-brainer, but worth repeating.

  • Send email from people they’ve heard of: Be it a celebrity name or a guru, people like receiving emails from names they recognize.

  • Do not be afraid to send too much email: Unsubscribes are notably higher for organizations that send one or two emails per month; as the frequency of emails reaches eight the number declines.

  • Your newest subscribers are your best: While most subscribers opt out shortly after first subscribing to an email, CTRs early on are also at a high – proving the adage “get ‘em while they’re hot.”

  • Ask people to follow you, not forward emails: It’s not just using social media, but using it wisely. Survey data showed about 80% people either never or rarely forward or Tweet commercial email, even with the advent of ‘share’ and ‘tweet’ buttons. Instead, get people to follow you through Facebook, Twitter, etc. driving prospects to subscribe to your email.

  • Make them want to get your emails: 70% of people report reading most or all of their email, and 58% have separate “junk” inboxes. Given that, Zarella stressed incorporating all the best practice takeaways detailed to ensure your message gets to people’s “good” email address, read, and acted upon.

So that's it folks. You are now an Email marketing ninja. Go out and start practicing some of these ideas.

Published in Data Science