Santalytics Part 1

With an impossible task looming, poor old Santa is clueless at best. There are 15,000 kids in this route (Recipient Database.xlsx, Address Database.xlsx) and he’ll have to somehow stitch together all the data he’s gotten from his elves - a dozen log files of what the kids have been up to this year (* Action Log.xlsx). Summarizing these and finding a rating for each kid should help! You can do so by subtracting just how naughty they were from how nice they were throughout the year.

Santa has been doing this for centuries and has used trial and error to build out his naughty and nice ratings for grouping the kids (Naughty or Nice Ratings.xlsx). Using an approximate even distribution while assigning to each of the 25 groups, can you use Alteryx to determine which kids fall into each category this year?

We only have so many days until the Holidays and presents also need to be assigned too! We already know the naughty kids will get coal, but what about the other 20 groupings of kids? The elves do good work, but presents aren’t free - we should probably use the price of each gift to make sure the best kids are getting the best classes of gifts! You can use the price of the gifts to also sort these into 20 evenly distributed groups. Let’s hold off until Santa knows his exact routes to pick the gifts specifically - the Reindeer can only hold so much!

Goal of Part 1:

We want a list of recipients ranked with

  • their Naughty or Nice rating and Score
  • the class of present they are entitled to.

Some R packages we’ll need:

library(fs)
library(sf)
library(here)
library(readxl)
library(Hmisc)
library(magrittr)
library(hrbrthemes)
library(tidyverse)

Grabbing the data from the post:

if (!file.exists(here::here("data/santalytics.zip"))) {
  
  download.file(
    url = "https://community.alteryx.com/pvsmt99345/attachments/pvsmt99345/santalytics2016/2/1/Santalytics%20Part%201.yxzp",
    destfile = here::here("data/santalytics.zip")
  )
  
  unzip(
    zipfile = here::here("data/santalytics.zip"),
    exdir = here::here("data/santalytics")
  )
  
}

Quick look at the file structure:

fs::dir_tree(here::here("data/santalytics"))
/Users/hrbrmstr/projects/santalytics/data/santalytics
├── Data
│   ├── Address Database.xlsx
│   ├── Naughty or Nice Ratings.xlsx
│   ├── Presents.xlsx
│   ├── Recipient Database.xlsx
│   ├── Santa's April Action Log.xlsx
│   ├── Santa's August Action Log.xlsx
│   ├── Santa's December Action Log.xlsx
│   ├── Santa's February Action Log.xlsx
│   ├── Santa's January Action Log.xlsx
│   ├── Santa's July Action Log.xlsx
│   ├── Santa's June Action Log.xlsx
│   ├── Santa's March Action Log.xlsx
│   ├── Santa's May Action Log.xlsx
│   ├── Santa's November Action Log.xlsx
│   ├── Santa's October Action Log.xlsx
│   └── Santa's September Action Log.xlsx
└── Santalytics Part 1.yxmd
ratings <- read_excel(here::here("data/santalytics/Data/Naughty or Nice Ratings.xlsx"))

glimpse(ratings)
## Observations: 25
## Variables: 3
## $ Classification <chr> "Almost got a present this year - but still naughty!",…
## $ Score          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ Alignment      <chr> "Naughty", "Naughty", "Naughty", "Naughty", "Naughty",…
ratings
## # A tibble: 25 x 3
##    Classification                                                Score Alignment
##    <chr>                                                         <dbl> <chr>    
##  1 Almost got a present this year - but still naughty!               1 Naughty  
##  2 Naughty                                                           2 Naughty  
##  3 There's no sugar coating it, this is a naughty listing for s…     3 Naughty  
##  4 Bad apple.                                                        4 Naughty  
##  5 Bad to the bone.                                                  5 Naughty  
##  6 Recently moved from naughty list over to my nice list.            6 Nice     
##  7 Nice, but several naughty marks.                                  7 Nice     
##  8 "Even after an \"iffy\" October, still has a \"Nice\" rating…     8 Nice     
##  9 Still very much on nice list, but must continue good behavio…     9 Nice     
## 10 Nice, but has been naughty at times.                             10 Nice     
## # … with 15 more rows
if (!file.exists(here::here("data/actions.rds"))) {
  
  list.files(here::here("data/santalytics/Data/"), pattern = "Action", full.names = TRUE) %>% 
    map_df(read_excel) -> actions
  
  saveRDS(actions, here::here("data/actions.rds"))
  
}

actions <- readRDS(here::here("data/actions.rds"))

glimpse(actions)
## Observations: 1,645,324
## Variables: 5
## $ Date      <dttm> 2016-04-01, 2016-04-01, 2016-04-01, 2016-04-01, 2016-04-01…
## $ ID        <chr> "4", "7", "8", "10", "12", "15", "18", "24", "28", "36", "3…
## $ Action    <chr> "Flicked a booger", "Didn't use an inside voice whilst indo…
## $ Degree    <dbl> 7, 7, 3, 9, 1, 10, 9, 7, 4, 7, 4, 5, 4, 10, 10, 8, 10, 7, 8…
## $ Alignment <chr> "Naughty", "Naughty", "Nice", "Naughty", "Nice", "Naughty",…
actions
## # A tibble: 1,645,324 x 5
##    Date                ID    Action                             Degree Alignment
##    <dttm>              <chr> <chr>                               <dbl> <chr>    
##  1 2016-04-01 00:00:00 4     Flicked a booger                        7 Naughty  
##  2 2016-04-01 00:00:00 7     Didn't use an inside voice whilst…      7 Naughty  
##  3 2016-04-01 00:00:00 8     Stopped to give a stranger direct…      3 Nice     
##  4 2016-04-01 00:00:00 10    Cut someone off                         9 Naughty  
##  5 2016-04-01 00:00:00 12    Said please and thank you               1 Nice     
##  6 2016-04-01 00:00:00 15    Forgot to call mom                     10 Naughty  
##  7 2016-04-01 00:00:00 18    Cut someone off                         9 Naughty  
##  8 2016-04-01 00:00:00 24    Rolling stop                            7 Naughty  
##  9 2016-04-01 00:00:00 28    Made someone smile                      4 Nice     
## 10 2016-04-01 00:00:00 36    Flicked a booger                        7 Naughty  
## # … with 1,645,314 more rows
recipients <- read_excel(here::here("data/santalytics/Data/Recipient Database.xlsx"))

glimpse(recipients)
## Observations: 15,000
## Variables: 3
## $ ID    <chr> "1745", "10305", "11179", "3564", "6447", "1161", "5193", "1555…
## $ First <chr> "Aaron", "Aaron", "Aaron", "Aaron", "Aaron", "Aaron", "Aaron", …
## $ Last  <chr> "Anderson", "Baker", "Bell", "Bennett", "Bowman", "Bradley", "B…
recipients
## # A tibble: 15,000 x 3
##    ID    First Last    
##    <chr> <chr> <chr>   
##  1 1745  Aaron Anderson
##  2 10305 Aaron Baker   
##  3 11179 Aaron Bell    
##  4 3564  Aaron Bennett 
##  5 6447  Aaron Bowman  
##  6 1161  Aaron Bradley 
##  7 5193  Aaron Brooks  
##  8 1555  Aaron Bryant  
##  9 12794 Aaron Carter  
## 10 14869 Aaron Coleman 
## # … with 14,990 more rows
mutate(actions, Degree = ifelse(Alignment == "Naughty", -Degree, Degree)) %>% 
  count(ID, wt = Degree, name = "social_score") %>% 
  mutate(find_out = ifelse(social_score < 0, "naughty", "nice"))  -> surveillance_tally

glimpse(surveillance_tally)
## Observations: 15,000
## Variables: 3
## $ ID           <chr> "1", "10", "100", "1000", "10000", "10001", "10002", "10…
## $ social_score <dbl> 127, 64, 47, 64, 91, 33, 50, 145, 81, 216, -1, 86, 21, 1…
## $ find_out     <chr> "nice", "nice", "nice", "nice", "nice", "nice", "nice", …
surveillance_tally
## # A tibble: 15,000 x 3
##    ID    social_score find_out
##    <chr>        <dbl> <chr>   
##  1 1              127 nice    
##  2 10              64 nice    
##  3 100             47 nice    
##  4 1000            64 nice    
##  5 10000           91 nice    
##  6 10001           33 nice    
##  7 10002           50 nice    
##  8 10003          145 nice    
##  9 10004           81 nice    
## 10 10005          216 nice    
## # … with 14,990 more rows
ggplot(surveillance_tally, aes(social_score)) +
  geom_density(fill = alpha(ft_cols$blue, 3/4)) +
  labs(
    x = "Overall Surveillance Behaviour Score", y = "Density",
    title = "Surveillance Behaviour Score Distribution"
  ) +
  theme_ipsum_es(grid="XY")

filter(surveillance_tally, find_out == "naughty") %>% 
  mutate(grp = cut2(social_score, g = 5) %>% fct_relevel(rev) %>% as.integer()) %>% 
  bind_rows(
    filter(surveillance_tally, find_out == "nice") %>% 
      mutate(grp = cut2(social_score, g = 20) %>% as.integer() %>% add(5)) 
  ) %>% 
  mutate(grp = factor(grp, levels = c(5:1, 6:25))) %>% 
  arrange(social_score, grp) %>% 
  left_join(recipients, "ID") -> surveillance_tally

glimpse(surveillance_tally)
## Observations: 15,000
## Variables: 6
## $ ID           <chr> "13027", "13724", "104", "14658", "11092", "14485", "831…
## $ social_score <dbl> -279, -244, -230, -189, -188, -188, -184, -176, -175, -1…
## $ find_out     <chr> "naughty", "naughty", "naughty", "naughty", "naughty", "…
## $ grp          <fct> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
## $ First        <chr> "Willie", "Susan", "Harold", "Larry", "Evelyn", "Fred", …
## $ Last         <chr> "Carr", "Walker", "Bell", "Allen", "Spencer", "James", "…
surveillance_tally
## # A tibble: 15,000 x 6
##    ID    social_score find_out grp   First     Last   
##    <chr>        <dbl> <chr>    <fct> <chr>     <chr>  
##  1 13027         -279 naughty  5     Willie    Carr   
##  2 13724         -244 naughty  5     Susan     Walker 
##  3 104           -230 naughty  5     Harold    Bell   
##  4 14658         -189 naughty  5     Larry     Allen  
##  5 11092         -188 naughty  5     Evelyn    Spencer
##  6 14485         -188 naughty  5     Fred      James  
##  7 8310          -184 naughty  5     Katherine Barnes 
##  8 3365          -176 naughty  5     William   Perez  
##  9 14262         -175 naughty  5     Wanda     Burns  
## 10 13601         -168 naughty  5     Ruby      Woods  
## # … with 14,990 more rows
count(surveillance_tally, grp, find_out) %>% 
  ggplot(aes(grp, n)) +
  geom_col(aes(fill = find_out)) +
  scale_fill_manual(values = c("naughty" = "black", "nice" = "forestgreen")) +
  theme_ipsum_es(grid="Y")

presents <- read_excel(here::here("data/santalytics/Data/Presents.xlsx"))

glimpse(presents)
## Observations: 2,658
## Variables: 6
## $ Type         <chr> "PHOTO/COMMODITIES", "PHOTO/COMMODITIES", "PHOTO/COMMODI…
## $ Subtype      <chr> "BATTERIES", "BATTERIES", "BATTERIES", "BATTERIES", "BAT…
## $ Manufacturer <chr> "Duracell", "Duracell", "Duracell", "Energizer", "Durace…
## $ Present      <chr> "Duracell - AAA Batteries (4-Pack)", "Duracell - AA 1.5V…
## $ Price        <chr> "5.49", "5.49", "7.49", "4.99", "8.99", "9.99", "7.99", …
## $ Weight       <chr> "0.19", "0.32", "0.65", "0.26", "0.46", "1.29", "0.035",…
presents
## # A tibble: 2,658 x 6
##    Type       Subtype    Manufacturer Present                       Price Weight
##    <chr>      <chr>      <chr>        <chr>                         <chr> <chr> 
##  1 PHOTO/COM… BATTERIES  Duracell     Duracell - AAA Batteries (4-… 5.49  0.19  
##  2 PHOTO/COM… BATTERIES  Duracell     Duracell - AA 1.5V CopperTop… 5.49  0.32  
##  3 PHOTO/COM… BATTERIES  Duracell     Duracell - AA Batteries (8-P… 7.49  0.65  
##  4 PHOTO/COM… BATTERIES  Energizer    Energizer - MAX Batteries AA… 4.99  0.26  
##  5 PHOTO/COM… BATTERIES  Duracell     Duracell - C Batteries (4-Pa… 8.99  0.46  
##  6 PHOTO/COM… BATTERIES  Duracell     Duracell - D Batteries (4-Pa… 9.99  1.29  
##  7 PHOTO/COM… BATTERIES  Duracell     Duracell - 9V Batteries (2-P… 7.99  0.035 
##  8 PHOTO/COM… DSLR DIGI… Canon        Canon - EOS Rebel T5 DSLR Ca… 599.… 6.3   
##  9 INTERACTI… TOY DRONES Protocol     Protocol - TigerJet 3-Channe… 39.99 1.42  
## 10 COMPUTERS  MONITORS   BenQ         "BenQ - 27\" LED HD Monitor … 329.… 19.8  
## # … with 2,648 more rows
presents %>% 
  mutate(
    Price = as.numeric(Price),
    grp = cut2(Price, m = nrow(presents)/20, g = 20) %>% as.integer() %>% add(5)
  ) -> presents

count(presents, grp)
## # A tibble: 20 x 2
##      grp     n
##    <dbl> <int>
##  1     6   160
##  2     7   185
##  3     8    74
##  4     9   115
##  5    10   133
##  6    11   166
##  7    12   125
##  8    13   182
##  9    14    58
## 10    15   138
## 11    16   135
## 12    17   170
## 13    18    88
## 14    19   132
## 15    20   166
## 16    21   119
## 17    22   118
## 18    23   130
## 19    24   132
## 20    25   132

Santalytics Part 2

The Elf thanks you all for participating in Part 1. In fact we are so excited over the level of participation, that we are upping the ante. Stay tuned on that. For now we are onto part 2 and it’s going to get tricky.

With nice kids scattered across the globe, Santa can’t be wasting any time this Holiday season! Identify where all our presents need to make it this year. We’ll have to call on the elves to distribute them to each house, but let’s see if we can’t keep Santa from making any extra trips.

Determine the least number of trade areas we can distribute bunches of presents to while making sure that no two points in a distribution hub are more than 500 miles apart - remember, we only need to worry about including the nice kids who will be getting presents delivered this year. Once your distribution hubs are assigned, what’s the minimum weight that we can use for every one of the hubs while making sure each kid gets a present from the classification of present that they earned? Santa will worry about how many reindeer to hook to the sleigh, but we need to let him know the minimum towage to account for!

Goal of Part 2:

  • Find a list of delivery “hubs” that include every nice kid - with no two kids in a hub being more than 500 miles apart or 250 miles from the central recipient (hub) location
  • Identify the minimum weight that be used to deliver presents (with respect to each present class in that hub) to every hub, excluding presents of 0 or null weight
addresses <- read_excel(here::here("data/santalytics/Data/Address Database.xlsx"))

glimpse(addresses)
## Observations: 15,000
## Variables: 6
## $ `Recipient ID` <chr> "5931", "11472", "10096", "9774", "8885", "5812", "902…
## $ Latitude       <dbl> 30.21144, 32.75870, 31.66144, 35.84026, 34.24749, 36.6…
## $ Longitude      <dbl> 66.04765, 61.65397, 65.66098, 65.23090, 70.87218, 66.8…
## $ City           <chr> "Amānzī", "Anār Darah", "Babasakhib", "Bal Chirāgh", "…
## $ State          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ Country        <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanis…
addresses
## # A tibble: 15,000 x 6
##    `Recipient ID` Latitude Longitude City           State Country    
##    <chr>             <dbl>     <dbl> <chr>          <chr> <chr>      
##  1 5931               30.2      66.0 Amānzī         <NA>  Afghanistan
##  2 11472              32.8      61.7 Anār Darah     <NA>  Afghanistan
##  3 10096              31.7      65.7 Babasakhib     <NA>  Afghanistan
##  4 9774               35.8      65.2 Bal Chirāgh    <NA>  Afghanistan
##  5 8885               34.2      70.9 Bāsawul        <NA>  Afghanistan
##  6 5812               36.7      66.8 Chimtāl        <NA>  Afghanistan
##  7 9028               36.7      66.8 Chimtāl        <NA>  Afghanistan
##  8 1855               33.4      66.3 Chowṉêy        <NA>  Afghanistan
##  9 6887               37.2      69.4 Dasht-e Qal‘ah <NA>  Afghanistan
## 10 13733              36.0      67.3 Dehī           <NA>  Afghanistan
## # … with 14,990 more rows
filter(surveillance_tally, find_out == "nice") %>% 
  left_join(addresses, c("ID" = "Recipient ID")) %>% 
  st_as_sf(coords = c("Longitude", "Latitude")) %>% 
  st_set_crs(4326) -> nice_addresses
ggplot() + 
  geom_sf(data = nice_addresses, size = 0.125) +
  coord_sf(crs = "+proj=eqearth +wktext", datum = NA) +
  theme_ipsum_es(grid="")

nice_addresses <- st_transform(nice_addresses, crs = "+proj=eqearth +wktext")

if (!all(file.exists(here::here("data", c("hub-members.rds", "hubs.rds"))))) {
  
  naddr <- nice_addresses[, "ID"] # make a copy
  nbuf <- st_buffer(naddr, dist = 250*1609.34) # make circle polygons
  
  hubs <- NULL # will store the hub #'s
  hub_members <- vector("list") # all the hub members
  
  repeat {
    
    message("Addresses left: ", sprintf("%5d", nrow(naddr)),
            " • Hub count: ", sprintf("%3d", nrow(hubs) %||% 0))
    
    close <- st_intersects(naddr, nbuf) # find intersecting points in each circle
    
    opt <- close[[which.max(lengths(close))[[1]]]] # pick the biggest one
    
    st_union(naddr[opt,]) %>%
      st_centroid() %>%
      st_nearest_feature(naddr[opt,]) -> hub # get the point closest to the center of the circle
    
    hubs <- rbind(hubs, naddr[opt,][hub,]) # add to the hubs list
    hub_members <- append(hub_members, list(naddr[opt,])) # add to members
    
    naddr  <- naddr[-(opt),] # reduce the population
    
    if (nrow(naddr) == 0) break # done if none left
    
    nbuf <- st_buffer(naddr, dist = 402336) # make new polygons
    
  }
  
  saveRDS(hubs, here::here("data/hubs.rds"))
  saveRDS(hub_members, here::here("data/hub-members.rds"))
  
}

# Let's take a look at the hubs!

hubs <- readRDS(here::here("data/hubs.rds"))
hub_members <- readRDS(here::here("data/hub-members.rds"))

hulls <- map(hub_members, ~st_union(.x) %>% st_convex_hull())
hulls <- do.call(c, hulls)

# take another look

ggplot() +
  geom_sf(data = st_buffer(hubs, dist = 250*1609.34), fill = NA, size = 0.125, color = "#b2b2b2") +
  geom_sf(data = hulls, fill = "black", size = 0.125) +
  geom_sf(data = nice_addresses, size = 0.125, color = "#b2b2b2") +
  geom_sf(data = hubs, color = "red", size = 0.25) +
  coord_sf(datum = NA)

# bring them back together
# prbly cld have just used data frames (which wld take less time to knit back together)
if (!file.exists(here::here("data/hub-groups.rds"))) {
  
  pb <- progress_estimated(length(hub_members))
  
  map(1:length(hub_members), ~{
    pb$tick()$print()
    mutate(hub_members[[.x]], delivery_group = .x)  
  }) %>% 
    do.call(rbind, .) -> hub_groups
 
  saveRDS(hub_groups, here::here("data/hub-groups.rds")) 
  
}

hub_groups <- readRDS(here::here("data/hub-groups.rds")) 

# add the naughty/nice scores
hub_groups %>% 
  left_join(
    as_tibble(nice_addresses) %>% 
      select(ID, social_score, find_out, grp),
    "ID"
  ) -> hub_groups

# find the max weight present per "nice" group
# i might have misunderstood this section
group_by(presents, grp) %>% 
  top_n(1, wt=Weight) %>% 
  select(grp, Weight) %>% 
  ungroup() %>% 
  mutate(
    Weight = as.numeric(Weight),
    grp = factor(grp, levels = 1:25, labels = c(5:1, 6:25))
  ) -> max_wt

hub_groups %>% 
  left_join(max_wt, "grp") %>% 
  count(delivery_group, wt = Weight, name = "min_weight") %>% 
  as_tibble() %>% 
  select(-geometry) -> min_weights

min_weights
## # A tibble: 336 x 2
##    delivery_group min_weight
##             <int>      <dbl>
##  1              1     19440.
##  2              2     12765.
##  3              3     11002.
##  4              4     10127.
##  5              5      9282.
##  6              6      8784.
##  7              7      8800.
##  8              8      6512.
##  9              9      5420.
## 10             10      6581.
## # … with 326 more rows

Santalytics Part 3

In Part 2 we identified the hubs Santa will visit this season and the minimum weight that can deliver presents to every kid in those hubs with respect to their present score.

But what about maximizing the space of the sled so that it’s full, while accounting for how much weight the reindeer can pull?

Can you help the elves revisit the present assignments for each nice kid now that we know how many reindeer Santa is attaching to the sleigh this year? They want to make sure every kid is getting the biggest and best (priciest then heaviest in priority order) present they earned in their present classes. The kids who behaved the best should be the first to get their presents adjusted - they earned it!

Goal of Part 3:

Determine the exact present distribution of the nice kids without exceeding 422 lbs per hub - prioritize price, then weight and assign to the nicest kids first

Santalytics Part 4

Now that we’ve declared our model as the new Santalytics paradigm, we need to break it down for Santa! He’s not a data guy, after all. Can you help make a visualization that will map out Santa’s route for him? You must use Alteryx for at least some of your process.

Goal of Part 4:

  • Visualize Santa’s trip around the globe