With an impossible task looming, poor old Santa is clueless at best. There are 15,000 kids in this route (Recipient Database.xlsx
, Address Database.xlsx
) and he’ll have to somehow stitch together all the data he’s gotten from his elves - a dozen log files of what the kids have been up to this year (* Action Log.xlsx
). Summarizing these and finding a rating for each kid should help! You can do so by subtracting just how naughty they were from how nice they were throughout the year.
Santa has been doing this for centuries and has used trial and error to build out his naughty and nice ratings for grouping the kids (Naughty or Nice Ratings.xlsx
). Using an approximate even distribution while assigning to each of the 25 groups, can you use Alteryx to determine which kids fall into each category this year?
We only have so many days until the Holidays and presents also need to be assigned too! We already know the naughty kids will get coal, but what about the other 20 groupings of kids? The elves do good work, but presents aren’t free - we should probably use the price of each gift to make sure the best kids are getting the best classes of gifts! You can use the price of the gifts to also sort these into 20 evenly distributed groups. Let’s hold off until Santa knows his exact routes to pick the gifts specifically - the Reindeer can only hold so much!
We want a list of recipients ranked with
Some R packages we’ll need:
library(fs)
library(sf)
library(here)
library(readxl)
library(Hmisc)
library(magrittr)
library(hrbrthemes)
library(tidyverse)
Grabbing the data from the post:
if (!file.exists(here::here("data/santalytics.zip"))) {
download.file(
url = "https://community.alteryx.com/pvsmt99345/attachments/pvsmt99345/santalytics2016/2/1/Santalytics%20Part%201.yxzp",
destfile = here::here("data/santalytics.zip")
)
unzip(
zipfile = here::here("data/santalytics.zip"),
exdir = here::here("data/santalytics")
)
}
Quick look at the file structure:
fs::dir_tree(here::here("data/santalytics"))
/Users/hrbrmstr/projects/santalytics/data/santalytics
├── Data
│ ├── Address Database.xlsx
│ ├── Naughty or Nice Ratings.xlsx
│ ├── Presents.xlsx
│ ├── Recipient Database.xlsx
│ ├── Santa's April Action Log.xlsx
│ ├── Santa's August Action Log.xlsx
│ ├── Santa's December Action Log.xlsx
│ ├── Santa's February Action Log.xlsx
│ ├── Santa's January Action Log.xlsx
│ ├── Santa's July Action Log.xlsx
│ ├── Santa's June Action Log.xlsx
│ ├── Santa's March Action Log.xlsx
│ ├── Santa's May Action Log.xlsx
│ ├── Santa's November Action Log.xlsx
│ ├── Santa's October Action Log.xlsx
│ └── Santa's September Action Log.xlsx
└── Santalytics Part 1.yxmd
ratings <- read_excel(here::here("data/santalytics/Data/Naughty or Nice Ratings.xlsx"))
glimpse(ratings)
## Observations: 25
## Variables: 3
## $ Classification <chr> "Almost got a present this year - but still naughty!",…
## $ Score <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ Alignment <chr> "Naughty", "Naughty", "Naughty", "Naughty", "Naughty",…
ratings
## # A tibble: 25 x 3
## Classification Score Alignment
## <chr> <dbl> <chr>
## 1 Almost got a present this year - but still naughty! 1 Naughty
## 2 Naughty 2 Naughty
## 3 There's no sugar coating it, this is a naughty listing for s… 3 Naughty
## 4 Bad apple. 4 Naughty
## 5 Bad to the bone. 5 Naughty
## 6 Recently moved from naughty list over to my nice list. 6 Nice
## 7 Nice, but several naughty marks. 7 Nice
## 8 "Even after an \"iffy\" October, still has a \"Nice\" rating… 8 Nice
## 9 Still very much on nice list, but must continue good behavio… 9 Nice
## 10 Nice, but has been naughty at times. 10 Nice
## # … with 15 more rows
if (!file.exists(here::here("data/actions.rds"))) {
list.files(here::here("data/santalytics/Data/"), pattern = "Action", full.names = TRUE) %>%
map_df(read_excel) -> actions
saveRDS(actions, here::here("data/actions.rds"))
}
actions <- readRDS(here::here("data/actions.rds"))
glimpse(actions)
## Observations: 1,645,324
## Variables: 5
## $ Date <dttm> 2016-04-01, 2016-04-01, 2016-04-01, 2016-04-01, 2016-04-01…
## $ ID <chr> "4", "7", "8", "10", "12", "15", "18", "24", "28", "36", "3…
## $ Action <chr> "Flicked a booger", "Didn't use an inside voice whilst indo…
## $ Degree <dbl> 7, 7, 3, 9, 1, 10, 9, 7, 4, 7, 4, 5, 4, 10, 10, 8, 10, 7, 8…
## $ Alignment <chr> "Naughty", "Naughty", "Nice", "Naughty", "Nice", "Naughty",…
actions
## # A tibble: 1,645,324 x 5
## Date ID Action Degree Alignment
## <dttm> <chr> <chr> <dbl> <chr>
## 1 2016-04-01 00:00:00 4 Flicked a booger 7 Naughty
## 2 2016-04-01 00:00:00 7 Didn't use an inside voice whilst… 7 Naughty
## 3 2016-04-01 00:00:00 8 Stopped to give a stranger direct… 3 Nice
## 4 2016-04-01 00:00:00 10 Cut someone off 9 Naughty
## 5 2016-04-01 00:00:00 12 Said please and thank you 1 Nice
## 6 2016-04-01 00:00:00 15 Forgot to call mom 10 Naughty
## 7 2016-04-01 00:00:00 18 Cut someone off 9 Naughty
## 8 2016-04-01 00:00:00 24 Rolling stop 7 Naughty
## 9 2016-04-01 00:00:00 28 Made someone smile 4 Nice
## 10 2016-04-01 00:00:00 36 Flicked a booger 7 Naughty
## # … with 1,645,314 more rows
recipients <- read_excel(here::here("data/santalytics/Data/Recipient Database.xlsx"))
glimpse(recipients)
## Observations: 15,000
## Variables: 3
## $ ID <chr> "1745", "10305", "11179", "3564", "6447", "1161", "5193", "1555…
## $ First <chr> "Aaron", "Aaron", "Aaron", "Aaron", "Aaron", "Aaron", "Aaron", …
## $ Last <chr> "Anderson", "Baker", "Bell", "Bennett", "Bowman", "Bradley", "B…
recipients
## # A tibble: 15,000 x 3
## ID First Last
## <chr> <chr> <chr>
## 1 1745 Aaron Anderson
## 2 10305 Aaron Baker
## 3 11179 Aaron Bell
## 4 3564 Aaron Bennett
## 5 6447 Aaron Bowman
## 6 1161 Aaron Bradley
## 7 5193 Aaron Brooks
## 8 1555 Aaron Bryant
## 9 12794 Aaron Carter
## 10 14869 Aaron Coleman
## # … with 14,990 more rows
mutate(actions, Degree = ifelse(Alignment == "Naughty", -Degree, Degree)) %>%
count(ID, wt = Degree, name = "social_score") %>%
mutate(find_out = ifelse(social_score < 0, "naughty", "nice")) -> surveillance_tally
glimpse(surveillance_tally)
## Observations: 15,000
## Variables: 3
## $ ID <chr> "1", "10", "100", "1000", "10000", "10001", "10002", "10…
## $ social_score <dbl> 127, 64, 47, 64, 91, 33, 50, 145, 81, 216, -1, 86, 21, 1…
## $ find_out <chr> "nice", "nice", "nice", "nice", "nice", "nice", "nice", …
surveillance_tally
## # A tibble: 15,000 x 3
## ID social_score find_out
## <chr> <dbl> <chr>
## 1 1 127 nice
## 2 10 64 nice
## 3 100 47 nice
## 4 1000 64 nice
## 5 10000 91 nice
## 6 10001 33 nice
## 7 10002 50 nice
## 8 10003 145 nice
## 9 10004 81 nice
## 10 10005 216 nice
## # … with 14,990 more rows
ggplot(surveillance_tally, aes(social_score)) +
geom_density(fill = alpha(ft_cols$blue, 3/4)) +
labs(
x = "Overall Surveillance Behaviour Score", y = "Density",
title = "Surveillance Behaviour Score Distribution"
) +
theme_ipsum_es(grid="XY")
filter(surveillance_tally, find_out == "naughty") %>%
mutate(grp = cut2(social_score, g = 5) %>% fct_relevel(rev) %>% as.integer()) %>%
bind_rows(
filter(surveillance_tally, find_out == "nice") %>%
mutate(grp = cut2(social_score, g = 20) %>% as.integer() %>% add(5))
) %>%
mutate(grp = factor(grp, levels = c(5:1, 6:25))) %>%
arrange(social_score, grp) %>%
left_join(recipients, "ID") -> surveillance_tally
glimpse(surveillance_tally)
## Observations: 15,000
## Variables: 6
## $ ID <chr> "13027", "13724", "104", "14658", "11092", "14485", "831…
## $ social_score <dbl> -279, -244, -230, -189, -188, -188, -184, -176, -175, -1…
## $ find_out <chr> "naughty", "naughty", "naughty", "naughty", "naughty", "…
## $ grp <fct> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
## $ First <chr> "Willie", "Susan", "Harold", "Larry", "Evelyn", "Fred", …
## $ Last <chr> "Carr", "Walker", "Bell", "Allen", "Spencer", "James", "…
surveillance_tally
## # A tibble: 15,000 x 6
## ID social_score find_out grp First Last
## <chr> <dbl> <chr> <fct> <chr> <chr>
## 1 13027 -279 naughty 5 Willie Carr
## 2 13724 -244 naughty 5 Susan Walker
## 3 104 -230 naughty 5 Harold Bell
## 4 14658 -189 naughty 5 Larry Allen
## 5 11092 -188 naughty 5 Evelyn Spencer
## 6 14485 -188 naughty 5 Fred James
## 7 8310 -184 naughty 5 Katherine Barnes
## 8 3365 -176 naughty 5 William Perez
## 9 14262 -175 naughty 5 Wanda Burns
## 10 13601 -168 naughty 5 Ruby Woods
## # … with 14,990 more rows
count(surveillance_tally, grp, find_out) %>%
ggplot(aes(grp, n)) +
geom_col(aes(fill = find_out)) +
scale_fill_manual(values = c("naughty" = "black", "nice" = "forestgreen")) +
theme_ipsum_es(grid="Y")
presents <- read_excel(here::here("data/santalytics/Data/Presents.xlsx"))
glimpse(presents)
## Observations: 2,658
## Variables: 6
## $ Type <chr> "PHOTO/COMMODITIES", "PHOTO/COMMODITIES", "PHOTO/COMMODI…
## $ Subtype <chr> "BATTERIES", "BATTERIES", "BATTERIES", "BATTERIES", "BAT…
## $ Manufacturer <chr> "Duracell", "Duracell", "Duracell", "Energizer", "Durace…
## $ Present <chr> "Duracell - AAA Batteries (4-Pack)", "Duracell - AA 1.5V…
## $ Price <chr> "5.49", "5.49", "7.49", "4.99", "8.99", "9.99", "7.99", …
## $ Weight <chr> "0.19", "0.32", "0.65", "0.26", "0.46", "1.29", "0.035",…
presents
## # A tibble: 2,658 x 6
## Type Subtype Manufacturer Present Price Weight
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 PHOTO/COM… BATTERIES Duracell Duracell - AAA Batteries (4-… 5.49 0.19
## 2 PHOTO/COM… BATTERIES Duracell Duracell - AA 1.5V CopperTop… 5.49 0.32
## 3 PHOTO/COM… BATTERIES Duracell Duracell - AA Batteries (8-P… 7.49 0.65
## 4 PHOTO/COM… BATTERIES Energizer Energizer - MAX Batteries AA… 4.99 0.26
## 5 PHOTO/COM… BATTERIES Duracell Duracell - C Batteries (4-Pa… 8.99 0.46
## 6 PHOTO/COM… BATTERIES Duracell Duracell - D Batteries (4-Pa… 9.99 1.29
## 7 PHOTO/COM… BATTERIES Duracell Duracell - 9V Batteries (2-P… 7.99 0.035
## 8 PHOTO/COM… DSLR DIGI… Canon Canon - EOS Rebel T5 DSLR Ca… 599.… 6.3
## 9 INTERACTI… TOY DRONES Protocol Protocol - TigerJet 3-Channe… 39.99 1.42
## 10 COMPUTERS MONITORS BenQ "BenQ - 27\" LED HD Monitor … 329.… 19.8
## # … with 2,648 more rows
presents %>%
mutate(
Price = as.numeric(Price),
grp = cut2(Price, m = nrow(presents)/20, g = 20) %>% as.integer() %>% add(5)
) -> presents
count(presents, grp)
## # A tibble: 20 x 2
## grp n
## <dbl> <int>
## 1 6 160
## 2 7 185
## 3 8 74
## 4 9 115
## 5 10 133
## 6 11 166
## 7 12 125
## 8 13 182
## 9 14 58
## 10 15 138
## 11 16 135
## 12 17 170
## 13 18 88
## 14 19 132
## 15 20 166
## 16 21 119
## 17 22 118
## 18 23 130
## 19 24 132
## 20 25 132
The Elf thanks you all for participating in Part 1. In fact we are so excited over the level of participation, that we are upping the ante. Stay tuned on that. For now we are onto part 2 and it’s going to get tricky.
With nice kids scattered across the globe, Santa can’t be wasting any time this Holiday season! Identify where all our presents need to make it this year. We’ll have to call on the elves to distribute them to each house, but let’s see if we can’t keep Santa from making any extra trips.
Determine the least number of trade areas we can distribute bunches of presents to while making sure that no two points in a distribution hub are more than 500 miles apart - remember, we only need to worry about including the nice kids who will be getting presents delivered this year. Once your distribution hubs are assigned, what’s the minimum weight that we can use for every one of the hubs while making sure each kid gets a present from the classification of present that they earned? Santa will worry about how many reindeer to hook to the sleigh, but we need to let him know the minimum towage to account for!
addresses <- read_excel(here::here("data/santalytics/Data/Address Database.xlsx"))
glimpse(addresses)
## Observations: 15,000
## Variables: 6
## $ `Recipient ID` <chr> "5931", "11472", "10096", "9774", "8885", "5812", "902…
## $ Latitude <dbl> 30.21144, 32.75870, 31.66144, 35.84026, 34.24749, 36.6…
## $ Longitude <dbl> 66.04765, 61.65397, 65.66098, 65.23090, 70.87218, 66.8…
## $ City <chr> "Amānzī", "Anār Darah", "Babasakhib", "Bal Chirāgh", "…
## $ State <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanis…
addresses
## # A tibble: 15,000 x 6
## `Recipient ID` Latitude Longitude City State Country
## <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 5931 30.2 66.0 Amānzī <NA> Afghanistan
## 2 11472 32.8 61.7 Anār Darah <NA> Afghanistan
## 3 10096 31.7 65.7 Babasakhib <NA> Afghanistan
## 4 9774 35.8 65.2 Bal Chirāgh <NA> Afghanistan
## 5 8885 34.2 70.9 Bāsawul <NA> Afghanistan
## 6 5812 36.7 66.8 Chimtāl <NA> Afghanistan
## 7 9028 36.7 66.8 Chimtāl <NA> Afghanistan
## 8 1855 33.4 66.3 Chowṉêy <NA> Afghanistan
## 9 6887 37.2 69.4 Dasht-e Qal‘ah <NA> Afghanistan
## 10 13733 36.0 67.3 Dehī <NA> Afghanistan
## # … with 14,990 more rows
filter(surveillance_tally, find_out == "nice") %>%
left_join(addresses, c("ID" = "Recipient ID")) %>%
st_as_sf(coords = c("Longitude", "Latitude")) %>%
st_set_crs(4326) -> nice_addresses
ggplot() +
geom_sf(data = nice_addresses, size = 0.125) +
coord_sf(crs = "+proj=eqearth +wktext", datum = NA) +
theme_ipsum_es(grid="")
nice_addresses <- st_transform(nice_addresses, crs = "+proj=eqearth +wktext")
if (!all(file.exists(here::here("data", c("hub-members.rds", "hubs.rds"))))) {
naddr <- nice_addresses[, "ID"] # make a copy
nbuf <- st_buffer(naddr, dist = 250*1609.34) # make circle polygons
hubs <- NULL # will store the hub #'s
hub_members <- vector("list") # all the hub members
repeat {
message("Addresses left: ", sprintf("%5d", nrow(naddr)),
" • Hub count: ", sprintf("%3d", nrow(hubs) %||% 0))
close <- st_intersects(naddr, nbuf) # find intersecting points in each circle
opt <- close[[which.max(lengths(close))[[1]]]] # pick the biggest one
st_union(naddr[opt,]) %>%
st_centroid() %>%
st_nearest_feature(naddr[opt,]) -> hub # get the point closest to the center of the circle
hubs <- rbind(hubs, naddr[opt,][hub,]) # add to the hubs list
hub_members <- append(hub_members, list(naddr[opt,])) # add to members
naddr <- naddr[-(opt),] # reduce the population
if (nrow(naddr) == 0) break # done if none left
nbuf <- st_buffer(naddr, dist = 402336) # make new polygons
}
saveRDS(hubs, here::here("data/hubs.rds"))
saveRDS(hub_members, here::here("data/hub-members.rds"))
}
# Let's take a look at the hubs!
hubs <- readRDS(here::here("data/hubs.rds"))
hub_members <- readRDS(here::here("data/hub-members.rds"))
hulls <- map(hub_members, ~st_union(.x) %>% st_convex_hull())
hulls <- do.call(c, hulls)
# take another look
ggplot() +
geom_sf(data = st_buffer(hubs, dist = 250*1609.34), fill = NA, size = 0.125, color = "#b2b2b2") +
geom_sf(data = hulls, fill = "black", size = 0.125) +
geom_sf(data = nice_addresses, size = 0.125, color = "#b2b2b2") +
geom_sf(data = hubs, color = "red", size = 0.25) +
coord_sf(datum = NA)
# bring them back together
# prbly cld have just used data frames (which wld take less time to knit back together)
if (!file.exists(here::here("data/hub-groups.rds"))) {
pb <- progress_estimated(length(hub_members))
map(1:length(hub_members), ~{
pb$tick()$print()
mutate(hub_members[[.x]], delivery_group = .x)
}) %>%
do.call(rbind, .) -> hub_groups
saveRDS(hub_groups, here::here("data/hub-groups.rds"))
}
hub_groups <- readRDS(here::here("data/hub-groups.rds"))
# add the naughty/nice scores
hub_groups %>%
left_join(
as_tibble(nice_addresses) %>%
select(ID, social_score, find_out, grp),
"ID"
) -> hub_groups
# find the max weight present per "nice" group
# i might have misunderstood this section
group_by(presents, grp) %>%
top_n(1, wt=Weight) %>%
select(grp, Weight) %>%
ungroup() %>%
mutate(
Weight = as.numeric(Weight),
grp = factor(grp, levels = 1:25, labels = c(5:1, 6:25))
) -> max_wt
hub_groups %>%
left_join(max_wt, "grp") %>%
count(delivery_group, wt = Weight, name = "min_weight") %>%
as_tibble() %>%
select(-geometry) -> min_weights
min_weights
## # A tibble: 336 x 2
## delivery_group min_weight
## <int> <dbl>
## 1 1 19440.
## 2 2 12765.
## 3 3 11002.
## 4 4 10127.
## 5 5 9282.
## 6 6 8784.
## 7 7 8800.
## 8 8 6512.
## 9 9 5420.
## 10 10 6581.
## # … with 326 more rows
In Part 2 we identified the hubs Santa will visit this season and the minimum weight that can deliver presents to every kid in those hubs with respect to their present score.
But what about maximizing the space of the sled so that it’s full, while accounting for how much weight the reindeer can pull?
Can you help the elves revisit the present assignments for each nice kid now that we know how many reindeer Santa is attaching to the sleigh this year? They want to make sure every kid is getting the biggest and best (priciest then heaviest in priority order) present they earned in their present classes. The kids who behaved the best should be the first to get their presents adjusted - they earned it!
Determine the exact present distribution of the nice kids without exceeding 422 lbs per hub - prioritize price, then weight and assign to the nicest kids first
Now that we’ve declared our model as the new Santalytics paradigm, we need to break it down for Santa! He’s not a data guy, after all. Can you help make a visualization that will map out Santa’s route for him? You must use Alteryx for at least some of your process.