help santa deliver toys with R!
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

325 lines
12 KiB

---
title: "Santalytics"
output: html_document
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
```
## Santalytics Part 1
- Alteryx Original Post: <https://community.alteryx.com/t5/SANTALYTICS-2016/SANTALYTICS-Part-1/m-p/38846#U38846>
- Alteryx Original Solution: <https://community.alteryx.com/t5/Alter-Nation-Blog/SANTALYTICS-Part-1-Solution-and-Behind-the-Data/ba-p/39324>
With an impossible task looming, poor old Santa is clueless at best. There are 15,000 kids in this route (`Recipient Database.xlsx`, `Address Database.xlsx`) and he'll have to somehow stitch together all the data he's gotten from his elves - a dozen log files of what the kids have been up to this year (`* Action Log.xlsx`). Summarizing these and finding a rating for each kid should help! You can do so by subtracting just how naughty they were from how nice they were throughout the year.
Santa has been doing this for centuries and has used trial and error to build out his naughty and nice ratings for grouping the kids (`Naughty or Nice Ratings.xlsx`). Using an approximate even distribution while assigning to each of the 25 groups, can you use Alteryx to determine which kids fall into each category this year?
We only have so many days until the Holidays and presents also need to be assigned too! We already know the naughty kids will get coal, but what about the other 20 groupings of kids? The elves do good work, but presents aren't free - we should probably use the price of each gift to make sure the best kids are getting the best classes of gifts! You can use the price of the gifts to also sort these into 20 evenly distributed groups. Let's hold off until Santa knows his exact routes to pick the gifts specifically - the Reindeer can only hold so much!
### Goal of Part 1:
We want a list of recipients ranked with
- their Naughty or Nice rating and Score
- the class of present they are entitled to.
Some R packages we'll need:
```{r libs, cache = FALSE}
library(fs)
library(sf)
library(here)
library(readxl)
library(Hmisc)
library(magrittr)
library(hrbrthemes)
library(tidyverse)
```
Grabbing the data from the post:
```{r data}
if (!file.exists(here::here("data/santalytics.zip"))) {
download.file(
url = "https://community.alteryx.com/pvsmt99345/attachments/pvsmt99345/santalytics2016/2/1/Santalytics%20Part%201.yxzp",
destfile = here::here("data/santalytics.zip")
)
unzip(
zipfile = here::here("data/santalytics.zip"),
exdir = here::here("data/santalytics")
)
}
```
Quick look at the file structure:
```{r data-files-soverview, comment = ""}
fs::dir_tree(here::here("data/santalytics"))
```
```{r d3}
ratings <- read_excel(here::here("data/santalytics/Data/Naughty or Nice Ratings.xlsx"))
glimpse(ratings)
ratings
```
```{r d1}
if (!file.exists(here::here("data/actions.rds"))) {
list.files(here::here("data/santalytics/Data/"), pattern = "Action", full.names = TRUE) %>%
map_df(read_excel) -> actions
saveRDS(actions, here::here("data/actions.rds"))
}
actions <- readRDS(here::here("data/actions.rds"))
glimpse(actions)
actions
```
```{r d5}
recipients <- read_excel(here::here("data/santalytics/Data/Recipient Database.xlsx"))
glimpse(recipients)
recipients
```
```{r d2}
mutate(actions, Degree = ifelse(Alignment == "Naughty", -Degree, Degree)) %>%
count(ID, wt = Degree, name = "social_score") %>%
mutate(find_out = ifelse(social_score < 0, "naughty", "nice")) -> surveillance_tally
glimpse(surveillance_tally)
surveillance_tally
```
```{r}
ggplot(surveillance_tally, aes(social_score)) +
geom_density(fill = alpha(ft_cols$blue, 3/4)) +
labs(
x = "Overall Surveillance Behaviour Score", y = "Density",
title = "Surveillance Behaviour Score Distribution"
) +
theme_ipsum_es(grid="XY")
```
```{r}
filter(surveillance_tally, find_out == "naughty") %>%
mutate(grp = cut2(social_score, g = 5) %>% fct_relevel(rev) %>% as.integer()) %>%
bind_rows(
filter(surveillance_tally, find_out == "nice") %>%
mutate(grp = cut2(social_score, g = 20) %>% as.integer() %>% add(5))
) %>%
mutate(grp = factor(grp, levels = c(5:1, 6:25))) %>%
arrange(social_score, grp) %>%
left_join(recipients, "ID") -> surveillance_tally
glimpse(surveillance_tally)
surveillance_tally
```
```{r}
count(surveillance_tally, grp, find_out) %>%
ggplot(aes(grp, n)) +
geom_col(aes(fill = find_out)) +
scale_fill_manual(values = c("naughty" = "black", "nice" = "forestgreen")) +
theme_ipsum_es(grid="Y")
```
```{r d4}
presents <- read_excel(here::here("data/santalytics/Data/Presents.xlsx"))
glimpse(presents)
presents
presents %>%
mutate(
Price = as.numeric(Price),
grp = cut2(Price, m = nrow(presents)/20, g = 20) %>% as.integer() %>% add(5)
) -> presents
count(presents, grp)
```
## Santalytics Part 2
- Alteryx Original Post: <https://community.alteryx.com/t5/SANTALYTICS-2016/SANTALYTICS-Part-2/m-p/39424#U39424>
- Alteryx Original Solution: <https://community.alteryx.com/t5/Alter-Nation-Blog/SANTALYTICS-Part-2-Solution-and-Behind-the-Data/ba-p/40030>
The Elf thanks you all for participating in Part 1. In fact we are so excited over the level of participation, that we are upping the ante. Stay tuned on that. For now we are onto part 2 and it's going to get tricky.
With nice kids scattered across the globe, Santa can't be wasting any time this Holiday season! Identify where all our presents need to make it this year. We'll have to call on the elves to distribute them to each house, but let's see if we can't keep Santa from making any extra trips.
Determine the least number of trade areas we can distribute bunches of presents to while making sure that no two points in a distribution hub are more than 500 miles apart - remember, we only need to worry about including the nice kids who will be getting presents delivered this year. Once your distribution hubs are assigned, what's the minimum weight that we can use for every one of the hubs while making sure each kid gets a present from the classification of present that they earned? Santa will worry about how many reindeer to hook to the sleigh, but we need to let him know the minimum towage to account for!
### Goal of Part 2:
- Find a list of delivery "hubs" that include every nice kid - with no two kids in a hub being more than 500 miles apart or 250 miles from the central recipient (hub) location
- Identify the minimum weight that be used to deliver presents (with respect to each present class in that hub) to every hub, excluding presents of 0 or null weight
```{r map}
addresses <- read_excel(here::here("data/santalytics/Data/Address Database.xlsx"))
glimpse(addresses)
addresses
filter(surveillance_tally, find_out == "nice") %>%
left_join(addresses, c("ID" = "Recipient ID")) %>%
st_as_sf(coords = c("Longitude", "Latitude")) %>%
st_set_crs(4326) -> nice_addresses
```
```{r}
ggplot() +
geom_sf(data = nice_addresses, size = 0.125) +
coord_sf(crs = "+proj=eqearth +wktext", datum = NA) +
theme_ipsum_es(grid="")
```
```{r dist}
nice_addresses <- st_transform(nice_addresses, crs = "+proj=eqearth +wktext")
if (!all(file.exists(here::here("data", c("hub-members.rds", "hubs.rds"))))) {
naddr <- nice_addresses[, "ID"] # make a copy
nbuf <- st_buffer(naddr, dist = 250*1609.34) # make circle polygons
hubs <- NULL # will store the hub #'s
hub_members <- vector("list") # all the hub members
repeat {
message("Addresses left: ", sprintf("%5d", nrow(naddr)),
" • Hub count: ", sprintf("%3d", nrow(hubs) %||% 0))
close <- st_intersects(naddr, nbuf) # find intersecting points in each circle
opt <- close[[which.max(lengths(close))[[1]]]] # pick the biggest one
st_union(naddr[opt,]) %>%
st_centroid() %>%
st_nearest_feature(naddr[opt,]) -> hub # get the point closest to the center of the circle
hubs <- rbind(hubs, naddr[opt,][hub,]) # add to the hubs list
hub_members <- append(hub_members, list(naddr[opt,])) # add to members
naddr  <- naddr[-(opt),] # reduce the population
if (nrow(naddr) == 0) break # done if none left
nbuf <- st_buffer(naddr, dist = 402336) # make new polygons
}
saveRDS(hubs, here::here("data/hubs.rds"))
saveRDS(hub_members, here::here("data/hub-members.rds"))
}
# Let's take a look at the hubs!
hubs <- readRDS(here::here("data/hubs.rds"))
hub_members <- readRDS(here::here("data/hub-members.rds"))
hulls <- map(hub_members, ~st_union(.x) %>% st_convex_hull())
hulls <- do.call(c, hulls)
# take another look
ggplot() +
geom_sf(data = st_buffer(hubs, dist = 250*1609.34), fill = NA, size = 0.125, color = "#b2b2b2") +
geom_sf(data = hulls, fill = "black", size = 0.125) +
geom_sf(data = nice_addresses, size = 0.125, color = "#b2b2b2") +
geom_sf(data = hubs, color = "red", size = 0.25) +
coord_sf(datum = NA)
# bring them back together
# prbly cld have just used data frames (which wld take less time to knit back together)
if (!file.exists(here::here("data/hub-groups.rds"))) {
pb <- progress_estimated(length(hub_members))
map(1:length(hub_members), ~{
pb$tick()$print()
mutate(hub_members[[.x]], delivery_group = .x)
}) %>%
do.call(rbind, .) -> hub_groups
saveRDS(hub_groups, here::here("data/hub-groups.rds"))
}
hub_groups <- readRDS(here::here("data/hub-groups.rds"))
# add the naughty/nice scores
hub_groups %>%
left_join(
as_tibble(nice_addresses) %>%
select(ID, social_score, find_out, grp),
"ID"
) -> hub_groups
# find the max weight present per "nice" group
# i might have misunderstood this section
group_by(presents, grp) %>%
top_n(1, wt=Weight) %>%
select(grp, Weight) %>%
ungroup() %>%
mutate(
Weight = as.numeric(Weight),
grp = factor(grp, levels = 1:25, labels = c(5:1, 6:25))
) -> max_wt
hub_groups %>%
left_join(max_wt, "grp") %>%
count(delivery_group, wt = Weight, name = "min_weight") %>%
as_tibble() %>%
select(-geometry) -> min_weights
min_weights
```
## Santalytics Part 3
- Alteryx Original Post: <https://community.alteryx.com/t5/SANTALYTICS-2016/SANTALYTICS-Part-3/m-p/40130#M44>
- Alteryx Original Solution: <https://community.alteryx.com/t5/Alter-Nation-Blog/SANTALYTICS-Part-3-Solution-and-Behind-the-Data/ba-p/40941>
In Part 2 we identified the hubs Santa will visit this season and the minimum weight that can deliver presents to every kid in those hubs with respect to their present score.
But what about maximizing the space of the sled so that it’s full, while accounting for how much weight the reindeer can pull?
Can you help the elves revisit the present assignments for each nice kid now that we know how many reindeer Santa is attaching to the sleigh this year? They want to make sure every kid is getting the biggest and best (priciest then heaviest in priority order) present they earned in their present classes. The kids who behaved the best should be the first to get their presents adjusted - they earned it!
### Goal of Part 3:
Determine the exact present distribution of the nice kids without exceeding 422 lbs per hub - prioritize price, then weight and assign to the nicest kids first
## Santalytics Part 4
- Alteryx Original Post: <https://community.alteryx.com/t5/SANTALYTICS-2016/SANTALYTICS-Part-4/m-p/40944#M53>
- Alteryx Original Solution: <https://community.alteryx.com/t5/Alter-Nation-Blog/SANTALYTICS-Part-4-Solution-and-Series-Finale/ba-p/41631>
Now that we've declared our model as the new Santalytics paradigm, we need to break it down for Santa! He's not a data guy, after all. Can you help make a visualization that will map out Santa's route for him? You must use Alteryx for at least some of your process.
### Goal of Part 4:
- Visualize Santa's trip around the globe