boB Rudis
135af6c1c0
|
4 anni fa | |
---|---|---|
R | 5 anni fa | |
docs | 6 anni fa | |
inst/java | 4 anni fa | |
man | 4 anni fa | |
tests | 6 anni fa | |
.Rbuildignore | 6 anni fa | |
.gitignore | 6 anni fa | |
.travis.yml | 6 anni fa | |
CONDUCT.md | 6 anni fa | |
DESCRIPTION | 4 anni fa | |
LICENSE | 6 anni fa | |
NAMESPACE | 5 anni fa | |
NEWS.md | 4 anni fa | |
README.Rmd | 6 anni fa | |
README.md | 6 anni fa | |
codecov.yml | 6 anni fa | |
cran-comments.md | 6 anni fa | |
sergeant-caffeinated.Rproj | 6 anni fa | |
sergeant.png | 6 anni fa |
README.md
💂☕️ sergeant.caffeinated
Tools to Transform and Query Data with ‘Apache’ ‘Drill’ (JDBC)
NOTE
This is the Java/JDBC-interface to Apache Drill. For non-Java/JDBC, see
the sergeant
package (GitLab;
GitHub).
Description
Drill + sergeant
is (IMO) a streamlined alternative to Spark +
sparklyr
if you don’t need the ML components of Spark (i.e. just need
to query “big data” sources, need to interface with parquet, need to
combine disparate data source types — json, csv, parquet, rdbms - for
aggregation, etc). Drill also has support for spatial queries.
Using Drill SQL queries that reference parquet files on a local linux or macOS workstation can often be more performant than doing the same data ingestion & wrangling work with R (especially for large or disperate data sets). Drill can often help further streaming workflows that infolve wrangling many tiny JSON files on a daily basis.
Drill can be obtained from https://drill.apache.org/download/ (use
“Direct File Download”). Drill can also be installed via
Docker. For
local installs on Unix-like systems, a common/suggestion location for
the Drill directory is /usr/local/drill
as the install directory.
Drill embedded (started using the $DRILL_BASE_DIR/bin/drill-embedded
script) is a super-easy way to get started playing with Drill on a
single workstation and most of many workflows can “get by” using Drill
this way.
The following functions are implemented:
DBI
(RJDBC)
drill_jdbc
: Connect to Drill using JDBC, enabling use of said idioms. SeeRJDBC
for more info.
NOTE: The DRILL JDBC driver fully-qualified path must be placed in the
DRILL_JDBC_JAR
environment variable. This is best done via
~/.Renviron
for interactive work. i.e.
DRILL_JDBC_JAR=/usr/local/drill/jars/drill-jdbc-all-1.14.0.jar
dplyr
: (RJDBC)
src_drill_jdbc
: Connect to Drill (using dplyr & RJDBC) + supporting functions
Installation
devtools::install_git("https://gitlab.com/hrbrmstr/sergeant-caffeinated")
# OF
devtools::install_github("hrbrmstr/sergeant-caffeinated")
Usage
library(sergeant.caffeinated)
library(tidyverse)
# use localhost if running standalone on same system otherwise the host or IP of your Drill server
ds <- src_drill_jdbc("localhost") #ds
db <- tbl(ds, "cp.`employee.json`")
# without `collect()`:
count(db, gender, marital_status)
## # Source: lazy query [?? x 3]
## # Database: DrillJDBCConnection
## # Groups: gender
## gender marital_status n
## <chr> <chr> <dbl>
## 1 F S 297.
## 2 M M 278.
## 3 M S 276.
## 4 F M 304.
count(db, gender, marital_status) %>% collect()
## # A tibble: 4 x 3
## # Groups: gender [2]
## gender marital_status n
## * <chr> <chr> <dbl>
## 1 F S 297.
## 2 M M 278.
## 3 M S 276.
## 4 F M 304.
group_by(db, position_title) %>%
count(gender) -> tmp2
group_by(db, position_title) %>%
count(gender) %>%
ungroup() %>%
mutate(full_desc=ifelse(gender=="F", "Female", "Male")) %>%
collect() %>%
select(Title=position_title, Gender=full_desc, Count=n)
## # A tibble: 30 x 3
## Title Gender Count
## * <chr> <chr> <dbl>
## 1 President Female 1.
## 2 VP Country Manager Male 3.
## 3 VP Country Manager Female 3.
## 4 VP Information Systems Female 1.
## 5 VP Human Resources Female 1.
## 6 Store Manager Female 13.
## 7 VP Finance Male 1.
## 8 Store Manager Male 11.
## 9 HQ Marketing Female 2.
## 10 HQ Information Systems Female 4.
## # ... with 20 more rows
arrange(db, desc(employee_id)) %>% print(n=20)
## # Source: table<cp.`employee.json`> [?? x 16]
## # Database: DrillJDBCConnection
## # Ordered by: desc(employee_id)
## employee_id full_name first_name last_name position_id position_title store_id department_id birth_date hire_date
## <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl> <chr> <chr>
## 1 1156. Kris Stand Kris Stand 18. Store Tempora… 18. 18. 1914-02-02 1998-01-0…
## 2 1155. Vivian Bu… Vivian Burnham 18. Store Tempora… 18. 18. 1914-02-02 1998-01-0…
## 3 1154. Judy Dool… Judy Doolittle 18. Store Tempora… 18. 18. 1914-02-02 1998-01-0…
## 4 1153. Gail Pirn… Gail Pirnie 18. Store Tempora… 18. 18. 1914-02-02 1998-01-0…
## 5 1152. Barbara Y… Barbara Younce 17. Store Permane… 18. 17. 1914-02-02 1998-01-0…
## 6 1151. Burnis Bi… Burnis Biltoft 17. Store Permane… 18. 17. 1914-02-02 1998-01-0…
## 7 1150. Foster De… Foster Detwiler 17. Store Permane… 18. 17. 1914-02-02 1998-01-0…
## 8 1149. Bertha Ci… Bertha Ciruli 17. Store Permane… 18. 17. 1914-02-02 1998-01-0…
## 9 1148. Sharon Bi… Sharon Bishop 16. Store Tempora… 18. 16. 1914-02-02 1998-01-0…
## 10 1147. Jacquelin… Jacqueline Cutwright 16. Store Tempora… 18. 16. 1914-02-02 1998-01-0…
## 11 1146. Elizabeth… Elizabeth Anderson 16. Store Tempora… 18. 16. 1914-02-02 1998-01-0…
## 12 1145. Michael S… Michael Swartwood 16. Store Tempora… 18. 16. 1914-02-02 1998-01-0…
## 13 1144. Shirley C… Shirley Curtsinger 15. Store Permane… 18. 15. 1914-02-02 1998-01-0…
## 14 1143. Ana Quick Ana Quick 15. Store Permane… 18. 15. 1914-02-02 1998-01-0…
## 15 1142. Hazel Sou… Hazel Souza 15. Store Permane… 18. 15. 1914-02-02 1998-01-0…
## 16 1141. James Com… James Compagno 15. Store Permane… 18. 15. 1914-02-02 1998-01-0…
## 17 1140. Mona Jara… Mona Jaramillo 13. Store Shift S… 18. 11. 1961-09-24 1998-01-0…
## 18 1139. Jeanette … Jeanette Belsey 12. Store Assista… 18. 11. 1972-05-12 1998-01-0…
## 19 1138. James Eic… James Eichorn 18. Store Tempora… 12. 18. 1914-02-02 1998-01-0…
## 20 1137. Heather G… Heather Geiermann 18. Store Tempora… 12. 18. 1914-02-02 1998-01-0…
## # ... with more rows, and 6 more variables: salary <dbl>, supervisor_id <dbl>, education_level <chr>,
## # marital_status <chr>, gender <chr>, management_role <chr>
mutate(db, position_title=tolower(position_title)) %>%
mutate(salary=as.numeric(salary)) %>%
mutate(gender=ifelse(gender=="F", "Female", "Male")) %>%
mutate(marital_status=ifelse(marital_status=="S", "Single", "Married")) %>%
group_by(supervisor_id) %>%
summarise(underlings_count=n()) %>%
collect()
## # A tibble: 112 x 2
## supervisor_id underlings_count
## * <dbl> <dbl>
## 1 0. 1.
## 2 1. 7.
## 3 5. 9.
## 4 4. 2.
## 5 2. 3.
## 6 20. 2.
## 7 21. 4.
## 8 22. 7.
## 9 6. 4.
## 10 36. 2.
## # ... with 102 more rows
``` ### Test Results
library(sergeant.caffeinated)
library(testthat)
##
## Attaching package: 'testthat'
## The following object is masked from 'package:dplyr':
##
## matches
## The following object is masked from 'package:purrr':
##
## is_null
date()
## [1] "Sun Oct 14 09:31:23 2018"
devtools::test()
## Loading sergeant.caffeinated
## Testing sergeant.caffeinated
## ✔ | OK F W S | Context
##
⠏ | 0 | JDBC
⠋ | 1 | JDBC
⠙ | 2 | JDBC
⠹ | 3 | JDBC
✔ | 3 | JDBC [0.3 s]
##
## ══ Results ════════════════════════════════════════════════════════════════
## Duration: 0.3 s
##
## OK: 3
## Failed: 0
## Warnings: 0
## Skipped: 0
sergeant Metrics
Lang | # Files | (%) | LoC | (%) | Blank lines | (%) | # Lines | (%) |
---|---|---|---|---|---|---|---|---|
R | 7 | 0.88 | 302 | 0.9 | 78 | 0.68 | 164 | 0.77 |
Rmd | 1 | 0.12 | 35 | 0.1 | 37 | 0.32 | 48 | 0.23 |
Code of Conduct
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.