選択できるのは25トピックまでです。 トピックは、先頭が英数字で、英数字とダッシュ('-')を使用した35文字以内のものにしてください。
boB Rudis 135af6c1c0
4j
4年前
R Better query error messages 5年前
docs docs 6年前
inst/java 4j 4年前
man 4j 4年前
tests JDBC interface 6年前
.Rbuildignore JDBC interface 6年前
.gitignore JDBC interface 6年前
.travis.yml JDBC interface 6年前
CONDUCT.md JDBC interface 6年前
DESCRIPTION 4j 4年前
LICENSE JDBC interface 6年前
NAMESPACE Better query error messages 5年前
NEWS.md 4j 4年前
README.Rmd JDBC interface 6年前
README.md docs 6年前
codecov.yml JDBC interface 6年前
cran-comments.md JDBC interface 6年前
sergeant-caffeinated.Rproj JDBC interface 6年前
sergeant.png JDBC interface 6年前

README.md

DOI Travis-CI BuildStatus CoverageStatus CRAN\_Status\_Badge

💂️ sergeant.caffeinated

Tools to Transform and Query Data with ‘Apache’ ‘Drill’ (JDBC)

NOTE

This is the Java/JDBC-interface to Apache Drill. For non-Java/JDBC, see the sergeant package (GitLab; GitHub).

Description

Drill + sergeant is (IMO) a streamlined alternative to Spark + sparklyr if you don’t need the ML components of Spark (i.e. just need to query “big data” sources, need to interface with parquet, need to combine disparate data source types — json, csv, parquet, rdbms - for aggregation, etc). Drill also has support for spatial queries.

Using Drill SQL queries that reference parquet files on a local linux or macOS workstation can often be more performant than doing the same data ingestion & wrangling work with R (especially for large or disperate data sets). Drill can often help further streaming workflows that infolve wrangling many tiny JSON files on a daily basis.

Drill can be obtained from https://drill.apache.org/download/ (use “Direct File Download”). Drill can also be installed via Docker. For local installs on Unix-like systems, a common/suggestion location for the Drill directory is /usr/local/drill as the install directory.

Drill embedded (started using the $DRILL_BASE_DIR/bin/drill-embedded script) is a super-easy way to get started playing with Drill on a single workstation and most of many workflows can “get by” using Drill this way.

The following functions are implemented:

DBI (RJDBC)

  • drill_jdbc: Connect to Drill using JDBC, enabling use of said idioms. See RJDBC for more info.

NOTE: The DRILL JDBC driver fully-qualified path must be placed in the DRILL_JDBC_JAR environment variable. This is best done via ~/.Renviron for interactive work. i.e. DRILL_JDBC_JAR=/usr/local/drill/jars/drill-jdbc-all-1.14.0.jar

dplyr: (RJDBC)

  • src_drill_jdbc: Connect to Drill (using dplyr & RJDBC) + supporting functions

Installation

devtools::install_git("https://gitlab.com/hrbrmstr/sergeant-caffeinated")
# OF
devtools::install_github("hrbrmstr/sergeant-caffeinated")

Usage

library(sergeant.caffeinated)
library(tidyverse)

# use localhost if running standalone on same system otherwise the host or IP of your Drill server
ds <- src_drill_jdbc("localhost")  #ds
db <- tbl(ds, "cp.`employee.json`") 

# without `collect()`:
count(db, gender, marital_status)
## # Source:   lazy query [?? x 3]
## # Database: DrillJDBCConnection
## # Groups:   gender
##   gender marital_status     n
##   <chr>  <chr>          <dbl>
## 1 F      S               297.
## 2 M      M               278.
## 3 M      S               276.
## 4 F      M               304.

count(db, gender, marital_status) %>% collect()
## # A tibble: 4 x 3
## # Groups:   gender [2]
##   gender marital_status     n
## * <chr>  <chr>          <dbl>
## 1 F      S               297.
## 2 M      M               278.
## 3 M      S               276.
## 4 F      M               304.

group_by(db, position_title) %>% 
  count(gender) -> tmp2

group_by(db, position_title) %>% 
  count(gender) %>% 
  ungroup() %>% 
  mutate(full_desc=ifelse(gender=="F", "Female", "Male")) %>% 
  collect() %>% 
  select(Title=position_title, Gender=full_desc, Count=n)
## # A tibble: 30 x 3
##    Title                  Gender Count
##  * <chr>                  <chr>  <dbl>
##  1 President              Female    1.
##  2 VP Country Manager     Male      3.
##  3 VP Country Manager     Female    3.
##  4 VP Information Systems Female    1.
##  5 VP Human Resources     Female    1.
##  6 Store Manager          Female   13.
##  7 VP Finance             Male      1.
##  8 Store Manager          Male     11.
##  9 HQ Marketing           Female    2.
## 10 HQ Information Systems Female    4.
## # ... with 20 more rows

arrange(db, desc(employee_id)) %>% print(n=20)
## # Source:     table<cp.`employee.json`> [?? x 16]
## # Database:   DrillJDBCConnection
## # Ordered by: desc(employee_id)
##    employee_id full_name  first_name last_name  position_id position_title store_id department_id birth_date hire_date 
##          <dbl> <chr>      <chr>      <chr>            <dbl> <chr>             <dbl>         <dbl> <chr>      <chr>     
##  1       1156. Kris Stand Kris       Stand              18. Store Tempora…      18.           18. 1914-02-02 1998-01-0…
##  2       1155. Vivian Bu… Vivian     Burnham            18. Store Tempora…      18.           18. 1914-02-02 1998-01-0…
##  3       1154. Judy Dool… Judy       Doolittle          18. Store Tempora…      18.           18. 1914-02-02 1998-01-0…
##  4       1153. Gail Pirn… Gail       Pirnie             18. Store Tempora…      18.           18. 1914-02-02 1998-01-0…
##  5       1152. Barbara Y… Barbara    Younce             17. Store Permane…      18.           17. 1914-02-02 1998-01-0…
##  6       1151. Burnis Bi… Burnis     Biltoft            17. Store Permane…      18.           17. 1914-02-02 1998-01-0…
##  7       1150. Foster De… Foster     Detwiler           17. Store Permane…      18.           17. 1914-02-02 1998-01-0…
##  8       1149. Bertha Ci… Bertha     Ciruli             17. Store Permane…      18.           17. 1914-02-02 1998-01-0…
##  9       1148. Sharon Bi… Sharon     Bishop             16. Store Tempora…      18.           16. 1914-02-02 1998-01-0…
## 10       1147. Jacquelin… Jacqueline Cutwright          16. Store Tempora…      18.           16. 1914-02-02 1998-01-0…
## 11       1146. Elizabeth… Elizabeth  Anderson           16. Store Tempora…      18.           16. 1914-02-02 1998-01-0…
## 12       1145. Michael S… Michael    Swartwood          16. Store Tempora…      18.           16. 1914-02-02 1998-01-0…
## 13       1144. Shirley C… Shirley    Curtsinger         15. Store Permane…      18.           15. 1914-02-02 1998-01-0…
## 14       1143. Ana Quick  Ana        Quick              15. Store Permane…      18.           15. 1914-02-02 1998-01-0…
## 15       1142. Hazel Sou… Hazel      Souza              15. Store Permane…      18.           15. 1914-02-02 1998-01-0…
## 16       1141. James Com… James      Compagno           15. Store Permane…      18.           15. 1914-02-02 1998-01-0…
## 17       1140. Mona Jara… Mona       Jaramillo          13. Store Shift S…      18.           11. 1961-09-24 1998-01-0…
## 18       1139. Jeanette … Jeanette   Belsey             12. Store Assista…      18.           11. 1972-05-12 1998-01-0…
## 19       1138. James Eic… James      Eichorn            18. Store Tempora…      12.           18. 1914-02-02 1998-01-0…
## 20       1137. Heather G… Heather    Geiermann          18. Store Tempora…      12.           18. 1914-02-02 1998-01-0…
## # ... with more rows, and 6 more variables: salary <dbl>, supervisor_id <dbl>, education_level <chr>,
## #   marital_status <chr>, gender <chr>, management_role <chr>

mutate(db, position_title=tolower(position_title)) %>%
  mutate(salary=as.numeric(salary)) %>% 
  mutate(gender=ifelse(gender=="F", "Female", "Male")) %>%
  mutate(marital_status=ifelse(marital_status=="S", "Single", "Married")) %>% 
  group_by(supervisor_id) %>% 
  summarise(underlings_count=n()) %>% 
  collect()
## # A tibble: 112 x 2
##    supervisor_id underlings_count
##  *         <dbl>            <dbl>
##  1            0.               1.
##  2            1.               7.
##  3            5.               9.
##  4            4.               2.
##  5            2.               3.
##  6           20.               2.
##  7           21.               4.
##  8           22.               7.
##  9            6.               4.
## 10           36.               2.
## # ... with 102 more rows

``` ### Test Results

library(sergeant.caffeinated)
library(testthat)
## 
## Attaching package: 'testthat'
## The following object is masked from 'package:dplyr':
## 
##     matches
## The following object is masked from 'package:purrr':
## 
##     is_null

date()
## [1] "Sun Oct 14 09:31:23 2018"

devtools::test()
## Loading sergeant.caffeinated
## Testing sergeant.caffeinated
## ✔ | OK F W S | Context
## |  0       | JDBC|  1       | JDBC|  2       | JDBC|  3       | JDBC|  3       | JDBC [0.3 s]
## 
## ══ Results ════════════════════════════════════════════════════════════════
## Duration: 0.3 s
## 
## OK:       3
## Failed:   0
## Warnings: 0
## Skipped:  0

sergeant Metrics

Lang # Files (%) LoC (%) Blank lines (%) # Lines (%)
R 7 0.88 302 0.9 78 0.68 164 0.77
Rmd 1 0.12 35 0.1 37 0.32 48 0.23

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.