[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1248912.svg)](https://doi.org/10.5281/zenodo.1248912) [![Travis-CI Build Status](https://travis-ci.org/hrbrmstr/sergeant.svg?branch=master)](https://travis-ci.org/hrbrmstr/sergeant) [![Coverage Status](https://codecov.io/gh/hrbrmstr/sergeant/branch/master/graph/badge.svg)](https://codecov.io/gh/hrbrmstr/sergeant) [![CRAN\_Status\_Badge](https://www.r-pkg.org/badges/version/sergeant)](https://cran.r-project.org/package=sergeant) # 💂 sergeant Tools to Transform and Query Data with ‘Apache’ ‘Drill’ ## \*\* IMPORTANT \*\* Version 0.7.0+ (a.k.a. the main branch) splits off the JDBC interface into a separate package `sergeant.caffeinated` ([GitLab](https://gitlab.com/hrbrmstr/sergeant-caffeinated); [GitHub](https://github.com/hrbrmstr/sergeant-caffeinated)). I\# Description Drill + `sergeant` is (IMO) a streamlined alternative to Spark + `sparklyr` if you don’t need the ML components of Spark (i.e. just need to query “big data” sources, need to interface with parquet, need to combine disparate data source types — json, csv, parquet, rdbms - for aggregation, etc). Drill also has support for spatial queries. Using Drill SQL queries that reference parquet files on a local linux or macOS workstation can often be more performant than doing the same data ingestion & wrangling work with R (especially for large or disperate data sets). Drill can often help further streamline workflows that involve wrangling many tiny JSON files on a daily basis. Drill can be obtained from (use “Direct File Download”). Drill can also be installed via [Docker](https://drill.apache.org/docs/running-drill-on-docker/). For local installs on Unix-like systems, a common/suggestion location for the Drill directory is `/usr/local/drill` as the install directory. Drill embedded (started using the `$DRILL_BASE_DIR/bin/drill-embedded` script) is a super-easy way to get started playing with Drill on a single workstation and most of many workflows can “get by” using Drill this way. There are a few convenience wrappers for various informational SQL queries (like `drill_version()`). Please file an PR if you add more. Some of the more “controlling vs data ops” REST API functions aren’t implemented. Please file a PR if you need those. The following functions are implemented: **`DBI`** (REST) - A “just enough” feature complete R `DBI` driver has been implemented using the Drill REST API, mostly to facilitate the `dplyr` interface. Use the `RJDBC` driver interface if you need more `DBI` functionality. - This also means that SQL functions unique to Drill have also been “implemented” (i.e. made accessible to the `dplyr` interface). If you have custom Drill SQL functions that need to be implemented please file an issue on GitHub. Many should work without it, but some may require a custom interface. **`dplyr`**: (REST) - `src_drill`: Connect to Drill (using `dplyr`) + supporting functions Note that a number of Drill SQL functions have been mapped to R functions (e.g. `grepl`) to make it easier to transition from non-database-backed SQL ops to Drill. See the help on `drill_custom_functions` for more info on these helper Drill custom function mappings. **Drill APIs**: - `drill_connection`: Setup parameters for a Drill server/cluster connection - `drill_active`: Test whether Drill HTTP REST API server is up - `drill_cancel`: Cancel the query that has the given queryid - `drill_functions`: Show all the available Drill built-in functions & UDFs (Apache Drill 1.15.0+ required) - `drill_jdbc`: Connect to Drill using JDBC - `drill_metrics`: Get the current memory metrics - `drill_options`: List the name, default, and data type of the system and session options - `drill_popts`: Show all the available Drill options (1.15.0+) - `drill_profile`: Get the profile of the query that has the given query id - `drill_profiles`: Get the profiles of running and completed queries - `drill_query`: Submit a query and return results - `drill_set`: Set Drill SYSTEM or SESSION options - `drill_settings_reset`: Changes (optionally, all) session settings back to system defaults - `drill_show_files`: Show files in a file system schema. - `drill_show_schemas`: Returns a list of available schemas. - `drill_stats`: Get Drillbit information, such as ports numbers - `drill_status`: Get the status of Drill - `drill_storage`: Get the list of storage plugin names and configurations - `drill_system_reset`: Changes (optionally, all) system settings back to system defaults - `drill_threads`: Get information about threads - `drill_uplift`: Turn a columnar query results into a type-converted tbl - `drill_use`: Change to a particular schema. - `drill_version`: Identify the version of Drill running **Helpers** - `ctas_profile`: Generate a Drill CTAS Statement from a Query - `drill_up`: sart a Dockerized Drill Instance \# `sdrill_down`: stop a Dockerized Drill Instance by container id - `howall_drill`: Show all dead and running Drill Docker containers - `stopall_drill`: Prune all dead and running Drill Docker containers # Installation ``` r install.packages("sergeant", repos = "https://cinc.rud.is") # or devtools::install_git("https://git.rud.is/hrbrmstr/sergeant.git") # or devtools::install_git("https://git.sr.ht/~hrbrmstr/sergeant") # or devtools::install_gitlab("hrbrmstr/sergeant") # or devtools::install_bitbucket("hrbrmstr/sergeant") # or devtools::install_github("hrbrmstr/sergeant") ``` # Usage ### `dplyr` interface ``` r library(sergeant) library(tidyverse) # use localhost if running standalone on same system otherwise the host or IP of your Drill server ds <- src_drill("localhost") #ds db <- tbl(ds, "cp.`employee.json`") # without `collect()`: count(db, gender, marital_status) ## # Source: lazy query [?? x 3] ## # Database: DrillConnection ## # Groups: gender ## gender marital_status n ## ## 1 F S 297 ## 2 M M 278 ## 3 M S 276 ## 4 F M 304 count(db, gender, marital_status) %>% collect() ## # A tibble: 4 x 3 ## # Groups: gender [2] ## gender marital_status n ## ## 1 F S 297 ## 2 M M 278 ## 3 M S 276 ## 4 F M 304 group_by(db, position_title) %>% count(gender) -> tmp2 group_by(db, position_title) %>% count(gender) %>% ungroup() %>% mutate(full_desc = ifelse(gender == "F", "Female", "Male")) %>% collect() %>% select(Title = position_title, Gender = full_desc, Count = n) ## # A tibble: 30 x 3 ## Title Gender Count ## ## 1 President Female 1 ## 2 VP Country Manager Male 3 ## 3 VP Country Manager Female 3 ## 4 VP Information Systems Female 1 ## 5 VP Human Resources Female 1 ## 6 Store Manager Female 13 ## 7 VP Finance Male 1 ## 8 Store Manager Male 11 ## 9 HQ Marketing Female 2 ## 10 HQ Information Systems Female 4 ## # … with 20 more rows arrange(db, desc(employee_id)) %>% print(n = 20) ## # Source: table [?? x 20] ## # Database: DrillConnection ## # Ordered by: desc(employee_id) ## employee_id full_name first_name last_name position_id position_title store_id department_id birth_date hire_date ## ## 1 999 Beverly … Beverly Dittmar 17 Store Permane… 8 17 1914-02-02 1998-01-… ## 2 998 Elizabet… Elizabeth Jantzer 17 Store Permane… 8 17 1914-02-02 1998-01-… ## 3 997 John Swe… John Sweet 17 Store Permane… 8 17 1914-02-02 1998-01-… ## 4 996 William … William Murphy 17 Store Permane… 8 17 1914-02-02 1998-01-… ## 5 995 Carol Li… Carol Lindsay 17 Store Permane… 8 17 1914-02-02 1998-01-… ## 6 994 Richard … Richard Burke 17 Store Permane… 8 17 1914-02-02 1998-01-… ## 7 993 Ethan Bu… Ethan Bunosky 17 Store Permane… 8 17 1914-02-02 1998-01-… ## 8 992 Claudett… Claudette Cabrera 17 Store Permane… 8 17 1914-02-02 1998-01-… ## 9 991 Maria Te… Maria Terry 17 Store Permane… 8 17 1914-02-02 1998-01-… ## 10 990 Stacey C… Stacey Case 17 Store Permane… 8 17 1914-02-02 1998-01-… ## 11 99 Elizabet… Elizabeth Horne 18 Store Tempora… 6 18 1976-10-05 1997-01-… ## 12 989 Dominick… Dominick Nutter 17 Store Permane… 8 17 1914-02-02 1998-01-… ## 13 988 Brian Wi… Brian Willeford 17 Store Permane… 8 17 1914-02-02 1998-01-… ## 14 987 Margaret… Margaret Clendenen 17 Store Permane… 8 17 1914-02-02 1998-01-… ## 15 986 Maeve Wa… Maeve Wall 17 Store Permane… 8 17 1914-02-02 1998-01-… ## 16 985 Mildred … Mildred Morrow 16 Store Tempora… 8 16 1914-02-02 1998-01-… ## 17 984 French W… French Wilson 16 Store Tempora… 8 16 1914-02-02 1998-01-… ## 18 983 Elisabet… Elisabeth Duncan 16 Store Tempora… 8 16 1914-02-02 1998-01-… ## 19 982 Linda An… Linda Anderson 16 Store Tempora… 8 16 1914-02-02 1998-01-… ## 20 981 Selene W… Selene Watson 16 Store Tempora… 8 16 1914-02-02 1998-01-… ## # … with more rows, and 6 more variables: salary , supervisor_id , education_level , ## # marital_status , gender , management_role mutate(db, position_title = tolower(position_title)) %>% mutate(salary = as.numeric(salary)) %>% mutate(gender = ifelse(gender == "F", "Female", "Male")) %>% mutate(marital_status = ifelse(marital_status == "S", "Single", "Married")) %>% group_by(supervisor_id) %>% summarise(underlings_count = n()) %>% collect() ## # A tibble: 112 x 2 ## supervisor_id underlings_count ## ## 1 0 1 ## 2 1 7 ## 3 5 9 ## 4 4 2 ## 5 2 3 ## 6 20 2 ## 7 21 4 ## 8 22 7 ## 9 6 4 ## 10 36 2 ## # … with 102 more rows ``` ### REST API ``` r dc <- drill_connection("localhost") drill_active(dc) ## [1] TRUE drill_version(dc) ## [1] "1.15.0" drill_storage(dc)$name ## [1] "cp" "dfs" "drilldat" "hbase" "hdfs" "hive" "kudu" "mongo" "my" "s3" drill_query(dc, "SELECT * FROM cp.`employee.json` limit 100") ## # A tibble: 100 x 16 ## employee_id full_name first_name last_name position_id position_title store_id department_id birth_date hire_date ## ## 1 1 Sheri No… Sheri Nowmer 1 President 0 1 1961-08-26 1994-12-… ## 2 2 Derrick … Derrick Whelply 2 VP Country Ma… 0 1 1915-07-03 1994-12-… ## 3 4 Michael … Michael Spence 2 VP Country Ma… 0 1 1969-06-20 1998-01-… ## 4 5 Maya Gut… Maya Gutierrez 2 VP Country Ma… 0 1 1951-05-10 1998-01-… ## 5 6 Roberta … Roberta Damstra 3 VP Informatio… 0 2 1942-10-08 1994-12-… ## 6 7 Rebecca … Rebecca Kanagaki 4 VP Human Reso… 0 3 1949-03-27 1994-12-… ## 7 8 Kim Brun… Kim Brunner 11 Store Manager 9 11 1922-08-10 1998-01-… ## 8 9 Brenda B… Brenda Blumberg 11 Store Manager 21 11 1979-06-23 1998-01-… ## 9 10 Darren S… Darren Stanz 5 VP Finance 0 5 1949-08-26 1994-12-… ## 10 11 Jonathan… Jonathan Murraiin 11 Store Manager 1 11 1967-06-20 1998-01-… ## # … with 90 more rows, and 6 more variables: salary , supervisor_id , education_level , ## # marital_status , gender , management_role drill_query(dc, "SELECT COUNT(gender) AS gctFROM cp.`employee.json` GROUP BY gender") drill_options(dc) ## # A tibble: 179 x 6 ## name value defaultValue accessibleScopes kind optionScope ## ## 1 debug.validate_iterators FALSE false ALL BOOLE… BOOT ## 2 debug.validate_vectors FALSE false ALL BOOLE… BOOT ## 3 drill.exec.functions.cast_empty_string_to_null FALSE false ALL BOOLE… BOOT ## 4 drill.exec.hashagg.fallback.enabled FALSE false ALL BOOLE… BOOT ## 5 drill.exec.hashjoin.fallback.enabled FALSE false ALL BOOLE… BOOT ## 6 drill.exec.memory.operator.output_batch_size 16777216 16777216 SYSTEM LONG BOOT ## 7 drill.exec.memory.operator.output_batch_size_avail_mem_fac… 0.1 0.1 SYSTEM DOUBLE BOOT ## 8 drill.exec.storage.file.partition.column.label dir dir ALL STRING BOOT ## 9 drill.exec.storage.implicit.filename.column.label filename filename ALL STRING BOOT ## 10 drill.exec.storage.implicit.filepath.column.label filepath filepath ALL STRING BOOT ## # … with 169 more rows drill_options(dc, "json") ## # A tibble: 10 x 6 ## name value defaultValue accessibleScopes kind optionScope ## ## 1 store.hive.maprdb_json.optimize_scan_with_native_reader FALSE false ALL BOOLEAN BOOT ## 2 store.json.all_text_mode TRUE false ALL BOOLEAN SYSTEM ## 3 store.json.extended_types TRUE false ALL BOOLEAN SYSTEM ## 4 store.json.read_numbers_as_double FALSE false ALL BOOLEAN BOOT ## 5 store.json.reader.allow_nan_inf TRUE true ALL BOOLEAN BOOT ## 6 store.json.reader.print_skipped_invalid_record_number TRUE false ALL BOOLEAN SYSTEM ## 7 store.json.reader.skip_invalid_records TRUE false ALL BOOLEAN SYSTEM ## 8 store.json.writer.allow_nan_inf TRUE true ALL BOOLEAN BOOT ## 9 store.json.writer.skip_null_fields TRUE true ALL BOOLEAN BOOT ## 10 store.json.writer.uglify TRUE false ALL BOOLEAN SYSTEM ``` ## Working with parquet files ``` r drill_query(dc, "SELECT * FROM dfs.`/usr/local/drill/sample-data/nation.parquet` LIMIT 5") ## # A tibble: 5 x 4 ## N_NATIONKEY N_NAME N_REGIONKEY N_COMMENT ## ## 1 0 ALGERIA 0 haggle. carefully f ## 2 1 ARGENTINA 1 al foxes promise sly ## 3 2 BRAZIL 1 y alongside of the p ## 4 3 CANADA 1 eas hang ironic, sil ## 5 4 EGYPT 4 y above the carefull ``` Including multiple parquet files in different directories (note the wildcard support): ``` r drill_query(dc, "SELECT * FROM dfs.`/usr/local/drill/sample-data/nations*/nations*.parquet` LIMIT 5") ## # A tibble: 5 x 5 ## dir0 N_NATIONKEY N_NAME N_REGIONKEY N_COMMENT ## ## 1 nationsSF 0 ALGERIA 0 haggle. carefully f ## 2 nationsSF 1 ARGENTINA 1 al foxes promise sly ## 3 nationsSF 2 BRAZIL 1 y alongside of the p ## 4 nationsSF 3 CANADA 1 eas hang ironic, sil ## 5 nationsSF 4 EGYPT 4 y above the carefull ``` ### Drill has built-in support for spatial ops Via: A common use case is to select data within boundary of given polygon: ``` r drill_query(dc, " select columns[2] as city, columns[4] as lon, columns[3] as lat from cp.`sample-data/CA-cities.csv` where ST_Within( ST_Point(columns[4], columns[3]), ST_GeomFromText( 'POLYGON((-121.95 37.28, -121.94 37.35, -121.84 37.35, -121.84 37.28, -121.95 37.28))' ) ) ") ## # A tibble: 7 x 3 ## city lon lat ## ## 1 Burbank -121.9316233 37.3232752 ## 2 San Jose -121.8949555 37.3393857 ## 3 Lick -121.8457863 37.2871647 ## 4 Willow Glen -121.8896771 37.3085532 ## 5 Buena Vista -121.9166227 37.3213308 ## 6 Parkmoor -121.9307898 37.3210531 ## 7 Fruitdale -121.932746 37.31086 ``` ### sergeant Metrics | Lang | \# Files | (%) | LoC | (%) | Blank lines | (%) | \# Lines | (%) | | :--- | -------: | --: | --: | --: | ----------: | --: | -------: | --: | | Rmd | 1 | 1 | 55 | 1 | 54 | 1 | 89 | 1 | ## Code of Conduct Please note that this project is released with a Contributor Code of Conduct By participating in this project you agree to abide by its terms.