sergeant

History
boB Rudis 2531ca8103 README		5 years ago
..
README.Rmd	README	5 years ago
README.Rmd

---
output: github_document
editor_options: 
  chunk_output_type: console
---
<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, echo = FALSE, include=FALSE}
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  collapse = TRUE,
  comment = "## ",
  fig.path = "README-"
)
options(sergeant.bigint.warnonce = FALSE)
options(width=120)
```

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1248912.svg)](https://doi.org/10.5281/zenodo.1248912) 
[![Travis-CI Build Status](https://travis-ci.org/hrbrmstr/sergeant.svg?branch=master)](https://travis-ci.org/hrbrmstr/sergeant) 
[![Coverage Status](https://codecov.io/gh/hrbrmstr/sergeant/branch/master/graph/badge.svg)](https://codecov.io/gh/hrbrmstr/sergeant)
[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/sergeant)](https://cran.r-project.org/package=sergeant)

# 💂 sergeant

Tools to Transform and Query Data with 'Apache' 'Drill'

## \*\* IMPORTANT \*\*

Version 0.7.0 (a.k.a. the main branch) splits off the JDBC interface into a separate package `sergeant.caffeinated` ([GitLab](https://gitlab.com/hrbrmstr/sergeant-caffeinated); [GitHub](https://github.com/hrbrmstr/sergeant-caffeinated)).

I# Description

Drill + `sergeant` is (IMO) a streamlined alternative to Spark + `sparklyr` if you don't need the ML components of Spark (i.e. just need to query "big data" sources, need to interface with parquet, need to combine disparate data source types — json, csv, parquet, rdbms - for aggregation, etc). Drill also has support for spatial queries.

Using Drill SQL queries that reference parquet files on a local linux or macOS workstation can often be more performant than doing the same data ingestion & wrangling work with R (especially for large or disperate data sets). Drill can often help further streamline workflows that involve wrangling many tiny JSON files on a daily basis.

Drill can be obtained from <https://drill.apache.org/download/> (use "Direct File Download"). Drill can also be installed via [Docker](https://drill.apache.org/docs/running-drill-on-docker/). For local installs on Unix-like systems, a common/suggestion location for the Drill directory is  `/usr/local/drill` as the install directory. 

Drill embedded (started using the `$DRILL_BASE_DIR/bin/drill-embedded` script) is a super-easy way to get started playing with Drill on a single workstation and most of many workflows can "get by" using Drill this way.

There are a few convenience wrappers for various informational SQL queries (like `drill_version()`). Please file an PR if you add more.

Some of the more "controlling vs data ops" REST API functions aren't implemented. Please file a PR if you need those.

The following functions are implemented:

**`DBI`** (REST)

- A "just enough" feature complete R `DBI` driver has been implemented using the Drill REST API, mostly to facilitate the `dplyr` interface. Use the `RJDBC` driver interface if you need more `DBI` functionality.
- This also means that SQL functions unique to Drill have also been "implemented" (i.e. made accessible to the `dplyr` interface). If you have custom Drill SQL functions that need to be implemented please file an issue on GitHub. Many should work without it, but some may require a custom interface. 

**`dplyr`**: (REST)

- `src_drill`: Connect to Drill (using `dplyr`) + supporting functions

Note that a number of Drill SQL functions have been mapped to R functions (e.g. `grepl`) to make it easier to transition from non-database-backed SQL ops to Drill. See the help on `drill_custom_functions` for more info on these helper Drill custom function mappings.

**Drill APIs**:

- `drill_connection`: Setup parameters for a Drill server/cluster connection
- `drill_active`: Test whether Drill HTTP REST API server is up
- `drill_cancel`:	Cancel the query that has the given queryid
- `drill_functions`: Show all the available Drill built-in functions & UDFs (Apache Drill 1.15.0+ required)
- `drill_jdbc`:	Connect to Drill using JDBC
- `drill_metrics`:	Get the current memory metrics
- `drill_options`:	List the name, default, and data type of the system and session options
- `drill_popts`: Show all the available Drill options (1.15.0+)
- `drill_rofile`:	Get the profile of the query that has the given query id
- `drill_profiles`:	Get the profiles of running and completed queries
- `drill_query`:	Submit a query and return results
- `drill_set`:	Set Drill SYSTEM or SESSION options
- `drill_settings_reset`:	Changes (optionally, all) session settings back to system defaults
- `drill_show_files`:	Show files in a file system schema.
- `drill_show_schemas`:	Returns a list of available schemas.
- `drill_stats`:	Get Drillbit information, such as ports numbers
- `drill_status`:	Get the status of Drill
- `drill_storage`:	Get the list of storage plugin names and configurations
- `drill_system_reset`:	Changes (optionally, all) system settings back to system defaults
- `drill_threads`:	Get information about threads
- `drill_uplift`:	Turn a columnar query results into a type-converted tbl
- `drill_use`:	Change to a particular schema.
- `drill_version`:	Identify the version of Drill running

**Helpers**

- `ctas_profile`: Generate a Drill CTAS Statement from a Query
- `drill_up`: 
 sart a Dockerized Drill Instance
# `sdrill_down`:  stop a Dockerized Drill Instance by container id
- `howall_drill`: Show all dead and running Drill Docker containers
- `stopall_drill`: Prune all dead and running Drill Docker containers

# Installation

```{r einstall-ex, results='asis', echo = FALSE}
hrbrpkghelpr::install_block()
`````
# Usage

### `dplyr` interface

```{r dplyr-01, message=FALSE}
library(sergeant)
library(tidyverse)

# use localhost if running standalone on same system otherwise the host or IP of your Drill server
ds <- src_drill("localhost")  #ds
db <- tbl(ds, "cp.`employee.json`") 

# without `collect()`:
count(db, gender, marital_status)

count(db, gender, marital_status) %>% collect()

group_by(db, position_title) %>%
  count(gender) -> tmp2

group_by(db, position_title) %>%
  count(gender) %>%
  ungroup() %>%
  mutate(full_desc = ifelse(gender == "F", "Female", "Male")) %>%
  collect() %>%
  select(Title = position_title, Gender = full_desc, Count = n)

arrange(db, desc(employee_id)) %>% print(n = 20)

mutate(db, position_title = tolower(position_title)) %>%
  mutate(salary = as.numeric(salary)) %>%
  mutate(gender = ifelse(gender == "F", "Female", "Male")) %>%
  mutate(marital_status = ifelse(marital_status == "S", "Single", "Married")) %>%
  group_by(supervisor_id) %>%
  summarise(underlings_count = n()) %>%
  collect()
```

### REST API

```{r rest-01}
dc <- drill_connection("localhost") 

drill_active(dc)

drill_version(dc)

drill_storage(dc)$name

drill_query(dc, "SELECT * FROM cp.`employee.json` limit 100")

drill_query(dc, "SELECT COUNT(gender) AS gctFROM cp.`employee.json` GROUP BY gender")

drill_options(dc)

drill_options(dc, "json")
```

## Working with parquet files

```{r}
drill_query(dc, "SELECT * FROM dfs.`/usr/local/drill/sample-data/nation.parquet` LIMIT 5")
```

Including multiple parquet files in different directories (note the wildcard support):

```{r}
drill_query(dc, "SELECT * FROM dfs.`/usr/local/drill/sample-data/nations*/nations*.parquet` LIMIT 5")
```

### Drill has built-in support for spatial ops

Via: <https://github.com/k255/drill-gis>

A common use case is to select data within boundary of given polygon:

```{r}
drill_query(dc, "
select columns[2] as city, columns[4] as lon, columns[3] as lat
    from cp.`sample-data/CA-cities.csv`
    where
        ST_Within(
            ST_Point(columns[4], columns[3]),
            ST_GeomFromText(
                'POLYGON((-121.95 37.28, -121.94 37.35, -121.84 37.35, -121.84 37.28, -121.95 37.28))'
                )
            )
")
```
### sergeant Metrics

```{r echo=FALSE}
cloc::cloc_pkg_md()
```

## Code of Conduct

Please note that this project is released with a [Contributor Code of Conduct](CONDUCT.md). 
By participating in this project you agree to abide by its terms.