`sergeant` : Tools to Transform and Query Data with the 'Apache' 'Drill' 'API'
Drill + `sergeant` is (IMO) a nice alternative to Spark + `sparklyr` if you don't need the ML components of Spark (i.e. just need to query "big data" sources, need to interface with parquet, need to combine disperate data source types — json, csv, parquet, rdbms - for aggregation, etc). Drill also has support for spatial queries.
The package doesn't have a `dplyr`-esque interface yet, but creating one is possible since Drill uses pretty standard SQL for queries. Right now, you need to build Drill SQL queries by hand and issue them with `drill_query()`. It's good to get one's hands dirty with some SQL on occassion (it build character).
I find writing SQL queries to parquet files with Drill on a local 64GB Linux workstation to be more performant than doing the data ingestion work with R (for large or disperate data sets). I also work with many tiny JSON files on a daily basis and Drill makes it much easier to do so. YMMV.
You can download Drill from <https://drill.apache.org/download/> (use "Direct File Download"). I use `/usr/local/drill` as the install directory. `drill-embedded` is a super-easy way to get started playing with Drill on a single workstation and most of my workflows can get by using Drill this way. If there is sufficient desire for an automated downloader and a way to start the `drill-embedded` server from within R, please file an issue.
Theren are a few convenience wrappers for various informational SQL queries (like `drill_version()`). Please file an PR if you add more.
The package has been written with retrieval of rectangular data sources in mind. If you need/want a version of `drill_query()` that will enable returning of non-rectangular data (which is possible with Drill) then please file an issue.
Some of the more "controlling vs data ops" REST API functions aren't implemented. Please file a PR if you need those.
Finally, I run most of this locally and at home, so it's all been coded with no authentication or encryption in mind. If you want/need support for that, please file an issue. If there is demand for this, it will change the R API a bit (I've already thought out what to do but have no need for it right now).
The following functions are implemented:
- `drill_cancel`: Cancel the query that has the given queryid.
- `drill_active`: Test whether Drill HTTP REST API server is up
- `drill_cancel`: Cancel the query that has the given queryid
- `drill_metrics`: Get the current memory metrics
- `drill_options`: List the name, default, and data type of the system and session options
- `drill_profile`: Get the profile of the query that has the given queryid.
- `drill_profile`: Get the profile of the query that has the given queryid
- `drill_profiles`: Get the profiles of running and completed queries
- `drill_query`: Submit a query and return results
- `drill_set`: Set Drill SYSTEM or SESSION options
@ -59,10 +78,16 @@ library(sergeant)
# current verison
packageVersion("sergeant")
drill_active()
drill_version()
drill_storage()$name
```
Working with the built-in JSON data sets:
```{r}
drill_query("SELECT * FROM cp.`employee.json` limit 100")
drill_query("SELECT COUNT(gender) AS gender FROM cp.`employee.json` GROUP BY gender")
@ -70,6 +95,38 @@ drill_query("SELECT COUNT(gender) AS gender FROM cp.`employee.json` GROUP BY gen
drill_options()
```
## Working with parquet files
```{r}
drill_query("SELECT * FROM dfs.`/usr/local/drill/sample-data/nation.parquet` LIMIT 5")
```
Including multiple parquet files in different directories (note the wildcard support):
```{r}
drill_query("SELECT * FROM dfs.`/usr/local/drill/sample-data/nations*/nations*.parquet` LIMIT 5")
```
### A preview of the built-in support for spatial ops
Via: <https://github.com/k255/drill-gis>
A common use case is to select data within boundary of given polygon:
```{r}
drill_query("
select columns[2] as city, columns[4] as lon, columns[3] as lat
`sergeant` : Tools to Transform and Query Data with the 'Apache' 'Drill' 'API'
Drill + `sergeant` is (IMO) a nice alternative to Spark + `sparklyr` if you don't need the ML components of Spark (i.e. just need to query "big data" sources, need to interface with parquet, need to combine disperate data source types — json, csv, parquet, rdbms - for aggregation, etc). Drill also has support for spatial queries.
The package doesn't have a `dplyr`-esque interface yet, but creating one is possible since Drill uses pretty standard SQL for queries. Right now, you need to build Drill SQL queries by hand and issue them with `drill_query()`. It's good to get one's hands dirty with some SQL on occassion (it build character).
I find writing SQL queries to parquet files with Drill on a local 64GB Linux workstation to be more performant than doing the data ingestion work with R (for large or disperate data sets). I also work with many tiny JSON files on a daily basis and Drill makes it much easier to do so. YMMV.
You can download Drill from <https://drill.apache.org/download/> (use "Direct File Download"). I use `/usr/local/drill` as the install directory. `drill-embedded` is a super-easy way to get started playing with Drill on a single workstation and most of my workflows can get by using Drill this way. If there is sufficient desire for an automated downloader and a way to start the `drill-embedded` server from within R, please file an issue.
Theren are a few convenience wrappers for various informational SQL queries (like `drill_version()`). Please file an PR if you add more.
The package has been written with retrieval of rectangular data sources in mind. If you need/want a version of `drill_query()` that will enable returning of non-rectangular data (which is possible with Drill) then please file an issue.
Some of the more "controlling vs data ops" REST API functions aren't implemented. Please file a PR if you need those.
Finally, I run most of this locally and at home, so it's all been coded with no authentication or encryption in mind. If you want/need support for that, please file an issue. If there is demand for this, it will change the R API a bit (I've already thought out what to do but have no need for it right now).
The following functions are implemented:
- `drill_cancel`: Cancel the query that has the given queryid.
- `drill_active`: Test whether Drill HTTP REST API server is up
- `drill_cancel`: Cancel the query that has the given queryid
- `drill_metrics`: Get the current memory metrics
- `drill_options`: List the name, default, and data type of the system and session options
- `drill_profile`: Get the profile of the query that has the given queryid.
- `drill_profile`: Get the profile of the query that has the given queryid
- `drill_profiles`: Get the profiles of running and completed queries
- `drill_query`: Submit a query and return results
- `drill_set`: Set Drill SYSTEM or SESSION options