<p>As contributors and maintainers of this project, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities.</p>
<p>We are committed to making participation in this project a harassment-free experience for everyone, regardless of level of experience, gender, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion.</p>
<p>Examples of unacceptable behavior by participants include the use of sexual language or imagery, derogatory comments or personal attacks, trolling, public or private harassment, insults, or other unprofessional conduct.</p>
<p>Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed from the project team.</p>
<p>Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by opening an issue or contacting one or more of the project maintainers.</p>
<p>This Code of Conduct is adapted from the Contributor Covenant (<ahref="http:contributor-covenant.org"class="uri">http:contributor-covenant.org</a>), version 1.0.0, available at <ahref="http://contributor-covenant.org/version/1/0/0/"class="uri">http://contributor-covenant.org/version/1/0/0/</a></p>
</div>
</div>
</div>
<footer>
<divclass="copyright">
<p>Developed by Bob Rudis.</p>
</div>
<divclass="pkgdown">
<p>Site built with <ahref="http://pkgdown.r-lib.org/">pkgdown</a>.</p>
<scriptsrc="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js"integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa"crossorigin="anonymous"></script><!-- Font Awesome icons --><linkhref="https://maxcdn.bootstrapcdn.com/font-awesome/4.6.3/css/font-awesome.min.css"rel="stylesheet"integrity="sha384-T8Gy5hrqNKT+hzMclPo118YTQO6cYprQmhrYwIiQ/3axmI1hQomh7Ud2hPOy8SP1"crossorigin="anonymous">
<p><code>sergeant</code> : Tools to Transform and Query Data with ‘Apache’ ‘Drill’</p>
<p>Drill + <code>sergeant</code> is (IMO) a nice alternative to Spark + <code>sparklyr</code> if you don’t need the ML components of Spark (i.e. just need to query “big data” sources, need to interface with parquet, need to combine disparate data source types — json, csv, parquet, rdbms - for aggregation, etc). Drill also has support for spatial queries.</p>
<p>I find writing SQL queries to parquet files with Drill on a local linux or macOS workstation to be more performant than doing the data ingestion work with R (for large or disperate data sets). I also work with many tiny JSON files on a daily basis and Drill makes it much easier to do so. YMMV.</p>
<p>You can download Drill from <ahref="https://drill.apache.org/download/"class="uri">https://drill.apache.org/download/</a> (use “Direct File Download”). I use <code>/usr/local/drill</code> as the install directory. <code>drill-embedded</code> is a super-easy way to get started playing with Drill on a single workstation and most of my workflows can get by using Drill this way. If there is sufficient desire for an automated downloader and a way to start the <code>drill-embedded</code> server from within R, please file an issue.</p>
<p>Tools to Transform and Query Data with ‘Apache’ ‘Drill’</p>
<divid="note"class="section level2">
<h2class="hasAnchor">
<ahref="#note"class="anchor"></a>NOTE</h2>
<p>Version 0.7.0 splits off the JDBC interface into a separate package <code>sergeant.caffeinated</code> (<ahref="https://gitlab.com/hrbrmstr/sergeant-caffeinated">GitLab</a>; <ahref="https://github.com/hrbrmstr/sergeant-caffeinated">GitHub</a>).</p>
<p>Drill + <code>sergeant</code> is (IMO) a streamlined alternative to Spark + <code>sparklyr</code> if you don’t need the ML components of Spark (i.e. just need to query “big data” sources, need to interface with parquet, need to combine disparate data source types — json, csv, parquet, rdbms - for aggregation, etc). Drill also has support for spatial queries.</p>
<p>Using Drill SQL queries that reference parquet files on a local linux or macOS workstation can often be more performant than doing the same data ingestion & wrangling work with R (especially for large or disperate data sets). Drill can often help further streaming workflows that infolve wrangling many tiny JSON files on a daily basis.</p>
<p>Drill can be obtained from <ahref="https://drill.apache.org/download/"class="uri">https://drill.apache.org/download/</a> (use “Direct File Download”). Drill can also be installed via <ahref="https://drill.apache.org/docs/running-drill-on-docker/">Docker</a>. For local installs on Unix-like systems, a common/suggestion location for the Drill directory is <code>/usr/local/drill</code> as the install directory.</p>
<p>Drill embedded (started using the <code>$DRILL_BASE_DIR/bin/drill-embedded</code> script) is a super-easy way to get started playing with Drill on a single workstation and most of many workflows can “get by” using Drill this way.</p>
<p>There are a few convenience wrappers for various informational SQL queries (like <code><ahref="reference/drill_version.html">drill_version()</a></code>). Please file an PR if you add more.</p>
<p>The package has been written with retrieval of rectangular data sources in mind. If you need/want a version of <code><ahref="reference/drill_query.html">drill_query()</a></code> that will enable returning of non-rectangular data (which is possible with Drill) then please file an issue.</p>
<p>Some of the more “controlling vs data ops” REST API functions aren’t implemented. Please file a PR if you need those.</p>
<p>Finally, I run most of this locally and at home, so it’s all been coded with no authentication or encryption in mind. If you want/need support for that, please file an issue. If there is demand for this, it will change the R API a bit (I’ve already thought out what to do but have no need for it right now).</p>
<p>The following functions are implemented:</p>
<p><strong><code>DBI</code></strong></p>
<p><strong><code>DBI</code></strong> (REST)</p>
<ul>
<li>As complete of an R <code>DBI</code> driver has been implemented using the Drill REST API, mostly to facilitate the <code>dplyr</code> interface. Use the <code>RJDBC</code> driver interface if you need more <code>DBI</code> functionality.</li>
<li>This also means that SQL functions unique to Drill have also been “implemented” (i.e. made accessible to the <code>dplyr</code> interface). If you have custom Drill SQL functions that need to be implemented please file an issue on GitHub.</li>
<li>A “just enough” feature complete R <code>DBI</code> driver has been implemented using the Drill REST API, mostly to facilitate the <code>dplyr</code> interface. Use the <code>RJDBC</code> driver interface if you need more <code>DBI</code> functionality.</li>
<li>This also means that SQL functions unique to Drill have also been “implemented” (i.e. made accessible to the <code>dplyr</code> interface). If you have custom Drill SQL functions that need to be implemented please file an issue on GitHub. Many should work without it, but some may require a custom interface.</li>
<code>drill_jdbc</code>: Connect to Drill using JDBC, enabling use of said idioms. See <code>RJDBC</code> for more info.</li>
<li>NOTE: The DRILL JDBC driver fully-qualified path must be placed in the <code>DRILL_JDBC_JAR</code> environment variable. This is best done via <code>~/.Renviron</code> for interactive work. i.e. <code>DRILL_JDBC_JAR=/usr/local/drill/jars/drill-jdbc-all-1.9.0.jar</code>
</li>
</ul>
<p><strong><code>dplyr</code></strong>:</p>
<ul>
<li>
<code>src_drill</code>: Connect to Drill (using dplyr) + supporting functions</li>
<code>src_drill</code>: Connect to Drill (using <code>dplyr</code>) + supporting functions</li>
</ul>
<p>See <code>dplyr</code> for the <code>dplyr</code>operations (light testing shows they work in basic SQL use-cases but Drill’s SQL engine has issues with more complex queries).</p>
<p>Note that a number of Drill SQL functions have been mapped to R functions (e.g. <code>grepl</code>) to make it easier to transition from non-database-backed SQL ops to Drill. See the help on <code>drill_custom_functions</code> for more info on these helper Drill custom function mappings.</p>
<p><strong>Drill APIs</strong>:</p>
<ul>
<li>
@ -133,425 +151,427 @@
<li>
<code>drill_version</code>: Identify the version of Drill running</li>
ds <-<spanclass="st"></span><spanclass="kw"><ahref="reference/src_drill.html">src_drill</a></span>(<spanclass="st">"localhost"</span>) <spanclass="co"># use localhost if running standalone on same system otherwise the host or IP of your Drill server</span>
<spanclass="co">#> 1 18 F 18 1914-02-02 1140 Stand Store Temporary Stocker 1998-01-01</span>
<spanclass="co">#> 2 18 M 18 1914-02-02 1140 Burnham Store Temporary Stocker 1998-01-01</span>
<spanclass="co">#> 3 18 F 18 1914-02-02 1139 Doolittle Store Temporary Stocker 1998-01-01</span>
<spanclass="co">#> 4 18 M 18 1914-02-02 1139 Pirnie Store Temporary Stocker 1998-01-01</span>
<spanclass="co">#> 5 18 M 17 1914-02-02 1140 Younce Store Permanent Stocker 1998-01-01</span>
<spanclass="co">#> 6 18 F 17 1914-02-02 1140 Biltoft Store Permanent Stocker 1998-01-01</span>
<spanclass="co">#> 7 18 M 17 1914-02-02 1139 Detwiler Store Permanent Stocker 1998-01-01</span>
<spanclass="co">#> 8 18 F 17 1914-02-02 1139 Ciruli Store Permanent Stocker 1998-01-01</span>
<spanclass="co">#> 9 18 F 16 1914-02-02 1140 Bishop Store Temporary Checker 1998-01-01</span>
<spanclass="co">#> 10 18 F 16 1914-02-02 1140 Cutwright Store Temporary Checker 1998-01-01</span>
<spanclass="co">#> 11 18 F 16 1914-02-02 1139 Anderson Store Temporary Checker 1998-01-01</span>
<spanclass="co">#> 12 18 F 16 1914-02-02 1139 Swartwood Store Temporary Checker 1998-01-01</span>
<spanclass="co">#> 13 18 M 15 1914-02-02 1140 Curtsinger Store Permanent Checker 1998-01-01</span>
<spanclass="co">#> 14 18 F 15 1914-02-02 1140 Quick Store Permanent Checker 1998-01-01</span>
<spanclass="co">#> 15 18 M 15 1914-02-02 1139 Souza Store Permanent Checker 1998-01-01</span>
<spanclass="co">#> 16 18 M 15 1914-02-02 1139 Compagno Store Permanent Checker 1998-01-01</span>
<spanclass="co">#> 17 18 M 11 1961-09-24 1139 Jaramillo Store Shift Supervisor 1998-01-01</span>
<spanclass="co">#> 18 18 M 11 1972-05-12 17 Belsey Store Assistant Manager 1998-01-01</span>
<spanclass="co">#> 19 12 M 18 1914-02-02 1069 Eichorn Store Temporary Stocker 1998-01-01</span>
<spanclass="co">#> 20 12 F 18 1914-02-02 1069 Geiermann Store Temporary Stocker 1998-01-01</span>
<spanclass="co">#> # ... with more rows, and 8 more variables: management_role <chr>, salary <dbl>, marital_status <chr>, full_name <chr>,</span>
<aclass="sourceLine"id="cb2-4"data-line-number="4"><spanclass="co"># use localhost if running standalone on same system otherwise the host or IP of your Drill server</span></a>
<aclass="sourceLine"id="cb2-60"data-line-number="60"><spanclass="co">## 1 18 F 18 1914-02-02 1140 Stand Store Tempora… 1998-01-01 00:00:00 Store Temp Sta…</span></a>
<aclass="sourceLine"id="cb2-61"data-line-number="61"><spanclass="co">## 2 18 M 18 1914-02-02 1140 Burnham Store Tempora… 1998-01-01 00:00:00 Store Temp Sta…</span></a>
<aclass="sourceLine"id="cb2-62"data-line-number="62"><spanclass="co">## 3 18 F 18 1914-02-02 1139 Doolittle Store Tempora… 1998-01-01 00:00:00 Store Temp Sta…</span></a>
<aclass="sourceLine"id="cb2-63"data-line-number="63"><spanclass="co">## 4 18 M 18 1914-02-02 1139 Pirnie Store Tempora… 1998-01-01 00:00:00 Store Temp Sta…</span></a>
<aclass="sourceLine"id="cb2-64"data-line-number="64"><spanclass="co">## 5 18 M 17 1914-02-02 1140 Younce Store Permane… 1998-01-01 00:00:00 Store Full Tim…</span></a>
<aclass="sourceLine"id="cb2-65"data-line-number="65"><spanclass="co">## 6 18 F 17 1914-02-02 1140 Biltoft Store Permane… 1998-01-01 00:00:00 Store Full Tim…</span></a>
<aclass="sourceLine"id="cb2-66"data-line-number="66"><spanclass="co">## 7 18 M 17 1914-02-02 1139 Detwiler Store Permane… 1998-01-01 00:00:00 Store Full Tim…</span></a>
<aclass="sourceLine"id="cb2-67"data-line-number="67"><spanclass="co">## 8 18 F 17 1914-02-02 1139 Ciruli Store Permane… 1998-01-01 00:00:00 Store Full Tim…</span></a>
<aclass="sourceLine"id="cb2-68"data-line-number="68"><spanclass="co">## 9 18 F 16 1914-02-02 1140 Bishop Store Tempora… 1998-01-01 00:00:00 Store Full Tim…</span></a>
<aclass="sourceLine"id="cb2-69"data-line-number="69"><spanclass="co">## 10 18 F 16 1914-02-02 1140 Cutwright Store Tempora… 1998-01-01 00:00:00 Store Full Tim…</span></a>
<aclass="sourceLine"id="cb2-70"data-line-number="70"><spanclass="co">## 11 18 F 16 1914-02-02 1139 Anderson Store Tempora… 1998-01-01 00:00:00 Store Full Tim…</span></a>
<aclass="sourceLine"id="cb2-71"data-line-number="71"><spanclass="co">## 12 18 F 16 1914-02-02 1139 Swartwood Store Tempora… 1998-01-01 00:00:00 Store Full Tim…</span></a>
<aclass="sourceLine"id="cb2-72"data-line-number="72"><spanclass="co">## 13 18 M 15 1914-02-02 1140 Curtsinger Store Permane… 1998-01-01 00:00:00 Store Full Tim…</span></a>
<aclass="sourceLine"id="cb2-73"data-line-number="73"><spanclass="co">## 14 18 F 15 1914-02-02 1140 Quick Store Permane… 1998-01-01 00:00:00 Store Full Tim…</span></a>
<aclass="sourceLine"id="cb2-74"data-line-number="74"><spanclass="co">## 15 18 M 15 1914-02-02 1139 Souza Store Permane… 1998-01-01 00:00:00 Store Full Tim…</span></a>
<aclass="sourceLine"id="cb2-75"data-line-number="75"><spanclass="co">## 16 18 M 15 1914-02-02 1139 Compagno Store Permane… 1998-01-01 00:00:00 Store Full Tim…</span></a>
<aclass="sourceLine"id="cb2-76"data-line-number="76"><spanclass="co">## 17 18 M 11 1961-09-24 1139 Jaramillo Store Shift S… 1998-01-01 00:00:00 Store Manageme…</span></a>
<aclass="sourceLine"id="cb2-77"data-line-number="77"><spanclass="co">## 18 18 M 11 1972-05-12 17 Belsey Store Assista… 1998-01-01 00:00:00 Store Manageme…</span></a>
<aclass="sourceLine"id="cb2-78"data-line-number="78"><spanclass="co">## 19 12 M 18 1914-02-02 1069 Eichorn Store Tempora… 1998-01-01 00:00:00 Store Temp Sta…</span></a>
<aclass="sourceLine"id="cb2-79"data-line-number="79"><spanclass="co">## 20 12 F 18 1914-02-02 1069 Geiermann Store Tempora… 1998-01-01 00:00:00 Store Temp Sta…</span></a>
<aclass="sourceLine"id="cb2-80"data-line-number="80"><spanclass="co">## # ... with more rows, and 7 more variables: salary <dbl>, marital_status <chr>, full_name <chr>, employee_id <int>,</span></a>
dc <-<spanclass="st"></span><spanclass="kw"><ahref="reference/drill_connection.html">drill_connection</a></span>(<spanclass="st">"localhost"</span>)
<spanclass="co">#> 10 1 M 11 1967-06-20 5 Murraiin Store Manager 1998-01-01 Store Management</span>
<spanclass="co">#> # ... with 90 more rows, and 7 more variables: salary <dbl>, marital_status <chr>, full_name <chr>, employee_id <int>,</span>
<spanclass="kw"><ahref="reference/drill_query.html">drill_query</a></span>(dc, <spanclass="st">"SELECT COUNT(gender) AS gender FROM cp.`employee.json` GROUP BY gender"</span>)
<spanclass="co">#> Parsed with column specification:</span>
<aclass="sourceLine"id="cb3-40"data-line-number="40"><spanclass="co">## 6 0 F 3 1949-03-27 1 Kanagaki VP Human Resou… 1994-12-01 00:00:00 Senior Managem…</span></a>
<aclass="sourceLine"id="cb3-41"data-line-number="41"><spanclass="co">## 7 9 F 11 1922-08-10 5 Brunner Store Manager 1998-01-01 00:00:00 Store Manageme…</span></a>
<aclass="sourceLine"id="cb3-42"data-line-number="42"><spanclass="co">## 8 21 F 11 1979-06-23 5 Blumberg Store Manager 1998-01-01 00:00:00 Store Manageme…</span></a>
<aclass="sourceLine"id="cb3-44"data-line-number="44"><spanclass="co">## 10 1 M 11 1967-06-20 5 Murraiin Store Manager 1998-01-01 00:00:00 Store Manageme…</span></a>
<aclass="sourceLine"id="cb3-45"data-line-number="45"><spanclass="co">## # ... with 90 more rows, and 7 more variables: salary <dbl>, marital_status <chr>, full_name <chr>, employee_id <int>,</span></a>
<aclass="sourceLine"id="cb3-48"data-line-number="48"><spanclass="kw"><ahref="reference/drill_query.html">drill_query</a></span>(dc, <spanclass="st">"SELECT COUNT(gender) AS gender FROM cp.`employee.json` GROUP BY gender"</span>)</a>
<aclass="sourceLine"id="cb3-49"data-line-number="49"><spanclass="co">## Parsed with column specification:</span></a>
<aclass="sourceLine"id="cb3-63"data-line-number="63"><spanclass="co">## 1 debug.validate_iterators FALSE ALL BOOLEAN BOOT </span></a>
<aclass="sourceLine"id="cb3-64"data-line-number="64"><spanclass="co">## 2 debug.validate_vectors FALSE ALL BOOLEAN BOOT </span></a>
<aclass="sourceLine"id="cb3-65"data-line-number="65"><spanclass="co">## 3 drill.exec.functions.cast_empty_string_to_null FALSE ALL BOOLEAN BOOT </span></a>
<aclass="sourceLine"id="cb3-66"data-line-number="66"><spanclass="co">## 4 drill.exec.hashagg.fallback.enabled FALSE ALL BOOLEAN BOOT </span></a>
<aclass="sourceLine"id="cb3-67"data-line-number="67"><spanclass="co">## 5 drill.exec.memory.operator.output_batch_size 16777216 SYSTEM LONG BOOT </span></a>
<aclass="sourceLine"id="cb3-68"data-line-number="68"><spanclass="co">## 6 drill.exec.storage.file.partition.column.label dir ALL STRING BOOT </span></a>
<aclass="sourceLine"id="cb3-69"data-line-number="69"><spanclass="co">## 7 drill.exec.storage.implicit.filename.column.label filename ALL STRING BOOT </span></a>
<aclass="sourceLine"id="cb3-70"data-line-number="70"><spanclass="co">## 8 drill.exec.storage.implicit.filepath.column.label filepath ALL STRING BOOT </span></a>
<aclass="sourceLine"id="cb3-71"data-line-number="71"><spanclass="co">## 9 drill.exec.storage.implicit.fqn.column.label fqn ALL STRING BOOT </span></a>
<aclass="sourceLine"id="cb3-72"data-line-number="72"><spanclass="co">## 10 drill.exec.storage.implicit.suffix.column.label suffix ALL STRING BOOT </span></a>
<aclass="sourceLine"id="cb3-73"data-line-number="73"><spanclass="co">## # ... with 128 more rows</span></a>
<spanclass="co"># Use this if connecting to a cluster with zookeeper</span>
<spanclass="co"># con <- drill_jdbc("drill-node:2181", "drillbits1") </span>
<spanclass="co"># Use the following if running drill-embedded</span>
con <-<spanclass="st"></span><spanclass="kw"><ahref="reference/drill_jdbc.html">drill_jdbc</a></span>(<spanclass="st">"localhost:31010"</span>, <spanclass="dt">use_zk=</span><spanclass="ot">FALSE</span>)
<spanclass="co">#> Using [jdbc:drill:drillbit=localhost:31010]...</span>
<spanclass="kw"><ahref="reference/drill_query.html">drill_query</a></span>(con, <spanclass="st">"SELECT * FROM cp.`employee.json`"</span>)
<spanclass="co">#> # A tibble: 1,155 x 16</span>
<spanclass="co">#> 7 8 Kim Brunner Kim Brunner 11 Store Manager 9 11</span>
<spanclass="co">#> 8 9 Brenda Blumberg Brenda Blumberg 11 Store Manager 21 11</span>
<spanclass="co">#> 9 10 Darren Stanz Darren Stanz 5 VP Finance 0 5</span>
<spanclass="co">#> 10 11 Jonathan Murraiin Jonathan Murraiin 11 Store Manager 1 11</span>
<spanclass="co">#> # ... with 1,145 more rows, and 8 more variables: birth_date <chr>, hire_date <chr>, salary <dbl>, supervisor_id <dbl>,</span>
<spanclass="co">#> 7 8 Kim Brunner Kim Brunner 11 Store Manager 9 11</span>
<spanclass="co">#> 8 9 Brenda Blumberg Brenda Blumberg 11 Store Manager 21 11</span>
<spanclass="co">#> 9 10 Darren Stanz Darren Stanz 5 VP Finance 0 5</span>
<spanclass="co">#> 10 11 Jonathan Murraiin Jonathan Murraiin 11 Store Manager 1 11</span>
<spanclass="co">#> # ... with 1,145 more rows, and 8 more variables: birth_date <chr>, hire_date <chr>, salary <dbl>, supervisor_id <dbl>,</span>
<aclass="sourceLine"id="cb7-24"data-line-number="24">✔ <spanclass="op">|</span><spanclass="st"></span><spanclass="dv">3</span><spanclass="op">|</span><spanclass="st"></span>dplyr API [<spanclass="fl">0.1</span> s]</a>
<aclass="sourceLine"id="cb7-43"data-line-number="43">✔ <spanclass="op">|</span><spanclass="st"></span><spanclass="dv">16</span><spanclass="op">|</span><spanclass="st"></span>REST API [<spanclass="fl">1.9</span> s]</a>
<ahref="#code-of-conduct"class="anchor"></a>Code of Conduct</h3>
<p>Please note that this project is released with a <ahref="CONDUCT.md">Contributor Code of Conduct</a>. By participating in this project you agree to abide by its terms.</p>
<ahref="#code-of-conduct"class="anchor"></a>Code of Conduct</h2>
<p>Please note that this project is released with a <ahref="CONDUCT.html">Contributor Code of Conduct</a>. By participating in this project you agree to abide by its terms.</p>
<li>JDBC driver still in github repo but no longer included in pkg builds. See README.md or <code><ahref="../reference/drill_jdbc.html">drill_jdbc()</a></code> help for more information on using the JDBC driver with sergeant.</li>
<li>JDBC driver still in github repo but no longer included in pkg builds. See README.md or <code>drill_jdbc()</code> help for more information on using the JDBC driver with sergeant.</li>
<li>implemented a large subset of Drill SQL Functions <ahref="https://drill.apache.org/docs/about-sql-function-examples/"class="uri">https://drill.apache.org/docs/about-sql-function-examples/</a>
<li>can pass RJDBC connections made with <code><ahref="../reference/drill_jdbc.html">drill_jdbc()</a></code> to <code><ahref="../reference/drill_query.html">drill_query()</a></code>
<li>can pass RJDBC connections made with <code>drill_jdbc()</code> to <code><ahref="../reference/drill_query.html">drill_query()</a></code>
</li>
<li>finally enaled <code>nodes</code> parameter to be a multi-element character vector as it said in the function description</li>
<li>tweaked <code><ahref="../reference/drill_query.html">drill_query()</a></code> and <code><ahref="../reference/drill_version.html">drill_version()</a></code>