<p>As contributors and maintainers of this project, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities.</p>
<p>We are committed to making participation in this project a harassment-free experience for everyone, regardless of level of experience, gender, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion.</p>
<p>Examples of unacceptable behavior by participants include the use of sexual language or imagery, derogatory comments or personal attacks, trolling, public or private harassment, insults, or other unprofessional conduct.</p>
<p>Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed from the project team.</p>
<p>Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by opening an issue or contacting one or more of the project maintainers.</p>
<p>This Code of Conduct is adapted from the Contributor Covenant (<ahref="http:contributor-covenant.org"class="uri">http:contributor-covenant.org</a>), version 1.0.0, available at <ahref="http://contributor-covenant.org/version/1/0/0/"class="uri">http://contributor-covenant.org/version/1/0/0/</a></p>
</div>
</div>
</div>
<footer>
<divclass="copyright">
<p>Developed by Bob Rudis.</p>
</div>
<divclass="pkgdown">
<p>Site built with <ahref="http://pkgdown.r-lib.org/">pkgdown</a>.</p>
<scriptsrc="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js"integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa"crossorigin="anonymous"></script><!-- Font Awesome icons --><linkhref="https://maxcdn.bootstrapcdn.com/font-awesome/4.6.3/css/font-awesome.min.css"rel="stylesheet"integrity="sha384-T8Gy5hrqNKT+hzMclPo118YTQO6cYprQmhrYwIiQ/3axmI1hQomh7Ud2hPOy8SP1"crossorigin="anonymous">
<p><code>sergeant</code> : Tools to Transform and Query Data with ‘Apache’ ‘Drill’</p>
<p>Drill + <code>sergeant</code> is (IMO) a nice alternative to Spark + <code>sparklyr</code> if you don’t need the ML components of Spark (i.e. just need to query “big data” sources, need to interface with parquet, need to combine disparate data source types — json, csv, parquet, rdbms - for aggregation, etc). Drill also has support for spatial queries.</p>
<p>I find writing SQL queries to parquet files with Drill on a local linux or macOS workstation to be more performant than doing the data ingestion work with R (for large or disperate data sets). I also work with many tiny JSON files on a daily basis and Drill makes it much easier to do so. YMMV.</p>
<p>You can download Drill from <ahref="https://drill.apache.org/download/"class="uri">https://drill.apache.org/download/</a> (use “Direct File Download”). I use <code>/usr/local/drill</code> as the install directory. <code>drill-embedded</code> is a super-easy way to get started playing with Drill on a single workstation and most of my workflows can get by using Drill this way. If there is sufficient desire for an automated downloader and a way to start the <code>drill-embedded</code> server from within R, please file an issue.</p>
<p>There are a few convenience wrappers for various informational SQL queries (like <code><ahref="reference/drill_version.html">drill_version()</a></code>). Please file an PR if you add more.</p>
<p>The package has been written with retrieval of rectangular data sources in mind. If you need/want a version of <code><ahref="reference/drill_query.html">drill_query()</a></code> that will enable returning of non-rectangular data (which is possible with Drill) then please file an issue.</p>
<p>Some of the more “controlling vs data ops” REST API functions aren’t implemented. Please file a PR if you need those.</p>
<p>Finally, I run most of this locally and at home, so it’s all been coded with no authentication or encryption in mind. If you want/need support for that, please file an issue. If there is demand for this, it will change the R API a bit (I’ve already thought out what to do but have no need for it right now).</p>
<p>Tools to Transform and Query Data with ‘Apache’ ‘Drill’ (JDBC)</p>
<divid="note"class="section level2">
<h2class="hasAnchor">
<ahref="#note"class="anchor"></a>NOTE</h2>
<p>This is the Java/JDBC-interface to Apache Drill. For non-Java/JDBC, see the <code>sergeant</code> package (<ahref="https://gitlab.com/hrbrmstr/sergeant/">GitLab</a>; <ahref="https://github.com/hrbrmstr/sergeant/">GitHub</a>).</p>
<p>Drill + <code>sergeant</code> is (IMO) a streamlined alternative to Spark + <code>sparklyr</code> if you don’t need the ML components of Spark (i.e. just need to query “big data” sources, need to interface with parquet, need to combine disparate data source types — json, csv, parquet, rdbms - for aggregation, etc). Drill also has support for spatial queries.</p>
<p>Using Drill SQL queries that reference parquet files on a local linux or macOS workstation can often be more performant than doing the same data ingestion & wrangling work with R (especially for large or disperate data sets). Drill can often help further streaming workflows that infolve wrangling many tiny JSON files on a daily basis.</p>
<p>Drill can be obtained from <ahref="https://drill.apache.org/download/"class="uri">https://drill.apache.org/download/</a> (use “Direct File Download”). Drill can also be installed via <ahref="https://drill.apache.org/docs/running-drill-on-docker/">Docker</a>. For local installs on Unix-like systems, a common/suggestion location for the Drill directory is <code>/usr/local/drill</code> as the install directory.</p>
<p>Drill embedded (started using the <code>$DRILL_BASE_DIR/bin/drill-embedded</code> script) is a super-easy way to get started playing with Drill on a single workstation and most of many workflows can “get by” using Drill this way.</p>
<p>The following functions are implemented:</p>
<p><strong><code>DBI</code></strong></p>
<ul>
<li>As complete of an R <code>DBI</code> driver has been implemented using the Drill REST API, mostly to facilitate the <code>dplyr</code> interface. Use the <code>RJDBC</code> driver interface if you need more <code>DBI</code> functionality.</li>
<li>This also means that SQL functions unique to Drill have also been “implemented” (i.e. made accessible to the <code>dplyr</code> interface). If you have custom Drill SQL functions that need to be implemented please file an issue on GitHub.</li>
</ul>
<p><strong><code>RJDBC</code></strong></p>
<p><strong><code>DBI</code></strong> (RJDBC)</p>
<ul>
<li>
<code>drill_jdbc</code>: Connect to Drill using JDBC, enabling use of said idioms. See <code>RJDBC</code> for more info.</li>
<li>NOTE: The DRILL JDBC driver fully-qualified path must be placed in the <code>DRILL_JDBC_JAR</code> environment variable. This is best done via <code>~/.Renviron</code> for interactive work. i.e. <code>DRILL_JDBC_JAR=/usr/local/drill/jars/drill-jdbc-all-1.9.0.jar</code>
</li>
</ul>
<p><strong><code>dplyr</code></strong>:</p>
<p>NOTE: The DRILL JDBC driver fully-qualified path must be placed in the <code>DRILL_JDBC_JAR</code> environment variable. This is best done via <code>~/.Renviron</code> for interactive work. i.e. <code>DRILL_JDBC_JAR=/usr/local/drill/jars/drill-jdbc-all-1.14.0.jar</code></p>
<code>src_drill</code>: Connect to Drill (using dplyr) + supporting functions</li>
<code>src_drill_jdbc</code>: Connect to Drill (using dplyr& RJDBC) + supporting functions</li>
</ul>
<p>See <code>dplyr</code> for the <code>dplyr</code> operations (light testing shows they work in basic SQL use-cases but Drill’s SQL engine has issues with more complex queries).</p>
<p><strong>Drill APIs</strong>:</p>
<ul>
<li>
<code>drill_connection</code>: Setup parameters for a Drill server/cluster connection</li>
<li>
<code>drill_active</code>: Test whether Drill HTTP REST API server is up</li>
<li>
<code>drill_cancel</code>: Cancel the query that has the given queryid</li>
<li>
<code>drill_jdbc</code>: Connect to Drill using JDBC</li>
<li>
<code>drill_metrics</code>: Get the current memory metrics</li>
<li>
<code>drill_options</code>: List the name, default, and data type of the system and session options</li>
<li>
<code>drill_profile</code>: Get the profile of the query that has the given query id</li>
<li>
<code>drill_profiles</code>: Get the profiles of running and completed queries</li>
<li>
<code>drill_query</code>: Submit a query and return results</li>
<li>
<code>drill_set</code>: Set Drill SYSTEM or SESSION options</li>
<li>
<code>drill_settings_reset</code>: Changes (optionally, all) session settings back to system defaults</li>
<li>
<code>drill_show_files</code>: Show files in a file system schema.</li>
<li>
<code>drill_show_schemas</code>: Returns a list of available schemas.</li>
<li>
<code>drill_stats</code>: Get Drillbit information, such as ports numbers</li>
<li>
<code>drill_status</code>: Get the status of Drill</li>
<li>
<code>drill_storage</code>: Get the list of storage plugin names and configurations</li>
<li>
<code>drill_system_reset</code>: Changes (optionally, all) system settings back to system defaults</li>
<li>
<code>drill_threads</code>: Get information about threads</li>
<li>
<code>drill_uplift</code>: Turn a columnar query results into a type-converted tbl</li>
<li>
<code>drill_use</code>: Change to a particular schema.</li>
<li>
<code>drill_version</code>: Identify the version of Drill running</li>
ds <-<spanclass="st"></span><spanclass="kw"><ahref="reference/src_drill.html">src_drill</a></span>(<spanclass="st">"localhost"</span>) <spanclass="co"># use localhost if running standalone on same system otherwise the host or IP of your Drill server</span>
<spanclass="co">#> 1 18 F 18 1914-02-02 1140 Stand Store Temporary Stocker 1998-01-01</span>
<spanclass="co">#> 2 18 M 18 1914-02-02 1140 Burnham Store Temporary Stocker 1998-01-01</span>
<spanclass="co">#> 3 18 F 18 1914-02-02 1139 Doolittle Store Temporary Stocker 1998-01-01</span>
<spanclass="co">#> 4 18 M 18 1914-02-02 1139 Pirnie Store Temporary Stocker 1998-01-01</span>
<spanclass="co">#> 5 18 M 17 1914-02-02 1140 Younce Store Permanent Stocker 1998-01-01</span>
<spanclass="co">#> 6 18 F 17 1914-02-02 1140 Biltoft Store Permanent Stocker 1998-01-01</span>
<spanclass="co">#> 7 18 M 17 1914-02-02 1139 Detwiler Store Permanent Stocker 1998-01-01</span>
<spanclass="co">#> 8 18 F 17 1914-02-02 1139 Ciruli Store Permanent Stocker 1998-01-01</span>
<spanclass="co">#> 9 18 F 16 1914-02-02 1140 Bishop Store Temporary Checker 1998-01-01</span>
<spanclass="co">#> 10 18 F 16 1914-02-02 1140 Cutwright Store Temporary Checker 1998-01-01</span>
<spanclass="co">#> 11 18 F 16 1914-02-02 1139 Anderson Store Temporary Checker 1998-01-01</span>
<spanclass="co">#> 12 18 F 16 1914-02-02 1139 Swartwood Store Temporary Checker 1998-01-01</span>
<spanclass="co">#> 13 18 M 15 1914-02-02 1140 Curtsinger Store Permanent Checker 1998-01-01</span>
<spanclass="co">#> 14 18 F 15 1914-02-02 1140 Quick Store Permanent Checker 1998-01-01</span>
<spanclass="co">#> 15 18 M 15 1914-02-02 1139 Souza Store Permanent Checker 1998-01-01</span>
<spanclass="co">#> 16 18 M 15 1914-02-02 1139 Compagno Store Permanent Checker 1998-01-01</span>
<spanclass="co">#> 17 18 M 11 1961-09-24 1139 Jaramillo Store Shift Supervisor 1998-01-01</span>
<spanclass="co">#> 18 18 M 11 1972-05-12 17 Belsey Store Assistant Manager 1998-01-01</span>
<spanclass="co">#> 19 12 M 18 1914-02-02 1069 Eichorn Store Temporary Stocker 1998-01-01</span>
<spanclass="co">#> 20 12 F 18 1914-02-02 1069 Geiermann Store Temporary Stocker 1998-01-01</span>
<spanclass="co">#> # ... with more rows, and 8 more variables: management_role <chr>, salary <dbl>, marital_status <chr>, full_name <chr>,</span>
dc <-<spanclass="st"></span><spanclass="kw"><ahref="reference/drill_connection.html">drill_connection</a></span>(<spanclass="st">"localhost"</span>)
<spanclass="co">#> 10 1 M 11 1967-06-20 5 Murraiin Store Manager 1998-01-01 Store Management</span>
<spanclass="co">#> # ... with 90 more rows, and 7 more variables: salary <dbl>, marital_status <chr>, full_name <chr>, employee_id <int>,</span>
<spanclass="kw"><ahref="reference/drill_query.html">drill_query</a></span>(dc, <spanclass="st">"SELECT COUNT(gender) AS gender FROM cp.`employee.json` GROUP BY gender"</span>)
<spanclass="co">#> Parsed with column specification:</span>
<spanclass="co"># Use this if connecting to a cluster with zookeeper</span>
<spanclass="co"># con <- drill_jdbc("drill-node:2181", "drillbits1") </span>
<spanclass="co"># Use the following if running drill-embedded</span>
con <-<spanclass="st"></span><spanclass="kw"><ahref="reference/drill_jdbc.html">drill_jdbc</a></span>(<spanclass="st">"localhost:31010"</span>, <spanclass="dt">use_zk=</span><spanclass="ot">FALSE</span>)
<spanclass="co">#> Using [jdbc:drill:drillbit=localhost:31010]...</span>
<spanclass="kw"><ahref="reference/drill_query.html">drill_query</a></span>(con, <spanclass="st">"SELECT * FROM cp.`employee.json`"</span>)
<spanclass="co">#> # A tibble: 1,155 x 16</span>
<spanclass="co">#> 7 8 Kim Brunner Kim Brunner 11 Store Manager 9 11</span>
<spanclass="co">#> 8 9 Brenda Blumberg Brenda Blumberg 11 Store Manager 21 11</span>
<spanclass="co">#> 9 10 Darren Stanz Darren Stanz 5 VP Finance 0 5</span>
<spanclass="co">#> 10 11 Jonathan Murraiin Jonathan Murraiin 11 Store Manager 1 11</span>
<spanclass="co">#> # ... with 1,145 more rows, and 8 more variables: birth_date <chr>, hire_date <chr>, salary <dbl>, supervisor_id <dbl>,</span>
<spanclass="co">#> 7 8 Kim Brunner Kim Brunner 11 Store Manager 9 11</span>
<spanclass="co">#> 8 9 Brenda Blumberg Brenda Blumberg 11 Store Manager 21 11</span>
<spanclass="co">#> 9 10 Darren Stanz Darren Stanz 5 VP Finance 0 5</span>
<spanclass="co">#> 10 11 Jonathan Murraiin Jonathan Murraiin 11 Store Manager 1 11</span>
<spanclass="co">#> # ... with 1,145 more rows, and 8 more variables: birth_date <chr>, hire_date <chr>, salary <dbl>, supervisor_id <dbl>,</span>
<aclass="sourceLine"id="cb2-4"data-line-number="4"><spanclass="co"># use localhost if running standalone on same system otherwise the host or IP of your Drill server</span></a>
<aclass="sourceLine"id="cb2-60"data-line-number="60"><spanclass="co">## 1 1156. Kris Stand Kris Stand 18. Store Tempora… 18. 18. 1914-02-02 1998-01-0…</span></a>
<aclass="sourceLine"id="cb2-64"data-line-number="64"><spanclass="co">## 5 1152. Barbara Y… Barbara Younce 17. Store Permane… 18. 17. 1914-02-02 1998-01-0…</span></a>
<aclass="sourceLine"id="cb2-70"data-line-number="70"><spanclass="co">## 11 1146. Elizabeth… Elizabeth Anderson 16. Store Tempora… 18. 16. 1914-02-02 1998-01-0…</span></a>
<aclass="sourceLine"id="cb2-71"data-line-number="71"><spanclass="co">## 12 1145. Michael S… Michael Swartwood 16. Store Tempora… 18. 16. 1914-02-02 1998-01-0…</span></a>
<aclass="sourceLine"id="cb2-73"data-line-number="73"><spanclass="co">## 14 1143. Ana Quick Ana Quick 15. Store Permane… 18. 15. 1914-02-02 1998-01-0…</span></a>
<aclass="sourceLine"id="cb2-75"data-line-number="75"><spanclass="co">## 16 1141. James Com… James Compagno 15. Store Permane… 18. 15. 1914-02-02 1998-01-0…</span></a>
<aclass="sourceLine"id="cb2-78"data-line-number="78"><spanclass="co">## 19 1138. James Eic… James Eichorn 18. Store Tempora… 12. 18. 1914-02-02 1998-01-0…</span></a>
<aclass="sourceLine"id="cb2-80"data-line-number="80"><spanclass="co">## # ... with more rows, and 6 more variables: salary <dbl>, supervisor_id <dbl>, education_level <chr>,</span></a>
<ahref="#code-of-conduct"class="anchor"></a>Code of Conduct</h3>
<p>Please note that this project is released with a <ahref="CONDUCT.md">Contributor Code of Conduct</a>. By participating in this project you agree to abide by its terms.</p>
<ahref="#code-of-conduct"class="anchor"></a>Code of Conduct</h2>
<p>Please note that this project is released with a <ahref="CONDUCT.html">Contributor Code of Conduct</a>. By participating in this project you agree to abide by its terms.</p>
<li>Browse source code at <br><ahref="http://github.com/hrbrmstr/sergeant">http://github.com/hrbrmstr/sergeant</a>
<li>Browse source code at <br><ahref="https://github.com/hrbrmstr/sergeant-caffeinated">https://github.com/hrbrmstr/sergeant-caffeinated</a>
</li>
<li>Report a bug at <br><ahref="https://github.com/hrbrmstr/sergeant/issues">https://github.com/hrbrmstr/sergeant/issues</a>
<li>Report a bug at <br><ahref="https://github.com/hrbrmstr/sergeant-caffeinated/issues">https://github.com/hrbrmstr/sergeant-caffeinated/issues</a>
<li>JDBC driver still in github repo but no longer included in pkg builds. See README.md or <code><ahref="../reference/drill_jdbc.html">drill_jdbc()</a></code> help for more information on using the JDBC driver with sergeant.</li>
<li>implemented a large subset of Drill SQL Functions <ahref="https://drill.apache.org/docs/about-sql-function-examples/"class="uri">https://drill.apache.org/docs/about-sql-function-examples/</a>
<li>can pass RJDBC connections made with <code><ahref="../reference/drill_jdbc.html">drill_jdbc()</a></code> to <code><ahref="../reference/drill_query.html">drill_query()</a></code>
<li>can pass RJDBC connections made with <code><ahref="../reference/drill_jdbc.html">drill_jdbc()</a></code> to <code>drill_query()</code>
</li>
<li>finally enaled <code>nodes</code> parameter to be a multi-element character vector as it said in the function description</li>
<li>tweaked <code><ahref="../reference/drill_query.html">drill_query()</a></code> and <code><ahref="../reference/drill_version.html">drill_version()</a></code>
<li>tweaked <code>drill_query()</code> and <code>drill_version()</code>
<spanclass='fu'><ahref='drill_query.html'>drill_query</a></span>(<spanclass='no'>con</span>, <spanclass='st'>"SELECT * FROM cp.`employee.json`"</span>)
<preclass="examples"><spanclass='co'># NOT RUN {</span>