You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

571 lines
42 KiB

<!DOCTYPE html>
<!-- Generated by pkgdown: do not edit by hand --><html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Tools to Transform and Query Data with 'Apache' 'Drill' • sergeant</title>
<!-- jquery --><script src="https://code.jquery.com/jquery-3.1.0.min.js" integrity="sha384-nrOSfDHtoPMzJHjVTdCopGqIqeYETSXhZDFyniQ8ZHcVy08QesyHcnOUpMpqnmWq" crossorigin="anonymous"></script><!-- Bootstrap --><link href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" rel="stylesheet" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin="anonymous">
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script><!-- Font Awesome icons --><link href="https://maxcdn.bootstrapcdn.com/font-awesome/4.6.3/css/font-awesome.min.css" rel="stylesheet" integrity="sha384-T8Gy5hrqNKT+hzMclPo118YTQO6cYprQmhrYwIiQ/3axmI1hQomh7Ud2hPOy8SP1" crossorigin="anonymous">
<!-- pkgdown --><link href="pkgdown.css" rel="stylesheet">
<script src="jquery.sticky-kit.min.js"></script><script src="pkgdown.js"></script><!-- mathjax --><script src="https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script><!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
</head>
<body>
<div class="container template-vignette">
<header><div class="navbar navbar-default navbar-fixed-top" role="navigation">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="index.html">sergeant</a>
</div>
<div id="navbar" class="navbar-collapse collapse">
<ul class="nav navbar-nav">
<li>
<a href="reference/index.html">Reference</a>
</li>
<li>
<a href="news/index.html">News</a>
</li>
</ul>
<ul class="nav navbar-nav navbar-right">
<li>
<a href="http://github.com/hrbrmstr/sergeant">
<span class="fa fa-github fa-lg"></span>
</a>
</li>
</ul>
</div>
<!--/.nav-collapse -->
</div>
<!--/.container -->
</div>
<!--/.navbar -->
</header><div class="row">
<div class="col-md-9">
<div class="contents">
<!-- README.md is generated from README.Rmd. Please edit that file -->
<p><img src="sergeant.png" width="33" align="left" style="padding-right:20px"></p>
<p><code>sergeant</code> : Tools to Transform and Query Data with ‘Apache’ ‘Drill’</p>
<p>Drill + <code>sergeant</code> is (IMO) a nice alternative to Spark + <code>sparklyr</code> if you don’t need the ML components of Spark (i.e. just need to query “big data” sources, need to interface with parquet, need to combine disparate data source types — json, csv, parquet, rdbms - for aggregation, etc). Drill also has support for spatial queries.</p>
<p>I find writing SQL queries to parquet files with Drill on a local linux or macOS workstation to be more performant than doing the data ingestion work with R (for large or disperate data sets). I also work with many tiny JSON files on a daily basis and Drill makes it much easier to do so. YMMV.</p>
<p>You can download Drill from <a href="https://drill.apache.org/download/" class="uri">https://drill.apache.org/download/</a> (use “Direct File Download”). I use <code>/usr/local/drill</code> as the install directory. <code>drill-embedded</code> is a super-easy way to get started playing with Drill on a single workstation and most of my workflows can get by using Drill this way. If there is sufficient desire for an automated downloader and a way to start the <code>drill-embedded</code> server from within R, please file an issue.</p>
<p>There are a few convenience wrappers for various informational SQL queries (like <code><a href="reference/drill_version.html">drill_version()</a></code>). Please file an PR if you add more.</p>
<p>The package has been written with retrieval of rectangular data sources in mind. If you need/want a version of <code><a href="reference/drill_query.html">drill_query()</a></code> that will enable returning of non-rectangular data (which is possible with Drill) then please file an issue.</p>
<p>Some of the more “controlling vs data ops” REST API functions aren’t implemented. Please file a PR if you need those.</p>
<p>Finally, I run most of this locally and at home, so it’s all been coded with no authentication or encryption in mind. If you want/need support for that, please file an issue. If there is demand for this, it will change the R API a bit (I’ve already thought out what to do but have no need for it right now).</p>
<p>The following functions are implemented:</p>
<p><strong><code>DBI</code></strong></p>
<ul>
<li>As complete of an R <code>DBI</code> driver has been implemented using the Drill REST API, mostly to facilitate the <code>dplyr</code> interface. Use the <code>RJDBC</code> driver interface if you need more <code>DBI</code> functionality.</li>
<li>This also means that SQL functions unique to Drill have also been “implemented” (i.e. made accessible to the <code>dplyr</code> interface). If you have custom Drill SQL functions that need to be implemented please file an issue on GitHub.</li>
</ul>
<p><strong><code>RJDBC</code></strong></p>
<ul>
<li>
<code>drill_jdbc</code>: Connect to Drill using JDBC, enabling use of said idioms. See <code>RJDBC</code> for more info.</li>
<li>NOTE: The DRILL JDBC driver fully-qualified path must be placed in the <code>DRILL_JDBC_JAR</code> environment variable. This is best done via <code>~/.Renviron</code> for interactive work. i.e. <code>DRILL_JDBC_JAR=/usr/local/drill/jars/drill-jdbc-all-1.9.0.jar</code>
</li>
</ul>
<p><strong><code>dplyr</code></strong>:</p>
<ul>
<li>
<code>src_drill</code>: Connect to Drill (using dplyr) + supporting functions</li>
</ul>
<p>See <code>dplyr</code> for the <code>dplyr</code> operations (light testing shows they work in basic SQL use-cases but Drill’s SQL engine has issues with more complex queries).</p>
<p><strong>Drill APIs</strong>:</p>
<ul>
<li>
<code>drill_connection</code>: Setup parameters for a Drill server/cluster connection</li>
<li>
<code>drill_active</code>: Test whether Drill HTTP REST API server is up</li>
<li>
<code>drill_cancel</code>: Cancel the query that has the given queryid</li>
<li>
<code>drill_jdbc</code>: Connect to Drill using JDBC</li>
<li>
<code>drill_metrics</code>: Get the current memory metrics</li>
<li>
<code>drill_options</code>: List the name, default, and data type of the system and session options</li>
<li>
<code>drill_profile</code>: Get the profile of the query that has the given query id</li>
<li>
<code>drill_profiles</code>: Get the profiles of running and completed queries</li>
<li>
<code>drill_query</code>: Submit a query and return results</li>
<li>
<code>drill_set</code>: Set Drill SYSTEM or SESSION options</li>
<li>
<code>drill_settings_reset</code>: Changes (optionally, all) session settings back to system defaults</li>
<li>
<code>drill_show_files</code>: Show files in a file system schema.</li>
<li>
<code>drill_show_schemas</code>: Returns a list of available schemas.</li>
<li>
<code>drill_stats</code>: Get Drillbit information, such as ports numbers</li>
<li>
<code>drill_status</code>: Get the status of Drill</li>
<li>
<code>drill_storage</code>: Get the list of storage plugin names and configurations</li>
<li>
<code>drill_system_reset</code>: Changes (optionally, all) system settings back to system defaults</li>
<li>
<code>drill_threads</code>: Get information about threads</li>
<li>
<code>drill_uplift</code>: Turn a columnar query results into a type-converted tbl</li>
<li>
<code>drill_use</code>: Change to a particular schema.</li>
<li>
<code>drill_version</code>: Identify the version of Drill running</li>
</ul>
<div id="installation" class="section level3">
<h3 class="hasAnchor">
<a href="#installation" class="anchor"></a>Installation</h3>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">devtools<span class="op">::</span><span class="kw">install_github</span>(<span class="st">"hrbrmstr/sergeant"</span>)</code></pre></div>
</div>
<div id="experimental-dplyr-interface" class="section level3">
<h3 class="hasAnchor">
<a href="#experimental-dplyr-interface" class="anchor"></a>Experimental <code>dplyr</code> interface</h3>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(sergeant)
ds &lt;-<span class="st"> </span><span class="kw"><a href="reference/src_drill.html">src_drill</a></span>(<span class="st">"localhost"</span>) <span class="co"># use localhost if running standalone on same system otherwise the host or IP of your Drill server</span>
ds
<span class="co">#&gt; src: DrillConnection</span>
<span class="co">#&gt; tbls: INFORMATION_SCHEMA, cp.default, dfs.default, dfs.root, dfs.tmp, sys</span>
db &lt;-<span class="st"> </span><span class="kw">tbl</span>(ds, <span class="st">"cp.`employee.json`"</span>)
<span class="co"># without `collect()`:</span>
<span class="kw">count</span>(db, gender, marital_status)
<span class="co">#&gt; # Source: lazy query [?? x 3]</span>
<span class="co">#&gt; # Database: DrillConnection</span>
<span class="co">#&gt; # Groups: gender</span>
<span class="co">#&gt; marital_status gender n</span>
<span class="co">#&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt;</span>
<span class="co">#&gt; 1 S F 297</span>
<span class="co">#&gt; 2 M M 278</span>
<span class="co">#&gt; 3 S M 276</span>
<span class="co">#&gt; 4 M F 304</span>
<span class="co"># ^^ gets translated to:</span>
<span class="co"># </span>
<span class="co"># SELECT *</span>
<span class="co"># FROM (SELECT gender , marital_status , COUNT(*) AS n </span>
<span class="co"># FROM cp.`employee.json` </span>
<span class="co"># GROUP BY gender , marital_status ) govketbhqb </span>
<span class="co"># LIMIT 1000</span>
<span class="kw">count</span>(db, gender, marital_status) <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">collect</span>()
<span class="co">#&gt; # A tibble: 4 x 3</span>
<span class="co">#&gt; # Groups: gender [2]</span>
<span class="co">#&gt; marital_status gender n</span>
<span class="co">#&gt; * &lt;chr&gt; &lt;chr&gt; &lt;int&gt;</span>
<span class="co">#&gt; 1 S F 297</span>
<span class="co">#&gt; 2 M M 278</span>
<span class="co">#&gt; 3 S M 276</span>
<span class="co">#&gt; 4 M F 304</span>
<span class="co"># ^^ gets translated to:</span>
<span class="co"># </span>
<span class="co"># SELECT gender , marital_status , COUNT(*) AS n </span>
<span class="co"># FROM cp.`employee.json` </span>
<span class="co"># GROUP BY gender , marital_status </span>
<span class="kw">group_by</span>(db, position_title) <span class="op">%&gt;%</span><span class="st"> </span>
<span class="st"> </span><span class="kw">count</span>(gender) -&gt;<span class="st"> </span>tmp2
<span class="kw">group_by</span>(db, position_title) <span class="op">%&gt;%</span><span class="st"> </span>
<span class="st"> </span><span class="kw">count</span>(gender) <span class="op">%&gt;%</span><span class="st"> </span>
<span class="st"> </span><span class="kw">ungroup</span>() <span class="op">%&gt;%</span><span class="st"> </span>
<span class="st"> </span><span class="kw">mutate</span>(<span class="dt">full_desc=</span><span class="kw">ifelse</span>(gender<span class="op">==</span><span class="st">"F"</span>, <span class="st">"Female"</span>, <span class="st">"Male"</span>)) <span class="op">%&gt;%</span><span class="st"> </span>
<span class="st"> </span><span class="kw">collect</span>() <span class="op">%&gt;%</span><span class="st"> </span>
<span class="st"> </span><span class="kw">select</span>(<span class="dt">Title=</span>position_title, <span class="dt">Gender=</span>full_desc, <span class="dt">Count=</span>n)
<span class="co">#&gt; # A tibble: 30 x 3</span>
<span class="co">#&gt; Title Gender Count</span>
<span class="co">#&gt; * &lt;chr&gt; &lt;chr&gt; &lt;int&gt;</span>
<span class="co">#&gt; 1 President Female 1</span>
<span class="co">#&gt; 2 VP Country Manager Male 3</span>
<span class="co">#&gt; 3 VP Country Manager Female 3</span>
<span class="co">#&gt; 4 VP Information Systems Female 1</span>
<span class="co">#&gt; 5 VP Human Resources Female 1</span>
<span class="co">#&gt; 6 Store Manager Female 13</span>
<span class="co">#&gt; 7 VP Finance Male 1</span>
<span class="co">#&gt; 8 Store Manager Male 11</span>
<span class="co">#&gt; 9 HQ Marketing Female 2</span>
<span class="co">#&gt; 10 HQ Information Systems Female 4</span>
<span class="co">#&gt; # ... with 20 more rows</span>
<span class="co"># ^^ gets translated to:</span>
<span class="co"># </span>
<span class="co"># SELECT position_title , gender , n ,</span>
<span class="co"># CASE WHEN ( gender = 'F') THEN ('Female') ELSE ('Male') </span><span class="re">END</span><span class="co"> AS full_desc </span>
<span class="co"># FROM (SELECT position_title , gender , COUNT(*) AS n </span>
<span class="co"># FROM cp.`employee.json` </span>
<span class="co"># GROUP BY position_title , gender ) dcyuypuypb </span>
<span class="kw">arrange</span>(db, <span class="kw">desc</span>(employee_id)) <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">print</span>(<span class="dt">n=</span><span class="dv">20</span>)
<span class="co">#&gt; # Source: table&lt;cp.`employee.json`&gt; [?? x 16]</span>
<span class="co">#&gt; # Database: DrillConnection</span>
<span class="co">#&gt; # Ordered by: desc(employee_id)</span>
<span class="co">#&gt; store_id gender department_id birth_date supervisor_id last_name position_title hire_date</span>
<span class="co">#&gt; &lt;int&gt; &lt;chr&gt; &lt;int&gt; &lt;date&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;dttm&gt;</span>
<span class="co">#&gt; 1 18 F 18 1914-02-02 1140 Stand Store Temporary Stocker 1998-01-01</span>
<span class="co">#&gt; 2 18 M 18 1914-02-02 1140 Burnham Store Temporary Stocker 1998-01-01</span>
<span class="co">#&gt; 3 18 F 18 1914-02-02 1139 Doolittle Store Temporary Stocker 1998-01-01</span>
<span class="co">#&gt; 4 18 M 18 1914-02-02 1139 Pirnie Store Temporary Stocker 1998-01-01</span>
<span class="co">#&gt; 5 18 M 17 1914-02-02 1140 Younce Store Permanent Stocker 1998-01-01</span>
<span class="co">#&gt; 6 18 F 17 1914-02-02 1140 Biltoft Store Permanent Stocker 1998-01-01</span>
<span class="co">#&gt; 7 18 M 17 1914-02-02 1139 Detwiler Store Permanent Stocker 1998-01-01</span>
<span class="co">#&gt; 8 18 F 17 1914-02-02 1139 Ciruli Store Permanent Stocker 1998-01-01</span>
<span class="co">#&gt; 9 18 F 16 1914-02-02 1140 Bishop Store Temporary Checker 1998-01-01</span>
<span class="co">#&gt; 10 18 F 16 1914-02-02 1140 Cutwright Store Temporary Checker 1998-01-01</span>
<span class="co">#&gt; 11 18 F 16 1914-02-02 1139 Anderson Store Temporary Checker 1998-01-01</span>
<span class="co">#&gt; 12 18 F 16 1914-02-02 1139 Swartwood Store Temporary Checker 1998-01-01</span>
<span class="co">#&gt; 13 18 M 15 1914-02-02 1140 Curtsinger Store Permanent Checker 1998-01-01</span>
<span class="co">#&gt; 14 18 F 15 1914-02-02 1140 Quick Store Permanent Checker 1998-01-01</span>
<span class="co">#&gt; 15 18 M 15 1914-02-02 1139 Souza Store Permanent Checker 1998-01-01</span>
<span class="co">#&gt; 16 18 M 15 1914-02-02 1139 Compagno Store Permanent Checker 1998-01-01</span>
<span class="co">#&gt; 17 18 M 11 1961-09-24 1139 Jaramillo Store Shift Supervisor 1998-01-01</span>
<span class="co">#&gt; 18 18 M 11 1972-05-12 17 Belsey Store Assistant Manager 1998-01-01</span>
<span class="co">#&gt; 19 12 M 18 1914-02-02 1069 Eichorn Store Temporary Stocker 1998-01-01</span>
<span class="co">#&gt; 20 12 F 18 1914-02-02 1069 Geiermann Store Temporary Stocker 1998-01-01</span>
<span class="co">#&gt; # ... with more rows, and 8 more variables: management_role &lt;chr&gt;, salary &lt;dbl&gt;, marital_status &lt;chr&gt;, full_name &lt;chr&gt;,</span>
<span class="co">#&gt; # employee_id &lt;int&gt;, education_level &lt;chr&gt;, first_name &lt;chr&gt;, position_id &lt;int&gt;</span>
<span class="co"># ^^ gets translated to:</span>
<span class="co"># </span>
<span class="co"># SELECT *</span>
<span class="co"># FROM (SELECT *</span>
<span class="co"># FROM cp.`employee.json` </span>
<span class="co"># ORDER BY employee_id DESC) lvpxoaejbc </span>
<span class="co"># LIMIT 5</span>
<span class="kw">mutate</span>(db, <span class="dt">position_title=</span><span class="kw">tolower</span>(position_title)) <span class="op">%&gt;%</span>
<span class="st"> </span><span class="kw">mutate</span>(<span class="dt">salary=</span><span class="kw">as.numeric</span>(salary)) <span class="op">%&gt;%</span><span class="st"> </span>
<span class="st"> </span><span class="kw">mutate</span>(<span class="dt">gender=</span><span class="kw">ifelse</span>(gender<span class="op">==</span><span class="st">"F"</span>, <span class="st">"Female"</span>, <span class="st">"Male"</span>)) <span class="op">%&gt;%</span>
<span class="st"> </span><span class="kw">mutate</span>(<span class="dt">marital_status=</span><span class="kw">ifelse</span>(marital_status<span class="op">==</span><span class="st">"S"</span>, <span class="st">"Single"</span>, <span class="st">"Married"</span>)) <span class="op">%&gt;%</span><span class="st"> </span>
<span class="st"> </span><span class="kw">group_by</span>(supervisor_id) <span class="op">%&gt;%</span><span class="st"> </span>
<span class="st"> </span><span class="kw">summarise</span>(<span class="dt">underlings_count=</span><span class="kw">n</span>()) <span class="op">%&gt;%</span><span class="st"> </span>
<span class="st"> </span><span class="kw">collect</span>()
<span class="co">#&gt; # A tibble: 112 x 2</span>
<span class="co">#&gt; supervisor_id underlings_count</span>
<span class="co">#&gt; * &lt;int&gt; &lt;int&gt;</span>
<span class="co">#&gt; 1 0 1</span>
<span class="co">#&gt; 2 1 7</span>
<span class="co">#&gt; 3 5 9</span>
<span class="co">#&gt; 4 4 2</span>
<span class="co">#&gt; 5 2 3</span>
<span class="co">#&gt; 6 20 2</span>
<span class="co">#&gt; 7 21 4</span>
<span class="co">#&gt; 8 22 7</span>
<span class="co">#&gt; 9 6 4</span>
<span class="co">#&gt; 10 36 2</span>
<span class="co">#&gt; # ... with 102 more rows</span>
<span class="co"># ^^ gets translated to:</span>
<span class="co"># </span>
<span class="co"># SELECT supervisor_id , COUNT(*) AS underlings_count </span>
<span class="co"># FROM (SELECT employee_id , full_name , first_name , last_name , position_id , position_title , store_id , department_id , birth_date , hire_date , salary , supervisor_id , education_level , gender , management_role , CASE WHEN ( marital_status = 'S') THEN ('Single') ELSE ('Married') </span><span class="re">END</span><span class="co"> AS marital_status </span>
<span class="co"># FROM (SELECT employee_id , full_name , first_name , last_name , position_id , position_title , store_id , department_id , birth_date , hire_date , salary , supervisor_id , education_level , marital_status , management_role , CASE WHEN ( gender = 'F') THEN ('Female') ELSE ('Male') </span><span class="re">END</span><span class="co"> AS gender </span>
<span class="co"># FROM (SELECT employee_id , full_name , first_name , last_name , position_id , position_title , store_id , department_id , birth_date , hire_date , supervisor_id , education_level , marital_status , gender , management_role , CAST( salary AS DOUBLE) AS salary </span>
<span class="co"># FROM (SELECT employee_id , full_name , first_name , last_name , position_id , store_id , department_id , birth_date , hire_date , salary , supervisor_id , education_level , marital_status , gender , management_role , LOWER( position_title ) AS position_title </span>
<span class="co"># FROM cp.`employee.json` ) cnjsqxeick ) bnbnjrubna ) wavfmhkczv ) zaxeyyicxo </span>
<span class="co"># GROUP BY supervisor_id </span></code></pre></div>
</div>
<div id="usage" class="section level3">
<h3 class="hasAnchor">
<a href="#usage" class="anchor"></a>Usage</h3>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(sergeant)
<span class="co"># current verison</span>
<span class="kw">packageVersion</span>(<span class="st">"sergeant"</span>)
<span class="co">#&gt; [1] '0.5.0'</span>
dc &lt;-<span class="st"> </span><span class="kw"><a href="reference/drill_connection.html">drill_connection</a></span>(<span class="st">"localhost"</span>)
<span class="kw"><a href="reference/drill_active.html">drill_active</a></span>(dc)
<span class="co">#&gt; [1] TRUE</span>
<span class="kw"><a href="reference/drill_version.html">drill_version</a></span>(dc)
<span class="co">#&gt; [1] "1.10.0"</span>
<span class="kw"><a href="reference/drill_storage.html">drill_storage</a></span>(dc)<span class="op">$</span>name
<span class="co">#&gt; [1] "cp" "dfs" "hbase" "hive" "kudu" "mongo" "s3"</span></code></pre></div>
<p>Working with the built-in JSON data sets:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw"><a href="reference/drill_query.html">drill_query</a></span>(dc, <span class="st">"SELECT * FROM cp.`employee.json` limit 100"</span>)
<span class="co">#&gt; Parsed with column specification:</span>
<span class="co">#&gt; cols(</span>
<span class="co">#&gt; store_id = col_integer(),</span>
<span class="co">#&gt; gender = col_character(),</span>
<span class="co">#&gt; department_id = col_integer(),</span>
<span class="co">#&gt; birth_date = col_date(format = ""),</span>
<span class="co">#&gt; supervisor_id = col_integer(),</span>
<span class="co">#&gt; last_name = col_character(),</span>
<span class="co">#&gt; position_title = col_character(),</span>
<span class="co">#&gt; hire_date = col_datetime(format = ""),</span>
<span class="co">#&gt; management_role = col_character(),</span>
<span class="co">#&gt; salary = col_double(),</span>
<span class="co">#&gt; marital_status = col_character(),</span>
<span class="co">#&gt; full_name = col_character(),</span>
<span class="co">#&gt; employee_id = col_integer(),</span>
<span class="co">#&gt; education_level = col_character(),</span>
<span class="co">#&gt; first_name = col_character(),</span>
<span class="co">#&gt; position_id = col_integer()</span>
<span class="co">#&gt; )</span>
<span class="co">#&gt; # A tibble: 100 x 16</span>
<span class="co">#&gt; store_id gender department_id birth_date supervisor_id last_name position_title hire_date management_role</span>
<span class="co">#&gt; * &lt;int&gt; &lt;chr&gt; &lt;int&gt; &lt;date&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;dttm&gt; &lt;chr&gt;</span>
<span class="co">#&gt; 1 0 F 1 1961-08-26 0 Nowmer President 1994-12-01 Senior Management</span>
<span class="co">#&gt; 2 0 M 1 1915-07-03 1 Whelply VP Country Manager 1994-12-01 Senior Management</span>
<span class="co">#&gt; 3 0 M 1 1969-06-20 1 Spence VP Country Manager 1998-01-01 Senior Management</span>
<span class="co">#&gt; 4 0 F 1 1951-05-10 1 Gutierrez VP Country Manager 1998-01-01 Senior Management</span>
<span class="co">#&gt; 5 0 F 2 1942-10-08 1 Damstra VP Information Systems 1994-12-01 Senior Management</span>
<span class="co">#&gt; 6 0 F 3 1949-03-27 1 Kanagaki VP Human Resources 1994-12-01 Senior Management</span>
<span class="co">#&gt; 7 9 F 11 1922-08-10 5 Brunner Store Manager 1998-01-01 Store Management</span>
<span class="co">#&gt; 8 21 F 11 1979-06-23 5 Blumberg Store Manager 1998-01-01 Store Management</span>
<span class="co">#&gt; 9 0 M 5 1949-08-26 1 Stanz VP Finance 1994-12-01 Senior Management</span>
<span class="co">#&gt; 10 1 M 11 1967-06-20 5 Murraiin Store Manager 1998-01-01 Store Management</span>
<span class="co">#&gt; # ... with 90 more rows, and 7 more variables: salary &lt;dbl&gt;, marital_status &lt;chr&gt;, full_name &lt;chr&gt;, employee_id &lt;int&gt;,</span>
<span class="co">#&gt; # education_level &lt;chr&gt;, first_name &lt;chr&gt;, position_id &lt;int&gt;</span>
<span class="kw"><a href="reference/drill_query.html">drill_query</a></span>(dc, <span class="st">"SELECT COUNT(gender) AS gender FROM cp.`employee.json` GROUP BY gender"</span>)
<span class="co">#&gt; Parsed with column specification:</span>
<span class="co">#&gt; cols(</span>
<span class="co">#&gt; gender = col_integer()</span>
<span class="co">#&gt; )</span>
<span class="co">#&gt; # A tibble: 2 x 1</span>
<span class="co">#&gt; gender</span>
<span class="co">#&gt; * &lt;int&gt;</span>
<span class="co">#&gt; 1 601</span>
<span class="co">#&gt; 2 554</span>
<span class="kw"><a href="reference/drill_options.html">drill_options</a></span>(dc)
<span class="co">#&gt; # A tibble: 113 x 4</span>
<span class="co">#&gt; name value type kind</span>
<span class="co">#&gt; * &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;</span>
<span class="co">#&gt; 1 planner.enable_hash_single_key TRUE SYSTEM BOOLEAN</span>
<span class="co">#&gt; 2 store.parquet.reader.pagereader.queuesize 2 SYSTEM LONG</span>
<span class="co">#&gt; 3 planner.enable_limit0_optimization FALSE SYSTEM BOOLEAN</span>
<span class="co">#&gt; 4 store.json.read_numbers_as_double FALSE SYSTEM BOOLEAN</span>
<span class="co">#&gt; 5 planner.enable_constant_folding TRUE SYSTEM BOOLEAN</span>
<span class="co">#&gt; 6 store.json.extended_types FALSE SYSTEM BOOLEAN</span>
<span class="co">#&gt; 7 planner.memory.non_blocking_operators_memory 64 SYSTEM LONG</span>
<span class="co">#&gt; 8 planner.enable_multiphase_agg TRUE SYSTEM BOOLEAN</span>
<span class="co">#&gt; 9 exec.query_profile.debug_mode FALSE SYSTEM BOOLEAN</span>
<span class="co">#&gt; 10 planner.filter.max_selectivity_estimate_factor 1 SYSTEM DOUBLE</span>
<span class="co">#&gt; # ... with 103 more rows</span>
<span class="kw"><a href="reference/drill_options.html">drill_options</a></span>(dc, <span class="st">"json"</span>)
<span class="co">#&gt; # A tibble: 7 x 4</span>
<span class="co">#&gt; name value type kind</span>
<span class="co">#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;</span>
<span class="co">#&gt; 1 store.json.read_numbers_as_double FALSE SYSTEM BOOLEAN</span>
<span class="co">#&gt; 2 store.json.extended_types FALSE SYSTEM BOOLEAN</span>
<span class="co">#&gt; 3 store.json.writer.uglify FALSE SYSTEM BOOLEAN</span>
<span class="co">#&gt; 4 store.json.reader.skip_invalid_records FALSE SYSTEM BOOLEAN</span>
<span class="co">#&gt; 5 store.json.reader.print_skipped_invalid_record_number FALSE SYSTEM BOOLEAN</span>
<span class="co">#&gt; 6 store.json.all_text_mode FALSE SYSTEM BOOLEAN</span>
<span class="co">#&gt; 7 store.json.writer.skip_null_fields TRUE SYSTEM BOOLEAN</span></code></pre></div>
</div>
<div id="working-with-parquet-files" class="section level2">
<h2 class="hasAnchor">
<a href="#working-with-parquet-files" class="anchor"></a>Working with parquet files</h2>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw"><a href="reference/drill_query.html">drill_query</a></span>(dc, <span class="st">"SELECT * FROM dfs.`/usr/local/drill/sample-data/nation.parquet` LIMIT 5"</span>)
<span class="co">#&gt; Parsed with column specification:</span>
<span class="co">#&gt; cols(</span>
<span class="co">#&gt; N_COMMENT = col_character(),</span>
<span class="co">#&gt; N_NAME = col_character(),</span>
<span class="co">#&gt; N_NATIONKEY = col_integer(),</span>
<span class="co">#&gt; N_REGIONKEY = col_integer()</span>
<span class="co">#&gt; )</span>
<span class="co">#&gt; # A tibble: 5 x 4</span>
<span class="co">#&gt; N_COMMENT N_NAME N_NATIONKEY N_REGIONKEY</span>
<span class="co">#&gt; * &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt;</span>
<span class="co">#&gt; 1 haggle. carefully f ALGERIA 0 0</span>
<span class="co">#&gt; 2 al foxes promise sly ARGENTINA 1 1</span>
<span class="co">#&gt; 3 y alongside of the p BRAZIL 2 1</span>
<span class="co">#&gt; 4 eas hang ironic, sil CANADA 3 1</span>
<span class="co">#&gt; 5 y above the carefull EGYPT 4 4</span></code></pre></div>
<p>Including multiple parquet files in different directories (note the wildcard support):</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw"><a href="reference/drill_query.html">drill_query</a></span>(dc, <span class="st">"SELECT * FROM dfs.`/usr/local/drill/sample-data/nations*/nations*.parquet` LIMIT 5"</span>)
<span class="co">#&gt; Parsed with column specification:</span>
<span class="co">#&gt; cols(</span>
<span class="co">#&gt; N_COMMENT = col_character(),</span>
<span class="co">#&gt; N_NAME = col_character(),</span>
<span class="co">#&gt; N_NATIONKEY = col_integer(),</span>
<span class="co">#&gt; N_REGIONKEY = col_integer(),</span>
<span class="co">#&gt; dir0 = col_character()</span>
<span class="co">#&gt; )</span>
<span class="co">#&gt; # A tibble: 5 x 5</span>
<span class="co">#&gt; N_COMMENT N_NAME N_NATIONKEY N_REGIONKEY dir0</span>
<span class="co">#&gt; * &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt;</span>
<span class="co">#&gt; 1 haggle. carefully f ALGERIA 0 0 nationsMF</span>
<span class="co">#&gt; 2 al foxes promise sly ARGENTINA 1 1 nationsMF</span>
<span class="co">#&gt; 3 y alongside of the p BRAZIL 2 1 nationsMF</span>
<span class="co">#&gt; 4 eas hang ironic, sil CANADA 3 1 nationsMF</span>
<span class="co">#&gt; 5 y above the carefull EGYPT 4 4 nationsMF</span></code></pre></div>
<div id="a-preview-of-the-built-in-support-for-spatial-ops" class="section level3">
<h3 class="hasAnchor">
<a href="#a-preview-of-the-built-in-support-for-spatial-ops" class="anchor"></a>A preview of the built-in support for spatial ops</h3>
<p>Via: <a href="https://github.com/k255/drill-gis" class="uri">https://github.com/k255/drill-gis</a></p>
<p>A common use case is to select data within boundary of given polygon:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw"><a href="reference/drill_query.html">drill_query</a></span>(dc, <span class="st">"</span>
<span class="st">select columns[2] as city, columns[4] as lon, columns[3] as lat</span>
<span class="st"> from cp.`sample-data/CA-cities.csv`</span>
<span class="st"> where</span>
<span class="st"> ST_Within(</span>
<span class="st"> ST_Point(columns[4], columns[3]),</span>
<span class="st"> ST_GeomFromText(</span>
<span class="st"> 'POLYGON((-121.95 37.28, -121.94 37.35, -121.84 37.35, -121.84 37.28, -121.95 37.28))'</span>
<span class="st"> )</span>
<span class="st"> )</span>
<span class="st">"</span>)
<span class="co">#&gt; Parsed with column specification:</span>
<span class="co">#&gt; cols(</span>
<span class="co">#&gt; city = col_character(),</span>
<span class="co">#&gt; lon = col_double(),</span>
<span class="co">#&gt; lat = col_double()</span>
<span class="co">#&gt; )</span>
<span class="co">#&gt; # A tibble: 7 x 3</span>
<span class="co">#&gt; city lon lat</span>
<span class="co">#&gt; * &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;</span>
<span class="co">#&gt; 1 Burbank -121.9316 37.32328</span>
<span class="co">#&gt; 2 San Jose -121.8950 37.33939</span>
<span class="co">#&gt; 3 Lick -121.8458 37.28716</span>
<span class="co">#&gt; 4 Willow Glen -121.8897 37.30855</span>
<span class="co">#&gt; 5 Buena Vista -121.9166 37.32133</span>
<span class="co">#&gt; 6 Parkmoor -121.9308 37.32105</span>
<span class="co">#&gt; 7 Fruitdale -121.9327 37.31086</span></code></pre></div>
</div>
<div id="jdbc" class="section level3">
<h3 class="hasAnchor">
<a href="#jdbc" class="anchor"></a>JDBC</h3>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(RJDBC)
<span class="co">#&gt; Loading required package: rJava</span>
<span class="co"># Use this if connecting to a cluster with zookeeper</span>
<span class="co"># con &lt;- drill_jdbc("drill-node:2181", "drillbits1") </span>
<span class="co"># Use the following if running drill-embedded</span>
con &lt;-<span class="st"> </span><span class="kw"><a href="reference/drill_jdbc.html">drill_jdbc</a></span>(<span class="st">"localhost:31010"</span>, <span class="dt">use_zk=</span><span class="ot">FALSE</span>)
<span class="co">#&gt; Using [jdbc:drill:drillbit=localhost:31010]...</span>
<span class="kw"><a href="reference/drill_query.html">drill_query</a></span>(con, <span class="st">"SELECT * FROM cp.`employee.json`"</span>)
<span class="co">#&gt; # A tibble: 1,155 x 16</span>
<span class="co">#&gt; employee_id full_name first_name last_name position_id position_title store_id department_id</span>
<span class="co">#&gt; * &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;</span>
<span class="co">#&gt; 1 1 Sheri Nowmer Sheri Nowmer 1 President 0 1</span>
<span class="co">#&gt; 2 2 Derrick Whelply Derrick Whelply 2 VP Country Manager 0 1</span>
<span class="co">#&gt; 3 4 Michael Spence Michael Spence 2 VP Country Manager 0 1</span>
<span class="co">#&gt; 4 5 Maya Gutierrez Maya Gutierrez 2 VP Country Manager 0 1</span>
<span class="co">#&gt; 5 6 Roberta Damstra Roberta Damstra 3 VP Information Systems 0 2</span>
<span class="co">#&gt; 6 7 Rebecca Kanagaki Rebecca Kanagaki 4 VP Human Resources 0 3</span>
<span class="co">#&gt; 7 8 Kim Brunner Kim Brunner 11 Store Manager 9 11</span>
<span class="co">#&gt; 8 9 Brenda Blumberg Brenda Blumberg 11 Store Manager 21 11</span>
<span class="co">#&gt; 9 10 Darren Stanz Darren Stanz 5 VP Finance 0 5</span>
<span class="co">#&gt; 10 11 Jonathan Murraiin Jonathan Murraiin 11 Store Manager 1 11</span>
<span class="co">#&gt; # ... with 1,145 more rows, and 8 more variables: birth_date &lt;chr&gt;, hire_date &lt;chr&gt;, salary &lt;dbl&gt;, supervisor_id &lt;dbl&gt;,</span>
<span class="co">#&gt; # education_level &lt;chr&gt;, marital_status &lt;chr&gt;, gender &lt;chr&gt;, management_role &lt;chr&gt;</span>
<span class="co"># but it can work via JDBC function calls, too</span>
<span class="kw">dbGetQuery</span>(con, <span class="st">"SELECT * FROM cp.`employee.json`"</span>) <span class="op">%&gt;%</span><span class="st"> </span>
<span class="st"> </span>tibble<span class="op">::</span><span class="kw">as_tibble</span>()
<span class="co">#&gt; # A tibble: 1,155 x 16</span>
<span class="co">#&gt; employee_id full_name first_name last_name position_id position_title store_id department_id</span>
<span class="co">#&gt; * &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;</span>
<span class="co">#&gt; 1 1 Sheri Nowmer Sheri Nowmer 1 President 0 1</span>
<span class="co">#&gt; 2 2 Derrick Whelply Derrick Whelply 2 VP Country Manager 0 1</span>
<span class="co">#&gt; 3 4 Michael Spence Michael Spence 2 VP Country Manager 0 1</span>
<span class="co">#&gt; 4 5 Maya Gutierrez Maya Gutierrez 2 VP Country Manager 0 1</span>
<span class="co">#&gt; 5 6 Roberta Damstra Roberta Damstra 3 VP Information Systems 0 2</span>
<span class="co">#&gt; 6 7 Rebecca Kanagaki Rebecca Kanagaki 4 VP Human Resources 0 3</span>
<span class="co">#&gt; 7 8 Kim Brunner Kim Brunner 11 Store Manager 9 11</span>
<span class="co">#&gt; 8 9 Brenda Blumberg Brenda Blumberg 11 Store Manager 21 11</span>
<span class="co">#&gt; 9 10 Darren Stanz Darren Stanz 5 VP Finance 0 5</span>
<span class="co">#&gt; 10 11 Jonathan Murraiin Jonathan Murraiin 11 Store Manager 1 11</span>
<span class="co">#&gt; # ... with 1,145 more rows, and 8 more variables: birth_date &lt;chr&gt;, hire_date &lt;chr&gt;, salary &lt;dbl&gt;, supervisor_id &lt;dbl&gt;,</span>
<span class="co">#&gt; # education_level &lt;chr&gt;, marital_status &lt;chr&gt;, gender &lt;chr&gt;, management_role &lt;chr&gt;</span></code></pre></div>
</div>
<div id="test-results" class="section level3">
<h3 class="hasAnchor">
<a href="#test-results" class="anchor"></a>Test Results</h3>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(sergeant)
<span class="kw">library</span>(testthat)
<span class="co">#&gt; </span>
<span class="co">#&gt; Attaching package: 'testthat'</span>
<span class="co">#&gt; The following object is masked from 'package:dplyr':</span>
<span class="co">#&gt; </span>
<span class="co">#&gt; matches</span>
<span class="kw">date</span>()
<span class="co">#&gt; [1] "Sat Jun 17 20:47:11 2017"</span>
devtools<span class="op">::</span><span class="kw">test</span>()
<span class="co">#&gt; Loading sergeant</span>
<span class="co">#&gt; Testing sergeant</span>
<span class="co">#&gt; basic functionality: ..</span>
<span class="co">#&gt; </span>
<span class="co">#&gt; DONE ===================================================================================================================</span></code></pre></div>
</div>
<div id="code-of-conduct" class="section level3">
<h3 class="hasAnchor">
<a href="#code-of-conduct" class="anchor"></a>Code of Conduct</h3>
<p>Please note that this project is released with a <a href="CONDUCT.md">Contributor Code of Conduct</a>. By participating in this project you agree to abide by its terms.</p>
</div>
</div>
</div>
</div>
<div class="col-md-3 hidden-xs hidden-sm" id="sidebar">
<h2 class="hasAnchor">
<a href="#sidebar" class="anchor"></a>Links</h2>
<ul class="list-unstyled">
<li>Browse source code at <br><a href="http://github.com/hrbrmstr/sergeant">http://​github.com/​hrbrmstr/​sergeant</a>
</li>
<li>Report a bug at <br><a href="https://github.com/hrbrmstr/sergeant/issues">https://​github.com/​hrbrmstr/​sergeant/​issues</a>
</li>
</ul>
<h2>License</h2>
<p><a href="https://opensource.org/licenses/mit-license.php">MIT</a> + file <a href="LICENSE.html">LICENSE</a></p>
<h2>Developers</h2>
<ul class="list-unstyled">
<li>Bob Rudis <br><small class="roles"> Author, maintainer </small> </li>
<li><a href="authors.html">All authors...</a></li>
</ul>
<h2>Dev status</h2>
<ul class="list-unstyled">
<li><a href="https://travis-ci.org/hrbrmstr/sergeant"><img src="https://travis-ci.org/hrbrmstr/sergeant.svg?branch=master" alt="Travis-CI Build Status"></a></li>
</ul>
</div>
</div>
<footer><div class="copyright">
<p>Developed by Bob Rudis.</p>
</div>
<div class="pkgdown">
<p>Site built with <a href="http://hadley.github.io/pkgdown/">pkgdown</a>.</p>
</div>
</footer>
</div>
</body>
</html>