@ -1,35 +1,25 @@
# ` metis`
# metis
Helpers for Accessing and Querying Amazon Athena
Including a lightweight RJDBC shim.
In Greek mythology, Metis was Athena’s “helper”.
Access and Query Amazon Athena via DBI/JDBC
## Description
Still fairly beta-quality level but getting there.
The goal will be to get around enough of the “gotchas” that are
preventing raw RJDBC Athena connections from “just working” with `dplyr`
v0.6.0+ and also get around the [`fetchSize`
problem](https://www.reddit.com/r/aws/comments/6aq22b/fetchsize_limit/)
without having to not use `dbGetQuery()` .
The `AthenaJDBC42_2.0.2.jar` JAR file is included out of convenience but
that will likely move to a separate package as this gets closer to prime
time if this goes on CRAN.
NOTE that the updated driver *REQUIRES JDK 1.8+* .
See the **Usage** section for an example.
In Greek mythology, Metis was Athena’s “helper” so methods are provided
to help you accessing and querying Amazon Athena via DBI/JDBC and/or
`dplyr` . \#’ Methods are provides to connect to ‘Amazon’ ‘Athena’,
lookup schemas/tables,
## IMPORTANT
Since R 3.5 (I don't remember this happening in R 3.4.x) signals sent from interrupting Athena JDBC calls crash the R interpreter. You need to set the `-Xrs` option to avoid signals being passed on to the JVM owner. That has to be done _before_ `rJava` is loaded so you either need to remember to put it at the top of all scripts _or_ stick this in your local `~/.Rprofile` and/or sitewide `Rprofile` :
Since R 3.5 (I don’t remember this happening in R 3.4.x) signals sent
from interrupting Athena JDBC calls crash the R interpreter. You need to
set the `-Xrs` option to avoid signals being passed on to the JVM owner.
That has to be done *before* `rJava` is loaded so you either need to
remember to put it at the top of all scripts *or* stick this in your
local `~/.Rprofile` and/or sitewide `Rprofile` :
```r
``` r
if (!grepl("-Xrs", getOption("java.parameters", ""))) {
options(
"java.parameters" = c(getOption("java.parameters", default = NULL), "-Xrs")
@ -43,7 +33,7 @@ The following functions are implemented:
Easy-interface connection helper:
- `athena_connect` Make a JDBC connection to Athena
- `athena_connect` Simplified Athena JDBC connection helper
Custom JDBC Classes:
@ -54,13 +44,13 @@ Custom JDBC Classes:
Custom JDBC Class Methods:
- `dbConnect-method` : AthenaJDBC
- `dbExistsTable-method` : AthenaJDBC
- `dbGetQuery-method` : AthenaJDBC
- `dbListFields-method` : AthenaJDBC
- `dbListTables-method` : AthenaJDBC
- `dbReadTable-method` : AthenaJDBC
- `dbSendQuery-method` : AthenaJDBC
- `dbConnect-method`
- `dbExistsTable-method`
- `dbGetQuery-method`
- `dbListFields-method`
- `dbListTables-method`
- `dbReadTable-method`
- `dbSendQuery-method`
Pulled in from other `cloudyr` pkgs:
@ -70,44 +60,53 @@ Pulled in from other `cloudyr` pkgs:
## Installation
``` r
devtools::install_github("hrbrmstr/metis")
devtools::install_git("https://git.sr.ht/~hrbrmstr/metis-lite")
# OR
devtools::install_gitlab("hrbrmstr/metis-lite")
# OR
devtools::install_github("hrbrmstr/metis-lite")
```
## Usage
``` r
library(metis)
library(tidyverse)
library(metis.lite)
# current verison
packageVersion("metis")
packageVersion("metis.lite ")
```
## [1] '0.3.0'
``` r
use_credentials("default")
athena_connect(
default_schema = "sampledb",
s3_staging_dir = "s3://accessible-bucket",
log_path = "/tmp/athena.log",
log_level = "DEBUG"
) -> ath
dbListTables(ath, schema="sampledb")
library(rJava)
library(RJDBC)
library(metis.lite)
library(magrittr)
library(dbplyr)
library(dplyr)
dbConnect(
drv = metis.lite::Athena(),
schema_name = "sampledb",
provider = "com.simba.athena.amazonaws.auth.PropertiesFileCredentialsProvider",
AwsCredentialsProviderArguments = path.expand("~/.aws/athenaCredentials.props"),
s3_staging_dir = "s3://aws-athena-query-results-569593279821-us-east-1",
) -> con
dbListTables(con, schema="sampledb")
```
## [1] "elb_logs"
``` r
dbExistsTable(ath, "elb_logs", schema="sampledb")
dbExistsTable(con , "elb_logs", schema="sampledb")
```
## [1] TRUE
``` r
dbListFields(ath , "elb_logs", "sampledb")
dbListFields(con , "elb_logs", "sampledb")
```
## [1] "timestamp" "elbname" "requestip" "requestport"
@ -116,29 +115,109 @@ dbListFields(ath, "elb_logs", "sampledb")
## [13] "sentbytes" "requestverb" "url" "protocol"
``` r
dbGetQuery(ath, "SELECT * FROM sampledb.elb_logs LIMIT 10") %>%
type_convert() %>%
dbGetQuery(con, "SELECT * FROM sampledb.elb_logs LIMIT 10") %>%
glimpse()
```
## Observations: 10
## Variables: 16
## $ timestamp < dttm > 2014-09-30 01:28:17, 2014-09-30 00:01:30, 2014-09-30 00:01:30, 2014-09-30 00:01:30, ...
## $ elbname < chr > "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo...
## $ requestip < chr > "246.140.190.136", "240.109.129.138", "242.251.232.153", "253.227.207.81", "253.227.2...
## $ requestport < dbl > 63777, 22705, 22705, 22705, 23282, 24178, 22916, 23807, 22916, 21443
## $ backendip < chr > "250.193.168.100", "251.103.130.45", "243.140.114.254", "243.82.95.243", "246.129.102...
## $ backendport < dbl > 8888, 8888, 8888, 8888, 8899 , 8888, 8888, 8888, 8888, 8888
## $ requestprocessingtime < dbl > 7.2e-05, 6.9e-05, 8.7e-05, 9.7e-05, 8.1e-05, 4.6e-05, 4.3e-05, 5.3e-05, 5.5e-05, 4.4 e-05
## $ backendprocessingtime < dbl > 0.379241, 0.007541, 0.187126, 0.413337, 0.037030, 0.050222, 0.043706, 0.045953, 0.015...
## $ clientresponsetime < dbl > 8.0e-05, 4.3e-05, 7.5e-05, 8.7e-05, 4.5e-05, 3.3e-05, 3.3e-05, 6.9e-05, 8.5e-05, 4.9e-05
## $ elbresponsecode < int > 200, 302, 302, 200, 200, 200, 200, 200, 200, 200
## $ backendresponsecode < int > 200, 200, 200, 4 00, 200, 200, 200, 4 04 , 200, 200
## $ receivedbytes < dbl > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
## $ sentbytes < dbl > 58402, 0, 0, 58402, 32370, 20766, 3408, 152213, 84245, 3884
## $ timestamp < chr > "2014-09-29T18:18:51.826955Z", "2014-09-29T18:18:51.920462Z", "2014-09-29T18:18:52.2725…
## $ elbname < chr > "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo",…
## $ requestip < chr > "255.48.150.122", "249.213.227.93", "245.108.120.229", "241.112.203.216", "241.43.107.2…
## $ requestport < int > 62096, 62096, 62096, 62096, 56454, 33254, 18918, 64352, 1651, 56454
## $ backendip < chr > "244.238.214.120", "248.99.214.228", "243.3.190.175", "246.235.181.255", "241.112.203.2…
## $ backendport < int > 8888, 8888, 8888, 8888, 8888 , 8888, 8888, 8888, 8888, 8888
## $ requestprocessingtime < dbl > 9.0e-05, 9.7e-05, 8.7e-05, 9.4e-05, 7.6e-05, 8.3e-05, 6.3e-05, 5.4e-05, 8.2e-05, 8.7 e-05
## $ backendprocessingtime < dbl > 0.007410, 0.256533, 0.442659, 0.016772, 0.035036, 0.029892, 0.034148, 0.014858, 0.01518…
## $ clientresponsetime < dbl > 0.000055, 0.000075, 0.000131, 0.000078, 0.000057, 0.000043, 0.000033, 0.000043, 0.00007…
## $ elbresponsecode < chr > "302", "302", "200", "200", "200", "200", "200", "200", "200", "200"
## $ backendresponsecode < chr > " 200" , " 200" , " 200" , "2 00" , " 200" , " 200" , " 200" , "2 00" , " 200" , " 200"
## $ receivedbytes < S3: integer64 > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
## $ sentbytes < S3: integer64 > 0, 0, 58402, 152213, 20766, 32370, 3408, 3884, 84245, 3831
## $ requestverb < chr > "GET", "GET", "GET", "GET", "GET", "GET", "GET", "GET", "GET", "GET"
## $ url < chr > "http://www.abcxyz.com:80/", "http://www.abcxyz.com:80/", "http://www.abcxyz.com:80/a...
## $ protocol < chr > "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "...
## $ url < chr > "http://www.abcxyz.com:80/", "http://www.abcxyz.com:80/accounts/login/?next=/", "http:/…
## $ protocol < chr > "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HT…
### Check types
``` r
dbGetQuery(con, "
SELECT
CAST('chr' AS CHAR(4)) achar,
CAST('varchr' AS VARCHAR) avarchr,
CAST(SUBSTR(timestamp, 1, 10) AS DATE) AS tsday,
CAST(100.1 AS DOUBLE) AS justadbl,
CAST(127 AS TINYINT) AS asmallint,
CAST(100 AS INTEGER) AS justanint,
CAST(100000000000000000 AS BIGINT) AS abigint,
CAST(('GET' = 'GET') AS BOOLEAN) AS is_get,
ARRAY[1, 2, 3] AS arr1,
ARRAY['1', '2, 3', '4'] AS arr2,
MAP(ARRAY['foo', 'bar'], ARRAY[1, 2]) AS mp,
CAST(ROW(1, 2.0) AS ROW(x BIGINT, y DOUBLE)) AS rw,
CAST('{\"a\":1}' AS JSON) js
FROM elb_logs
LIMIT 1
") %>%
glimpse()
```
## Observations: 1
## Variables: 13
## $ achar < chr > "chr "
## $ avarchr < chr > "varchr"
## $ tsday < date > 2014-09-26
## $ justadbl < dbl > 100.1
## $ asmallint < int > 127
## $ justanint < int > 100
## $ abigint < S3: integer64 > 100000000000000000
## $ is_get < lgl > TRUE
## $ arr1 < chr > "1, 2, 3"
## $ arr2 < chr > "1, 2, 3, 4"
## $ mp < chr > "{bar=2, foo=1}"
## $ rw < chr > "{x=1, y=2.0}"
## $ js < chr > "\"{\\\"a\\\":1}\""
#### dplyr
``` r
tbl(con, sql("
SELECT
CAST('chr' AS CHAR(4)) achar,
CAST('varchr' AS VARCHAR) avarchr,
CAST(SUBSTR(timestamp, 1, 10) AS DATE) AS tsday,
CAST(100.1 AS DOUBLE) AS justadbl,
CAST(127 AS TINYINT) AS asmallint,
CAST(100 AS INTEGER) AS justanint,
CAST(100000000000000000 AS BIGINT) AS abigint,
CAST(('GET' = 'GET') AS BOOLEAN) AS is_get,
ARRAY[1, 2, 3] AS arr,
ARRAY['1', '2, 3', '4'] AS arr,
MAP(ARRAY['foo', 'bar'], ARRAY[1, 2]) AS mp,
CAST(ROW(1, 2.0) AS ROW(x BIGINT, y DOUBLE)) AS rw,
CAST('{\"a\":1}' AS JSON) js
FROM elb_logs
LIMIT 1
")) %>%
glimpse()
```
## Observations: ??
## Variables: 13
## Database: AthenaConnection
## $ achar < chr > "chr "
## $ avarchr < chr > "varchr"
## $ tsday < date > 2014-09-27
## $ justadbl < dbl > 100.1
## $ asmallint < int > 127
## $ justanint < int > 100
## $ abigint < S3: integer64 > 100000000000000000
## $ is_get < lgl > TRUE
## $ arr < chr > "1, 2, 3"
## $ arr < chr > "1, 2, 3, 4"
## $ mp < chr > "{bar=2, foo=1}"
## $ rw < chr > "{x=1, y=2.0}"
## $ js < chr > "\"{\\\"a\\\":1}\""
## Code of Conduct