You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

102 lines
3.4 KiB

7 years ago
---
output: rmarkdown::github_document
---
7 years ago
# decapitated
Headless 'Chrome' Orchestration
## Description
7 years ago
The 'Chrome' browser <https://www.google.com/chrome/> has a headless mode
which can be instrumented programmatically. Tools are provided to perform headless
'Chrome' instrumentation on the command-line, including retrieving the javascript-executed web page, PDF output or screen shot of a URL.
7 years ago
### IMPORTANT
You'll need to set an envrionment variable `HEADLESS_CHROME` to one of these two values:
7 years ago
- Windows(32bit): `C:/Program Files/Google/Chrome/Application/chrome.exe`
- Windows(64bit): `C:/Program Files (x86)/Google/Chrome/Application/chrome.exe`
7 years ago
- macOS: `/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome`
7 years ago
- Linux: `/usr/bin/google-chrome`
7 years ago
7 years ago
A guess is made (but not verified yet) if `HEADLESS_CHROME` is non-existent.
7 years ago
7 years ago
It's best to use `~/.Renviron` to store this value.
## Working around headless Chrome & OS security restrictions:
Security restrictions on various operating systems and OS configurations can cause
headless Chrome execution to fail. As a result, headless Chrome operations should
use a special directory for `decapitated` package operations. You can pass this
in as `work_dir`. If `work_dir` is `NULL` a `.rdecapdata` directory will be
created in your home directory and used for the data, crash dumps and utility
directories for Chrome operations.
`tempdir()` does not always meet these requirements (after testing on various
macOS 10.13 systems) as Chrome does some interesting attribute setting for
some of its file operations.
If you pass in a `work_dir`, it must be one that does not violate OS security
restrictions or headless Chrome will not function.
## Helping it "always work"
The three core functions have a `prime` parameter. In testing (again, especially on macOS),
I noticed that the first one or two requests to a URL often resulted in an empty `<body>`
7 years ago
response. I don't use Chrome as my primary browser anymore so I'm not sure if that has something
7 years ago
to do with it, but requests after the first one or two do return content. The `prime`
parameter lets you specify `TRUE`, `FALSE` or a numeric value that will issue the
URL retrieval multiple times before returning a result (or generating a PDF or PNG).
Until there is more granular control over the command-line execution of headless
Chrome.
7 years ago
7 years ago
## What's in the tin?
7 years ago
The following functions are implemented:
- `chrome_dump_pdf`: "Print" to PDF
- `chrome_read_html`: Read a URL via headless Chrome and return the raw or rendered '<body>' 'innerHTML' DOM elements
7 years ago
- `chrome_shot`: Capture a screenshot
- `chrome_version`: Get Chrome version
7 years ago
- `get_chrome_env`: get an envrionment variable 'HEADLESS_CHROME'
- `set_chrome_env`: set an envrionment variable 'HEADLESS_CHROME'
7 years ago
7 years ago
## Installation
7 years ago
```{r eval=FALSE}
devtools::install_github("hrbrmstr/decapitated")
```
```{r message=FALSE, warning=FALSE, error=FALSE, include=FALSE}
options(width=120)
```
7 years ago
## Usage
7 years ago
```{r message=FALSE, warning=FALSE, error=FALSE}
library(decapitated)
# current verison
packageVersion("decapitated")
chrome_version()
chrome_read_html("http://httpbin.org/")
```
```{r eval=FALSE, message=FALSE, warning=FALSE, error=FALSE}
chrome_dump_pdf("http://httpbin.org/")
```
```{r message=FALSE, warning=FALSE, error=FALSE, eval=FALSE}
chrome_shot("http://httpbin.org/")
## format width height colorspace filesize
## 1 PNG 1600 1200 sRGB 215680
7 years ago
```
![screenshot.png](screenshot.png)