You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

115 lines
3.5 KiB

7 years ago
7 years ago
# decapitated
Headless ‘Chrome’ Orchestration
7 years ago
## Description
7 years ago
The ‘Chrome’ browser <https://www.google.com/chrome/> has a headless
mode which can be instrumented programmatically. Tools are provided to
perform headless ‘Chrome’ instrumentation on the command-line, including
retrieving the javascript-executed web page, PDF output or screen shot
of a URL.
7 years ago
### IMPORTANT
7 years ago
You’ll need to set an envrionment variable `HEADLESS_CHROME` to one of
these two values:
7 years ago
- Windows(32bit): `C:/Program
Files/Google/Chrome/Application/chrome.exe`
- Windows(64bit): `C:/Program Files
(x86)/Google/Chrome/Application/chrome.exe`
- macOS: `/Applications/Google\ Chrome.app/Contents/MacOS/Google\
Chrome`
- Linux: `/usr/bin/google-chrome`
7 years ago
A guess is made (but not verified yet) if `HEADLESS_CHROME` is
non-existent.
7 years ago
7 years ago
It’s best to use `~/.Renviron` to store this value.
## Working around headless Chrome & OS security restrictions:
Security restrictions on various operating systems and OS configurations
can cause headless Chrome execution to fail. As a result, headless
Chrome operations should use a special directory for `decapitated`
package operations. You can pass this in as `work_dir`. If `work_dir` is
`NULL` a `.rdecapdata` directory will be created in your home directory
and used for the data, crash dumps and utility directories for Chrome
operations.
`tempdir()` does not always meet these requirements (after testing on
various macOS 10.13 systems) as Chrome does some interesting attribute
setting for some of its file operations.
If you pass in a `work_dir`, it must be one that does not violate OS
security restrictions or headless Chrome will not function.
## Helping it “always work”
The three core functions have a `prime` parameter. In testing (again,
especially on macOS), I noticed that the first one or two requests to a
URL often resulted in an empty `<body>` response. I don’t use Chrome as
my primary browser anymroe so I’m not sure if that has somethign to do
with it, but requests after the first one or two do return content. The
`prime` parameter lets you specify `TRUE`, `FALSE` or a numeric value
that will issue the URL retrieval multiple times before returning a
result (or generating a PDF or PNG). Until there is more granular
control over the command-line execution of headless Chrome.
7 years ago
## What’s in the tin?
7 years ago
The following functions are implemented:
- `chrome_dump_pdf`: “Print” to PDF
- `chrome_read_html`: Read a URL via headless Chrome and return the
raw or rendered ’
7 years ago
<body>
‘’innerHTML’ DOM elements
- `chrome_shot`: Capture a screenshot
- `chrome_version`: Get Chrome version
- `get_chrome_env`: get an envrionment variable ‘HEADLESS\_CHROME’
- `set_chrome_env`: set an envrionment variable ‘HEADLESS\_CHROME’
7 years ago
7 years ago
## Installation
7 years ago
``` r
devtools::install_github("hrbrmstr/decapitated")
```
7 years ago
## Usage
7 years ago
``` r
library(decapitated)
# current verison
packageVersion("decapitated")
```
7 years ago
## [1] '0.2.0'
7 years ago
``` r
chrome_version()
chrome_read_html("http://httpbin.org/")
```
## {xml_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta http-equiv="content-type" valu ...
## [2] <body id="manpage">\n<a href="http://github.com/kennethreitz/httpbin"><img style="position: absolute; top: 0; rig ...
7 years ago
``` r
chrome_dump_pdf("http://httpbin.org/")
```
``` r
chrome_shot("http://httpbin.org/")
## format width height colorspace filesize
## 1 PNG 1600 1200 sRGB 215680
7 years ago
```
![screenshot.png](screenshot.png)