decapitated/README.Rmd

---
output: rmarkdown::github_document
---

# decapitated

Headless 'Chrome' Orchestration

## Description

The 'Chrome' browser <https://www.google.com/chrome/> has a headless mode
which can be instrumented programmatically. Tools are provided to perform headless
'Chrome' instrumentation on the command-line, including retrieving the javascript-executed web page, PDF output or screen shot of a URL.

## IMPORTANT

You'll need to set an envrionment variable `HEADLESS_CHROME` to use this package.

If this value is not set, a location heuristic is used on package start which looks
for the following depending on the operating system:

- Windows(32bit): `C:/Program Files/Google/Chrome/Application/chrome.exe`
- Windows(64bit): `C:/Program Files (x86)/Google/Chrome/Application/chrome.exe`
- macOS: `/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome`
- Linux: `/usr/bin/google-chrome`

If a verification test fails, you will be notified. 

**It is HIGHLY recommended** that you use `decapitated::download_chromium()` to use
a standalone version of Chrome with this packge for your platform. 

It's best to use `~/.Renviron` to store this value.

## Working around headless Chrome & OS security restrictions:

Security restrictions on various operating systems and OS configurations can cause
headless Chrome execution to fail. As a result, headless Chrome operations should
use a special directory for `decapitated` package operations. You can pass this
in as `work_dir`. If `work_dir` is `NULL` a `.rdecapdata` directory will be
created in your home directory and used for the data, crash dumps and utility
directories for Chrome operations.

`tempdir()` does not always meet these requirements (after testing on various
macOS 10.13 systems) as Chrome does some interesting attribute setting for
some of its file operations.

If you pass in a `work_dir`, it must be one that does not violate OS security
restrictions or headless Chrome will not function.

## Helping it "always work"

The three core functions have a `prime` parameter. In testing (again, especially on macOS),
I noticed that the first one or two requests to a URL often resulted in an empty `<body>`
response. I don't use Chrome as my primary browser anymore so I'm not sure if that has something
to do with it, but requests after the first one or two do return content. The `prime`
parameter lets you specify `TRUE`, `FALSE` or a numeric value that will issue the
URL retrieval multiple times before returning a result (or generating a PDF or PNG).
Until there is more granular control over the command-line execution of headless
Chrome.

## What's in the tin?

The following functions are implemented:

### CLI-based ops

- `downlaod_chromium`:  Download a standalone version of Chromium (recommended)
- `chrome_dump_pdf`:	"Print" to PDF
- `chrome_read_html`:	Read a URL via headless Chrome and return the raw or rendered '<body>' 'innerHTML' DOM elements
- `chrome_shot`:	Capture a screenshot
- `chrome_version`:	Get Chrome version
- `get_chrome_env`:	get an envrionment variable 'HEADLESS_CHROME'
- `set_chrome_env`:	set an envrionment variable 'HEADLESS_CHROME'

### `gepetto`-based ops

Helpers to get gepetto installed:

- `install_gepetto`:	Install gepetto
- `start_gepetto`:	Start/stop gepetto
- `stop_gepetto`:	Start/stop gepetto

API interface functions:

- `gepetto`:	Create a connection to a Gepetto API server
- `gep_active`:	Get test whether the gepetto server is active
- `gep_debug`:	Get "debug-level" information of a running gepetto server
- `gep_render_har`:	Render a page in a javascript context and serialize to HAR
- `gep_render_html`:	Render a page in a javascript context and serialize to HTML
- `gep_render_magick`:	Render a page in a javascript context and take a screenshot
- `gep_render_pdf`:	Render a page in a javascript context and rendero to PDF

More information on `gepetto` is forthcoming but you can take a sneak peek [here](https://gitlab.com/hrbrmstr/gepetto).

## Installation

```{r eval=FALSE}
devtools::install_github("hrbrmstr/decapitated")
```

```{r message=FALSE, warning=FALSE, error=FALSE, include=FALSE}
options(width=120)
```

## Usage

```{r message=FALSE, warning=FALSE, error=FALSE}
library(decapitated)

# current verison
packageVersion("decapitated")

chrome_version()

chrome_read_html("http://httpbin.org/")
```

```{r eval=FALSE, message=FALSE, warning=FALSE, error=FALSE}
chrome_dump_pdf("http://httpbin.org/")
```

```{r message=FALSE, warning=FALSE, error=FALSE, eval=FALSE}
chrome_shot("http://httpbin.org/")

##   format width height colorspace filesize
## 1    PNG  1600   1200       sRGB   215680
```

![screenshot.png](screenshot.png)
initial commit 7 years ago			`---`
			`output: rmarkdown::github_document`
			`---`

Update docs 6 years ago			`# decapitated`

			`Headless 'Chrome' Orchestration`

			`## Description`
initial commit 7 years ago
			`The 'Chrome' browser <https://www.google.com/chrome/> has a headless mode`
			`which can be instrumented programmatically. Tools are provided to perform headless`
major improvements all around 6 years ago			`'Chrome' instrumentation on the command-line, including retrieving the javascript-executed web page, PDF output or screen shot of a URL.`
initial commit 7 years ago
download_chromium() 6 years ago			`## IMPORTANT`
initial commit 7 years ago
download_chromium() 6 years ago			You'll need to set an envrionment variable `HEADLESS_CHROME` to use this package.

			`If this value is not set, a location heuristic is used on package start which looks`
			`for the following depending on the operating system:`
initial commit 7 years ago
add func get_env, set_env add onAttach for env 7 years ago			- Windows(32bit): `C:/Program Files/Google/Chrome/Application/chrome.exe`
			- Windows(64bit): `C:/Program Files (x86)/Google/Chrome/Application/chrome.exe`
initial commit 7 years ago			- macOS: `/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome`
Update docs 6 years ago			- Linux: `/usr/bin/google-chrome`
initial commit 7 years ago
download_chromium() 6 years ago			`If a verification test fails, you will be notified.`

			It is HIGHLY recommended that you use `decapitated::download_chromium()` to use
			`a standalone version of Chrome with this packge for your platform.`
initial commit 7 years ago
README 6 years ago			It's best to use `~/.Renviron` to store this value.

			`## Working around headless Chrome & OS security restrictions:`

			`Security restrictions on various operating systems and OS configurations can cause`
			`headless Chrome execution to fail. As a result, headless Chrome operations should`
			use a special directory for `decapitated` package operations. You can pass this
			in as `work_dir`. If `work_dir` is `NULL` a `.rdecapdata` directory will be
			`created in your home directory and used for the data, crash dumps and utility`
			`directories for Chrome operations.`

			`tempdir()` does not always meet these requirements (after testing on various
			`macOS 10.13 systems) as Chrome does some interesting attribute setting for`
			`some of its file operations.`

			If you pass in a `work_dir`, it must be one that does not violate OS security
			`restrictions or headless Chrome will not function.`

			`## Helping it "always work"`

			The three core functions have a `prime` parameter. In testing (again, especially on macOS),
			I noticed that the first one or two requests to a URL often resulted in an empty `<body>`
spelling 6 years ago			`response. I don't use Chrome as my primary browser anymore so I'm not sure if that has something`
README 6 years ago			to do with it, but requests after the first one or two do return content. The `prime`
			parameter lets you specify `TRUE`, `FALSE` or a numeric value that will issue the
			`URL retrieval multiple times before returning a result (or generating a PDF or PNG).`
			`Until there is more granular control over the command-line execution of headless`
			`Chrome.`
initial commit 7 years ago
Update docs 6 years ago			`## What's in the tin?`

initial commit 7 years ago			`The following functions are implemented:`

gepetto sneak peek 6 years ago			`### CLI-based ops`

download_chromium() 6 years ago			- `downlaod_chromium`: Download a standalone version of Chromium (recommended)
initial commit 7 years ago			- `chrome_dump_pdf`: "Print" to PDF
major improvements all around 6 years ago			- `chrome_read_html`: Read a URL via headless Chrome and return the raw or rendered '<body>' 'innerHTML' DOM elements
initial commit 7 years ago			- `chrome_shot`: Capture a screenshot
			- `chrome_version`: Get Chrome version
Update docs 6 years ago			- `get_chrome_env`: get an envrionment variable 'HEADLESS_CHROME'
			- `set_chrome_env`: set an envrionment variable 'HEADLESS_CHROME'
initial commit 7 years ago
gepetto sneak peek 6 years ago			### `gepetto`-based ops

helpers 6 years ago			`Helpers to get gepetto installed:`

			- `install_gepetto`: Install gepetto
			- `start_gepetto`: Start/stop gepetto
			- `stop_gepetto`: Start/stop gepetto

			`API interface functions:`

gepetto sneak peek 6 years ago			- `gepetto`: Create a connection to a Gepetto API server
			- `gep_active`: Get test whether the gepetto server is active
			- `gep_debug`: Get "debug-level" information of a running gepetto server
			- `gep_render_har`: Render a page in a javascript context and serialize to HAR
			- `gep_render_html`: Render a page in a javascript context and serialize to HTML
			- `gep_render_magick`: Render a page in a javascript context and take a screenshot
			- `gep_render_pdf`: Render a page in a javascript context and rendero to PDF

			More information on `gepetto` is forthcoming but you can take a sneak peek [here](https://gitlab.com/hrbrmstr/gepetto).

Update docs 6 years ago			`## Installation`
initial commit 7 years ago
			```{r eval=FALSE}
			`devtools::install_github("hrbrmstr/decapitated")`
			```

			```{r message=FALSE, warning=FALSE, error=FALSE, include=FALSE}
			`options(width=120)`
			```

Update docs 6 years ago			`## Usage`
initial commit 7 years ago
			```{r message=FALSE, warning=FALSE, error=FALSE}
			`library(decapitated)`

			`# current verison`
			`packageVersion("decapitated")`

			`chrome_version()`

			`chrome_read_html("http://httpbin.org/")`
			```

			```{r eval=FALSE, message=FALSE, warning=FALSE, error=FALSE}
			`chrome_dump_pdf("http://httpbin.org/")`
			```

			```{r message=FALSE, warning=FALSE, error=FALSE, eval=FALSE}
			`chrome_shot("http://httpbin.org/")`

			`## format width height colorspace filesize`
major improvements all around 6 years ago			`## 1 PNG 1600 1200 sRGB 215680`
initial commit 7 years ago			```

			`![screenshot.png](screenshot.png)`