htmlunit/README.Rmd

---
output: rmarkdown::github_document
---
```{r include=FALSE}
knitr::opts_chunk$set(message=FALSE, warning=FALSE, collapse=TRUE)
```

[![Travis-CI Build Status](https://travis-ci.org/hrbrmstr/htmlunit.svg?branch=master)](https://travis-ci.org/hrbrmstr/htmlunit)
[![Coverage Status](https://img.shields.io/codecov/c/github/hrbrmstr/htmlunit/master.svg)](https://codecov.io/github/hrbrmstr/htmlunit?branch=master)
[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/htmlunit)](https://cran.r-project.org/package=htmlunit)

# htmlunit

Tools to Scrape Dynamic Web Content via the 'HtmlUnit' Java Library

## Description

`HtmlUnit` (<http://htmlunit.sourceforge.net/>) is _a "'GUI'-Less 
browser for 'Java' programs". It models 'HTML' documents and provides an 'API' 
that allows one to invoke pages, fill out forms, click links and more just like 
one does in a "normal" browser. The library has fairly good and constantly
improving 'JavaScript' support and is able to work even with quite complex 'AJAX' 
libraries, simulating 'Chrome', 'Firefox' or 'Internet Explorer' depending on
the configuration used. It is typically used for testing purposes or to retrieve 
information from web sites._

Tools are provided to work with this library at a higher level than provided by
the exposed 'Java' libraries in the [`htmlunitjars`](https://gitlab.com/hrbrmstr/htmlunitjars)
package.

## What's Inside The Tin

The following functions are implemented:

### DSL

- `web_client`/`webclient`:	Create a new HtmlUnit WebClient instance<br/><br/>

- `wc_go`:	Visit a URL<br/>

- `wc_html_nodes`:	Select nodes from web client active page html content
- `wc_html_text`:	Extract attributes, text and tag name from webclient page html content<br/><br/>
- `wc_html_attr`:	Extract attributes, text and tag name from webclient page html content
- `wc_html_name`:	Extract attributes, text and tag name from webclient page html content

- `wc_headers`:	Return response headers of the last web request for current page
- `wc_browser_info`:	Retreive information about the browser used to create the 'webclient'
- `wc_content_length`:	Return content length of the last web request for current page
- `wc_content_type`:	Return content type of web request for current page<br/><br/>

- `wc_render`:	Retrieve current page contents<br/><br/>

- `wc_css`:	Enable/Disable CSS support
- `wc_dnt`:	Enable/Disable Do-Not-Track
- `wc_geo`:	Enable/Disable Geolocation
- `wc_img_dl`:	Enable/Disable Image Downloading
- `wc_load_time`:	Return load time of the last web request for current page
- `wc_resize`:	Resize the virtual browser window
- `wc_status`:	Return status code of web request for current page
- `wc_timeout`:	Change default request timeout
- `wc_title`:	Return page title for current page
- `wc_url`:	Return load time of the last web request for current page
- `wc_use_insecure_ssl`:	Enable/Disable Ignoring SSL Validation Issues
- `wc_wait`:	Block HtlUnit final rendering blocks until all background JavaScript tasks have finished executing

### Just the Content (pls)

- `hu_read_html`:	Read HTML from a URL with Browser Emulation & in a JavaScript Context

### Content++

- `wc_inspect`:  Perform a "Developer Tools"-like Network Inspection of a URL

## Installation

```{r eval=FALSE}
install.packages(c("htmlunitjars", "htmlunit"), repos = "https://cinc.rud.is", type="source")
```

```{r message=FALSE, warning=FALSE, error=FALSE, include=FALSE}
options(width=120)
```

## Usage

```{r message=FALSE, warning=FALSE, error=FALSE}
library(htmlunit)
library(tidyverse) # for some data ops; not req'd for pkg

# current verison
packageVersion("htmlunit")

```

Something `xml2::read_html()` cannot do, read the table from <https://hrbrmstr.github.io/htmlunitjars/index.html>:

![](man/figures/test-url-table.png)

```{r}
test_url <- "https://hrbrmstr.github.io/htmlunitjars/index.html"

pg <- xml2::read_html(test_url)

html_table(pg)
```

☹️

But, `hu_read_html()` can!

```{r}
pg <- hu_read_html(test_url)

html_table(pg)
```

All without needing a separate Selenium or Splash server instance.

### Content++

We can also get a HAR-like content + metadata dump:

```{r}
(xdf <- wc_inspect("https://rud.is/b"))

group_by(xdf, content_type) %>% 
  summarise(
    total_size = sum(content_length), 
    total_load_time = sum(load_time)/1000
  )
```

### DSL

```{r}
wc <- web_client()

wc %>% wc_browser_info()

wc <- web_client()

wc %>% wc_go("https://usa.gov/")

# if you want to use purrr::map_ functions the result of wc_html_nodes() needs to be passed to as.list()

wc %>%
  wc_html_nodes("a") %>%
  sapply(wc_html_text, trim = TRUE) %>% 
  head(10)

wc %>%
  wc_html_nodes(xpath=".//a") %>%
  sapply(wc_html_text, trim = TRUE) %>% 
  head(10)

wc %>%
  wc_html_nodes(xpath=".//a") %>%
  sapply(wc_html_attr, "href") %>% 
  head(10)
```

Handy function to get rendered plain text for text mining:

```{r}
wc %>% 
  wc_render("text") %>% 
  substr(1, 300) %>% 
  cat()
```

### htmlunit Metrics

```{r echo=FALSE}
cloc::cloc_pkg_md()
```

## Code of Conduct

Please note that this project is released with a [Contributor Code of Conduct](CONDUCT.md). 
By participating in this project you agree to abide by its terms.
R package repo initialization complete 5 years ago			`---`
			`output: rmarkdown::github_document`
			`---`
initial commit 5 years ago			```{r include=FALSE}
workin on the DSL 5 years ago			`knitr::opts_chunk$set(message=FALSE, warning=FALSE, collapse=TRUE)`
initial commit 5 years ago			```
cloc; conduct 5 years ago
			`[![Travis-CI Build Status](https://travis-ci.org/hrbrmstr/htmlunit.svg?branch=master)](https://travis-ci.org/hrbrmstr/htmlunit)`
			`[![Coverage Status](https://img.shields.io/codecov/c/github/hrbrmstr/htmlunit/master.svg)](https://codecov.io/github/hrbrmstr/htmlunit?branch=master)`
			`[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/htmlunit)](https://cran.r-project.org/package=htmlunit)`

R package repo initialization complete 5 years ago			`# htmlunit`

initial commit 5 years ago			`Tools to Scrape Dynamic Web Content via the 'HtmlUnit' Java Library`

R package repo initialization complete 5 years ago			`## Description`

initial commit 5 years ago			`HtmlUnit` (<http://htmlunit.sourceforge.net/>) is _a "'GUI'-Less
			`browser for 'Java' programs". It models 'HTML' documents and provides an 'API'`
			`that allows one to invoke pages, fill out forms, click links and more just like`
			`one does in a "normal" browser. The library has fairly good and constantly`
			`improving 'JavaScript' support and is able to work even with quite complex 'AJAX'`
			`libraries, simulating 'Chrome', 'Firefox' or 'Internet Explorer' depending on`
			`the configuration used. It is typically used for testing purposes or to retrieve`
			`information from web sites._`

			`Tools are provided to work with this library at a higher level than provided by`
			the exposed 'Java' libraries in the [`htmlunitjars`](https://gitlab.com/hrbrmstr/htmlunitjars)
			`package.`

R package repo initialization complete 5 years ago			`## What's Inside The Tin`

			`The following functions are implemented:`

more DSL 5 years ago			`### DSL`
workin on the DSL 5 years ago
more DSL 5 years ago			- `web_client`/`webclient`: Create a new HtmlUnit WebClient instance<br/><br/>
initial commit 5 years ago
more DSL 5 years ago			- `wc_go`: Visit a URL<br/>
workin on the DSL 5 years ago
more DSL 5 years ago			- `wc_html_nodes`: Select nodes from web client active page html content
			- `wc_html_text`: Extract attributes, text and tag name from webclient page html content<br/><br/>
			- `wc_html_attr`: Extract attributes, text and tag name from webclient page html content
			- `wc_html_name`: Extract attributes, text and tag name from webclient page html content
workin on the DSL 5 years ago
more DSL 5 years ago			- `wc_headers`: Return response headers of the last web request for current page
workin on the DSL 5 years ago			- `wc_browser_info`: Retreive information about the browser used to create the 'webclient'
workin on the DSL 5 years ago			- `wc_content_length`: Return content length of the last web request for current page
more DSL 5 years ago			- `wc_content_type`: Return content type of web request for current page<br/><br/>

			- `wc_render`: Retrieve current page contents<br/><br/>

workin on the DSL 5 years ago			- `wc_css`: Enable/Disable CSS support
			- `wc_dnt`: Enable/Disable Do-Not-Track
			- `wc_geo`: Enable/Disable Geolocation
			- `wc_img_dl`: Enable/Disable Image Downloading
			- `wc_load_time`: Return load time of the last web request for current page
			- `wc_resize`: Resize the virtual browser window
			- `wc_status`: Return status code of web request for current page
			- `wc_timeout`: Change default request timeout
			- `wc_title`: Return page title for current page
			- `wc_url`: Return load time of the last web request for current page
			- `wc_use_insecure_ssl`: Enable/Disable Ignoring SSL Validation Issues
			- `wc_wait`: Block HtlUnit final rendering blocks until all background JavaScript tasks have finished executing
workin on the DSL 5 years ago
more DSL 5 years ago			`### Just the Content (pls)`

			- `hu_read_html`: Read HTML from a URL with Browser Emulation & in a JavaScript Context

added wc_inspect() for dev tools-ish HAR retrieval 5 years ago			`### Content++`

			- `wc_inspect`: Perform a "Developer Tools"-like Network Inspection of a URL

R package repo initialization complete 5 years ago			`## Installation`

			```{r eval=FALSE}
added wc_inspect() for dev tools-ish HAR retrieval 5 years ago			`install.packages(c("htmlunitjars", "htmlunit"), repos = "https://cinc.rud.is", type="source")`
R package repo initialization complete 5 years ago			```

			```{r message=FALSE, warning=FALSE, error=FALSE, include=FALSE}
			`options(width=120)`
			```

			`## Usage`

			```{r message=FALSE, warning=FALSE, error=FALSE}
			`library(htmlunit)`
added wc_inspect() for dev tools-ish HAR retrieval 5 years ago			`library(tidyverse) # for some data ops; not req'd for pkg`
R package repo initialization complete 5 years ago
			`# current verison`
			`packageVersion("htmlunit")`

			```

initial commit 5 years ago			Something `xml2::read_html()` cannot do, read the table from <https://hrbrmstr.github.io/htmlunitjars/index.html>:

			`![](man/figures/test-url-table.png)`

			```{r}
			`test_url <- "https://hrbrmstr.github.io/htmlunitjars/index.html"`

			`pg <- xml2::read_html(test_url)`

			`html_table(pg)`
			```

			`☹️`

			But, `hu_read_html()` can!

			```{r}
			`pg <- hu_read_html(test_url)`

			`html_table(pg)`
			```

workin on the DSL 5 years ago			`All without needing a separate Selenium or Splash server instance.`

added wc_inspect() for dev tools-ish HAR retrieval 5 years ago			`### Content++`

			`We can also get a HAR-like content + metadata dump:`

			```{r}
			`(xdf <- wc_inspect("https://rud.is/b"))`

			`group_by(xdf, content_type) %>%`
			`summarise(`
			`total_size = sum(content_length),`
			`total_load_time = sum(load_time)/1000`
			`)`
			```

workin on the DSL 5 years ago			`### DSL`

			```{r}
			`wc <- web_client()`

			`wc %>% wc_browser_info()`

more DSL 5 years ago			`wc <- web_client()`

			`wc %>% wc_go("https://usa.gov/")`

			`# if you want to use purrr::map_ functions the result of wc_html_nodes() needs to be passed to as.list()`
workin on the DSL 5 years ago
more DSL 5 years ago			`wc %>%`
			`wc_html_nodes("a") %>%`
			`sapply(wc_html_text, trim = TRUE) %>%`
			`head(10)`
workin on the DSL 5 years ago
more DSL 5 years ago			`wc %>%`
			`wc_html_nodes(xpath=".//a") %>%`
			`sapply(wc_html_text, trim = TRUE) %>%`
			`head(10)`
workin on the DSL 5 years ago
more DSL 5 years ago			`wc %>%`
			`wc_html_nodes(xpath=".//a") %>%`
			`sapply(wc_html_attr, "href") %>%`
			`head(10)`
			```

			`Handy function to get rendered plain text for text mining:`

			```{r}
			`wc %>%`
			`wc_render("text") %>%`
			`substr(1, 300) %>%`
			`cat()`
cloc; conduct 5 years ago			```

			`### htmlunit Metrics`

			```{r echo=FALSE}
			`cloc::cloc_pkg_md()`
			```

			`## Code of Conduct`

			`Please note that this project is released with a [Contributor Code of Conduct](CONDUCT.md).`
			`By participating in this project you agree to abide by its terms.`