Tools to Scrape Dynamic Web Content via the ‘HtmlUnit’ Java Library
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

181 lines
5.1 KiB

---
output: rmarkdown::github_document
---
5 years ago
```{r include=FALSE}
knitr::opts_chunk$set(message=FALSE, warning=FALSE, collapse=TRUE)
5 years ago
```
5 years ago
[![Travis-CI Build Status](https://travis-ci.org/hrbrmstr/htmlunit.svg?branch=master)](https://travis-ci.org/hrbrmstr/htmlunit)
[![Coverage Status](https://img.shields.io/codecov/c/github/hrbrmstr/htmlunit/master.svg)](https://codecov.io/github/hrbrmstr/htmlunit?branch=master)
[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/htmlunit)](https://cran.r-project.org/package=htmlunit)
# htmlunit
5 years ago
Tools to Scrape Dynamic Web Content via the 'HtmlUnit' Java Library
## Description
5 years ago
`HtmlUnit` (<http://htmlunit.sourceforge.net/>) is _a "'GUI'-Less
browser for 'Java' programs". It models 'HTML' documents and provides an 'API'
that allows one to invoke pages, fill out forms, click links and more just like
one does in a "normal" browser. The library has fairly good and constantly
improving 'JavaScript' support and is able to work even with quite complex 'AJAX'
libraries, simulating 'Chrome', 'Firefox' or 'Internet Explorer' depending on
the configuration used. It is typically used for testing purposes or to retrieve
information from web sites._
Tools are provided to work with this library at a higher level than provided by
the exposed 'Java' libraries in the [`htmlunitjars`](https://gitlab.com/hrbrmstr/htmlunitjars)
package.
## What's Inside The Tin
The following functions are implemented:
5 years ago
### DSL
5 years ago
- `web_client`/`webclient`: Create a new HtmlUnit WebClient instance<br/><br/>
5 years ago
5 years ago
- `wc_go`: Visit a URL<br/>
5 years ago
- `wc_html_nodes`: Select nodes from web client active page html content
- `wc_html_text`: Extract attributes, text and tag name from webclient page html content<br/><br/>
- `wc_html_attr`: Extract attributes, text and tag name from webclient page html content
- `wc_html_name`: Extract attributes, text and tag name from webclient page html content
5 years ago
- `wc_headers`: Return response headers of the last web request for current page
- `wc_browser_info`: Retreive information about the browser used to create the 'webclient'
- `wc_content_length`: Return content length of the last web request for current page
5 years ago
- `wc_content_type`: Return content type of web request for current page<br/><br/>
- `wc_render`: Retrieve current page contents<br/><br/>
- `wc_css`: Enable/Disable CSS support
- `wc_dnt`: Enable/Disable Do-Not-Track
- `wc_geo`: Enable/Disable Geolocation
- `wc_img_dl`: Enable/Disable Image Downloading
- `wc_load_time`: Return load time of the last web request for current page
- `wc_resize`: Resize the virtual browser window
- `wc_status`: Return status code of web request for current page
- `wc_timeout`: Change default request timeout
- `wc_title`: Return page title for current page
- `wc_url`: Return load time of the last web request for current page
- `wc_use_insecure_ssl`: Enable/Disable Ignoring SSL Validation Issues
- `wc_wait`: Block HtlUnit final rendering blocks until all background JavaScript tasks have finished executing
5 years ago
### Just the Content (pls)
- `hu_read_html`: Read HTML from a URL with Browser Emulation & in a JavaScript Context
### Content++
- `wc_inspect`: Perform a "Developer Tools"-like Network Inspection of a URL
## Installation
```{r eval=FALSE}
install.packages(c("htmlunitjars", "htmlunit"), repos = "https://cinc.rud.is", type="source")
```
```{r message=FALSE, warning=FALSE, error=FALSE, include=FALSE}
options(width=120)
```
## Usage
```{r message=FALSE, warning=FALSE, error=FALSE}
library(htmlunit)
library(tidyverse) # for some data ops; not req'd for pkg
# current verison
packageVersion("htmlunit")
```
5 years ago
Something `xml2::read_html()` cannot do, read the table from <https://hrbrmstr.github.io/htmlunitjars/index.html>:
![](man/figures/test-url-table.png)
```{r}
test_url <- "https://hrbrmstr.github.io/htmlunitjars/index.html"
pg <- xml2::read_html(test_url)
html_table(pg)
```
☹️
But, `hu_read_html()` can!
```{r}
pg <- hu_read_html(test_url)
html_table(pg)
```
All without needing a separate Selenium or Splash server instance.
### Content++
We can also get a HAR-like content + metadata dump:
```{r}
(xdf <- wc_inspect("https://rud.is/b"))
group_by(xdf, content_type) %>%
summarise(
total_size = sum(content_length),
total_load_time = sum(load_time)/1000
)
```
### DSL
```{r}
wc <- web_client()
wc %>% wc_browser_info()
5 years ago
wc <- web_client()
wc %>% wc_go("https://usa.gov/")
# if you want to use purrr::map_ functions the result of wc_html_nodes() needs to be passed to as.list()
5 years ago
wc %>%
wc_html_nodes("a") %>%
sapply(wc_html_text, trim = TRUE) %>%
head(10)
5 years ago
wc %>%
wc_html_nodes(xpath=".//a") %>%
sapply(wc_html_text, trim = TRUE) %>%
head(10)
5 years ago
wc %>%
wc_html_nodes(xpath=".//a") %>%
sapply(wc_html_attr, "href") %>%
head(10)
```
Handy function to get rendered plain text for text mining:
```{r}
wc %>%
wc_render("text") %>%
substr(1, 300) %>%
cat()
5 years ago
```
### htmlunit Metrics
```{r echo=FALSE}
cloc::cloc_pkg_md()
```
## Code of Conduct
Please note that this project is released with a [Contributor Code of Conduct](CONDUCT.md).
By participating in this project you agree to abide by its terms.