htmlunit

Tools to Scrape Dynamic Web Content via the ‘HtmlUnit’ Java Library

Description

HtmlUnit (http://htmlunit.sourceforge.net/) is a “‘GUI’-Less browser for ‘Java’ programs”. It models ‘HTML’ documents and provides an ‘API’ that allows one to invoke pages, fill out forms, click links and more just like one does in a “normal” browser. The library has fairly good and constantly improving ‘JavaScript’ support and is able to work even with quite complex ‘AJAX’ libraries, simulating ‘Chrome’, ‘Firefox’ or ‘Internet Explorer’ depending on the configuration used. It is typically used for testing purposes or to retrieve information from web sites.

Tools are provided to work with this library at a higher level than provided by the exposed ‘Java’ libraries in the htmlunitjars package.

What’s Inside The Tin

The following functions are implemented:

hu_read_html: Read HTML from a URL with Browser Emua;tion & in a JavaScript Context

Installation

devtools::install_github("hrbrmstr/htmlunitjars")
devtools::install_github("hrbrmstr/htmlunit")

Usage

library(htmlunit)

# current verison
packageVersion("htmlunit")

## [1] '0.1.0'

Something xml2::read_html() cannot do, read the table from https://hrbrmstr.github.io/htmlunitjars/index.html:

test_url <- "https://hrbrmstr.github.io/htmlunitjars/index.html"

pg <- xml2::read_html(test_url)

html_table(pg)

## list()

☹️

But, hu_read_html() can!

pg <- hu_read_html(test_url)

html_table(pg)

## [[1]]
##      X1   X2
## 1   One  Two
## 2 Three Four
## 3  Five  Six

All without needing a separate Selenium or Splash server instance.

1.8 KiB Raw Blame History

htmlunit

Description

What’s Inside The Tin

Installation

Usage

1.8 KiB

Raw Blame History