Java Archive Wrapper Supporting the ‘htmlunit’ Package
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

153 lines
4.4 KiB

5 years ago
5 years ago
[![Project Status: Active – The project has reached a stable, usable
state and is being actively
developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
[![Signed
by](https://img.shields.io/badge/Keybase-Verified-brightgreen.svg)](https://keybase.io/hrbrmstr)
![Signed commit
5 years ago
%](https://img.shields.io/badge/Signed_Commits-100%25-lightgrey.svg)
5 years ago
[![Linux build
Status](https://travis-ci.org/hrbrmstr/htmlunitjars.svg?branch=master)](https://travis-ci.org/hrbrmstr/htmlunitjars)
[![Coverage
Status](https://codecov.io/gh/hrbrmstr/htmlunitjars/branch/master/graph/badge.svg)](https://codecov.io/gh/hrbrmstr/htmlunitjars)
![Minimal R
Version](https://img.shields.io/badge/R%3E%3D-3.2.0-blue.svg)
![License](https://img.shields.io/badge/License-Apache-blue.svg)
5 years ago
# htmlunitjars
5 years ago
Java Archive Wrapper Supporting the ‘htmlunit’ Package
## Description
5 years ago
Contents of the ‘HtmlUnit’ & supporting Java archives
5 years ago
(<https://htmlunit.sourceforge.net/>). Version number reflects the
version number of the included ‘JAR’ file.
5 years ago
> *`HtmlUnit` is a “GUI-Less browser for Java programs”. It models HTML
> documents and provides an API that allows you to invoke pages, fill
> out forms, click links, etc… just like you do in your “normal”
> browser.*
>
> *It has fairly good JavaScript support (which is constantly improving)
> and is able to work even with quite complex AJAX libraries, simulating
> Chrome, Firefox or Internet Explorer depending on the configuration
> used.*
>
> *It is typically used for testing purposes or to retrieve information
> from web sites.*
>
> *`HtmlUnit` is not a generic unit testing framework. It is
> specifically a way to simulate a browser.*
## What’s Inside The Tin
Everything necessary to use the HtmlUnit library directly via `rJava`.
`HtmlUnit` Library JavaDoc:
5 years ago
<https://htmlunit.sourceforge.net/apidocs/index.html>
5 years ago
## Installation
``` r
4 years ago
install.packages("htmlunitjars", repos = c("https://cinc.rud.is", "https://cloud.r-project.org/"))
5 years ago
# or
remotes::install_git("https://git.rud.is/hrbrmstr/htmlunitjars.git")
# or
remotes::install_git("https://git.sr.ht/~hrbrmstr/htmlunitjars")
# or
remotes::install_gitlab("hrbrmstr/htmlunitjars")
# or
5 years ago
remotes::install_bitbucket("hrbrmstr/htmlunitjars")
# or
5 years ago
remotes::install_github("hrbrmstr/htmlunitjars")
5 years ago
```
5 years ago
NOTE: To use the ‘remotes’ install options you will need to have the
[{remotes} package](https://github.com/r-lib/remotes) installed.
5 years ago
## Usage
``` r
library(htmlunitjars)
# current verison
packageVersion("htmlunitjars")
4 years ago
## [1] '2.40.0'
5 years ago
```
### Give It A Go
`xml2::read_html()` cannot execute javascript so the traditional
approach won’t work:
``` r
library(rvest)
test_url <- "https://hrbrmstr.github.io/htmlunitjars/index.html"
doc <- read_html(test_url)
html_table(doc)
5 years ago
## list()
5 years ago
```
☹️
We *can* do this with the classes from `HtmlUnit` proivided by this JAR
wrapper package:
``` r
library(htmlunitjars)
```
Tell `HtmlUnit` to work like FireFox:
``` r
browsers <- J("com.gargoylesoftware.htmlunit.BrowserVersion")
5 years ago
wc <- new(J("com.gargoylesoftware.htmlunit.WebClient"), browsers$CHROME)
5 years ago
```
Tell it to wait for javascript to execute and not throw exceptions on
page resource errors:
``` r
invisible(wc$waitForBackgroundJavaScriptStartingBefore(.jlong(2000L)))
wc_opts <- wc$getOptions()
wc_opts$setThrowExceptionOnFailingStatusCode(FALSE)
wc_opts$setThrowExceptionOnScriptError(FALSE)
```
Now, acccess the site again and get the table:
``` r
pg <- wc$getPage(test_url)
doc <- read_html(pg$asXml())
html_table(doc)
5 years ago
## [[1]]
## X1 X2
## 1 One Two
## 2 Three Four
## 3 Five Six
5 years ago
```
No need for Selenium or Splash\!
The ultimate goal is to have an `htmlunit` package that provides a nicer
API than needing to know how to work with `rJava` directly.
5 years ago
## htmlunitjars Metrics
| Lang | \# Files | (%) | LoC | (%) | Blank lines | (%) | \# Lines | (%) |
| :---- | -------: | ---: | --: | ---: | ----------: | ---: | -------: | ---: |
| XML | 1 | 0.09 | 69 | 0.41 | 0 | 0.00 | 0 | 0.00 |
| Java | 2 | 0.18 | 28 | 0.17 | 5 | 0.11 | 18 | 0.17 |
| Maven | 1 | 0.09 | 23 | 0.14 | 1 | 0.02 | 2 | 0.02 |
| Rmd | 1 | 0.09 | 21 | 0.12 | 35 | 0.74 | 50 | 0.47 |
| R | 5 | 0.45 | 15 | 0.09 | 1 | 0.02 | 36 | 0.34 |
| make | 1 | 0.09 | 13 | 0.08 | 5 | 0.11 | 0 | 0.00 |