Java Archive Wrapper Supporting the ‘htmlunit’ Package
Você não pode selecionar mais de 25 tópicos Os tópicos devem começar com uma letra ou um número, podem incluir traços ('-') e podem ter até 35 caracteres.
 
 
 
boB Rudis ec325fbc57
2.43.0
4 anos atrás
R example 5 anos atrás
docs initial commit 5 anos atrás
inst 2.43.0 4 anos atrás
java/htmlunit 2.43.0 4 anos atrás
man Upgrade htmlunit jar to 2.37.0 4 anos atrás
tests 2.40.0 jars 4 anos atrás
.Rbuildignore 2.34.0 update 5 anos atrás
.codecov.yml R package repo initialization complete 5 anos atrás
.gitignore R package repo initialization complete 5 anos atrás
.travis.yml R package repo initialization complete 5 anos atrás
DESCRIPTION 2.43.0 4 anos atrás
LICENSE stuff 5 anos atrás
NAMESPACE initial commit 5 anos atrás
NEWS.md 2.43.0 4 anos atrás
README.Rmd 2.36.0 + README 5 anos atrás
README.md 2.40.0 jars 4 anos atrás
htmlunitjars.Rproj R package repo initialization complete 5 anos atrás

README.md

Project Status: Active – The project has reached a stable, usablestate and is being activelydeveloped. Signedby Signed commit% Linux buildStatus CoverageStatus Minimal RVersion License

htmlunitjars

Java Archive Wrapper Supporting the ‘htmlunit’ Package

Description

Contents of the ‘HtmlUnit’ & supporting Java archives (https://htmlunit.sourceforge.net/). Version number reflects the version number of the included ‘JAR’ file.

HtmlUnit is a “GUI-Less browser for Java programs”. It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc… just like you do in your “normal” browser.

It has fairly good JavaScript support (which is constantly improving) and is able to work even with quite complex AJAX libraries, simulating Chrome, Firefox or Internet Explorer depending on the configuration used.

It is typically used for testing purposes or to retrieve information from web sites.

HtmlUnit is not a generic unit testing framework. It is specifically a way to simulate a browser.

What’s Inside The Tin

Everything necessary to use the HtmlUnit library directly via rJava.

HtmlUnit Library JavaDoc: https://htmlunit.sourceforge.net/apidocs/index.html

Installation

install.packages("htmlunitjars", repos = c("https://cinc.rud.is", "https://cloud.r-project.org/"))
# or
remotes::install_git("https://git.rud.is/hrbrmstr/htmlunitjars.git")
# or
remotes::install_git("https://git.sr.ht/~hrbrmstr/htmlunitjars")
# or
remotes::install_gitlab("hrbrmstr/htmlunitjars")
# or
remotes::install_bitbucket("hrbrmstr/htmlunitjars")
# or
remotes::install_github("hrbrmstr/htmlunitjars")

NOTE: To use the ‘remotes’ install options you will need to have the {remotes} package installed.

Usage

library(htmlunitjars)

# current verison
packageVersion("htmlunitjars")
## [1] '2.40.0'

Give It A Go

xml2::read_html() cannot execute javascript so the traditional approach won’t work:

library(rvest)

test_url <- "https://hrbrmstr.github.io/htmlunitjars/index.html"

doc <- read_html(test_url)

html_table(doc)
## list()

☹️

We can do this with the classes from HtmlUnit proivided by this JAR wrapper package:

library(htmlunitjars)

Tell HtmlUnit to work like FireFox:

browsers <- J("com.gargoylesoftware.htmlunit.BrowserVersion")

wc <- new(J("com.gargoylesoftware.htmlunit.WebClient"), browsers$CHROME)

Tell it to wait for javascript to execute and not throw exceptions on page resource errors:

invisible(wc$waitForBackgroundJavaScriptStartingBefore(.jlong(2000L)))

wc_opts <- wc$getOptions()
wc_opts$setThrowExceptionOnFailingStatusCode(FALSE)
wc_opts$setThrowExceptionOnScriptError(FALSE)

Now, acccess the site again and get the table:

pg <- wc$getPage(test_url)

doc <- read_html(pg$asXml())

html_table(doc)
## [[1]]
##      X1   X2
## 1   One  Two
## 2 Three Four
## 3  Five  Six

No need for Selenium or Splash!

The ultimate goal is to have an htmlunit package that provides a nicer API than needing to know how to work with rJava directly.

htmlunitjars Metrics

Lang # Files (%) LoC (%) Blank lines (%) # Lines (%)
XML 1 0.09 69 0.41 0 0.00 0 0.00
Java 2 0.18 28 0.17 5 0.11 18 0.17
Maven 1 0.09 23 0.14 1 0.02 2 0.02
Rmd 1 0.09 21 0.12 35 0.74 50 0.47
R 5 0.45 15 0.09 1 0.02 36 0.34
make 1 0.09 13 0.08 5 0.11 0 0.00