mirror of https://git.sr.ht/~hrbrmstr/htmlunit
boB Rudis
5 years ago
12 changed files with 561 additions and 16 deletions
@ -0,0 +1,201 @@ |
|||
Apache License |
|||
Version 2.0, January 2004 |
|||
http://www.apache.org/licenses/ |
|||
|
|||
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION |
|||
|
|||
1. Definitions. |
|||
|
|||
"License" shall mean the terms and conditions for use, reproduction, |
|||
and distribution as defined by Sections 1 through 9 of this document. |
|||
|
|||
"Licensor" shall mean the copyright owner or entity authorized by |
|||
the copyright owner that is granting the License. |
|||
|
|||
"Legal Entity" shall mean the union of the acting entity and all |
|||
other entities that control, are controlled by, or are under common |
|||
control with that entity. For the purposes of this definition, |
|||
"control" means (i) the power, direct or indirect, to cause the |
|||
direction or management of such entity, whether by contract or |
|||
otherwise, or (ii) ownership of fifty percent (50%) or more of the |
|||
outstanding shares, or (iii) beneficial ownership of such entity. |
|||
|
|||
"You" (or "Your") shall mean an individual or Legal Entity |
|||
exercising permissions granted by this License. |
|||
|
|||
"Source" form shall mean the preferred form for making modifications, |
|||
including but not limited to software source code, documentation |
|||
source, and configuration files. |
|||
|
|||
"Object" form shall mean any form resulting from mechanical |
|||
transformation or translation of a Source form, including but |
|||
not limited to compiled object code, generated documentation, |
|||
and conversions to other media types. |
|||
|
|||
"Work" shall mean the work of authorship, whether in Source or |
|||
Object form, made available under the License, as indicated by a |
|||
copyright notice that is included in or attached to the work |
|||
(an example is provided in the Appendix below). |
|||
|
|||
"Derivative Works" shall mean any work, whether in Source or Object |
|||
form, that is based on (or derived from) the Work and for which the |
|||
editorial revisions, annotations, elaborations, or other modifications |
|||
represent, as a whole, an original work of authorship. For the purposes |
|||
of this License, Derivative Works shall not include works that remain |
|||
separable from, or merely link (or bind by name) to the interfaces of, |
|||
the Work and Derivative Works thereof. |
|||
|
|||
"Contribution" shall mean any work of authorship, including |
|||
the original version of the Work and any modifications or additions |
|||
to that Work or Derivative Works thereof, that is intentionally |
|||
submitted to Licensor for inclusion in the Work by the copyright owner |
|||
or by an individual or Legal Entity authorized to submit on behalf of |
|||
the copyright owner. For the purposes of this definition, "submitted" |
|||
means any form of electronic, verbal, or written communication sent |
|||
to the Licensor or its representatives, including but not limited to |
|||
communication on electronic mailing lists, source code control systems, |
|||
and issue tracking systems that are managed by, or on behalf of, the |
|||
Licensor for the purpose of discussing and improving the Work, but |
|||
excluding communication that is conspicuously marked or otherwise |
|||
designated in writing by the copyright owner as "Not a Contribution." |
|||
|
|||
"Contributor" shall mean Licensor and any individual or Legal Entity |
|||
on behalf of whom a Contribution has been received by Licensor and |
|||
subsequently incorporated within the Work. |
|||
|
|||
2. Grant of Copyright License. Subject to the terms and conditions of |
|||
this License, each Contributor hereby grants to You a perpetual, |
|||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable |
|||
copyright license to reproduce, prepare Derivative Works of, |
|||
publicly display, publicly perform, sublicense, and distribute the |
|||
Work and such Derivative Works in Source or Object form. |
|||
|
|||
3. Grant of Patent License. Subject to the terms and conditions of |
|||
this License, each Contributor hereby grants to You a perpetual, |
|||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable |
|||
(except as stated in this section) patent license to make, have made, |
|||
use, offer to sell, sell, import, and otherwise transfer the Work, |
|||
where such license applies only to those patent claims licensable |
|||
by such Contributor that are necessarily infringed by their |
|||
Contribution(s) alone or by combination of their Contribution(s) |
|||
with the Work to which such Contribution(s) was submitted. If You |
|||
institute patent litigation against any entity (including a |
|||
cross-claim or counterclaim in a lawsuit) alleging that the Work |
|||
or a Contribution incorporated within the Work constitutes direct |
|||
or contributory patent infringement, then any patent licenses |
|||
granted to You under this License for that Work shall terminate |
|||
as of the date such litigation is filed. |
|||
|
|||
4. Redistribution. You may reproduce and distribute copies of the |
|||
Work or Derivative Works thereof in any medium, with or without |
|||
modifications, and in Source or Object form, provided that You |
|||
meet the following conditions: |
|||
|
|||
(a) You must give any other recipients of the Work or |
|||
Derivative Works a copy of this License; and |
|||
|
|||
(b) You must cause any modified files to carry prominent notices |
|||
stating that You changed the files; and |
|||
|
|||
(c) You must retain, in the Source form of any Derivative Works |
|||
that You distribute, all copyright, patent, trademark, and |
|||
attribution notices from the Source form of the Work, |
|||
excluding those notices that do not pertain to any part of |
|||
the Derivative Works; and |
|||
|
|||
(d) If the Work includes a "NOTICE" text file as part of its |
|||
distribution, then any Derivative Works that You distribute must |
|||
include a readable copy of the attribution notices contained |
|||
within such NOTICE file, excluding those notices that do not |
|||
pertain to any part of the Derivative Works, in at least one |
|||
of the following places: within a NOTICE text file distributed |
|||
as part of the Derivative Works; within the Source form or |
|||
documentation, if provided along with the Derivative Works; or, |
|||
within a display generated by the Derivative Works, if and |
|||
wherever such third-party notices normally appear. The contents |
|||
of the NOTICE file are for informational purposes only and |
|||
do not modify the License. You may add Your own attribution |
|||
notices within Derivative Works that You distribute, alongside |
|||
or as an addendum to the NOTICE text from the Work, provided |
|||
that such additional attribution notices cannot be construed |
|||
as modifying the License. |
|||
|
|||
You may add Your own copyright statement to Your modifications and |
|||
may provide additional or different license terms and conditions |
|||
for use, reproduction, or distribution of Your modifications, or |
|||
for any such Derivative Works as a whole, provided Your use, |
|||
reproduction, and distribution of the Work otherwise complies with |
|||
the conditions stated in this License. |
|||
|
|||
5. Submission of Contributions. Unless You explicitly state otherwise, |
|||
any Contribution intentionally submitted for inclusion in the Work |
|||
by You to the Licensor shall be under the terms and conditions of |
|||
this License, without any additional terms or conditions. |
|||
Notwithstanding the above, nothing herein shall supersede or modify |
|||
the terms of any separate license agreement you may have executed |
|||
with Licensor regarding such Contributions. |
|||
|
|||
6. Trademarks. This License does not grant permission to use the trade |
|||
names, trademarks, service marks, or product names of the Licensor, |
|||
except as required for reasonable and customary use in describing the |
|||
origin of the Work and reproducing the content of the NOTICE file. |
|||
|
|||
7. Disclaimer of Warranty. Unless required by applicable law or |
|||
agreed to in writing, Licensor provides the Work (and each |
|||
Contributor provides its Contributions) on an "AS IS" BASIS, |
|||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or |
|||
implied, including, without limitation, any warranties or conditions |
|||
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A |
|||
PARTICULAR PURPOSE. You are solely responsible for determining the |
|||
appropriateness of using or redistributing the Work and assume any |
|||
risks associated with Your exercise of permissions under this License. |
|||
|
|||
8. Limitation of Liability. In no event and under no legal theory, |
|||
whether in tort (including negligence), contract, or otherwise, |
|||
unless required by applicable law (such as deliberate and grossly |
|||
negligent acts) or agreed to in writing, shall any Contributor be |
|||
liable to You for damages, including any direct, indirect, special, |
|||
incidental, or consequential damages of any character arising as a |
|||
result of this License or out of the use or inability to use the |
|||
Work (including but not limited to damages for loss of goodwill, |
|||
work stoppage, computer failure or malfunction, or any and all |
|||
other commercial damages or losses), even if such Contributor |
|||
has been advised of the possibility of such damages. |
|||
|
|||
9. Accepting Warranty or Additional Liability. While redistributing |
|||
the Work or Derivative Works thereof, You may choose to offer, |
|||
and charge a fee for, acceptance of support, warranty, indemnity, |
|||
or other liability obligations and/or rights consistent with this |
|||
License. However, in accepting such obligations, You may act only |
|||
on Your own behalf and on Your sole responsibility, not on behalf |
|||
of any other Contributor, and only if You agree to indemnify, |
|||
defend, and hold each Contributor harmless for any liability |
|||
incurred by, or claims asserted against, such Contributor by reason |
|||
of your accepting any such warranty or additional liability. |
|||
|
|||
END OF TERMS AND CONDITIONS |
|||
|
|||
APPENDIX: How to apply the Apache License to your work. |
|||
|
|||
To apply the Apache License to your work, attach the following |
|||
boilerplate notice, with the fields enclosed by brackets "[]" |
|||
replaced with your own identifying information. (Don't include |
|||
the brackets!) The text should be enclosed in the appropriate |
|||
comment syntax for the file format. We also recommend that a |
|||
file or class name and description of purpose be included on the |
|||
same "printed page" as the copyright notice for easier |
|||
identification within third-party archives. |
|||
|
|||
Copyright [yyyy] [name of copyright owner] |
|||
|
|||
Licensed under the Apache License, Version 2.0 (the "License"); |
|||
you may not use this file except in compliance with the License. |
|||
You may obtain a copy of the License at |
|||
|
|||
http://www.apache.org/licenses/LICENSE-2.0 |
|||
|
|||
Unless required by applicable law or agreed to in writing, software |
|||
distributed under the License is distributed on an "AS IS" BASIS, |
|||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
|||
See the License for the specific language governing permissions and |
|||
limitations under the License. |
@ -1,4 +1,8 @@ |
|||
# Generated by roxygen2: do not edit by hand |
|||
|
|||
import(httr) |
|||
importFrom(jsonlite,fromJSON) |
|||
export("%>%") |
|||
export(hu_read_html) |
|||
import(htmlunitjars) |
|||
import(rJava) |
|||
import(rvest) |
|||
importFrom(magrittr,"%>%") |
|||
|
@ -0,0 +1,90 @@ |
|||
#' Read HTML from a URL with Browser Emulation & in a JavaScript Context |
|||
#' |
|||
#' Use a JavaScript-enabled browser context to read and render HTML from a URL. |
|||
#' |
|||
#' For the code in the examples, this is the site that is being scraped: |
|||
#' |
|||
#' \if{html}{ |
|||
#' \figure{test-url-table.png}{options: width="100\%" alt="Figure: test-url-table.png"} |
|||
#' } |
|||
#' |
|||
#' \if{latex}{ |
|||
#' \figure{test-url-table.png}{options: width=10cm} |
|||
#' } |
|||
#' |
|||
#' Note that it has a table of values but it is rendered via JavaScript. |
|||
#' |
|||
#' @param url URL to retrieve |
|||
#' @param emulate browser to emulate; one of "`best`", "`chrome`", "`firefox`", "`ie`" |
|||
#' @param ret what to return; if `html_document` (the default) then the HTML created |
|||
#' by the `HtmlUnit` emulated browser context is passed to [xml2::read_html()] |
|||
#' and an `xml2` `html_document`/`xml_document` is returned. Note that this causes |
|||
#' further HTML processing by `xml2`/`libxml2` so is not _exactly_ what |
|||
#' `HtmlUnit` generated. If you want the HTML code (text) without any further |
|||
#' processing then use `text` as the value. |
|||
#' @param js_delay time (ms) to let loaded javascript to execute; default is 2 seconds (2000 ms) |
|||
#' @param timeout overall timeout (ms); `0` == infinite wait (not recommended); note: the |
|||
#' timeout is used twice: first in making the socket connection, |
|||
#' second for data retrieval. If the time is critical you must |
|||
#' allow for twice the time specified here. Default 30s (30000 ms) |
|||
#' @param ignore_ssl_errors Should SSL/TLS errors be ignored. The default (`TRUE`) is |
|||
#' a current hack due to how `HtmlUnit` seems to handle virtual hosted sites |
|||
#' with multiple vhosts and multiple certificates. You can try it with `FALSE` |
|||
#' initially and revert back to `TRUE` if you encounter issues. |
|||
#' @param enable_dnt Enable the "Do Not Track" header. Default: `FALSE`. |
|||
#' @param download_images Download images as the page is loaded? Since this |
|||
#' function is a high-level wrapper designed to do a read of HTML, |
|||
#' it is recommended that you leave this the default `FALSE` to save |
|||
#' time/bandwidth. |
|||
#' @param options options to pass to [xml2::read_html()] if `ret` == `html_document`. |
|||
#' @return an `xml2` `html_document`/`xml_document` if `ret` == `html_document` else |
|||
#' the HTML document text generated by `HtmlUnit`. |
|||
#' @export |
|||
#' @examples \dontrun{ |
|||
#' test_url <- "https://hrbrmstr.github.io/htmlunitjars/index.html" |
|||
#' hu_read_html(test_url) |
|||
#' } |
|||
hu_read_html <- function(url, |
|||
emulate = c("best", "chrome", "firefox", "ie"), |
|||
ret = c("html_document", "text"), |
|||
js_delay = 2000L, |
|||
timeout = 30000L, |
|||
ignore_ssl_errors = TRUE, |
|||
enable_dnt = FALSE, |
|||
download_images = FALSE, |
|||
options = c("RECOVER", "NOERROR", "NOBLANKS")) { |
|||
|
|||
emulate <- match.arg(emulate, c("best", "chrome", "firefox", "ie")) |
|||
ret <- match.arg(ret, c("html_document", "text")) |
|||
|
|||
available_browsers <- J("com.gargoylesoftware.htmlunit.BrowserVersion") |
|||
|
|||
switch( |
|||
emulate, |
|||
best = available_browsers$BEST_SUPPORTED, |
|||
chrome = available_browsers$CHROME, |
|||
firefox = available_browsers$FIREFOX_60, |
|||
ie = available_browsers$INTERNET_EXPLORER |
|||
) -> use_browser |
|||
|
|||
wc <- new(J("com.gargoylesoftware.htmlunit.WebClient"), use_browser) |
|||
|
|||
res <- wc$waitForBackgroundJavaScriptStartingBefore(.jlong(as.integer(js_delay))) |
|||
|
|||
wc_opts <- wc$getOptions() |
|||
wc_opts$setThrowExceptionOnFailingStatusCode(FALSE) |
|||
wc_opts$setThrowExceptionOnScriptError(FALSE) |
|||
wc_opts$setTimeout(as.integer(timeout)) |
|||
|
|||
if (ignore_ssl_errors) wc_opts$setUseInsecureSSL(TRUE) |
|||
if (enable_dnt) wc_opts$setDoNotTrackEnabled(TRUE) |
|||
if (download_images) wc_opts$setDownloadImages(TRUE) |
|||
|
|||
pg <- wc$getPage(test_url) |
|||
|
|||
if (ret == "html_document") return(xml2::read_html(pg$asXml(), options = options)) |
|||
|
|||
return(pg$asText()) |
|||
|
|||
} |
|||
|
@ -0,0 +1,11 @@ |
|||
#' Pipe operator |
|||
#' |
|||
#' See \code{magrittr::\link[magrittr]{\%>\%}} for details. |
|||
#' |
|||
#' @name %>% |
|||
#' @rdname pipe |
|||
#' @keywords internal |
|||
#' @export |
|||
#' @importFrom magrittr %>% |
|||
#' @usage lhs \%>\% rhs |
|||
NULL |
@ -1,2 +1,78 @@ |
|||
|
|||
# htmlunit |
|||
|
|||
Tools to Scrape Dynamic Web Content via the ‘HtmlUnit’ Java Library |
|||
|
|||
## Description |
|||
|
|||
`HtmlUnit` (<http://htmlunit.sourceforge.net/>) is *a “‘GUI’-Less |
|||
browser for ‘Java’ programs”. It models ‘HTML’ documents and provides an |
|||
‘API’ that allows one to invoke pages, fill out forms, click links and |
|||
more just like one does in a “normal” browser. The library has fairly |
|||
good and constantly improving ‘JavaScript’ support and is able to work |
|||
even with quite complex ‘AJAX’ libraries, simulating ‘Chrome’, ‘Firefox’ |
|||
or ‘Internet Explorer’ depending on the configuration used. It is |
|||
typically used for testing purposes or to retrieve information from web |
|||
sites.* |
|||
|
|||
Tools are provided to work with this library at a higher level than |
|||
provided by the exposed ‘Java’ libraries in the |
|||
[`htmlunitjars`](https://gitlab.com/hrbrmstr/htmlunitjars) package. |
|||
|
|||
## What’s Inside The Tin |
|||
|
|||
The following functions are implemented: |
|||
|
|||
- `hu_read_html`: Read HTML from a URL with Browser Emua;tion & in a |
|||
JavaScript Context |
|||
|
|||
## Installation |
|||
|
|||
``` r |
|||
devtools::install_github("hrbrmstr/htmlunitjars") |
|||
devtools::install_github("hrbrmstr/htmlunit") |
|||
``` |
|||
|
|||
## Usage |
|||
|
|||
``` r |
|||
library(htmlunit) |
|||
|
|||
# current verison |
|||
packageVersion("htmlunit") |
|||
``` |
|||
|
|||
## [1] '0.1.0' |
|||
|
|||
Something `xml2::read_html()` cannot do, read the table from |
|||
<https://hrbrmstr.github.io/htmlunitjars/index.html>: |
|||
|
|||
![](man/figures/test-url-table.png) |
|||
|
|||
``` r |
|||
test_url <- "https://hrbrmstr.github.io/htmlunitjars/index.html" |
|||
|
|||
pg <- xml2::read_html(test_url) |
|||
|
|||
html_table(pg) |
|||
``` |
|||
|
|||
## list() |
|||
|
|||
☹️ |
|||
|
|||
But, `hu_read_html()` can\! |
|||
|
|||
``` r |
|||
pg <- hu_read_html(test_url) |
|||
|
|||
html_table(pg) |
|||
``` |
|||
|
|||
## [[1]] |
|||
## X1 X2 |
|||
## 1 One Two |
|||
## 2 Three Four |
|||
## 3 Five Six |
|||
|
|||
All without needing a separate Selenium or Splash server instance. |
|||
|
After Width: | Height: | Size: 18 KiB |
@ -0,0 +1,71 @@ |
|||
% Generated by roxygen2: do not edit by hand |
|||
% Please edit documentation in R/hu-read-html.R |
|||
\name{hu_read_html} |
|||
\alias{hu_read_html} |
|||
\title{Read HTML from a URL with Browser Emulation & in a JavaScript Context} |
|||
\usage{ |
|||
hu_read_html(url, emulate = c("best", "chrome", "firefox", "ie"), |
|||
ret = c("html_document", "text"), js_delay = 2000L, |
|||
timeout = 30000L, ignore_ssl_errors = TRUE, enable_dnt = FALSE, |
|||
download_images = FALSE, options = c("RECOVER", "NOERROR", |
|||
"NOBLANKS")) |
|||
} |
|||
\arguments{ |
|||
\item{url}{URL to retrieve} |
|||
|
|||
\item{emulate}{browser to emulate; one of "`best`", "`chrome`", "`firefox`", "`ie`"} |
|||
|
|||
\item{ret}{what to return; if `html_document` (the default) then the HTML created |
|||
by the `HtmlUnit` emulated browser context is passed to [xml2::read_html()] |
|||
and an `xml2` `html_document`/`xml_document` is returned. Note that this causes |
|||
further HTML processing by `xml2`/`libxml2` so is not _exactly_ what |
|||
`HtmlUnit` generated. If you want the HTML code (text) without any further |
|||
processing then use `text` as the value.} |
|||
|
|||
\item{js_delay}{time (ms) to let loaded javascript to execute; default is 2 seconds (2000 ms)} |
|||
|
|||
\item{timeout}{overall timeout (ms); `0` == infinite wait (not recommended); note: the |
|||
timeout is used twice: first in making the socket connection, |
|||
second for data retrieval. If the time is critical you must |
|||
allow for twice the time specified here. Default 30s (30000 ms)} |
|||
|
|||
\item{ignore_ssl_errors}{Should SSL/TLS errors be ignored. The default (`TRUE`) is |
|||
a current hack due to how `HtmlUnit` seems to handle virtual hosted sites |
|||
with multiple vhosts and multiple certificates. You can try it with `FALSE` |
|||
initially and revert back to `TRUE` if you encounter issues.} |
|||
|
|||
\item{enable_dnt}{Enable the "Do Not Track" header. Default: `FALSE`.} |
|||
|
|||
\item{download_images}{Download images as the page is loaded? Since this |
|||
function is a high-level wrapper designed to do a read of HTML, |
|||
it is recommended that you leave this the default `FALSE` to save |
|||
time/bandwidth.} |
|||
|
|||
\item{options}{options to pass to [xml2::read_html()] if `ret` == `html_document`.} |
|||
} |
|||
\value{ |
|||
an `xml2` `html_document`/`xml_document` if `ret` == `html_document` else |
|||
the HTML document text generated by `HtmlUnit`. |
|||
} |
|||
\description{ |
|||
Use a JavaScript-enabled browser context to read and render HTML from a URL. |
|||
} |
|||
\details{ |
|||
For the code in the examples, this is the site that is being scraped: |
|||
|
|||
\if{html}{ |
|||
\figure{test-url-table.png}{options: width="100\%" alt="Figure: test-url-table.png"} |
|||
} |
|||
|
|||
\if{latex}{ |
|||
\figure{test-url-table.png}{options: width=10cm} |
|||
} |
|||
|
|||
Note that it has a table of values but it is rendered via JavaScript. |
|||
} |
|||
\examples{ |
|||
\dontrun{ |
|||
test_url <- "https://hrbrmstr.github.io/htmlunitjars/index.html" |
|||
hu_read_html(test_url) |
|||
} |
|||
} |
@ -0,0 +1,12 @@ |
|||
% Generated by roxygen2: do not edit by hand |
|||
% Please edit documentation in R/utils-pipe.R |
|||
\name{\%>\%} |
|||
\alias{\%>\%} |
|||
\title{Pipe operator} |
|||
\usage{ |
|||
lhs \%>\% rhs |
|||
} |
|||
\description{ |
|||
See \code{magrittr::\link[magrittr]{\%>\%}} for details. |
|||
} |
|||
\keyword{internal} |
Loading…
Reference in new issue