--- output: rmarkdown::github_document editor_options: chunk_output_type: console --- ```{r pkg-knitr-opts, include=FALSE} hrbrpkghelpr::global_opts() ``` ```{r badges, results='asis', echo=FALSE, cache=FALSE} hrbrpkghelpr::stinking_badges() ``` ```{r description, results='asis', echo=FALSE, cache=FALSE} hrbrpkghelpr::yank_title_and_description() ``` ## What's Inside The Tin The following functions are implemented: ### DSL - `web_client`/`webclient`: Create a new HtmlUnit WebClient instance

- `wc_go`: Visit a URL
- `wc_html_nodes`: Select nodes from web client active page html content - `wc_html_text`: Extract attributes, text and tag name from webclient page html content

- `wc_html_attr`: Extract attributes, text and tag name from webclient page html content - `wc_html_name`: Extract attributes, text and tag name from webclient page html content - `wc_headers`: Return response headers of the last web request for current page - `wc_browser_info`: Retreive information about the browser used to create the 'webclient' - `wc_content_length`: Return content length of the last web request for current page - `wc_content_type`: Return content type of web request for current page

- `wc_render`: Retrieve current page contents

- `wc_css`: Enable/Disable CSS support - `wc_dnt`: Enable/Disable Do-Not-Track - `wc_geo`: Enable/Disable Geolocation - `wc_img_dl`: Enable/Disable Image Downloading - `wc_load_time`: Return load time of the last web request for current page - `wc_resize`: Resize the virtual browser window - `wc_status`: Return status code of web request for current page - `wc_timeout`: Change default request timeout - `wc_title`: Return page title for current page - `wc_url`: Return load time of the last web request for current page - `wc_use_insecure_ssl`: Enable/Disable Ignoring SSL Validation Issues - `wc_wait`: Block HtlUnit final rendering blocks until all background JavaScript tasks have finished executing ### Just the Content (pls) - `hu_read_html`: Read HTML from a URL with Browser Emulation & in a JavaScript Context ### Content++ - `wc_inspect`: Perform a "Developer Tools"-like Network Inspection of a URL ## Installation ```{r install-ex, results='asis', echo=FALSE, cache=FALSE} hrbrpkghelpr::install_block() ``` ## Usage ```{r cache=FALSE} library(htmlunit) library(tidyverse) # for some data ops; not req'd for pkg # current verison packageVersion("htmlunit") ``` Something `xml2::read_html()` cannot do, read the table from : ![](man/figures/test-url-table.png) ```{r ex1} test_url <- "https://hrbrmstr.github.io/htmlunitjars/index.html" pg <- xml2::read_html(test_url) html_table(pg) ``` ☹️ But, `hu_read_html()` can! ```{r ex2} pg <- hu_read_html(test_url) html_table(pg) ``` All without needing a separate Selenium or Splash server instance. ### Content++ We can also get a HAR-like content + metadata dump: ```{r ex3} xdf <- wc_inspect("https://rstudio.com") colnames(xdf) select(xdf, method, url, status_code, content_length, load_time) group_by(xdf, content_type) %>% summarise( total_size = sum(content_length), total_load_time = sum(load_time)/1000 ) ``` ### DSL ```{r ex4} wc <- web_client(emulate = "chrome") wc %>% wc_browser_info() wc <- web_client() wc %>% wc_go("https://usa.gov/") # if you want to use purrr::map_ functions the result of wc_html_nodes() needs to be passed to as.list() wc %>% wc_html_nodes("a") %>% sapply(wc_html_text, trim = TRUE) %>% head(10) wc %>% wc_html_nodes(xpath=".//a") %>% sapply(wc_html_text, trim = TRUE) %>% head(10) wc %>% wc_html_nodes(xpath=".//a") %>% sapply(wc_html_attr, "href") %>% head(10) ``` Handy function to get rendered plain text for text mining: ```{r ex5} wc %>% wc_render("text") %>% substr(1, 300) %>% cat() ``` ### htmlunit Metrics ```{r echo=FALSE} cloc::cloc_pkg_md() ``` ## Code of Conduct Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.