--- output: rmarkdown::github_document editor_options: chunk_output_type: console --- ```{r pkg-knitr-opts, include=FALSE} hrbrpkghelpr::global_opts() ``` ```{r badges, results='asis', echo=FALSE, cache=FALSE} hrbrpkghelpr::stinking_badges() ``` ```{r description, results='asis', echo=FALSE, cache=FALSE} hrbrpkghelpr::yank_title_and_description() ``` Partly inspired by [this SO question](http://stackoverflow.com/questions/37061873/identify-a-weblink-in-bold-in-r) and because there's a great deal of cruddy HTML out there that needs fixing to use properly when scraping data. It relies on a locally included version of [`libtidy`](http://www.html-tidy.org/) and works on macOS, Linux & Windows. It also incorporates an `htmlwidget` to view and test XPath queries on HTML/XML content and another widget to view an XML document in a collapseable tree view. ## What's Inside The Tin ```{r ingredients, results='asis', echo=FALSE, cache=FALSE} hrbrpkghelpr::describe_ingredients() ``` ## Installation ```{r install-ex, results='asis', echo=FALSE, cache=FALSE} hrbrpkghelpr::install_block() ``` ## Usage ```{r usage} library(htmltidy) # current verison packageVersion("htmltidy") library(XML) library(xml2) library(httr) library(purrr) ``` This is really "un-tidy" content: ```{r untidy-01} res <- GET("https://rud.is/test/untidy.html") cat(content(res, as="text")) ``` Let's see what `tidy_html()` does to it. It can handle the `response` object directly: ```{r tidy-01} cat(tidy_html(res, list(TidyDocType="html5", TidyWrapLen=200))) ``` But, you'll probably mostly use it on HTML you've identified as gnarly and already have that HTML text content handy: ```{r options-01} cat(tidy_html(content(res, as="text"), list(TidyDocType="html5", TidyWrapLen=200))) ``` NOTE: you could also just have done: ```{r options-02} cat(tidy_html(url("https://rud.is/test/untidy.html"), list(TidyDocType="html5", TidyWrapLen=200))) ``` You'll see that this differs substantially from the mangling `libxml2` does (via `read_html()`): ```{r options-03} pg <- read_html("https://rud.is/test/untidy.html") cat(toString(pg)) ``` It can also deal with "raw" and parsed objects: ```{r raw-01} tidy_html(content(res, as="raw")) tidy_html(content(res, as="text", encoding="UTF-8")) tidy_html(content(res, as="parsed", encoding="UTF-8")) ``` ```{r raw-02, eval=FALSE} tidy_html(suppressWarnings(htmlParse("https://rud.is/test/untidy.html"))) ## ## ## ## ## ## ## ##

https://rud.is/test/untidy.html

## ## ``` And, show the markup errors: ```{r errors-01, eval=FALSE} invisible(tidy_html(url("https://rud.is/test/untidy.html"), verbose=TRUE)) ## line 1 column 1 - Warning: missing declaration ## line 1 column 68 - Warning: nested emphasis ## line 1 column 138 - Warning: missing before
## line 1 column 68 - Warning: missing before
## line 1 column 164 - Warning: inserting implicit ## line 1 column 164 - Warning: missing ## line 1 column 159 - Warning: missing
## line 1 column 1 - Warning: inserting missing 'title' element ## line 1 column 164 - Warning: anchor "sp" already defined ## Info: Document content looks like XHTML5 ## Tidy found 9 warnings and 0 errors! ``` ## Testing Options ```{r more-options-01} opts <- list(TidyDocType="html5", TidyMakeClean=TRUE, TidyHideComments=TRUE, TidyIndentContent=FALSE, TidyWrapLen=200) txt <- "

Test

" cat(tidy_html(txt, option=opts)) ``` But, you're probably better off running it on plain HTML source. Since it's C/C++-backed, it's pretty fast: ```{r speed-01} book <- readLines("http://singlepageappbook.com/single-page.html") sum(map_int(book, nchar)) system.time(tidy_book <- tidy_html(book)) ``` (It's usually between 20 & 25 milliseconds to process those 202 kilobytes of HTML.) Not too shabby. ## htmltidy Metrics ```{r cloc, echo=FALSE} cloc::cloc_pkg_md() ``` ## Code of Conduct Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.