--- output: rmarkdown::github_document --- [![Travis-CI Build Status](https://travis-ci.org/hrbrmstr/htmltidy.svg?branch=master)](https://travis-ci.org/hrbrmstr/htmltidy) ```{r, echo = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "##", message = FALSE, warning = FALSE, error = FALSE, fig.retina=2, fig.path = "README-" ) ``` `htmltidy` — Clean up gnarly HTML/XHTML Inspired by [this SO question](http://stackoverflow.com/questions/37061873/identify-a-weblink-in-bold-in-r) and because there's a great deal of cruddy HTML out there that needs fixing to use properly when scraping data. It relies on a locally included version of [`libtidy`](http://www.html-tidy.org/) and works on macOS, Linux & Windows. The following functions are implemented: - `tidy_html` : Tidy or "Pretty Print" HTML/XHTML Documents ### Installation ```{r eval=FALSE} devtools::install_github("hrbrmstr/htmltidy") ``` ```{r echo=FALSE} options(width=120) ``` ### Usage ```{r message=FALSE, warning=FALSE} library(htmltidy) # current verison packageVersion("htmltidy") library(XML) library(xml2) library(httr) library(purrr) ``` This is really "un-tidy" content: ```{r message=FALSE, warning=FALSE} res <- GET("http://rud.is/test/untidy.html") cat(content(res, as="text")) ``` Let's see what `tidy_html()` does to it: ```{r message=FALSE, warning=FALSE} cat(tidy_html(content(res, as="text"), list(TidyDocType="html5", TidyWrapLen=200))) ``` NOTE: you could also just have done: ```{r message=FALSE, warning=FALSE} cat(tidy_html(url("http://rud.is/test/untidy.html"), list(TidyDocType="html5", TidyWrapLen=200))) ``` You'll see that this differs substantially from the mangling `libxml2` does (via `read_html()`): ```{r message=FALSE, warning=FALSE} pg <- read_html("http://rud.is/test/untidy.html") cat(toString(pg)) ``` It can also deal with "raw" and parsed objects: ```{r message=FALSE, warning=FALSE} tidy_html(content(res, as="raw")) tidy_html(content(res, as="text", encoding="UTF-8")) tidy_html(content(res, as="parsed", encoding="UTF-8")) tidy_html(htmlParse("http://rud.is/test/untidy.html")) ``` ### Testing Options ```{r message=FALSE, warning=FALSE} opts <- list(TidyDocType="html5", TidyMakeClean=TRUE, TidyHideComments=TRUE, TidyIndentContent=FALSE, TidyWrapLen=200) txt <- "

Test

" cat(tidy_html(txt, option=opts)) ``` But, you're probably better off running it on plain HTML source. Since it's C/C++-backed, it's pretty fast: ```{r message=FALSE, warning=FALSE} book <- readLines("http://singlepageappbook.com/single-page.html") sum(map_int(book, nchar)) system.time(tidy_book <- tidy_html(book)) ``` (It's usually between 20 & 25 milliseconds to process those 202 kilobytes of HTML.) Not too shabby. ### Code of Conduct Please note that this project is released with a [Contributor Code of Conduct](CONDUCT.md). By participating in this project you agree to abide by its terms.