[](https://travis-ci.org/hrbrmstr/htmltidy) [](https://cran.r-project.org/package=htmltidy)  `htmltidy` — Clean up gnarly HTML/XHTML Inspired by [this SO question](http://stackoverflow.com/questions/37061873/identify-a-weblink-in-bold-in-r) and because there's a great deal of cruddy HTML out there that needs fixing to use properly when scraping data. It relies on a locally included version of [`libtidy`](http://www.html-tidy.org/) and works on macOS, Linux & Windows. The following functions are implemented: - `tidy_html` : Tidy or "Pretty Print" HTML/XHTML Documents ### Installation ``` r devtools::install_github("hrbrmstr/htmltidy") ``` ### Usage ``` r library(htmltidy) # current verison packageVersion("htmltidy") ## [1] '0.3.0' library(XML) library(xml2) library(httr) library(purrr) ``` This is really "un-tidy" content: ``` r res <- GET("http://rud.is/test/untidy.html") cat(content(res, as="text")) ##
## ## ## ## This is some really poorly formatted HTML ## ## as is this portionTest
" cat(tidy_html(txt, option=opts)) ## ## ## ## ## ##Test
## ## ``` But, you're probably better off running it on plain HTML source. Since it's C/C++-backed, it's pretty fast: ``` r book <- readLines("http://singlepageappbook.com/single-page.html") sum(map_int(book, nchar)) ## [1] 207501 system.time(tidy_book <- tidy_html(book)) ## user system elapsed ## 0.022 0.002 0.024 ``` (It's usually between 20 & 25 milliseconds to process those 202 kilobytes of HTML.) Not too shabby. ### Code of Conduct Please note that this project is released with a [Contributor Code of Conduct](CONDUCT.md). By participating in this project you agree to abide by its terms.