[![Travis-CI Build Status](https://travis-ci.org/hrbrmstr/htmltidy.svg?branch=master)](https://travis-ci.org/hrbrmstr/htmltidy) [![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/hrbrmstr/htmltidy?branch=master&svg=true)](https://ci.appveyor.com/project/hrbrmstr/htmltidy) [![CRAN\_Status\_Badge](http://www.r-pkg.org/badges/version/htmltidy)](https://cran.r-project.org/package=htmltidy) ![downloads](http://cranlogs.r-pkg.org/badges/grand-total/htmltidy) # htmltidy Tidy Up and Test XPath Queries on HTML and XML Content ## Description Partly inspired by [this SO question](http://stackoverflow.com/questions/37061873/identify-a-weblink-in-bold-in-r) and because there’s a great deal of cruddy HTML out there that needs fixing to use properly when scraping data. It relies on a locally included version of [`libtidy`](http://www.html-tidy.org/) and works on macOS, Linux & Windows. It also incorporates an `htmlwidget` to view and test XPath queries on HTML/XML content and another widget to view an XML document in a collapseable tree view. ## What’s inside the tin? The following functions are implemented: - `tidy_html`: Tidy or “Pretty Print” HTML/XHTML Documents - `html_view`: HTML/XML pretty printer and viewer - `xml_view`: HTML/XML pretty printer and viewer - `html_tree_view`: HTML/XML tree viewer - `xml_tree_view`: HTML/XML tree viewer ## Installation ``` r devtools::install_github("hrbrmstr/htmltidy") ``` ## Usage ``` r library(htmltidy) # current verison packageVersion("htmltidy") ## [1] '0.5.0' library(XML) library(xml2) library(httr) library(purrr) ``` This is really “un-tidy” content: ``` r res <- GET("https://rud.is/test/untidy.html") cat(content(res, as="text")) ##
## ## ## ## This is some really poorly formatted HTML ## ## as is this portionhttps://rud.is/test/untidy.html
## ## ## ``` And, show the markup errors: ``` r invisible(tidy_html(url("https://rud.is/test/untidy.html"), verbose=TRUE)) ## line 1 column 1 - Warning: missing declaration ## line 1 column 68 - Warning: nested emphasis ## line 1 column 138 - Warning: missing beforeTest
" cat(tidy_html(txt, option=opts)) ## ## ## ## ## ##Test
## ## ``` But, you’re probably better off running it on plain HTML source. Since it’s C/C++-backed, it’s pretty fast: ``` r book <- readLines("http://singlepageappbook.com/single-page.html") sum(map_int(book, nchar)) ## [1] 207501 system.time(tidy_book <- tidy_html(book)) ## user system elapsed ## 0.026 0.001 0.027 ``` (It’s usually between 20 & 25 milliseconds to process those 202 kilobytes of HTML.) Not too shabby. ## htmltidy Metrics | Lang | \# Files | (%) | LoC | (%) | Blank lines | (%) | \# Lines | (%) | | :----------- | -------: | ---: | ----: | ---: | ----------: | ---: | -------: | ---: | | C | 27 | 0.34 | 28646 | 0.81 | 4696 | 0.77 | 4304 | 0.59 | | C/C++ Header | 37 | 0.47 | 5799 | 0.16 | 1227 | 0.20 | 2674 | 0.36 | | C++ | 4 | 0.05 | 647 | 0.02 | 117 | 0.02 | 64 | 0.01 | | R | 10 | 0.13 | 151 | 0.00 | 38 | 0.01 | 235 | 0.03 | | Rmd | 1 | 0.01 | 53 | 0.00 | 51 | 0.01 | 68 | 0.01 | ## Code of Conduct Please note that this project is released with a [Contributor Code of Conduct](CONDUCT.md). By participating in this project you agree to abide by its terms.