You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

4.3 KiB

output: rmarkdown::github_document
[![Travis-CI Build Status](](
[![AppVeyor Build Status](](

<!-- is generated from README.Rmd. Please edit that file -->

```{r, echo = FALSE}
collapse = TRUE,
comment = "##",
message = FALSE,
warning = FALSE,
error = FALSE,
fig.path = "README-"

`htmltidy` — Tidy Up and Test XPath Queries on HTML and XML Content

Partly inspired by [this SO question]( and because there's a great deal of cruddy HTML out there that needs fixing to use properly when scraping data.

It relies on a locally included version of [`libtidy`]( and works on macOS, Linux & Windows.

It also incorporates an `htmlwidget` to view and test XPath queries on HTML/XML content.

The following functions are implemented:

- `tidy_html`: Tidy or "Pretty Print" HTML/XHTML Documents
- `html_view`: HTML/XML pretty printer and viewer
- `xml_view`: HTML/XML pretty printer and viewer
- `html_tree_view`: HTML/XML tree viewer
- `xml_tree_view`: HTML/XML tree viewer

### Installation

```{r eval=FALSE}

```{r echo=FALSE}

### Usage

```{r message=FALSE, warning=FALSE}

# current verison


This is really "un-tidy" content:

```{r message=FALSE, warning=FALSE}
res <- GET("")
cat(content(res, as="text"))

Let's see what `tidy_html()` does to it.

It can handle the `response` object directly:

```{r message=FALSE, warning=FALSE}
cat(tidy_html(res, list(TidyDocType="html5", TidyWrapLen=200)))

But, you'll probably mostly use it on HTML you've identified as gnarly and already have that HTML text content handy:

```{r message=FALSE, warning=FALSE}
cat(tidy_html(content(res, as="text"), list(TidyDocType="html5", TidyWrapLen=200)))

NOTE: you could also just have done:

```{r message=FALSE, warning=FALSE}
list(TidyDocType="html5", TidyWrapLen=200)))

You'll see that this differs substantially from the mangling `libxml2` does (via `read_html()`):

```{r message=FALSE, warning=FALSE}
pg <- read_html("")

It can also deal with "raw" and parsed objects:

```{r message=FALSE, warning=FALSE}
tidy_html(content(res, as="raw"))

tidy_html(content(res, as="text", encoding="UTF-8"))

tidy_html(content(res, as="parsed", encoding="UTF-8"))


And, show the markup errors:

```{r message=FALSE, warning=FALSE}
invisible(tidy_html(url(""), verbose=TRUE))

### Testing Options

```{r message=FALSE, warning=FALSE}

opts <- list(TidyDocType="html5",

txt <- "<html>
p { color: red; }
<!-- ===== body ====== -->

<!--Default Zone
<!--Default Zone End-->

cat(tidy_html(txt, option=opts))


But, you're probably better off running it on plain HTML source.

Since it's C/C++-backed, it's pretty fast:

```{r message=FALSE, warning=FALSE}
book <- readLines("")
sum(map_int(book, nchar))
system.time(tidy_book <- tidy_html(book))

(It's usually between 20 & 25 milliseconds to process those 202 kilobytes of HTML.) Not too shabby.

### Code of Conduct

Please note that this project is released with a [Contributor Code of Conduct](
By participating in this project you agree to abide by its terms.