You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

4.5 KiB

output: rmarkdown::github_document
chunk_output_type: console
```{r pkg-knitr-opts, include=FALSE}

```{r badges, results='asis', echo=FALSE, cache=FALSE}

```{r description, results='asis', echo=FALSE, cache=FALSE}

Partly inspired by [this SO question]( and because there's a great deal of cruddy HTML out there that needs fixing to use properly when scraping data.

It relies on a locally included version of [`libtidy`]( and works on macOS, Linux & Windows.

It also incorporates an `htmlwidget` to view and test XPath queries on HTML/XML content and another widget to view an XML document in a collapseable tree view.

## What's Inside The Tin

```{r ingredients, results='asis', echo=FALSE, cache=FALSE}

## Installation

```{r install-ex, results='asis', echo=FALSE, cache=FALSE}

## Usage

```{r usage}

# current verison


This is really "un-tidy" content:

```{r untidy-01}
res <- GET("")
cat(content(res, as="text"))

Let's see what `tidy_html()` does to it.

It can handle the `response` object directly:

```{r tidy-01}
cat(tidy_html(res, list(TidyDocType="html5", TidyWrapLen=200)))

But, you'll probably mostly use it on HTML you've identified as gnarly and already have that HTML text content handy:

```{r options-01}
cat(tidy_html(content(res, as="text"), list(TidyDocType="html5", TidyWrapLen=200)))

NOTE: you could also just have done:

```{r options-02}
list(TidyDocType="html5", TidyWrapLen=200)))

You'll see that this differs substantially from the mangling `libxml2` does (via `read_html()`):

```{r options-03}
pg <- read_html("")

It can also deal with "raw" and parsed objects:

```{r raw-01}
tidy_html(content(res, as="raw"))

tidy_html(content(res, as="text", encoding="UTF-8"))

tidy_html(content(res, as="parsed", encoding="UTF-8"))

```{r raw-02, eval=FALSE}
## <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "">
## <html xmlns="">
## <head>
## <meta name="generator" content="HTML Tidy for HTML5 for R version 5.0.0">
## <title></title>
## </head>
## <body>
## <p></p>
## </body>
## </html>

And, show the markup errors:

```{r errors-01, eval=FALSE}
invisible(tidy_html(url(""), verbose=TRUE))
## line 1 column 1 - Warning: missing <!DOCTYPE> declaration
## line 1 column 68 - Warning: nested emphasis <b>
## line 1 column 138 - Warning: missing </span> before <div>
## line 1 column 68 - Warning: missing </b> before <div>
## line 1 column 164 - Warning: inserting implicit <span>
## line 1 column 164 - Warning: missing </span>
## line 1 column 159 - Warning: missing </div>
## line 1 column 1 - Warning: inserting missing 'title' element
## line 1 column 164 - Warning: <span> anchor "sp" already defined
## Info: Document content looks like XHTML5
## Tidy found 9 warnings and 0 errors!

## Testing Options

```{r more-options-01}
opts <- list(TidyDocType="html5",

txt <- "<html>
p { color: red; }
<!-- ===== body ====== -->

<!--Default Zone
<!--Default Zone End-->

cat(tidy_html(txt, option=opts))

But, you're probably better off running it on plain HTML source.

Since it's C/C++-backed, it's pretty fast:

```{r speed-01}
book <- readLines("")
sum(map_int(book, nchar))
system.time(tidy_book <- tidy_html(book))

(It's usually between 20 & 25 milliseconds to process those 202 kilobytes of HTML.) Not too shabby.

## htmltidy Metrics

```{r cloc, echo=FALSE}

## Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.