htmltidy

4.5 KiB

Raw Permalink Blame History

---
output: rmarkdown::github_document
editor_options: 
  chunk_output_type: console
---
```{r pkg-knitr-opts, include=FALSE}
hrbrpkghelpr::global_opts()
```

```{r badges, results='asis', echo=FALSE, cache=FALSE}
hrbrpkghelpr::stinking_badges()
```

```{r description, results='asis', echo=FALSE, cache=FALSE}
hrbrpkghelpr::yank_title_and_description()
```

Partly inspired by [this SO question](http://stackoverflow.com/questions/37061873/identify-a-weblink-in-bold-in-r) and because there's a great deal of cruddy HTML out there that needs fixing to use properly when scraping data.

It relies on a locally included version of [`libtidy`](http://www.html-tidy.org/) and works on macOS, Linux & Windows.

It also incorporates an `htmlwidget` to view and test XPath queries on HTML/XML content and another widget to view an XML document in a collapseable tree view.

## What's Inside The Tin

```{r ingredients, results='asis', echo=FALSE, cache=FALSE}
hrbrpkghelpr::describe_ingredients()
```

## Installation

```{r install-ex, results='asis', echo=FALSE, cache=FALSE}
hrbrpkghelpr::install_block()
```

## Usage

```{r usage}
library(htmltidy)

# current verison
packageVersion("htmltidy")

library(XML)
library(xml2)
library(httr)
library(purrr)
```

This is really "un-tidy" content:

```{r untidy-01}
res <- GET("https://rud.is/test/untidy.html")
cat(content(res, as="text"))
```

Let's see what `tidy_html()` does to it.

It can handle the `response` object directly:

```{r tidy-01}
cat(tidy_html(res, list(TidyDocType="html5", TidyWrapLen=200)))
```

But, you'll probably mostly use it on HTML you've identified as gnarly and already have that HTML text content handy:

```{r options-01}
cat(tidy_html(content(res, as="text"), list(TidyDocType="html5", TidyWrapLen=200)))
```

NOTE: you could also just have done:

```{r options-02}
cat(tidy_html(url("https://rud.is/test/untidy.html"), 
              list(TidyDocType="html5", TidyWrapLen=200)))
```

You'll see that this differs substantially from the mangling `libxml2` does (via `read_html()`):

```{r options-03}
pg <- read_html("https://rud.is/test/untidy.html")
cat(toString(pg))
```

It can also deal with "raw" and parsed objects:

```{r raw-01}
tidy_html(content(res, as="raw"))

tidy_html(content(res, as="text", encoding="UTF-8"))

tidy_html(content(res, as="parsed", encoding="UTF-8"))
```

```{r raw-02, eval=FALSE}
tidy_html(suppressWarnings(htmlParse("https://rud.is/test/untidy.html")))
## <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
## <html xmlns="http://www.w3.org/1999/xhtml">
## <head>
## <meta name="generator" content="HTML Tidy for HTML5 for R version 5.0.0">
## <title></title>
## </head>
## <body>
## <p>https://rud.is/test/untidy.html</p>
## </body>
## </html>
```

And, show the markup errors:

```{r errors-01, eval=FALSE}
invisible(tidy_html(url("https://rud.is/test/untidy.html"), verbose=TRUE))
## line 1 column 1 - Warning: missing <!DOCTYPE> declaration
## line 1 column 68 - Warning: nested emphasis <b>
## line 1 column 138 - Warning: missing </span> before <div>
## line 1 column 68 - Warning: missing </b> before <div>
## line 1 column 164 - Warning: inserting implicit <span>
## line 1 column 164 - Warning: missing </span>
## line 1 column 159 - Warning: missing </div>
## line 1 column 1 - Warning: inserting missing 'title' element
## line 1 column 164 - Warning: <span> anchor "sp" already defined
## Info: Document content looks like XHTML5
## Tidy found 9 warnings and 0 errors!
```

## Testing Options

```{r more-options-01}
opts <- list(TidyDocType="html5",
             TidyMakeClean=TRUE,
             TidyHideComments=TRUE,
             TidyIndentContent=FALSE,
             TidyWrapLen=200)

txt <- "<html>
<head>
      <style>
        p { color: red; }
      </style>
    <body>
          <!-- ===== body ====== -->
         <p>Test</p>

    </body>
        <!--Default Zone
        -->
        <!--Default Zone End-->
</html>"

cat(tidy_html(txt, option=opts))
```

But, you're probably better off running it on plain HTML source.

Since it's C/C++-backed, it's pretty fast:

```{r speed-01}
book <- readLines("http://singlepageappbook.com/single-page.html")
sum(map_int(book, nchar))
system.time(tidy_book <- tidy_html(book))
```

(It's usually between 20 & 25 milliseconds to process those 202 kilobytes of HTML.) Not too shabby.

## htmltidy Metrics

```{r cloc, echo=FALSE}
cloc::cloc_pkg_md()
```

## Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.
4.5 KiB Raw Permalink Blame History

4.5 KiB

Raw Permalink Blame History