You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 

2.2 KiB

---
output: rmarkdown::github_document
---
[![Travis-CI Build Status](https://travis-ci.org/hrbrmstr/htmltidy.svg?branch=master)](https://travis-ci.org/hrbrmstr/htmltidy)

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
message = FALSE,
warning = FALSE,
error = FALSE,
fig.retina=2,
fig.path = "README-"
)
```

`htmltidy` — Clean up gnarly HTML/XHTML

Inspired by [this SO question](http://stackoverflow.com/questions/37061873/identify-a-weblink-in-bold-in-r) and because there's a great deal of cruddy HTML out there that needs fixing to use properly when scraping data.

It relies on a locally included version of [`libtidy`](http://www.html-tidy.org/) and works on macOS, Linux & Windows.

The following functions are implemented:

- `tidy_html` : Clean up gnarly HTML/XHTML

### Installation

```{r eval=FALSE}
devtools::install_github("hrbrmstr/htmltidy")
```

```{r echo=FALSE}
options(width=120)
```

### Usage

```{r message=FALSE, warning=FALSE}
library(htmltidy)

# current verison
packageVersion("htmltidy")

library(XML)
library(xml2)
library(httr)

res <- GET("http://rud.is")

head(tidy_html(res$content), 256)

head(tidy_html(content(res, as="raw")), 256)

(class(tidy_html(content(res, as="text", encoding="UTF-8")))) # output is too long to show

tidy_html(content(res, as="parsed", encoding="UTF-8")) # same as tidy_html(read_html("http://rud.is"))

(class(tidy_html(htmlParse("http://rud.is")))) # output is too long to show
```

### Testing Options

```{r message=FALSE, warning=FALSE}

opts <- list(TidyDocType="html5",
TidyMakeClean=TRUE,
TidyHideComments=TRUE,
TidyIndentContent=FALSE,
TidyWrapLen=200)

txt <- "<html>
<head>
<style>
p { color: red; }
</style>
<body>
<!-- ===== body ====== -->
<p>Test</p>

</body>
<!--Default Zone
-->
<!--Default Zone End-->
</html>"

cat(tidy_html(txt, option=opts))

```

### Code of Conduct

Please note that this project is released with a [Contributor Code of Conduct](CONDUCT.md).
By participating in this project you agree to abide by its terms.