You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

156 lines
3.9 KiB

8 years ago
---
output: rmarkdown::github_document
---
8 years ago
[![Travis-CI Build Status](https://travis-ci.org/hrbrmstr/htmltidy.svg?branch=master)](https://travis-ci.org/hrbrmstr/htmltidy)
[![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/NA/NA?branch=master&svg=true)](https://ci.appveyor.com/project/NA/NA)
[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/htmltidy)](https://cran.r-project.org/package=htmltidy)
![downloads](http://cranlogs.r-pkg.org/badges/grand-total/htmltidy)
8 years ago
8 years ago
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "##",
message = FALSE,
warning = FALSE,
error = FALSE,
8 years ago
fig.retina=2,
fig.path = "README-"
)
```
8 years ago
`htmltidy` — Clean up gnarly HTML/XHTML
8 years ago
8 years ago
Inspired by [this SO question](http://stackoverflow.com/questions/37061873/identify-a-weblink-in-bold-in-r) and because there's a great deal of cruddy HTML out there that needs fixing to use properly when scraping data.
8 years ago
It relies on a locally included version of [`libtidy`](http://www.html-tidy.org/) and works on macOS, Linux & Windows.
8 years ago
8 years ago
The following functions are implemented:
- `tidy_html` : Tidy or "Pretty Print" HTML/XHTML Documents
8 years ago
### Installation
```{r eval=FALSE}
devtools::install_github("hrbrmstr/htmltidy")
```
```{r echo=FALSE}
8 years ago
options(width=120)
```
### Usage
8 years ago
```{r message=FALSE, warning=FALSE}
8 years ago
library(htmltidy)
# current verison
packageVersion("htmltidy")
8 years ago
library(XML)
library(xml2)
library(httr)
library(purrr)
```
This is really "un-tidy" content:
```{r message=FALSE, warning=FALSE}
res <- GET("http://rud.is/test/untidy.html")
cat(content(res, as="text"))
```
8 years ago
Let's see what `tidy_html()` does to it.
It can handle the `response` object directly:
```{r message=FALSE, warning=FALSE}
cat(tidy_html(res, list(TidyDocType="html5", TidyWrapLen=200)))
```
But, you'll probably mostly use it on HTML you've identified as gnarly and already have that HTML text content handy:
8 years ago
```{r message=FALSE, warning=FALSE}
cat(tidy_html(content(res, as="text"), list(TidyDocType="html5", TidyWrapLen=200)))
```
8 years ago
NOTE: you could also just have done:
8 years ago
```{r message=FALSE, warning=FALSE}
cat(tidy_html(url("http://rud.is/test/untidy.html"),
list(TidyDocType="html5", TidyWrapLen=200)))
```
8 years ago
You'll see that this differs substantially from the mangling `libxml2` does (via `read_html()`):
8 years ago
```{r message=FALSE, warning=FALSE}
pg <- read_html("http://rud.is/test/untidy.html")
cat(toString(pg))
```
It can also deal with "raw" and parsed objects:
```{r message=FALSE, warning=FALSE}
tidy_html(content(res, as="raw"))
tidy_html(content(res, as="text", encoding="UTF-8"))
tidy_html(content(res, as="parsed", encoding="UTF-8"))
tidy_html(htmlParse("http://rud.is/test/untidy.html"))
8 years ago
```
8 years ago
And, show the markup errors:
```{r message=FALSE, warning=FALSE}
invisible(tidy_html(url("http://rud.is/test/untidy.html"), verbose=TRUE))
```
8 years ago
### Testing Options
```{r message=FALSE, warning=FALSE}
opts <- list(TidyDocType="html5",
TidyMakeClean=TRUE,
TidyHideComments=TRUE,
TidyIndentContent=FALSE,
TidyWrapLen=200)
txt <- "<html>
8 years ago
<head>
8 years ago
<style>
p { color: red; }
</style>
<body>
<!-- ===== body ====== -->
<p>Test</p>
</body>
<!--Default Zone
-->
<!--Default Zone End-->
</html>"
cat(tidy_html(txt, option=opts))
8 years ago
```
But, you're probably better off running it on plain HTML source.
Since it's C/C++-backed, it's pretty fast:
```{r message=FALSE, warning=FALSE}
book <- readLines("http://singlepageappbook.com/single-page.html")
sum(map_int(book, nchar))
system.time(tidy_book <- tidy_html(book))
```
(It's usually between 20 & 25 milliseconds to process those 202 kilobytes of HTML.) Not too shabby.
8 years ago
### Code of Conduct
Please note that this project is released with a [Contributor Code of Conduct](CONDUCT.md).
By participating in this project you agree to abide by its terms.