You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

174 lines
4.5 KiB

8 years ago
---
output: rmarkdown::github_document
5 years ago
editor_options:
chunk_output_type: console
8 years ago
---
5 years ago
```{r pkg-knitr-opts, include=FALSE}
hrbrpkghelpr::global_opts()
8 years ago
```
5 years ago
```{r badges, results='asis', echo=FALSE, cache=FALSE}
hrbrpkghelpr::stinking_badges()
```
5 years ago
```{r description, results='asis', echo=FALSE, cache=FALSE}
hrbrpkghelpr::yank_title_and_description()
```
8 years ago
8 years ago
Partly inspired by [this SO question](http://stackoverflow.com/questions/37061873/identify-a-weblink-in-bold-in-r) and because there's a great deal of cruddy HTML out there that needs fixing to use properly when scraping data.
8 years ago
8 years ago
It relies on a locally included version of [`libtidy`](http://www.html-tidy.org/) and works on macOS, Linux & Windows.
8 years ago
It also incorporates an `htmlwidget` to view and test XPath queries on HTML/XML content and another widget to view an XML document in a collapseable tree view.
5 years ago
## What's Inside The Tin
8 years ago
5 years ago
```{r ingredients, results='asis', echo=FALSE, cache=FALSE}
hrbrpkghelpr::describe_ingredients()
```
8 years ago
## Installation
8 years ago
5 years ago
```{r install-ex, results='asis', echo=FALSE, cache=FALSE}
hrbrpkghelpr::install_block()
8 years ago
```
## Usage
8 years ago
5 years ago
```{r usage}
8 years ago
library(htmltidy)
# current verison
packageVersion("htmltidy")
8 years ago
library(XML)
library(xml2)
library(httr)
library(purrr)
```
This is really "un-tidy" content:
5 years ago
```{r untidy-01}
res <- GET("https://rud.is/test/untidy.html")
cat(content(res, as="text"))
```
8 years ago
Let's see what `tidy_html()` does to it.
It can handle the `response` object directly:
5 years ago
```{r tidy-01}
cat(tidy_html(res, list(TidyDocType="html5", TidyWrapLen=200)))
```
But, you'll probably mostly use it on HTML you've identified as gnarly and already have that HTML text content handy:
8 years ago
5 years ago
```{r options-01}
cat(tidy_html(content(res, as="text"), list(TidyDocType="html5", TidyWrapLen=200)))
```
8 years ago
NOTE: you could also just have done:
8 years ago
5 years ago
```{r options-02}
cat(tidy_html(url("https://rud.is/test/untidy.html"),
list(TidyDocType="html5", TidyWrapLen=200)))
```
8 years ago
You'll see that this differs substantially from the mangling `libxml2` does (via `read_html()`):
8 years ago
5 years ago
```{r options-03}
pg <- read_html("https://rud.is/test/untidy.html")
cat(toString(pg))
```
It can also deal with "raw" and parsed objects:
5 years ago
```{r raw-01}
tidy_html(content(res, as="raw"))
tidy_html(content(res, as="text", encoding="UTF-8"))
tidy_html(content(res, as="parsed", encoding="UTF-8"))
5 years ago
```
5 years ago
```{r raw-02, eval=FALSE}
5 years ago
tidy_html(suppressWarnings(htmlParse("https://rud.is/test/untidy.html")))
5 years ago
## <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
## <html xmlns="http://www.w3.org/1999/xhtml">
## <head>
## <meta name="generator" content="HTML Tidy for HTML5 for R version 5.0.0">
## <title></title>
## </head>
## <body>
## <p>https://rud.is/test/untidy.html</p>
## </body>
## </html>
8 years ago
```
8 years ago
And, show the markup errors:
5 years ago
```{r errors-01, eval=FALSE}
invisible(tidy_html(url("https://rud.is/test/untidy.html"), verbose=TRUE))
5 years ago
## line 1 column 1 - Warning: missing <!DOCTYPE> declaration
## line 1 column 68 - Warning: nested emphasis <b>
## line 1 column 138 - Warning: missing </span> before <div>
## line 1 column 68 - Warning: missing </b> before <div>
## line 1 column 164 - Warning: inserting implicit <span>
## line 1 column 164 - Warning: missing </span>
## line 1 column 159 - Warning: missing </div>
## line 1 column 1 - Warning: inserting missing 'title' element
## line 1 column 164 - Warning: <span> anchor "sp" already defined
## Info: Document content looks like XHTML5
## Tidy found 9 warnings and 0 errors!
8 years ago
```
## Testing Options
8 years ago
5 years ago
```{r more-options-01}
8 years ago
opts <- list(TidyDocType="html5",
TidyMakeClean=TRUE,
TidyHideComments=TRUE,
TidyIndentContent=FALSE,
TidyWrapLen=200)
txt <- "<html>
8 years ago
<head>
8 years ago
<style>
p { color: red; }
</style>
<body>
<!-- ===== body ====== -->
<p>Test</p>
</body>
<!--Default Zone
-->
<!--Default Zone End-->
</html>"
cat(tidy_html(txt, option=opts))
8 years ago
```
But, you're probably better off running it on plain HTML source.
Since it's C/C++-backed, it's pretty fast:
5 years ago
```{r speed-01}
book <- readLines("http://singlepageappbook.com/single-page.html")
sum(map_int(book, nchar))
system.time(tidy_book <- tidy_html(book))
```
(It's usually between 20 & 25 milliseconds to process those 202 kilobytes of HTML.) Not too shabby.
## htmltidy Metrics
5 years ago
```{r cloc, echo=FALSE}
cloc::cloc_pkg_md()
```
## Code of Conduct
8 years ago
5 years ago
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.