You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

60 lines
1.8 KiB

6 years ago
---
output: rmarkdown::github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.retina=2,
fig.path = "README-"
)
```
`htmltidy` — Clean up gnarly HTML/XML
6 years ago
Inspired by [this SO question](http://stackoverflow.com/questions/37061873/identify-a-weblink-in-bold-in-r) and because there's a great deal of cruddy HTML out there that needs fixing to use properly when scraping data.
6 years ago
NOTE: Requires [`libtidy`](http://www.html-tidy.org/) and presently is super-basic (no way to set options and pretty much only does HTML)
6 years ago
`brew install tidy-html5` on OS X to get this to work. You'll have to do a bit more leg-work to get it to work on linux (`apt-get install libtidy-dev` on Ubuntu sticks the library in a `tidy` subdir off `/usr/lib` and I don't have a `configure` script setup yet).
**SEEKING COLLABORATORS**
This works enough for me to use in a pinch. It should be straightforward (but tedious) to:
- enable passing options in a `list`
- bundle `libtidy` _with the package_ and get it to work on Windows, linux & MacOS as the library compiles on all three with the necessary tools.
6 years ago
The following functions are implemented:
- `tidy` : Clean up gnarly HTML/XML
### Installation
```{r eval=FALSE}
devtools::install_github("hrbrmstr/htmltidy")
```
```{r echo=FALSE, message=FALSE, warning=FALSE, error=FALSE}
options(width=120)
```
### Usage
```{r}
library(htmltidy)
# current verison
packageVersion("htmltidy")
cat(tidy("<b><p><a href='http://google.com'>google &gt</a></p></b>"))
```
### Code of Conduct
Please note that this project is released with a [Contributor Code of Conduct](CONDUCT.md).
By participating in this project you agree to abide by its terms.