|
7 years ago | |
---|---|---|
R | 7 years ago | |
man | 7 years ago | |
src | 7 years ago | |
tests | 7 years ago | |
.Rbuildignore | 7 years ago | |
.gitignore | 7 years ago | |
.travis.yml | 7 years ago | |
CONDUCT.md | 7 years ago | |
DESCRIPTION | 7 years ago | |
LICENSE | 7 years ago | |
NAMESPACE | 7 years ago | |
NEWS.md | 7 years ago | |
README.Rmd | 7 years ago | |
README.md | 7 years ago | |
htmltidy.Rproj | 7 years ago |
README.md
htmltidy
— Clean up gnarly HTML/XML
Inspired by this SO question and because there's a great deal of cruddy HTML out there that needs fixing to use properly when scraping data.
NOTE: Requires libtidy
and presently is super-basic (no way to set options and pretty much only does HTML)
You'll need to first do a brew install tidy-html5
on MacOS or apt-get install libtidy-dev
on Ubuntu/Debian to get this to work. NOTE that the linux libraries may be older and return slightly different (but no less tidy) HTML.
SEEKING COLLABORATORS
This works enough for me to use in a pinch. It should be straightforward (but tedious) to:
- enable passing options in a
list
- bundle
libtidy
with the package and get it to work on Windows, linux & MacOS as the library compiles on all three with the necessary tools.
The following functions are implemented:
tidy
: Clean up gnarly HTML/XML
Installation
devtools::install_github("hrbrmstr/htmltidy")
Usage
library(htmltidy)
# current verison
packageVersion("htmltidy")
#> [1] '0.1.0.9000'
cat(tidy("<b><p><a href='http://google.com'>google ></a></p></b>"))
#> <!DOCTYPE html>
#> <html xmlns="http://www.w3.org/1999/xhtml">
#> <head>
#> <meta name="generator" content=
#> "HTML Tidy for HTML5 for Mac OS X version 5.2.0" />
#> <title></title>
#> </head>
#> <body>
#> <p><b><a href='http://google.com'>google ></a></b></p>
#> </body>
#> </html>
Code of Conduct
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.