You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
boB Rudis 89b63733ed
update README
8 years ago
R initial commit 8 years ago
man trying for ubuntu 8 years ago
src basic error checking & tests 8 years ago
tests changed test to work with multiple libraries 8 years ago
.Rbuildignore initial commit 8 years ago
.gitignore initial commit 8 years ago
.travis.yml initial commit 8 years ago
CONDUCT.md initial commit 8 years ago
DESCRIPTION basic error checking & tests 8 years ago
LICENSE initial commit 8 years ago
NAMESPACE trying for ubuntu 8 years ago
NEWS.md basic error checking & tests 8 years ago
README.Rmd update README 8 years ago
README.md update README 8 years ago
htmltidy.Rproj initial commit 8 years ago

README.md

htmltidy — Clean up gnarly HTML/XML

Inspired by this SO question and because there's a great deal of cruddy HTML out there that needs fixing to use properly when scraping data.

NOTE: Requires libtidy and presently is super-basic (no way to set options and pretty much only does HTML)

You'll need to first do a brew install tidy-html5 on MacOS or apt-get install libtidy-dev on Ubuntu/Debian to get this to work. NOTE that the linux libraries may be older and return slightly different (but no less tidy) HTML.

SEEKING COLLABORATORS

This works enough for me to use in a pinch. It should be straightforward (but tedious) to:

  • enable passing options in a list
  • bundle libtidy with the package and get it to work on Windows, linux & MacOS as the library compiles on all three with the necessary tools.

The following functions are implemented:

  • tidy : Clean up gnarly HTML/XML

Installation

devtools::install_github("hrbrmstr/htmltidy")

Usage

library(htmltidy)

# current verison
packageVersion("htmltidy")
#> [1] '0.1.0.9000'

cat(tidy("<b><p><a href='http://google.com'>google &gt</a></p></b>"))
#> <!DOCTYPE html>
#> <html xmlns="http://www.w3.org/1999/xhtml">
#> <head>
#> <meta name="generator" content=
#> "HTML Tidy for HTML5 for Mac OS X version 5.2.0" />
#> <title></title>
#> </head>
#> <body>
#> <p><b><a href='http://google.com'>google &gt;</a></b></p>
#> </body>
#> </html>

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.