You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

1.7 KiB

htmltidy — Clean up gnarly HTML/XML

Inspired by this SO question and because there's a great deal of cruddy HTML out there that needs fixing to use properly when scraping data.

NOTE: Requires libtidy and presently is super-basic (no way to set options and pretty much only does HTML)

You'll need to first do a brew install tidy-html5 on MacOS or apt-get install libtidy-dev on Ubuntu/Debian to get this to work.


This works enough for me to use in a pinch. It should be straightforward (but tedious) to:

  • enable passing options in a list
  • bundle libtidy with the package and get it to work on Windows, linux & MacOS as the library compiles on all three with the necessary tools.

The following functions are implemented:

  • tidy : Clean up gnarly HTML/XML





# current verison
#> [1] ''

cat(tidy("<b><p><a href=''>google &gt</a></p></b>"))
#> <!DOCTYPE html>
#> <html xmlns="">
#> <head>
#> <meta name="generator" content=
#> "HTML Tidy for HTML5 for Mac OS X version 5.2.0" />
#> <title></title>
#> </head>
#> <body>
#> <p><b><a href=''>google &gt;</a></b></p>
#> </body>
#> </html>

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.