You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
htmltidy — Clean up gnarly HTML/XML
Inspired by this SO question and because there's a great deal of cruddy HTML out there that needs fixing to use properly when scraping data.
It relies on a locally included version of
libtidy and presently is super-basic (no way to set options and pretty much only does HTML)
This works enough for me to use in a pinch. It should be straightforward (but tedious) to:
- enable passing options in a
- Getting it to work on Windows.
The following functions are implemented:
tidy_html: Clean up gnarly HTML/XML
library(htmltidy) # current verison packageVersion("htmltidy") #>  '0.2.0.9000' cat(tidy_html("<b><p><a href='http://google.com'>google ></a></p></b>")) #> <!DOCTYPE html> #> <html xmlns="http://www.w3.org/1999/xhtml"> #> <head> #> <meta name="generator" content= #> "HTML Tidy for HTML5 for R version 5.0.0" /> #> <title></title> #> </head> #> <body> #> <p><b><a href='http://google.com'>google ></a></b></p> #> </body> #> </html>
Code of Conduct
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.