You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
hrbrmstr b8865f7560
README
8 years ago
R README 8 years ago
man README 8 years ago
src embedded HTML Tidy library 8 years ago
tests embedded HTML Tidy library 8 years ago
.DS_Store embedded HTML Tidy library 8 years ago
.Rbuildignore initial commit 8 years ago
.gitignore initial commit 8 years ago
.travis.yml travis 8 years ago
CONDUCT.md initial commit 8 years ago
DESCRIPTION embedded HTML Tidy library 8 years ago
LICENSE initial commit 8 years ago
NAMESPACE embedded HTML Tidy library 8 years ago
NEWS.md embedded HTML Tidy library 8 years ago
README.Rmd README 8 years ago
README.md README 8 years ago
htmltidy.Rproj initial commit 8 years ago

README.md

Travis-CI Build Status

htmltidy — Clean up gnarly HTML/XML

Inspired by this SO question and because there's a great deal of cruddy HTML out there that needs fixing to use properly when scraping data.

It relies on a locally included version of libtidy and presently is super-basic (no way to set options and pretty much only does HTML)

This works enough for me to use in a pinch. It should be straightforward (but tedious) to:

  • enable passing options in a list
  • Getting it to work on Windows.

The following functions are implemented:

  • tidy_html : Clean up gnarly HTML/XML

TODO

Fix:

* checking compiled code ... WARNING
File ‘htmltidy/libs/htmltidy.so’:
  Found ‘___stderrp’, possibly from ‘stderr’ (C)
    Objects: ‘alloc.o’, ‘streamio.o’, ‘tidylib.o’
  Found ‘___stdoutp’, possibly from ‘stdout’ (C)
    Objects: ‘sprtf.o’, ‘tidylib.o’
  Found ‘_exit’, possibly from ‘exit’ (C)
    Objects: ‘alloc.o’, ‘sprtf.o’

Installation

devtools::install_github("hrbrmstr/htmltidy")

Usage

library(htmltidy)

# current verison
packageVersion("htmltidy")
#> [1] '0.2.0.9000'

cat(tidy_html("<b><p><a href='http://google.com'>google &gt</a></p></b>"))
#> <!DOCTYPE html>
#> <html xmlns="http://www.w3.org/1999/xhtml">
#> <head>
#> <meta name="generator" content=
#> "HTML Tidy for HTML5 for R version 5.0.0" />
#> <title></title>
#> </head>
#> <body>
#> <p><b><a href='http://google.com'>google &gt;</a></b></p>
#> </body>
#> </html>

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.