Pass in HTML content as either plain or raw text or parsed objects (either with the
XML
or xml2
packages) or as an httr
response
object
along with an options list that specifies how the content will be tidied and get back
tidied content of the same object type as passed in to the function.
# S3 method for response tidy_html(content, options = list(TidyXhtmlOut = TRUE), verbose = FALSE) tidy_html(content, options = list(TidyXhtmlOut = TRUE), verbose = FALSE) # S3 method for default tidy_html(content, options = list(TidyXhtmlOut = TRUE), verbose = FALSE) # S3 method for character tidy_html(content, options = list(TidyXhtmlOut = TRUE), verbose = FALSE) # S3 method for raw tidy_html(content, options = list(TidyXhtmlOut = TRUE), verbose = FALSE) # S3 method for xml_document tidy_html(content, options = list(TidyXhtmlOut = TRUE), verbose = FALSE) # S3 method for HTMLInternalDocument tidy_html(content, options = list(TidyXhtmlOut = TRUE), verbose = FALSE) # S3 method for connection tidy_html(content, options = list(TidyXhtmlOut = TRUE), verbose = FALSE)
xml2
or XML
packages.FALSE
)Tidied HTML/XHTML content. The object type will be the same as that of the input type
except when it is a connection
, then a character vector will be returned.
The default option TixyXhtmlOut
will convert the input content to XHTML.
Currently supported options:
TidyAltText
, TidyBodyOnly
, TidyBreakBeforeBR
,
TidyCoerceEndTags
, TidyDropEmptyElems
, TidyDropEmptyParas
,
TidyFixBackslash
, TidyFixComments
, TidyGDocClean
, TidyHideComments
,
TidyHtmlOut
, TidyIndentContent
, TidyJoinClasses
, TidyJoinStyles
,
TidyLogicalEmphasis
, TidyMakeBare
, TidyMakeClean
, TidyMark
,
TidyOmitOptionalTags
, TidyReplaceColor
, TidyUpperCaseAttrs
,
TidyUpperCaseTags
, TidyWord2000
, TidyXhtmlOut
TidyDoctype
, TidyInlineTags
, TidyBlockTags
,
TidyEmptyTags
, TidyPreTags
TidyIndentSpaces
, TidyTabSize
, TidyWrapLen
File https://github.com/hrbrmstr/htmltidy/issues if there are other libtidy
options you'd like supported.
It is likely that the most used options will be:
TidyXhtmlOut
(logical),
TidyHtmlOut
(logical) and
TidyDocType
which should be one of "omit
",
"html5
", "auto
", "strict
" or "loose
".
You can clean up Microsoft Word (2000) and Google Docs HTML via logical settings for
TidyWord2000
and TidyGDocClean
, respectively.
It may also be advantageous to remove all comments with TidyHideComments
.
If document parsing errors are severe enough, tidy_html()
will not be able
to clean the document and will display the errors (this output can be captured with
sink()
or capture.output()
) along with a warning and return a "best effort"
cleaned version of the document.
http://api.html-tidy.org/tidy/quickref_5.1.25.html & https://github.com/htacg/tidy-html5/blob/master/include/tidyenum.h for definitions of the options supported above and https://www.w3.org/People/Raggett/tidy/ for an explanation of what "tidy" HTML is and some canonical examples of what it can do.
opts <- list( TidyDocType="html5", TidyMakeClean=TRUE, TidyHideComments=TRUE, TidyIndentContent=TRUE, TidyWrapLen=200 ) txt <- paste0( c("<html><head><style>p { color: red; }</style><body><!-- ===== body ====== -->", "<p>Test</p></body><!--Default Zone --> <!--Default Zone End--></html>"), collapse="") cat(tidy_html(txt, option=opts))#> <!DOCTYPE html> #> <html> #> <head> #> <meta name="generator" content="HTML Tidy for HTML5 for R version 5.0.0"> #> <style> #> p { color: red; } #> </style> #> <title></title> #> </head> #> <body> #> <p> #> Test #> </p> #> </body> #> </html> #>library(httr) res <- GET("http://rud.is/test/untidy.html") # look at the original, un-tidy source cat(content(res, as="text", encoding="UTF-8"))#> <head> #> <style> #> body { font-family: sans-serif; } #> </style> #> </head> #> <body> #> <b>This is <b>some <i>really </i> poorly formatted HTML</b> #> #> as is this <span id="sp">portion<div> #># see the tidied version cat(tidy_html(content(res, as="text", encoding="UTF-8"), list(TidyDocType="html5", TidyWrapLen=200)))#> <!DOCTYPE html> #> <html> #> <head> #> <meta name="generator" content="HTML Tidy for HTML5 for R version 5.0.0"> #> <style> #> body { font-family: sans-serif; } #> </style> #> <title></title> #> </head> #> <body> #> <b>This is some <i>really</i> poorly formatted HTML as is this <span id="sp">portion</span></b> #> <div><span id="sp"></span></div> #> </body> #> </html> #># but, you could also just do: cat(tidy_html(url("http://rud.is/test/untidy.html")))#> <!DOCTYPE html> #> <html xmlns="http://www.w3.org/1999/xhtml"> #> <head> #> <meta name="generator" content= #> "HTML Tidy for HTML5 for R version 5.0.0" /> #> <style> #> <![CDATA[ #> body { font-family: sans-serif; } #> ]]> #> </style> #> <title></title> #> </head> #> <body> #> <b>This is some <i>really</i> poorly formatted HTMLas is this #> <span id="sp">portion</span></b> #> <div><span id="sp"></span></div> #> </body> #> </html> #>