You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
292 lines
17 KiB
292 lines
17 KiB
<!-- Generated by pkgdown: do not edit by hand -->
|
|
<!DOCTYPE html>
|
|
<html>
|
|
<head>
|
|
<meta charset="utf-8">
|
|
<meta http-equiv="X-UA-Compatible" content="IE=edge">
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
|
|
|
<title>Home. htmltidy</title>
|
|
|
|
<!-- jquery -->
|
|
<script src="https://code.jquery.com/jquery-3.1.0.min.js" integrity="sha384-nrOSfDHtoPMzJHjVTdCopGqIqeYETSXhZDFyniQ8ZHcVy08QesyHcnOUpMpqnmWq" crossorigin="anonymous"></script>
|
|
|
|
<!-- Bootstrap -->
|
|
|
|
<link href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" rel="stylesheet" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin="anonymous">
|
|
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script>
|
|
|
|
<!-- Font Awesome icons -->
|
|
<link href="https://maxcdn.bootstrapcdn.com/font-awesome/4.6.3/css/font-awesome.min.css" rel="stylesheet" integrity="sha384-T8Gy5hrqNKT+hzMclPo118YTQO6cYprQmhrYwIiQ/3axmI1hQomh7Ud2hPOy8SP1" crossorigin="anonymous">
|
|
|
|
|
|
<!-- pkgdown -->
|
|
<link href="pkgdown.css" rel="stylesheet">
|
|
<script src="pkgdown.js"></script>
|
|
|
|
<!-- mathjax -->
|
|
<script src='https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'></script>
|
|
|
|
<!--[if lt IE 9]>
|
|
<script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script>
|
|
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
|
|
<![endif]-->
|
|
</head>
|
|
|
|
<body>
|
|
<div class="container">
|
|
<header>
|
|
|
|
<div class="navbar navbar-default navbar-fixed-top" role="navigation">
|
|
<div class="container">
|
|
<div class="navbar-header">
|
|
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar">
|
|
<span class="icon-bar"></span>
|
|
<span class="icon-bar"></span>
|
|
<span class="icon-bar"></span>
|
|
</button>
|
|
<a class="navbar-brand" href="index.html">htmltidy</a>
|
|
</div>
|
|
<div id="navbar" class="navbar-collapse collapse">
|
|
<ul class="nav navbar-nav">
|
|
<li>
|
|
<a href="index.html">Home</a>
|
|
</li>
|
|
<li>
|
|
<a href="reference/index.html">Reference</a>
|
|
</li>
|
|
<li>
|
|
<a href="news/index.html">News</a>
|
|
</li>
|
|
</ul>
|
|
<ul class="nav navbar-nav navbar-right">
|
|
<li>
|
|
<a href="https://github.com/hrbrmstr/htmltidy">
|
|
<span class="fa fa-github fa-lg"></span>
|
|
|
|
</a>
|
|
</li>
|
|
</ul>
|
|
</div><!--/.nav-collapse -->
|
|
</div><!--/.container -->
|
|
</div><!--/.navbar -->
|
|
|
|
</header>
|
|
|
|
<div class="row">
|
|
<div class="col-md-9">
|
|
|
|
<p><a href="https://travis-ci.org/hrbrmstr/htmltidy"><img src="https://travis-ci.org/hrbrmstr/htmltidy.svg?branch=master" alt="Travis-CI Build Status"></a> <a href="https://ci.appveyor.com/project/hrbrmstr/htmltidy"><img src="https://ci.appveyor.com/api/projects/status/github/hrbrmstr/htmltidy?branch=master&svg=true" alt="AppVeyor Build Status"></a> <a href="https://cran.r-project.org/package=htmltidy"><img src="http://www.r-pkg.org/badges/version/htmltidy" alt="CRAN_Status_Badge"></a> <img src="http://cranlogs.r-pkg.org/badges/grand-total/htmltidy" alt="downloads"></p>
|
|
<!-- README.md is generated from README.Rmd. Please edit that file -->
|
|
<p><code>htmltidy</code> — Tidy Up and Test XPath Queries on HTML and XML Content</p>
|
|
<p>Partly inspired by <a href="http://stackoverflow.com/questions/37061873/identify-a-weblink-in-bold-in-r">this SO question</a> and because there’s a great deal of cruddy HTML out there that needs fixing to use properly when scraping data.</p>
|
|
<p>It relies on a locally included version of <a href="http://www.html-tidy.org/"><code>libtidy</code></a> and works on macOS, Linux & Windows.</p>
|
|
<p>It also incorporates an <code>htmlwidget</code> to view and test XPath queries on HTML/XML content.</p>
|
|
<p>The following functions are implemented:</p>
|
|
<ul>
|
|
<li>
|
|
<code>tidy_html</code>: Tidy or “Pretty Print” HTML/XHTML Documents</li>
|
|
<li>
|
|
<code>html_view</code>: HTML/XML pretty printer and viewer</li>
|
|
<li>
|
|
<code>xml_view</code>: HTML/XML pretty printer and viewer</li>
|
|
<li>
|
|
<code>html_tree_view</code>: HTML/XML tree viewer</li>
|
|
<li>
|
|
<code>xml_tree_view</code>: HTML/XML tree viewer</li>
|
|
</ul>
|
|
<h3 id="installation">Installation</h3>
|
|
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">devtools::<span class="kw">install_github</span>(<span class="st">"hrbrmstr/htmltidy"</span>)</code></pre></div>
|
|
<h3 id="usage">Usage</h3>
|
|
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(htmltidy)
|
|
|
|
<span class="co"># current verison</span>
|
|
<span class="kw">packageVersion</span>(<span class="st">"htmltidy"</span>)
|
|
## [1] '0.3.0'
|
|
|
|
<span class="kw">library</span>(XML)
|
|
<span class="kw">library</span>(xml2)
|
|
<span class="kw">library</span>(httr)
|
|
<span class="kw">library</span>(purrr)</code></pre></div>
|
|
<p>This is really “un-tidy” content:</p>
|
|
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">res <-<span class="st"> </span><span class="kw">GET</span>(<span class="st">"http://rud.is/test/untidy.html"</span>)
|
|
<span class="kw">cat</span>(<span class="kw">content</span>(res, <span class="dt">as=</span><span class="st">"text"</span>))
|
|
## <head>
|
|
## <style>
|
|
## body { font-family: sans-serif; }
|
|
## </style>
|
|
## </head>
|
|
## <body>
|
|
## <b>This is <b>some <i>really </i> poorly formatted HTML</b>
|
|
##
|
|
## as is this <span id="sp">portion<div></code></pre></div>
|
|
<p>Let’s see what <code><a href="reference/tidy_html.response.html">tidy_html()</a></code> does to it.</p>
|
|
<p>It can handle the <code>response</code> object directly:</p>
|
|
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">cat</span>(<span class="kw"><a href="reference/tidy_html.response.html">tidy_html</a></span>(res, <span class="kw">list</span>(<span class="dt">TidyDocType=</span><span class="st">"html5"</span>, <span class="dt">TidyWrapLen=</span><span class="dv">200</span>)))
|
|
## <!DOCTYPE html>
|
|
## <html>
|
|
## <head>
|
|
## <meta name="generator" content="HTML Tidy for HTML5 for R version 5.0.0">
|
|
## <style>
|
|
## body { font-family: sans-serif; }
|
|
## </style>
|
|
## <title></title>
|
|
## </head>
|
|
## <body>
|
|
## <b>This is some <i>really</i> poorly formatted HTML as is this <span id="sp">portion</span></b>
|
|
## <div><span id="sp"></span></div>
|
|
## </body>
|
|
## </html></code></pre></div>
|
|
<p>But, you’ll probably mostly use it on HTML you’ve identified as gnarly and already have that HTML text content handy:</p>
|
|
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">cat</span>(<span class="kw"><a href="reference/tidy_html.response.html">tidy_html</a></span>(<span class="kw">content</span>(res, <span class="dt">as=</span><span class="st">"text"</span>), <span class="kw">list</span>(<span class="dt">TidyDocType=</span><span class="st">"html5"</span>, <span class="dt">TidyWrapLen=</span><span class="dv">200</span>)))
|
|
## <!DOCTYPE html>
|
|
## <html>
|
|
## <head>
|
|
## <meta name="generator" content="HTML Tidy for HTML5 for R version 5.0.0">
|
|
## <style>
|
|
## body { font-family: sans-serif; }
|
|
## </style>
|
|
## <title></title>
|
|
## </head>
|
|
## <body>
|
|
## <b>This is some <i>really</i> poorly formatted HTML as is this <span id="sp">portion</span></b>
|
|
## <div><span id="sp"></span></div>
|
|
## </body>
|
|
## </html></code></pre></div>
|
|
<p>NOTE: you could also just have done:</p>
|
|
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">cat</span>(<span class="kw"><a href="reference/tidy_html.response.html">tidy_html</a></span>(<span class="kw">url</span>(<span class="st">"http://rud.is/test/untidy.html"</span>),
|
|
<span class="kw">list</span>(<span class="dt">TidyDocType=</span><span class="st">"html5"</span>, <span class="dt">TidyWrapLen=</span><span class="dv">200</span>)))
|
|
## <!DOCTYPE html>
|
|
## <html>
|
|
## <head>
|
|
## <meta name="generator" content="HTML Tidy for HTML5 for R version 5.0.0">
|
|
## <style>
|
|
## body { font-family: sans-serif; }
|
|
## </style>
|
|
## <title></title>
|
|
## </head>
|
|
## <body>
|
|
## <b>This is some <i>really</i> poorly formatted HTMLas is this <span id="sp">portion</span></b>
|
|
## <div><span id="sp"></span></div>
|
|
## </body>
|
|
## </html></code></pre></div>
|
|
<p>You’ll see that this differs substantially from the mangling <code>libxml2</code> does (via <code>read_html()</code>):</p>
|
|
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">pg <-<span class="st"> </span><span class="kw">read_html</span>(<span class="st">"http://rud.is/test/untidy.html"</span>)
|
|
<span class="kw">cat</span>(<span class="kw">toString</span>(pg))
|
|
## <?xml version="1.0" standalone="yes"?>
|
|
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
|
|
## <html><head><style><![CDATA[
|
|
## body { font-family: sans-serif; }
|
|
## ]]></style></head><body>
|
|
## <b>This is <b>some <i>really </i> poorly formatted HTML</b>
|
|
##
|
|
## as is this <span id="sp">portion<div/></span></b></body></html></code></pre></div>
|
|
<p>It can also deal with “raw” and parsed objects:</p>
|
|
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw"><a href="reference/tidy_html.response.html">tidy_html</a></span>(<span class="kw">content</span>(res, <span class="dt">as=</span><span class="st">"raw"</span>))
|
|
## [1] 3c 21 44 4f 43 54 59 50 45 20 68 74 6d 6c 3e 0a 3c 68 74 6d 6c 20 78 6d 6c 6e 73 3d 22 68 74 74 70 3a 2f 2f 77 77
|
|
## [39] 77 2e 77 33 2e 6f 72 67 2f 31 39 39 39 2f 78 68 74 6d 6c 22 3e 0a 3c 68 65 61 64 3e 0a 3c 6d 65 74 61 20 6e 61 6d
|
|
## [77] 65 3d 22 67 65 6e 65 72 61 74 6f 72 22 20 63 6f 6e 74 65 6e 74 3d 0a 22 48 54 4d 4c 20 54 69 64 79 20 66 6f 72 20
|
|
## [115] 48 54 4d 4c 35 20 66 6f 72 20 52 20 76 65 72 73 69 6f 6e 20 35 2e 30 2e 30 22 20 2f 3e 0a 3c 74 69 74 6c 65 3e 3c
|
|
## [153] 2f 74 69 74 6c 65 3e 0a 3c 2f 68 65 61 64 3e 0a 3c 62 6f 64 79 3e 0a 3c 2f 62 6f 64 79 3e 0a 3c 2f 68 74 6d 6c 3e
|
|
## [191] 0a
|
|
|
|
<span class="kw"><a href="reference/tidy_html.response.html">tidy_html</a></span>(<span class="kw">content</span>(res, <span class="dt">as=</span><span class="st">"text"</span>, <span class="dt">encoding=</span><span class="st">"UTF-8"</span>))
|
|
## [1] "<!DOCTYPE html>\n<html xmlns=\"http://www.w3.org/1999/xhtml\">\n<head>\n<meta name=\"generator\" content=\n\"HTML Tidy for HTML5 for R version 5.0.0\" />\n<style>\n<![CDATA[\nbody { font-family: sans-serif; }\n]]>\n</style>\n<title></title>\n</head>\n<body>\n<b>This is some <i>really</i> poorly formatted HTML as is this\n<span id=\"sp\">portion</span></b>\n<div><span id=\"sp\"></span></div>\n</body>\n</html>\n"
|
|
|
|
<span class="kw"><a href="reference/tidy_html.response.html">tidy_html</a></span>(<span class="kw">content</span>(res, <span class="dt">as=</span><span class="st">"parsed"</span>, <span class="dt">encoding=</span><span class="st">"UTF-8"</span>))
|
|
## {xml_document}
|
|
## <html xmlns="http://www.w3.org/1999/xhtml">
|
|
## [1] <head>\n <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />\n <meta name="generator" content ...
|
|
## [2] <body>\n<b>This is some <i>really</i> poorly formatted HTML as is this\n<span id="sp">portion</span></b>\n</body>
|
|
|
|
<span class="kw"><a href="reference/tidy_html.response.html">tidy_html</a></span>(<span class="kw">htmlParse</span>(<span class="st">"http://rud.is/test/untidy.html"</span>))
|
|
## <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
|
## <html xmlns="http://www.w3.org/1999/xhtml">
|
|
## <head>
|
|
## <meta name="generator" content="HTML Tidy for HTML5 for R version 5.0.0">
|
|
## <style>
|
|
## <![CDATA[
|
|
## body { font-family: sans-serif; }
|
|
## ]]>
|
|
## </style>
|
|
## <title></title>
|
|
## </head>
|
|
## <body>
|
|
## <b>This is some <i>really</i> poorly formatted HTML as is this
|
|
## <span id="sp">portion</span></b>
|
|
## <div><span id="sp"></span></div>
|
|
## </body>
|
|
## </html>
|
|
## </code></pre></div>
|
|
<p>And, show the markup errors:</p>
|
|
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">invisible</span>(<span class="kw"><a href="reference/tidy_html.response.html">tidy_html</a></span>(<span class="kw">url</span>(<span class="st">"http://rud.is/test/untidy.html"</span>), <span class="dt">verbose=</span><span class="ot">TRUE</span>))
|
|
## line 1 column 1 - Warning: missing <!DOCTYPE> declaration
|
|
## line 1 column 68 - Warning: nested emphasis <b>
|
|
## line 1 column 138 - Warning: missing </span> before <div>
|
|
## line 1 column 68 - Warning: missing </b> before <div>
|
|
## line 1 column 164 - Warning: inserting implicit <span>
|
|
## line 1 column 164 - Warning: missing </span>
|
|
## line 1 column 159 - Warning: missing </div>
|
|
## line 1 column 1 - Warning: inserting missing 'title' element
|
|
## line 1 column 164 - Warning: <span> anchor "sp" already defined
|
|
## Info: Document content looks like XHTML5
|
|
## Tidy found 9 warnings and 0 errors!</code></pre></div>
|
|
<h3 id="testing-options">Testing Options</h3>
|
|
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">
|
|
opts <-<span class="st"> </span><span class="kw">list</span>(<span class="dt">TidyDocType=</span><span class="st">"html5"</span>,
|
|
<span class="dt">TidyMakeClean=</span><span class="ot">TRUE</span>,
|
|
<span class="dt">TidyHideComments=</span><span class="ot">TRUE</span>,
|
|
<span class="dt">TidyIndentContent=</span><span class="ot">FALSE</span>,
|
|
<span class="dt">TidyWrapLen=</span><span class="dv">200</span>)
|
|
|
|
txt <-<span class="st"> "<html></span>
|
|
<span class="st"><head></span>
|
|
<span class="st"> <style></span>
|
|
<span class="st"> p { color: red; }</span>
|
|
<span class="st"> </style></span>
|
|
<span class="st"> <body></span>
|
|
<span class="st"> <!-- ===== body ====== --></span>
|
|
<span class="st"> <p>Test</p></span>
|
|
|
|
<span class="st"> </body></span>
|
|
<span class="st"> <!--Default Zone</span>
|
|
<span class="st"> --></span>
|
|
<span class="st"> <!--Default Zone End--></span>
|
|
<span class="st"></html>"</span>
|
|
|
|
<span class="kw">cat</span>(<span class="kw"><a href="reference/tidy_html.response.html">tidy_html</a></span>(txt, <span class="dt">option=</span>opts))
|
|
## <!DOCTYPE html>
|
|
## <html>
|
|
## <head>
|
|
## <meta name="generator" content="HTML Tidy for HTML5 for R version 5.0.0">
|
|
## <style>
|
|
## p { color: red; }
|
|
## </style>
|
|
## <title></title>
|
|
## </head>
|
|
## <body>
|
|
## <p>Test</p>
|
|
## </body>
|
|
## </html></code></pre></div>
|
|
<p>But, you’re probably better off running it on plain HTML source.</p>
|
|
<p>Since it’s C/C++-backed, it’s pretty fast:</p>
|
|
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">book <-<span class="st"> </span><span class="kw">readLines</span>(<span class="st">"http://singlepageappbook.com/single-page.html"</span>)
|
|
<span class="kw">sum</span>(<span class="kw">map_int</span>(book, nchar))
|
|
## [1] 207501
|
|
<span class="kw">system.time</span>(tidy_book <-<span class="st"> </span><span class="kw"><a href="reference/tidy_html.response.html">tidy_html</a></span>(book))
|
|
## user system elapsed
|
|
## 0.021 0.001 0.022</code></pre></div>
|
|
<p>(It’s usually between 20 & 25 milliseconds to process those 202 kilobytes of HTML.) Not too shabby.</p>
|
|
<h3 id="code-of-conduct">Code of Conduct</h3>
|
|
<p>Please note that this project is released with a <a href="CONDUCT.md">Contributor Code of Conduct</a>. By participating in this project you agree to abide by its terms.</p>
|
|
|
|
</div>
|
|
</div>
|
|
|
|
<footer>
|
|
<p>Built by <a href="http://hadley.github.io/pkgdown/">pkgdown</a>. Styled with <a href="http://getbootstrap.com">Bootstrap 3</a>.</p>
|
|
</footer>
|
|
</div>
|
|
|
|
</body>
|
|
</html>
|
|
|