You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

293 lines
17 KiB

8 years ago
<!-- Generated by pkgdown: do not edit by hand -->
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Home. htmltidy</title>
<!-- jquery -->
<script src="https://code.jquery.com/jquery-3.1.0.min.js" integrity="sha384-nrOSfDHtoPMzJHjVTdCopGqIqeYETSXhZDFyniQ8ZHcVy08QesyHcnOUpMpqnmWq" crossorigin="anonymous"></script>
<!-- Bootstrap -->
<link href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" rel="stylesheet" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin="anonymous">
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script>
<!-- Font Awesome icons -->
<link href="https://maxcdn.bootstrapcdn.com/font-awesome/4.6.3/css/font-awesome.min.css" rel="stylesheet" integrity="sha384-T8Gy5hrqNKT+hzMclPo118YTQO6cYprQmhrYwIiQ/3axmI1hQomh7Ud2hPOy8SP1" crossorigin="anonymous">
<!-- pkgdown -->
<link href="pkgdown.css" rel="stylesheet">
<script src="pkgdown.js"></script>
<!-- mathjax -->
<script src='https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'></script>
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
</head>
<body>
<div class="container">
<header>
<div class="navbar navbar-default navbar-fixed-top" role="navigation">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="index.html">htmltidy</a>
</div>
<div id="navbar" class="navbar-collapse collapse">
<ul class="nav navbar-nav">
<li>
<a href="index.html">Home</a>
</li>
<li>
<a href="reference/index.html">Reference</a>
</li>
<li>
<a href="news/index.html">News</a>
</li>
</ul>
<ul class="nav navbar-nav navbar-right">
<li>
<a href="https://github.com/hrbrmstr/htmltidy">
<span class="fa fa-github fa-lg"></span>
</a>
</li>
</ul>
</div><!--/.nav-collapse -->
</div><!--/.container -->
</div><!--/.navbar -->
</header>
<div class="row">
<div class="col-md-9">
<p><a href="https://travis-ci.org/hrbrmstr/htmltidy"><img src="https://travis-ci.org/hrbrmstr/htmltidy.svg?branch=master" alt="Travis-CI Build Status"></a> <a href="https://ci.appveyor.com/project/hrbrmstr/htmltidy"><img src="https://ci.appveyor.com/api/projects/status/github/hrbrmstr/htmltidy?branch=master&amp;svg=true" alt="AppVeyor Build Status"></a> <a href="https://cran.r-project.org/package=htmltidy"><img src="http://www.r-pkg.org/badges/version/htmltidy" alt="CRAN_Status_Badge"></a> <img src="http://cranlogs.r-pkg.org/badges/grand-total/htmltidy" alt="downloads"></p>
<!-- README.md is generated from README.Rmd. Please edit that file -->
<p><code>htmltidy</code> &mdash; Tidy Up and Test XPath Queries on HTML and XML Content</p>
<p>Partly inspired by <a href="http://stackoverflow.com/questions/37061873/identify-a-weblink-in-bold-in-r">this SO question</a> and because there&rsquo;s a great deal of cruddy HTML out there that needs fixing to use properly when scraping data.</p>
<p>It relies on a locally included version of <a href="http://www.html-tidy.org/"><code>libtidy</code></a> and works on macOS, Linux &amp; Windows.</p>
<p>It also incorporates an <code>htmlwidget</code> to view and test XPath queries on HTML/XML content.</p>
<p>The following functions are implemented:</p>
<ul>
<li>
<code>tidy_html</code>: Tidy or &ldquo;Pretty Print&rdquo; HTML/XHTML Documents</li>
<li>
<code>html_view</code>: HTML/XML pretty printer and viewer</li>
<li>
<code>xml_view</code>: HTML/XML pretty printer and viewer</li>
<li>
<code>html_tree_view</code>: HTML/XML tree viewer</li>
<li>
<code>xml_tree_view</code>: HTML/XML tree viewer</li>
</ul>
<h3 id="installation">Installation</h3>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">devtools::<span class="kw">install_github</span>(<span class="st">"hrbrmstr/htmltidy"</span>)</code></pre></div>
<h3 id="usage">Usage</h3>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(htmltidy)
<span class="co"># current verison</span>
<span class="kw">packageVersion</span>(<span class="st">"htmltidy"</span>)
## [1] '0.3.0'
<span class="kw">library</span>(XML)
<span class="kw">library</span>(xml2)
<span class="kw">library</span>(httr)
<span class="kw">library</span>(purrr)</code></pre></div>
<p>This is really &ldquo;un-tidy&rdquo; content:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">res &lt;-<span class="st"> </span><span class="kw">GET</span>(<span class="st">"http://rud.is/test/untidy.html"</span>)
<span class="kw">cat</span>(<span class="kw">content</span>(res, <span class="dt">as=</span><span class="st">"text"</span>))
## &lt;head&gt;
## &lt;style&gt;
## body { font-family: sans-serif; }
## &lt;/style&gt;
## &lt;/head&gt;
## &lt;body&gt;
## &lt;b&gt;This is &lt;b&gt;some &lt;i&gt;really &lt;/i&gt; poorly formatted HTML&lt;/b&gt;
##
## as is this &lt;span id="sp"&gt;portion&lt;div&gt;</code></pre></div>
<p>Let&rsquo;s see what <code><a href="reference/tidy_html.response.html">tidy_html()</a></code> does to it.</p>
<p>It can handle the <code>response</code> object directly:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">cat</span>(<span class="kw"><a href="reference/tidy_html.response.html">tidy_html</a></span>(res, <span class="kw">list</span>(<span class="dt">TidyDocType=</span><span class="st">"html5"</span>, <span class="dt">TidyWrapLen=</span><span class="dv">200</span>)))
## &lt;!DOCTYPE html&gt;
## &lt;html&gt;
## &lt;head&gt;
## &lt;meta name="generator" content="HTML Tidy for HTML5 for R version 5.0.0"&gt;
## &lt;style&gt;
## body { font-family: sans-serif; }
## &lt;/style&gt;
## &lt;title&gt;&lt;/title&gt;
## &lt;/head&gt;
## &lt;body&gt;
## &lt;b&gt;This is some &lt;i&gt;really&lt;/i&gt; poorly formatted HTML as is this &lt;span id="sp"&gt;portion&lt;/span&gt;&lt;/b&gt;
## &lt;div&gt;&lt;span id="sp"&gt;&lt;/span&gt;&lt;/div&gt;
## &lt;/body&gt;
## &lt;/html&gt;</code></pre></div>
<p>But, you&rsquo;ll probably mostly use it on HTML you&rsquo;ve identified as gnarly and already have that HTML text content handy:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">cat</span>(<span class="kw"><a href="reference/tidy_html.response.html">tidy_html</a></span>(<span class="kw">content</span>(res, <span class="dt">as=</span><span class="st">"text"</span>), <span class="kw">list</span>(<span class="dt">TidyDocType=</span><span class="st">"html5"</span>, <span class="dt">TidyWrapLen=</span><span class="dv">200</span>)))
## &lt;!DOCTYPE html&gt;
## &lt;html&gt;
## &lt;head&gt;
## &lt;meta name="generator" content="HTML Tidy for HTML5 for R version 5.0.0"&gt;
## &lt;style&gt;
## body { font-family: sans-serif; }
## &lt;/style&gt;
## &lt;title&gt;&lt;/title&gt;
## &lt;/head&gt;
## &lt;body&gt;
## &lt;b&gt;This is some &lt;i&gt;really&lt;/i&gt; poorly formatted HTML as is this &lt;span id="sp"&gt;portion&lt;/span&gt;&lt;/b&gt;
## &lt;div&gt;&lt;span id="sp"&gt;&lt;/span&gt;&lt;/div&gt;
## &lt;/body&gt;
## &lt;/html&gt;</code></pre></div>
<p>NOTE: you could also just have done:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">cat</span>(<span class="kw"><a href="reference/tidy_html.response.html">tidy_html</a></span>(<span class="kw">url</span>(<span class="st">"http://rud.is/test/untidy.html"</span>),
<span class="kw">list</span>(<span class="dt">TidyDocType=</span><span class="st">"html5"</span>, <span class="dt">TidyWrapLen=</span><span class="dv">200</span>)))
## &lt;!DOCTYPE html&gt;
## &lt;html&gt;
## &lt;head&gt;
## &lt;meta name="generator" content="HTML Tidy for HTML5 for R version 5.0.0"&gt;
## &lt;style&gt;
## body { font-family: sans-serif; }
## &lt;/style&gt;
## &lt;title&gt;&lt;/title&gt;
## &lt;/head&gt;
## &lt;body&gt;
## &lt;b&gt;This is some &lt;i&gt;really&lt;/i&gt; poorly formatted HTMLas is this &lt;span id="sp"&gt;portion&lt;/span&gt;&lt;/b&gt;
## &lt;div&gt;&lt;span id="sp"&gt;&lt;/span&gt;&lt;/div&gt;
## &lt;/body&gt;
## &lt;/html&gt;</code></pre></div>
<p>You&rsquo;ll see that this differs substantially from the mangling <code>libxml2</code> does (via <code>read_html()</code>):</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">pg &lt;-<span class="st"> </span><span class="kw">read_html</span>(<span class="st">"http://rud.is/test/untidy.html"</span>)
<span class="kw">cat</span>(<span class="kw">toString</span>(pg))
## &lt;?xml version="1.0" standalone="yes"?&gt;
## &lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
## &lt;html&gt;&lt;head&gt;&lt;style&gt;&lt;![CDATA[
## body { font-family: sans-serif; }
## ]]&gt;&lt;/style&gt;&lt;/head&gt;&lt;body&gt;
## &lt;b&gt;This is &lt;b&gt;some &lt;i&gt;really &lt;/i&gt; poorly formatted HTML&lt;/b&gt;
##
## as is this &lt;span id="sp"&gt;portion&lt;div/&gt;&lt;/span&gt;&lt;/b&gt;&lt;/body&gt;&lt;/html&gt;</code></pre></div>
<p>It can also deal with &ldquo;raw&rdquo; and parsed objects:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw"><a href="reference/tidy_html.response.html">tidy_html</a></span>(<span class="kw">content</span>(res, <span class="dt">as=</span><span class="st">"raw"</span>))
## [1] 3c 21 44 4f 43 54 59 50 45 20 68 74 6d 6c 3e 0a 3c 68 74 6d 6c 20 78 6d 6c 6e 73 3d 22 68 74 74 70 3a 2f 2f 77 77
## [39] 77 2e 77 33 2e 6f 72 67 2f 31 39 39 39 2f 78 68 74 6d 6c 22 3e 0a 3c 68 65 61 64 3e 0a 3c 6d 65 74 61 20 6e 61 6d
## [77] 65 3d 22 67 65 6e 65 72 61 74 6f 72 22 20 63 6f 6e 74 65 6e 74 3d 0a 22 48 54 4d 4c 20 54 69 64 79 20 66 6f 72 20
## [115] 48 54 4d 4c 35 20 66 6f 72 20 52 20 76 65 72 73 69 6f 6e 20 35 2e 30 2e 30 22 20 2f 3e 0a 3c 74 69 74 6c 65 3e 3c
## [153] 2f 74 69 74 6c 65 3e 0a 3c 2f 68 65 61 64 3e 0a 3c 62 6f 64 79 3e 0a 3c 2f 62 6f 64 79 3e 0a 3c 2f 68 74 6d 6c 3e
## [191] 0a
<span class="kw"><a href="reference/tidy_html.response.html">tidy_html</a></span>(<span class="kw">content</span>(res, <span class="dt">as=</span><span class="st">"text"</span>, <span class="dt">encoding=</span><span class="st">"UTF-8"</span>))
## [1] "&lt;!DOCTYPE html&gt;\n&lt;html xmlns=\"http://www.w3.org/1999/xhtml\"&gt;\n&lt;head&gt;\n&lt;meta name=\"generator\" content=\n\"HTML Tidy for HTML5 for R version 5.0.0\" /&gt;\n&lt;style&gt;\n&lt;![CDATA[\nbody { font-family: sans-serif; }\n]]&gt;\n&lt;/style&gt;\n&lt;title&gt;&lt;/title&gt;\n&lt;/head&gt;\n&lt;body&gt;\n&lt;b&gt;This is some &lt;i&gt;really&lt;/i&gt; poorly formatted HTML as is this\n&lt;span id=\"sp\"&gt;portion&lt;/span&gt;&lt;/b&gt;\n&lt;div&gt;&lt;span id=\"sp\"&gt;&lt;/span&gt;&lt;/div&gt;\n&lt;/body&gt;\n&lt;/html&gt;\n"
<span class="kw"><a href="reference/tidy_html.response.html">tidy_html</a></span>(<span class="kw">content</span>(res, <span class="dt">as=</span><span class="st">"parsed"</span>, <span class="dt">encoding=</span><span class="st">"UTF-8"</span>))
## {xml_document}
## &lt;html xmlns="http://www.w3.org/1999/xhtml"&gt;
## [1] &lt;head&gt;\n &lt;meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /&gt;\n &lt;meta name="generator" content ...
## [2] &lt;body&gt;\n&lt;b&gt;This is some &lt;i&gt;really&lt;/i&gt; poorly formatted HTML as is this\n&lt;span id="sp"&gt;portion&lt;/span&gt;&lt;/b&gt;\n&lt;/body&gt;
<span class="kw"><a href="reference/tidy_html.response.html">tidy_html</a></span>(<span class="kw">htmlParse</span>(<span class="st">"http://rud.is/test/untidy.html"</span>))
## &lt;!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"&gt;
## &lt;html xmlns="http://www.w3.org/1999/xhtml"&gt;
## &lt;head&gt;
## &lt;meta name="generator" content="HTML Tidy for HTML5 for R version 5.0.0"&gt;
## &lt;style&gt;
## &lt;![CDATA[
## body { font-family: sans-serif; }
## ]]&gt;
## &lt;/style&gt;
## &lt;title&gt;&lt;/title&gt;
## &lt;/head&gt;
## &lt;body&gt;
## &lt;b&gt;This is some &lt;i&gt;really&lt;/i&gt; poorly formatted HTML as is this
## &lt;span id="sp"&gt;portion&lt;/span&gt;&lt;/b&gt;
## &lt;div&gt;&lt;span id="sp"&gt;&lt;/span&gt;&lt;/div&gt;
## &lt;/body&gt;
## &lt;/html&gt;
## </code></pre></div>
<p>And, show the markup errors:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">invisible</span>(<span class="kw"><a href="reference/tidy_html.response.html">tidy_html</a></span>(<span class="kw">url</span>(<span class="st">"http://rud.is/test/untidy.html"</span>), <span class="dt">verbose=</span><span class="ot">TRUE</span>))
## line 1 column 1 - Warning: missing &lt;!DOCTYPE&gt; declaration
## line 1 column 68 - Warning: nested emphasis &lt;b&gt;
## line 1 column 138 - Warning: missing &lt;/span&gt; before &lt;div&gt;
## line 1 column 68 - Warning: missing &lt;/b&gt; before &lt;div&gt;
## line 1 column 164 - Warning: inserting implicit &lt;span&gt;
## line 1 column 164 - Warning: missing &lt;/span&gt;
## line 1 column 159 - Warning: missing &lt;/div&gt;
## line 1 column 1 - Warning: inserting missing 'title' element
## line 1 column 164 - Warning: &lt;span&gt; anchor "sp" already defined
## Info: Document content looks like XHTML5
## Tidy found 9 warnings and 0 errors!</code></pre></div>
<h3 id="testing-options">Testing Options</h3>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">
opts &lt;-<span class="st"> </span><span class="kw">list</span>(<span class="dt">TidyDocType=</span><span class="st">"html5"</span>,
<span class="dt">TidyMakeClean=</span><span class="ot">TRUE</span>,
<span class="dt">TidyHideComments=</span><span class="ot">TRUE</span>,
<span class="dt">TidyIndentContent=</span><span class="ot">FALSE</span>,
<span class="dt">TidyWrapLen=</span><span class="dv">200</span>)
txt &lt;-<span class="st"> "&lt;html&gt;</span>
<span class="st">&lt;head&gt;</span>
<span class="st"> &lt;style&gt;</span>
<span class="st"> p { color: red; }</span>
<span class="st"> &lt;/style&gt;</span>
<span class="st"> &lt;body&gt;</span>
<span class="st"> &lt;!-- ===== body ====== --&gt;</span>
<span class="st"> &lt;p&gt;Test&lt;/p&gt;</span>
<span class="st"> &lt;/body&gt;</span>
<span class="st"> &lt;!--Default Zone</span>
<span class="st"> --&gt;</span>
<span class="st"> &lt;!--Default Zone End--&gt;</span>
<span class="st">&lt;/html&gt;"</span>
<span class="kw">cat</span>(<span class="kw"><a href="reference/tidy_html.response.html">tidy_html</a></span>(txt, <span class="dt">option=</span>opts))
## &lt;!DOCTYPE html&gt;
## &lt;html&gt;
## &lt;head&gt;
## &lt;meta name="generator" content="HTML Tidy for HTML5 for R version 5.0.0"&gt;
## &lt;style&gt;
## p { color: red; }
## &lt;/style&gt;
## &lt;title&gt;&lt;/title&gt;
## &lt;/head&gt;
## &lt;body&gt;
## &lt;p&gt;Test&lt;/p&gt;
## &lt;/body&gt;
## &lt;/html&gt;</code></pre></div>
<p>But, you&rsquo;re probably better off running it on plain HTML source.</p>
<p>Since it&rsquo;s C/C++-backed, it&rsquo;s pretty fast:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">book &lt;-<span class="st"> </span><span class="kw">readLines</span>(<span class="st">"http://singlepageappbook.com/single-page.html"</span>)
<span class="kw">sum</span>(<span class="kw">map_int</span>(book, nchar))
## [1] 207501
<span class="kw">system.time</span>(tidy_book &lt;-<span class="st"> </span><span class="kw"><a href="reference/tidy_html.response.html">tidy_html</a></span>(book))
## user system elapsed
## 0.021 0.001 0.022</code></pre></div>
<p>(It&rsquo;s usually between 20 &amp; 25 milliseconds to process those 202 kilobytes of HTML.) Not too shabby.</p>
<h3 id="code-of-conduct">Code of Conduct</h3>
<p>Please note that this project is released with a <a href="CONDUCT.md">Contributor Code of Conduct</a>. By participating in this project you agree to abide by its terms.</p>
</div>
</div>
<footer>
<p>Built by <a href="http://hadley.github.io/pkgdown/">pkgdown</a>. Styled with <a href="http://getbootstrap.com">Bootstrap 3</a>.</p>
</footer>
</div>
</body>
</html>