jericho

Break Down the Walls of 'HTML' Tags into Usable Text

boB Rudis 982b0e50d0 appveyor		7 years ago
R	docs & basic tests	7 years ago
inst	docs & basic tests	7 years ago
java/jericho	initial commit	7 years ago
man	docs & basic tests	7 years ago
tests	docs & basic tests	7 years ago
.Rbuildignore	appveyor	7 years ago
.codecov.yml	initial commit	7 years ago
.gitignore	initial commit	7 years ago
.travis.yml	travis	7 years ago
DESCRIPTION	docs & basic tests	7 years ago
LICENSE	initial commit	7 years ago
NAMESPACE	initial commit	7 years ago
NEWS.md	initial commit	7 years ago
README.Rmd	readme	7 years ago
README.md	README	7 years ago
appveyor.yml	appveyor	7 years ago
jericho.Rproj	initial commit	7 years ago

README.md

jericho : Break Down the Walls of 'HTML' Tags into Usable Text

Structured 'HTML' content can be useful when you need to parse data tables or other tagged data from within a document. However, it is also useful to obtain "just the text" from a document free from the walls of tags that surround it. Tools are provied that wrap methods in the 'Jericho HTML Parser' Java library by Martin Jericho http://jericho.htmlparser.net/docs/index.html. Martin's library is used in many at-scale projects, icluding the 'The Internet Archive'.

As a result of using a Java library, this package requires rJava.

The following functions are implemented:

html_to_text: Convert HTML to Text
render_html_to_text: Render HTML to Text

Installation

devtools::install_github("hrbrmstr/jericho")

Usage

Let's use this NASA blog post as an example.

library(jericho)

# current verison
packageVersion("jericho")

## [1] '0.1.0'

URL <- "https://blogs.nasa.gov/spacestation/2017/09/02/touchdown-expedition-52-back-on-earth/"
  
doc <- paste0(readr::read_lines(URL), collapse="\n")

This is pure text extraction:

html_to_text(doc)

This provides a human readable version of the segment content that is modelled on the way Mozilla Thunderbird and other email clients provide an automatic conversion of HTML content to text in their alternative MIME encoding of emails.

render_html_to_text(doc)

You should run each to see and compare the output (GitHub markdown documents aren't the best viewing medium).

Test Results

library(jericho)
library(testthat)

date()

## [1] "Mon Sep  4 13:06:20 2017"

test_dir("tests/")

## testthat results ========================================================================================================
## OK: 6 SKIPPED: 0 FAILED: 0
## 
## DONE ===================================================================================================================