Break Down the Walls of 'HTML' Tags into Usable Text
Você não pode selecionar mais de 25 tópicos Os tópicos devem começar com uma letra ou um número, podem incluir traços ('-') e podem ter até 35 caracteres.
 
 
 
boB Rudis 693279a385
README
5 anos atrás
R Forgot to add jerichojars to @import 7 anos atrás
inst Split into two packages, one for the JAR file and one for the actual work. 7 anos atrás
java/jericho README 5 anos atrás
man examples 7 anos atrás
tests docs & basic tests 7 anos atrás
.Rbuildignore appveyor 7 anos atrás
.codecov.yml initial commit 7 anos atrás
.gitignore initial commit 7 anos atrás
.travis.yml Split into two packages, one for the JAR file and one for the actual work. 7 anos atrás
DESCRIPTION README 5 anos atrás
LICENSE initial commit 7 anos atrás
NAMESPACE Forgot to add jerichojars to @import 7 anos atrás
NEWS.md Split into two packages, one for the JAR file and one for the actual work. 7 anos atrás
README.Rmd README 5 anos atrás
README.md README 5 anos atrás
appveyor.yml appveyor config issues 7 anos atrás
jericho.Rproj initial commit 7 anos atrás

README.md

BuildStatus Buildstatus codecov

jericho : Break Down the Walls of ‘HTML’ Tags into Usable Text

Structured ‘HTML’ content can be useful when you need to parse data tables or other tagged data from within a document. However, it is also useful to obtain “just the text” from a document free from the walls of tags that surround it. Tools are provied that wrap methods in the ‘Jericho HTML Parser’ Java library by Martin Jericho http://jericho.htmlparser.net/docs/index.html. Martin’s library is used in many at-scale projects, icluding the ‘The Internet Archive’.

As a result of using a Java library, this package requires rJava.

The following functions are implemented:

  • html_to_text: Convert HTML to Text
  • render_html_to_text: Render HTML to Text

Installation

If you do use devtools, then it should pickup the Remotes: section in DESCRIPTION. Until the package is on CRAN, you might want to also invoke the installation of jerichojars as shown below:

install.packages(c("jerichojars", "jericho"), repos = "https://cinc.rud.is/")

Usage

Let’s use this NASA blog post as an example.

library(jericho)

# current verison
packageVersion("jericho")
## [1] '0.2.0'
URL <- "https://blogs.nasa.gov/spacestation/2017/09/02/touchdown-expedition-52-back-on-earth/"
  
doc <- paste0(readr::read_lines(URL), collapse = "\n")

This is pure text extraction:

html_to_text(doc)

This provides a human readable version of the segment content that is modelled on the way Mozilla Thunderbird and other email clients provide an automatic conversion of HTML content to text in their alternative MIME encoding of emails.

render_html_to_text(doc)

You should run each to see and compare the output (GitHub markdown documents aren’t the best viewing medium).

jericho Metrics

Lang # Files (%) LoC (%) Blank lines (%) # Lines (%)
Java 2 0.18 49 0.38 9 0.19 14 0.13
R 6 0.55 40 0.31 10 0.21 62 0.56
Maven 1 0.09 23 0.18 1 0.02 1 0.01
Rmd 1 0.09 9 0.07 24 0.50 33 0.30
make 1 0.09 8 0.06 4 0.08 0 0.00