Tools to Work with the 'Splash' JavaScript Rendering Service in R
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

382 lines
15 KiB

5 years ago
[![Travis-CI Build
Status](https://travis-ci.org/hrbrmstr/splashr.svg?branch=master)](https://travis-ci.org/hrbrmstr/splashr)
[![Coverage
Status](https://img.shields.io/codecov/c/github/hrbrmstr/splashr/master.svg)](https://codecov.io/github/hrbrmstr/splashr?branch=master)
[![CRAN\_Status\_Badge](http://www.r-pkg.org/badges/version/splashr)](https://cran.r-project.org/package=splashr)
[![](http://cranlogs.r-pkg.org/badges/splashr)](http://cran.rstudio.com/web/packages/splashr/index.html)
# `splashr` : Tools to Work with the ‘Splash’ JavaScript Rendering Service
TL;DR: This package works with Splash rendering servers which are really
just a REST API & `lua` scripting interface to a QT browser. It’s an
alternative to the Selenium ecosystem which was really engineered for
application testing & validation.
Sometimes, all you need is a page scrape after javascript has been
allowed to roam wild and free over meticulously crafted HTML tags. So,
this package does not do *everything* Selenium can in pure R (though,
the Lua interface is equally as powerful and accessible via R), but if
you’re just trying to get a page back that needs javascript rendering,
this is a nice, lightweight, consistent alternative.
It’s also an alternative to the somewhat abandoned `phantomjs` (which
you can use in R within or without a Selenium context as it’s it’s own
webdriver) and it may be useful to compare renderings between this
package & `phantomjs`.
The package uses the
[`stevedore`](https://github.com/richfitz/stevedore) package to
orchestrate Docker on your system (if you have Docker and more on how to
use the `stevedore` integration below) but you can also do get it
running in Docker on the command-line with two commands:
sudo docker pull scrapinghub/splash:latest --disable-browser-caches
sudo docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash:latest --disable-browser-caches
Do whatever you Windows ppl do with Docker on your systems to make ^^
work.
Folks super-new to Docker on Unix-ish platforms should [make sure to
do](https://github.com/hrbrmstr/splashr/issues/3#issuecomment-280686494):
sudo groupadd docker
sudo usermod -aG docker $USER
(`$USER` is your username and shld be defined for you in the
environment)
5 years ago
If using the [`stevedore`](https://github.com/richfitz/stevedore)
package you can use the convience wrappers in this pacakge:
install_splash()
splash_container <- start_splash()
and then run:
stop_splash(splash_container)
when done. All of that happens on your `localhost` and you will not need
to specify `splash_obj` to many of the `splashr` functions if you’re
running Splash in this default configuration as long as you use named
parameters. You can also use the pre-defined `splash_local` object if
you want to use positional parameters.
Now, you can run Selenium in Docker, so this is not unique to Splash.
But, a Docker context makes it so that you don’t have to run or maintain
icky Python stuff directly on your system. Leave it in the abandoned
warehouse district where it belongs.
5 years ago
All you need for this package to work is a running Splash instance. You
provide the host/port for it and it’s scrape-tastic fun from there\!
5 years ago
### About Splash
> ‘Splash’ <https://github.com/scrapinghub/splash> is a javascript
> rendering service. It’s a lightweight web browser with an ‘HTTP’ API,
> implemented in Python using ‘Twisted’and ’QT’ \[and provides some of
> the core functionality of the ‘RSelenium’ or ‘seleniumPipes’ R
> packages but with a Java-free footprint\]. The (twisted) ‘QT’ reactor
> is used to make the sever fully asynchronous allowing to take
> advantage of ‘webkit’ concurrency via QT main loop. Some of Splash
> features include the ability to process multiple webpages in parallel;
> retrieving HTML results and/or take screenshots; disabling images or
> use Adblock Plus rules to make rendering faster; executing custom
> JavaScript in page context; getting detailed rendering info in HAR
> format.
5 years ago
The following functions are implemented:
5 years ago
- `render_html`: Return the HTML of the javascript-rendered page.
- `render_har`: Return information about Splash interaction with a
website in [HAR](http://www.softwareishard.com/blog/har-12-spec/)
format.
- `render_jpeg`: Return a image (in JPEG format) of the
javascript-rendered page.
- `render_png`: Return a image (in PNG format) of the
javascript-rendered page.
- `execute_lua`: Execute a custom rendering script and return a
result.
- `splash`: Configure parameters for connecting to a Splash server
- `install_splash`: Retrieve the Docker image for Splash
- `start_splash`: Start a Splash server Docker container
- `stop_splash`: Stop a running a Splash server Docker container
Mini-DSL (domain-specific language). These can be used to create a
“script” without actually scripting in Lua. They are a
less-powerful/configurable set of calls than what you can make with a
full Lua function but the idea is to have it take care of very common
but simple use-cases, like waiting a period of time before capturing a
HAR/HTML/PNG image of a site:
- `splash_plugins`: Enable or disable browser plugins (e.g. Flash).
- `splash_click`: Trigger mouse click event in web page.
- `splash_focus`: Focus on a document element provided by a CSS
selector
- `splash_images`: Enable/disable images
- `splash_response_body`: Enable or disable response content tracking.
- `splash_go`: Go to an URL.
- `splash_wait`: Wait for a period time
- `splash_har`: Return information about Splash interaction with a
website in HAR format.
- `splash_html`: Return a HTML snapshot of a current page.
- `splash_png`: Return a screenshot of a current page in PNG format.
- `splash_press`: Trigger mouse press event in web page.
- `splash_release`: Trigger mouse release event in web page.
- `splash_send_keys`: Send keyboard events to page context.
- `splash_send_text`: Send text as input to page context, literally,
character by character.
- `splash_user_agent`: Overwrite the User-Agent header for all further
requests. NOTE: There are many “helper” user agent strings to go
with `splash_user_agent`. Look for objects in `splashr` starting
with `ua_`.
`httr` helpers. These help turn various bits of `splashr` objects into
`httr`-ish things:
- `as_req`: Turn a HAR response entry into a working `httr` function
you can use to make a request with
- `as_request`: Turn a HAR response entry into an `httr`
`response`-like object (i.e. you can use `httr::content()` on it)
Helpers:
- `tidy_har`: Turn a gnHARly HAR object into a tidy data frame
- `get_body_size`: Retrieve size of content | body | headers
- `get_content_sie`: Retrieve size of content | body | headers
- `get_content_type` Retrieve or test content type of a HAR request
object
- `get_headers_size` Retrieve size of content | body | headers
- `is_binary`: Retrieve or test content type of a HAR request object
- `is_content_type`: Retrieve or test content type of a HAR request
object
- `is_css`: Retrieve or test content type of a HAR request object
- `is_gif`: Retrieve or test content type of a HAR request object
- `is_html`: Retrieve or test content type of a HAR request object
- `is_javascript`: Retrieve or test content type of a HAR request
object
- `is_jpeg`: Retrieve or test content type of a HAR request object
- `is_json`: Retrieve or test content type of a HAR request object
- `is_plain`: Retrieve or test content type of a HAR request object
- `is_png`: Retrieve or test content type of a HAR request object
- `is_svg`: Retrieve or test content type of a HAR request object
- `is_xhr`: Retrieve or test content type of a HAR request object
- `is_xml`: Retrieve or test content type of a HAR request object
Some functions from `HARtools` are imported/exported and `%>%` is
imported/exported.
5 years ago
### TODO
Suggest more in a feature req\!
5 years ago
- <strike>Implement `render.json`</strike>
- <strike>Implement “file rendering”</strike>
- <strike>Implement `execute` (you can script Splash\!)</strike>
- <strike>Add integration with
[`HARtools`](https://github.com/johndharrison/HARtools)</strike>
- <strike>*Possibly* writing R function wrappers to install/start/stop
Splash</strike> which would also support enabling javascript
profiles, request filters and proxy profiles from with R directly,
using [`harbor`](https://github.com/wch/harbor)
- Re-implement `render_file()`
- Testing results with all combinations of parameters
5 years ago
5 years ago
### Installation
``` r
3 years ago
# CRAN
install.packages("splashr")
# DEV
# See DESCRIPTION for non-CINC-provided dependencies
install.packages("splashr", repos = c("https://cinc.rud.is/"))
4 years ago
```
5 years ago
### Usage
NOTE: ALL of these examples assume Splash is running in the default
configuraiton on `localhost` (i.e. started with `start_splash()` or the
docker example commands) unless otherwise noted.
5 years ago
``` r
library(splashr)
library(magick)
library(rvest)
library(anytime)
library(tidyverse)
5 years ago
# current verison
packageVersion("splashr")
```
## [1] '0.7.0'
5 years ago
``` r
splash_active()
5 years ago
```
## [1] TRUE
5 years ago
``` r
splash_debug()
5 years ago
```
## List of 7
## $ active : list()
## $ argcache: int 0
## $ fds : int 59
## $ leaks :List of 5
## ..$ Deferred : int 20
## ..$ DelayedCall: int 3
## ..$ LuaRuntime : int 1
## ..$ QTimer : int 1
## ..$ Request : int 1
## $ maxrss : int 219096
5 years ago
## $ qsize : int 0
## $ url : chr "http://localhost:8050"
5 years ago
## - attr(*, "class")= chr [1:2] "splash_debug" "list"
## NULL
Notice the difference between a rendered HTML scrape and a non-rendered
one:
5 years ago
``` r
render_html(url = "http://marvel.com/universe/Captain_America_(Steve_Rogers)")
5 years ago
```
## {html_document}
## <html lang="en">
## [1] <head class="at-element-marker">\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta char ...
## [2] <body class="terrigen">\n<div id="__next"><div id="terrigen-page" class="page"><div id="page-wrapper" class="page ...
5 years ago
``` r
xml2::read_html("http://marvel.com/universe/Captain_America_(Steve_Rogers)")
5 years ago
```
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta charset="utf-8" class="next-he ...
## [2] <body class="terrigen">\n<div id="__next"><div id="terrigen-page" class="page"><div id="page-wrapper" class="page ...
5 years ago
You can also profile pages:
``` r
render_har(url = "http://www.poynter.org/") -> har
print(har)
```
## --------HAR VERSION--------
## HAR specification version: 1.2
## --------HAR CREATOR--------
## Created by: Splash
## version: 3.4.1
## --------HAR BROWSER--------
## Browser: QWebKit
## version: 602.1
## --------HAR PAGES--------
## Page id: 1 , Page title: Poynter - Poynter
## --------HAR ENTRIES--------
## Number of entries: 270
## REQUESTS:
## Page: 1
## Number of entries: 270
## - http://www.poynter.org/
5 years ago
## - https://www.poynter.org/
## - https://www.googletagservices.com/tag/js/gpt.js
## - https://www.poynter.org/wp-includes/js/wp-emoji-release.min.js?ver=5.3.2
## - https://www.poynter.org/wp-includes/css/dist/block-library/style.min.css?ver=5.3.2
5 years ago
## ........
## - https://www.poynter.org/wp-content/themes/elumine-child/assets/fonts/PoynterOSDisplay/PoynterOSDisplay-BoldItalic...
## - https://securepubads.g.doubleclick.net/gampad/ads?gdfp_req=1&pvsid=3405643238502098&correlator=142496486392385&ou...
## - https://securepubads.g.doubleclick.net/gpt/pubads_impl_rendering_2020021101.js
## - https://tpc.googlesyndication.com/safeframe/1-0-37/html/container.html
## - https://static.chartbeat.com/js/chartbeat.js
``` r
tidy_har(har)
```
## # A tibble: 270 x 20
## status started total_time page_ref timings req_url req_method req_http_version req_hdr_size req_headers req_cookies
## <int> <chr> <int> <chr> <I<lis> <chr> <chr> <chr> <int> <I<list>> <I<list>>
## 1 301 2020-0… 171 1 <tibbl http:/ GET HTTP/1.1 189 <tibble [2 <tibble [0
## 2 200 2020-0… 238 1 <tibbl https: GET HTTP/1.1 189 <tibble [2 <tibble [0
## 3 200 2020-0… 135 1 <tibbl https: GET HTTP/1.1 164 <tibble [3 <tibble [0
## 4 200 2020-0… 93 1 <tibbl https: GET HTTP/1.1 164 <tibble [3 <tibble [0
## 5 200 2020-0… 158 1 <tibbl https: GET HTTP/1.1 179 <tibble [3 <tibble [0
## 6 200 2020-0… 218 1 <tibbl https: GET HTTP/1.1 179 <tibble [3 <tibble [0
## 7 200 2020-0… 262 1 <tibbl https: GET HTTP/1.1 179 <tibble [3 <tibble [0
## 8 200 2020-0… 265 1 <tibbl https: GET HTTP/1.1 179 <tibble [3 <tibble [0
## 9 200 2020-0… 266 1 <tibbl https: GET HTTP/1.1 179 <tibble [3 <tibble [0
## 10 200 2020-0… 269 1 <tibbl https: GET HTTP/1.1 179 <tibble [3 <tibble [0
## # … with 260 more rows, and 9 more variables: resp_url <chr>, resp_rdrurl <chr>, resp_type <chr>, resp_size <int>,
## # resp_cookies <I<list>>, resp_headers <I<list>>, resp_encoding <chr>, resp_content_size <dbl>, resp_content <chr>
You can use
[`HARtools::HARviewer`](https://github.com/johndharrison/HARtools/blob/master/R/HARviewer.R)
— which this pkg import/exports — to get view the HAR in an interactive
HTML widget.
Full web page snapshots are easy-peasy too:
5 years ago
``` r
render_png(url = "http://www.marveldirectory.com/individuals/c/captainamerica.htm")
5 years ago
```
![](img/cap.png)
``` r
render_jpeg(url = "http://static2.comicvine.com/uploads/scale_small/3/31666/5052983-capasr2016001-eptingvar-18bdb.jpg")
5 years ago
```
![](img/cap.jpg)
### Executing custom Lua scripts
``` r
lua_ex <- '
function main(splash)
splash:go("http://rud.is/b")
splash:wait(0.5)
local title = splash:evaljs("document.title")
return {title=title}
end
'
splash_local %>% execute_lua(lua_ex) -> res
rawToChar(res) %>%
jsonlite::fromJSON()
```
## $title
## [1] "rud.is | \"In God we trust. All others must bring data\""
### Interacting With Flash sites
``` r
splash_local %>%
splash_plugins(TRUE) %>%
splash_go("https://gis.cdc.gov/GRASP/Fluview/FluHospRates.html") %>%
splash_wait(4) %>%
splash_click(460, 550) %>%
splash_wait(2) %>%
splash_click(230, 85) %>%
splash_wait(2) %>%
splash_png()
```
<img src="img/flash.png" width="50%"/>
5 years ago
``` r
stop_splash(splash_vm)
```
5 years ago
### Code of Conduct
Please note that this project is released with a [Contributor Code of
Conduct](CONDUCT.md). By participating in this project you agree to
abide by its terms.