Tools to Work with the 'Splash' JavaScript Rendering Service in R
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

307 lines
18KB

  1. ---
  2. title: "Introduction to splashr"
  3. author: "Bob Rudis"
  4. date: "`r Sys.Date()`"
  5. output:
  6. rmarkdown::html_vignette:
  7. toc: true
  8. vignette: >
  9. %\VignetteIndexEntry{Introduction to splashr}
  10. %\VignetteEngine{knitr::rmarkdown}
  11. %\VignetteEncoding{UTF-8}
  12. ---
  13. Capturing information/conent from internet resources can be a tricky endeavour. Along with the many legal + ethical issues there are an increasing numbner of sites that render content dynamically, either through `XMLHttpRequests` (XHR) or on-page JavaScript (JS) rendering of
  14. in-page content. There are also many sites that make it difficult to fill-in form data programmatically.
  15. There are ways to capture these types of resources in R. One way is via the [`RSelenium`](https://CRAN.R-project.org/package=RSelenium) ecosystem of packages. Another is with packages such as [`webshot`](https://CRAN.R-project.org/package=webshot). One can also write custom [`phantomjs`](http://phantomjs.org/) scripts and post-process the HTML output.
  16. The `splashr` package provides tooling around another web-scraping ecosystem: [Splash](https://scrapinghub.com/splash). A Splash environment is fundamentally a headless web browser based on the QT WebKit library. Unlike the Selenium ecosystem, Splash is not based on the [WebDriver](https://www.w3.org/TR/webdriver/) protocol, but has a custom HTTP API that provides both similar and different idioms for accessing and maniuplating web content.
  17. ## Getting Started
  18. Before you can use `splashr` you will need access to a Splash environment. You can either:
  19. - [pay for instances](https://app.scrapinghub.com/account/signup/);
  20. - [get a Splash server running locally by hand](https://github.com/scrapinghub/splash), or
  21. - use Splash in a [Docker](https://www.docker.com/) container.
  22. The package and this document are going to steer you into using Docker containers. Docker is free for macOS, Windows and Linux systems, plus most major cloud computing providers have support for Docker containers. If you don't have Docker installed, then your first step should be to get Docker going and [verifying your setup](https://docs.docker.com/get-started/).
  23. Once you have Docker working, you can follow the [Splash installation guidance](https://splash.readthedocs.io/en/stable/install.html) to manually obtain, start and stop Splash docker containers. _There must be a running, accessible Splash instance for `splashr` to work_.
  24. If you're comfortable trying to get a working Python environment working on your system, you can also use the Splash Docker helper functions that come with this package:
  25. - `install_splash()` will perform the same operation as `docker pull ...`
  26. - `start_splash()` will perform the same operation as `docker run ...`, and
  27. - `stop_splash()` will stop and remove the conainter object returned by `start_splash()`
  28. Follow the vignettes in the [`docker`](https://CRAN.R-project.org/package=docker) package to get the `docker` package up and running.
  29. The remainder of this document assumes that you have a Splash instance up and running on your localhost.
  30. ## Scraping Bascis --- `render_` functions
  31. Splash (and, hence, `splashr`) has a feature-rich API that ranges from quick-and-easy to complex-detailed-and-powerful. We'll start with some easy basics. First make sure Splash is running:
  32. ```
  33. library(splashr)
  34. splash_active()
  35. ## Status of splash instance on [http://localhost:8050]: ok. Max RSS: 74.42578 Mb
  36. ##
  37. ## [1] TRUE
  38. ```
  39. THe first action we'll perform may surprise you. We're going to take a screenshot of the <https://analytics.usa.gov/> site. Why that site? First, the Terms of Service allow for scraping. Second, it has a great deal of dynamic content. And, third, we can validate our scraping findings with a direct data download (which will be an exercise left to the reader).
  40. Enough words. Let's see what this site looks like!
  41. ```
  42. library(magick)
  43. render_png(url = "https://analytics.usa.gov/", wait = 5)
  44. ## format width height colorspace filesize
  45. ## 1 PNG 1024 2761 sRGB 531597
  46. ```
  47. <img style="max-widgh:100%" width="100%" src="figures/splashr01.png"/>
  48. Let's decompose what we just did:
  49. 1. We called `render_png()` function. The job of this function is to --- by default -- take a "screenshot" of the fully rendered page content at a specified URL.
  50. 1. We passed in the `url = ` parameter. The default first parameter is a `splashr` object created by the `splash()`. However, since it's highly likely most folks will be running a Splash server locally with the default configuration, most `splashr` functions will use an inherent, "`splash_local`" object if you're willing to use named parameters for all other parameter values.
  51. 1. We passed in a `wait = ` parameter, asking the Splash server to wait for a few seconds to give the content time to render. This is an important consideration which we'll go into later in this document.
  52. 1. `splashr` passed on our command to the running Splash instance and the Splash server sent back a PNG file which the `splashr` package read in with the help of the `magick` package. If you're operating in RStudio you'll see the above image in the viewer. Alternatively, you can do:
  53. ```
  54. image_browse(render_png(url = "https://analytics.usa.gov/", wait = 5))
  55. ```
  56. to see the image if you're in another R environment. NOTE: web page screenshots can be captured in PNG or JPEG format by choosing the appropriate `render_` function.
  57. Now that we've validated that we're getting the content we want, we can do something a bit more useful, like retrieve the HTML content of the page:
  58. ```
  59. pg <- render_html(url = "https://analytics.usa.gov/")
  60. pg
  61. ## {xml_document}
  62. ## <html lang="en">
  63. ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<!--\n\n Hi! Welcome to our source code.\n\n This d ...
  64. ## [2] <body>\n <!-- Google Tag Manager (noscript) -->\n<noscript>&lt;iframe src="https://www.googletagmanager.com/ns.html?id=GTM-MQSGZS"\ ...
  65. ```
  66. The `render_html()` function behaves a great deal like `xml2::read_html()` function except that it's just retrieving the current web page [HTML DOM](https://www.w3schools.com/js/js_htmldom.asp). What do we mean by that? Well, unlike `httr::GET()` or `xml2::read_html()`, the Splash environment is a bona-fide browser environment, just like Chrome, Safari or Firefox. It's always running (until you shut down the Splash server). That means any active JS on the page can be modifying the content (like ticking a time counter or updating stock prices, etc). We didn't specify a `wait = ` delay this time, but it's generally a good idea to do that for very dynamic sites. This particular site seems update the various tables and charts every 10 seconds to show "live" stats.
  67. We can work with that `pg` content just like we would with `rvest` / `xml2`. Let's look at the visitor total from the past 90 days:
  68. ```
  69. library(rvest)
  70. html_text(html_nodes(pg, "span#total_visitors"))
  71. ## [1] "2.37 billion"
  72. ```
  73. If we tried to read that value with plain, ol' `read_html` here's what we'd get:
  74. ```
  75. pg2 <- read_html("https://analytics.usa.gov/")
  76. html_text(html_nodes(pg2, "span#total_visitors"))
  77. ## [1] "..."
  78. ```
  79. Not exactly helpful.
  80. So, with just a small example, we've seen that it's pretty simple to pull dyanmic content out of a web site with just a few more steps than `read_html()` requires.
  81. But, we can do even more with these `render_` functions.
  82. ## Your Own Private 'Developer Tools'
  83. Anyone performing scraping operations likely knows about each browser's "developer tools" environment. If you're not familiar with them you can get a quick primer [on their secrets](http://devtoolsecrets.com/) before continuing with this vignette.
  84. The devtools inspector lets you see --- amongst other items -- network resources that were pulled down with the web page. So, while `read_html()` just gets the individual HTML file for a web site, its Splash devtools counterpart --- `render_har()` --- is pulling every image, JS file, CSS sheet, etc that can be rendered in QT WebKit. We can see what the USA.Gov Analytics site is making us load with it:
  85. ```
  86. har <- render_har(url = "https://analytics.usa.gov/")
  87. har
  88. ## --------HAR VERSION--------
  89. ## HAR specification version: 1.2
  90. ## --------HAR CREATOR--------
  91. ## Created by: Splash
  92. ## version: 3.0
  93. ## --------HAR BROWSER--------
  94. ## Browser: QWebKit
  95. ## version: 602.1
  96. ## --------HAR PAGES--------
  97. ## Page id: 1 , Page title: analytics.usa.gov | The US government's web traffic.
  98. ## --------HAR ENTRIES--------
  99. ## Number of entries: 29
  100. ## REQUESTS:
  101. ## Page: 1
  102. ## Number of entries: 29
  103. ## - https://analytics.usa.gov/
  104. ## - https://analytics.usa.gov/css/vendor/css/uswds.v0.9.1.css
  105. ## - https://analytics.usa.gov/css/public_analytics.css
  106. ## - https://analytics.usa.gov/js/vendor/d3.v3.min.js
  107. ## - https://analytics.usa.gov/js/vendor/q.min.js
  108. ## ........
  109. ## - https://analytics.usa.gov/data/live/top-downloads-yesterday.json
  110. ## - https://analytics.usa.gov/css/vendor/fonts/sourcesanspro-bold-webfont.woff2
  111. ## - https://analytics.usa.gov/css/vendor/fonts/sourcesanspro-regular-webfont.woff2
  112. ## - https://analytics.usa.gov/css/vendor/fonts/sourcesanspro-light-webfont.woff2
  113. ## - https://analytics.usa.gov/css/vendor/fonts/sourcesanspro-italic-webfont.woff2
  114. ```
  115. A "HAR" is an HTTP Archive and `splashr` works with the R [`hartools`](https://CRAN.R-project.org/package=HARtools) package to provide access to the elements loaded with a Splash QT WebKit page request. We can see all of them if we perform a manual inspection:
  116. ```
  117. for (e in har$log$entries) cat(e$request$url, "\n")
  118. ## https://analytics.usa.gov/
  119. ## https://analytics.usa.gov/css/vendor/css/uswds.v0.9.1.css
  120. ## https://analytics.usa.gov/css/public_analytics.css
  121. ## https://analytics.usa.gov/js/vendor/d3.v3.min.js
  122. ## https://analytics.usa.gov/js/vendor/q.min.js
  123. ## https://analytics.usa.gov/css/google-fonts.css
  124. ## https://analytics.usa.gov/js/vendor/uswds.v0.9.1.js
  125. ## https://analytics.usa.gov/js/index.js
  126. ## https://www.googletagmanager.com/gtm.js?id=GTM-MQSGZS
  127. ## https://www.google-analytics.com/analytics.js
  128. ## https://analytics.usa.gov/css/img/arrow-down.svg
  129. ## https://analytics.usa.gov/data/live/realtime.json
  130. ## https://analytics.usa.gov/data/live/today.json
  131. ## https://analytics.usa.gov/data/live/devices.json
  132. ## https://analytics.usa.gov/data/live/browsers.json
  133. ## https://analytics.usa.gov/data/live/ie.json
  134. ## https://analytics.usa.gov/data/live/os.json
  135. ## https://analytics.usa.gov/data/live/windows.json
  136. ## https://analytics.usa.gov/data/live/top-cities-realtime.json
  137. ## https://analytics.usa.gov/data/live/top-countries-realtime.json
  138. ## https://analytics.usa.gov/data/live/top-countries-realtime.json
  139. ## https://analytics.usa.gov/data/live/top-pages-realtime.json
  140. ## https://analytics.usa.gov/data/live/top-domains-7-days.json
  141. ## https://analytics.usa.gov/data/live/top-domains-30-days.json
  142. ## https://analytics.usa.gov/data/live/top-downloads-yesterday.json
  143. ## https://analytics.usa.gov/css/vendor/fonts/sourcesanspro-bold-webfont.woff2
  144. ## https://analytics.usa.gov/css/vendor/fonts/sourcesanspro-regular-webfont.woff2
  145. ## https://analytics.usa.gov/css/vendor/fonts/sourcesanspro-light-webfont.woff2
  146. ## https://analytics.usa.gov/css/vendor/fonts/sourcesanspro-italic-webfont.woff2
  147. ```
  148. With just a visual inspection, we can see there are JSON files being loaded at some point that likely contain some of the data we're. The content for each of them can be available to us in the HAR object if we specify the `response_body = TRUE` parameter:
  149. ```
  150. har <- render_har(url = "https://analytics.usa.gov/", wait = 5, response_body = TRUE)
  151. for (e in har$log$entries) {
  152. cat(sprintf("%s => [%s] is %s bytes\n",
  153. e$request$url, e$response$content$mimeType,
  154. scales::comma(e$response$content$size)))
  155. }
  156. ## https://analytics.usa.gov/ => [text/html] is 19,718 bytes
  157. ## https://analytics.usa.gov/css/vendor/css/uswds.v0.9.1.css => [text/css] is 64,676 bytes
  158. ## https://analytics.usa.gov/css/public_analytics.css => [text/css] is 13,932 bytes
  159. ## https://analytics.usa.gov/js/vendor/d3.v3.min.js => [application/x-javascript] is 150,760 bytes
  160. ## https://analytics.usa.gov/js/vendor/q.min.js => [application/x-javascript] is 41,625 bytes
  161. ## https://analytics.usa.gov/css/google-fonts.css => [text/css] is 112,171 bytes
  162. ## https://analytics.usa.gov/js/vendor/uswds.v0.9.1.js => [application/x-javascript] is 741,447 bytes
  163. ## https://analytics.usa.gov/js/index.js => [application/x-javascript] is 29,868 bytes
  164. ## https://www.googletagmanager.com/gtm.js?id=GTM-MQSGZS => [] is 0 bytes
  165. ## https://www.google-analytics.com/analytics.js => [] is 0 bytes
  166. ## https://analytics.usa.gov/css/img/arrow-down.svg => [image/svg+xml] is 780 bytes
  167. ## https://analytics.usa.gov/css/vendor/fonts/sourcesanspro-bold-webfont.woff2 => [font/woff2] is 23,368 bytes
  168. ## https://analytics.usa.gov/css/vendor/fonts/sourcesanspro-regular-webfont.woff2 => [font/woff2] is 23,684 bytes
  169. ## https://analytics.usa.gov/css/vendor/fonts/sourcesanspro-light-webfont.woff2 => [font/woff2] is 23,608 bytes
  170. ## https://analytics.usa.gov/css/vendor/fonts/sourcesanspro-italic-webfont.woff2 => [font/woff2] is 17,472 bytes
  171. ## https://analytics.usa.gov/data/live/realtime.json => [application/json] is 357 bytes
  172. ## https://analytics.usa.gov/data/live/today.json => [application/json] is 2,467 bytes
  173. ## https://analytics.usa.gov/data/live/devices.json => [application/json] is 625 bytes
  174. ## https://analytics.usa.gov/data/live/browsers.json => [application/json] is 4,697 bytes
  175. ## https://analytics.usa.gov/data/live/ie.json => [application/json] is 944 bytes
  176. ## https://analytics.usa.gov/data/live/os.json => [application/json] is 1,378 bytes
  177. ## https://analytics.usa.gov/data/live/windows.json => [application/json] is 978 bytes
  178. ## https://analytics.usa.gov/data/live/top-cities-realtime.json => [application/json] is 604,096 bytes
  179. ## https://analytics.usa.gov/data/live/top-countries-realtime.json => [application/json] is 15,179 bytes
  180. ## https://analytics.usa.gov/data/live/top-pages-realtime.json => [application/json] is 3,565 bytes
  181. ## https://analytics.usa.gov/data/live/top-domains-7-days.json => [application/json] is 1,979 bytes
  182. ## https://analytics.usa.gov/data/live/top-domains-30-days.json => [application/json] is 5,915 bytes
  183. ## https://analytics.usa.gov/data/live/top-downloads-yesterday.json => [application/json] is 25,751 bytes
  184. ```
  185. I happen to know that the `devices.json` file has the visitor counts and we can retrieve it from the HAR object directly with some helpers:
  186. ```
  187. har_entries(har)[[18]] %>%
  188. get_response_body("text") %>%
  189. jsonlite::fromJSON() %>%
  190. str()
  191. ## List of 5
  192. ## $ name : chr "devices"
  193. ## $ query :List of 8
  194. ## ..$ start-date : chr "90daysAgo"
  195. ## ..$ end-date : chr "yesterday"
  196. ## ..$ dimensions : chr "ga:date,ga:deviceCategory"
  197. ## ..$ metrics : chr "ga:sessions"
  198. ## ..$ sort : chr "ga:date"
  199. ## ..$ start-index : int 1
  200. ## ..$ max-results : int 10000
  201. ## ..$ samplingLevel: chr "HIGHER_PRECISION"
  202. ## $ meta :List of 2
  203. ## ..$ name : chr "Devices"
  204. ## ..$ description: chr "90 days of desktop/mobile/tablet visits for all sites."
  205. ## $ totals :List of 2
  206. ## ..$ visits : num 2.37e+09
  207. ## ..$ devices:List of 3
  208. ## .. ..$ desktop: int 1303660363
  209. ## .. ..$ mobile : int 924913139
  210. ## .. ..$ tablet : int 137183761
  211. ## $ taken_at: chr "2017-08-27T10:00:02.175Z"
  212. ```
  213. Now, if we wanted to make that request on our own, we could fiddle with the various `list` element details to build our own `httr` function, or we could make use of another helper to automatigally build an `httr` function for us:
  214. ```
  215. library(httr)
  216. req <- as_httr_req(har_entries(har)[[18]])
  217. req() %>%
  218. content(as="parsed") %>%
  219. str()
  220. ## Output is the same as previous block
  221. ```
  222. This is an example of the built `httr` function:
  223. ```
  224. httr::VERB(verb = "GET", url = "https://analytics.usa.gov/data/live/devices.json",
  225. httr::add_headers(`User-Agent` = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/602.1 (KHTML, like Gecko) splash Version/9.0 Safari/602.1",
  226. Accept = "application/json,*/*",
  227. Referer = "https://analytics.usa.gov/"))
  228. ```
  229. ## The Full Monty
  230. One final `render_` function is the `render_json()` function. Let's see what it does before explaining it:
  231. ```
  232. json <- render_json(url = "https://analytics.usa.gov/", wait = 5, png = TRUE, response_body = TRUE)
  233. str(json, 1)
  234. ## List of 10
  235. ## $ frameName : chr ""
  236. ## $ requestedUrl: chr "https://analytics.usa.gov/"
  237. ## $ geometry :List of 4
  238. ## $ png : chr "iVBORw0KGgoAAAANSUhEUgAABAAAAAMACAYAAAC6uhUNAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAgAElEQVR4AeydBZxUVRvGX7pTEBURBCWkVEQ"| __truncated__
  239. ## $ html : chr "<!DOCTYPE html><html lang=\"en\"><!-- Initalize title and data source variables --><head>\n <!--\n\n Hi! We"| __truncated__
  240. ## $ title : chr "analytics.usa.gov | The US government's web traffic."
  241. ## $ history :List of 1
  242. ## $ url : chr "https://analytics.usa.gov/"
  243. ## $ childFrames : list()
  244. ## $ har :List of 1
  245. ## ..- attr(*, "class")= chr [1:2] "har" "list"
  246. ## - attr(*, "class")= chr [1:2] "splash_json" "list"
  247. ```
  248. The function name corresponds to the [Splash HTTP API call](https://splash.readthedocs.io/en/stable/api.html). It is actally returning JSON => a JSON object holding pretty much everything associated with the page. Think of it as a one-stop-shop function if you want a screen shot, page content and HAR resources with just one call.
  249. You've now got plenty of scraping toys to play with to get a feel for how `splashr` works. Other vignettes cover the special domain-specific language (DSL) contained within `splashr` (giving you access to more powerful features of the Splash platform) and other helper functions that make it easier to work with `splashr` returned objects.