Tools to Work with the 'Splash' JavaScript Rendering Service in R
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

splashr_helpers.Rmd 14KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345
  1. ---
  2. title: "splashr Helper Functions and Data"
  3. author: "Bob Rudis"
  4. date: "`r Sys.Date()`"
  5. output:
  6. rmarkdown::html_vignette:
  7. toc: true
  8. vignette: >
  9. %\VignetteIndexEntry{splashr Helper Functions and Data}
  10. %\VignetteEngine{knitr::rmarkdown}
  11. %\VignetteEncoding{UTF-8}
  12. ---
  13. Splash has a ton of features and `splashr` exposes many of them. The `render_` functions and DSL can return everything from simple, tiny JSON data to huge, nested `list` structures of complex objects.
  14. Furthermore, web content mining can be tricky. Modern sites can present information in different ways depending on the type of browser or device you use and many won't serve pages to "generic" browsers.
  15. Finally, the Dockerized containers of Splash servers make it really easy to get started, but you may prefer an R console over the system command-line.
  16. Let's see what extra goodies `splashr` provides to make our lives easier.
  17. ## Handling `splashr` Objects
  18. One of the most powerful functions in `splashr` is `render_har()`. You get every component loaded by dynamic web page, and some sites have upwards of 100 elements for any given page. How can you get to the bits that you want?
  19. We'll use `render_har()` to demonstrate how to find resources a site loads and use the data we gather to assess how "safe" these sites are — i.e. how many third-party javascript components they load and how safely they are loaded. Note that code in this vignette assumes a Splash instance is running locally on your system.
  20. We'll check <https://apple.com/> first since Apple claims to care about our privacy. If that's true, then they'll will load few or no third-party content.
  21. ```{r eval=FALSE}
  22. (apple <- render_har(url = "https://apple.com/", response_body = TRUE))
  23. ## --------HAR VERSION--------
  24. ## HAR specification version: 1.2
  25. ## --------HAR CREATOR--------
  26. ## Created by: Splash
  27. ## version: 3.3.1
  28. ## --------HAR BROWSER--------
  29. ## Browser: QWebKit
  30. ## version: 602.1
  31. ## --------HAR PAGES--------
  32. ## Page id: 1 , Page title: Apple
  33. ## --------HAR ENTRIES--------
  34. ## Number of entries: 84
  35. ## REQUESTS:
  36. ## Page: 1
  37. ## Number of entries: 84
  38. ## - https://apple.com/
  39. ## - https://www.apple.com/
  40. ## - https://www.apple.com/ac/globalnav/4/en_US/styles/ac-globalnav.built.css
  41. ## - https://www.apple.com/ac/localnav/4/styles/ac-localnav.built.css
  42. ## - https://www.apple.com/ac/globalfooter/4/en_US/styles/ac-globalfooter.built.css
  43. ## ........
  44. ## - https://www.apple.com/v/home/ea/images/heroes/iphone-xs/iphone_xs_0afef_mediumtall.jpg
  45. ## - https://www.apple.com/v/home/ea/images/heroes/iphone-xr/iphone_xr_5e40f_mediumtall.jpg
  46. ## - https://www.apple.com/v/home/ea/images/heroes/iphone-xs/iphone_xs_0afef_mediumtall.jpg
  47. ## - https://www.apple.com/v/home/ea/images/heroes/macbook-air/macbook_air_mediumtall.jpg
  48. ## - https://www.apple.com/v/home/ea/images/heroes/macbook-air/macbook_air_mediumtall.jpg
  49. ```
  50. The HAR output shows that when you visit `apple.com` your browser makes at least 84 requests for resources. We can see what types of content is loaded:
  51. ```{r eval=FALSE}
  52. har_entries(apple) %>%
  53. purrr::map_chr(get_content_type) %>%
  54. table(dnn = "content_type") %>%
  55. broom::tidy() %>%
  56. dplyr::arrange(desc(n))
  57. ## # A tibble: 9 x 2
  58. ## content_type n
  59. ## <chr> <int>
  60. ## 1 font/woff2 27
  61. ## 2 application/x-javascript 15
  62. ## 3 image/svg+xml 10
  63. ## 4 text/css 9
  64. ## 5 image/jpeg 7
  65. ## 6 image/png 6
  66. ## 7 application/font-woff 4
  67. ## 8 text/html 3
  68. ## 9 application/json 2
  69. ```
  70. Lots of calls to fonts, 15 javascript files and even 2 JSON files. Let's see what the domains are for these resources:
  71. ```{r eval=FALSE}
  72. har_entries(apple) %>%
  73. purrr::map_chr(get_response_url) %>%
  74. purrr::map_chr(urltools::domain) %>%
  75. unique()
  76. ## [1] "apple.com" "www.apple.com" "securemetrics.apple.com"
  77. ```
  78. Wow! Only calls to Apple-controlled resources.
  79. I wonder what's in those JSON files, though:
  80. ```{r eval=FALSE}
  81. har_entries(apple) %>%
  82. purrr::keep(is_json) %>%
  83. purrr::map(get_response_body, "text") %>%
  84. purrr::map(jsonlite::fromJSON) %>%
  85. str(3)
  86. ## List of 2
  87. ## $ :List of 2
  88. ## ..$ locale :List of 3
  89. ## .. ..$ country : chr "us"
  90. ## .. ..$ attr : chr "en-US"
  91. ## .. ..$ textDirection: chr "ltr"
  92. ## ..$ localeswitcher:List of 7
  93. ## .. ..$ name : chr "localeswitcher"
  94. ## .. ..$ metadata : Named list()
  95. ## .. ..$ displayIndex: int 1
  96. ## .. ..$ copy :List of 5
  97. ## .. ..$ continue :List of 5
  98. ## .. ..$ exit :List of 5
  99. ## .. ..$ select :List of 5
  100. ## $ :List of 2
  101. ## ..$ id : chr "ad6ca319-1ef1-20da-c4e0-5185088996cb"
  102. ## ..$ results:'data.frame': 2 obs. of 2 variables:
  103. ## .. ..$ sectionName : chr [1:2] "quickLinks" "suggestions"
  104. ## .. ..$ sectionResults:List of 2
  105. ```
  106. So, locale metadata and something to do with on-page links/suggestions.
  107. As demonstrated, the `har_entries()` function makes it easy to get to the individual elements and we used the `is_json()` helper with `purrr` functions to slice and dice the structure at will. Here are all the `is_` functions you can use with HAR objects:
  108. - `is_binary()`
  109. - `is_content_type()`
  110. - `is_css()`
  111. - `is_gif()`
  112. - `is_html()`
  113. - `is_javascript()`
  114. - `is_jpeg()`
  115. - `is_json()`
  116. - `is_plain()`
  117. - `is_png()`
  118. - `is_svg()`
  119. - `is_xhr()`
  120. - `is_xml()`
  121. You can also use various `get_` helpers to avoid gnarly `$` or `[[]]` constructs:
  122. - `get_body_size()` --- Retrieve size of content | body | headers
  123. - `get_content_size()` --- Retrieve size of content | body | headers
  124. - `get_content_type()` --- Retrieve or test content type of a HAR request object
  125. - `get_headers` --- Retrieve response headers as a data frame
  126. - `get_headers_size()` --- Retrieve size of content | body | headers
  127. - `get_request_type()` --- Retrieve or test request type
  128. - `get_request_url()` --- Retrieve request URL
  129. - `get_response_url()` --- Retrieve response URL
  130. - `get_response_body()` --- Retrieve the body content of a HAR entry
  131. We've seen one example of them already, here's another:
  132. ```{r eval=FALSE}
  133. har_entries(apple) %>%
  134. purrr::map_dbl(get_body_size)
  135. ## [1] 0 54521 95644 98069 43183 8689 19035 794210 66487 133730 311054 13850 199928 161859 90322 343189 19035
  136. ## [18] 794210 66487 133730 554 802 1002 1160 1694 264 1082 1661 390 416 108468 108828 100064 109728
  137. ## [35] 109412 99196 108856 109360 108048 8868 10648 10380 10476 137 311054 13850 3192 3253 4130 2027 1247
  138. ## [52] 1748 582 199928 109628 107832 109068 100632 108928 97812 108312 108716 107028 65220 73628 72188 72600 70400
  139. ## [69] 73928 72164 73012 71080 1185 161859 90322 343189 0 491 60166 58509 60166 58509 53281 53281
  140. ```
  141. So, a visit to Apple's page transfers nearly 8MB of content down to your browser.
  142. California also claims to care about your privacy, but is it _really_ true?
  143. ```{r eval=FALSE}
  144. ca <- render_har(url = "https://www.ca.gov/", response_body = TRUE)
  145. har_entries(ca) %>%
  146. purrr::map_chr(~.x$response$url %>% urltools::domain()) %>%
  147. unique()
  148. ## [1] "www.ca.gov" "fonts.googleapis.com" "california.azureedge.net"
  149. ## [4] "portal-california.azureedge.net" "az416426.vo.msecnd.net" "fonts.gstatic.com"
  150. ## [7] "ssl.google-analytics.com" "cse.google.com" "translate.google.com"
  151. ## [10] "api.stateentityprofile.ca.gov" "translate.googleapis.com" "www.google.com"
  152. ## [13] "clients1.google.com" "www.gstatic.com" "platform.twitter.com"
  153. ## [16] "dc.services.visualstudio.com"
  154. ```
  155. Yikes! It _sure_ doesn't look that way given all the folks they let track you when you visit their main page. Are they executing javascript from those sites?
  156. ```{r eval=FALSE}
  157. ## # A tibble: 8 x 2
  158. ## dom type
  159. ## <chr> <chr>
  160. ## 1 california.azureedge.net application/javascript
  161. ## 2 california.azureedge.net application/x-javascript
  162. ## 3 az416426.vo.msecnd.net application/x-javascript
  163. ## 4 cse.google.com text/javascript
  164. ## 5 translate.google.com text/javascript
  165. ## 6 translate.googleapis.com text/javascript
  166. ## 7 www.google.com text/javascript
  167. ## 8 platform.twitter.com application/javascript
  168. ```
  169. We can also examine the response headers to check for signs of safety as well (i.e. are there content security policy headers or other types of security-oriented headers):
  170. ```{r eval=FALSE}
  171. har_entries(ca) %>%
  172. purrr::map_df(get_headers) %>%
  173. dplyr::count(name, sort=TRUE) %>%
  174. print(n=50)
  175. ## # A tibble: 42 x 2
  176. ## name n
  177. ## <chr> <int>
  178. ## 1 date 149
  179. ## 2 server 148
  180. ## 3 content-type 142
  181. ## 4 last-modified 126
  182. ## 5 etag 104
  183. ## 6 content-encoding 83
  184. ## 7 access-control-allow-origin 78
  185. ## 8 accept-ranges 74
  186. ## 9 vary 69
  187. ## 10 content-length 66
  188. ## 11 x-ms-ref 57
  189. ## 12 x-ms-ref-originshield 57
  190. ## 13 access-control-expose-headers 56
  191. ## 14 content-md5 51
  192. ## 15 x-ms-blob-type 51
  193. ## 16 x-ms-lease-status 51
  194. ## 17 x-ms-request-id 51
  195. ## 18 x-ms-version 51
  196. ## 19 cache-control 37
  197. ## 20 expires 34
  198. ## 21 alt-svc 30
  199. ## 22 x-xss-protection 29
  200. ## 23 x-content-type-options 27
  201. ## 24 age 22
  202. ## 25 transfer-encoding 20
  203. ## 26 timing-allow-origin 14
  204. ## 27 x-powered-by 14
  205. ## 28 access-control-allow-headers 7
  206. ## 29 pragma 6
  207. ## 30 request-context 5
  208. ## 31 x-aspnet-version 5
  209. ## 32 x-frame-options 4
  210. ## 33 content-disposition 3
  211. ## 34 access-control-max-age 2
  212. ## 35 content-language 2
  213. ## 36 p3p 2
  214. ## 37 x-cache 2
  215. ## 38 access-control-allow-methods 1
  216. ## 39 location 1
  217. ## 40 set-cookie 1
  218. ## 41 strict-transport-security 1
  219. ## 42 x-ms-session-id 1
  220. ```
  221. Unfortunately, they do let Google and Twitter execute javascript.
  222. They seem to use quite a bit of Microsoft tech. Let's look at the HTTP servers they directly and indirectly rely on:
  223. ```{r eval=FALSE}
  224. har_entries(ca) %>%
  225. purrr::map_chr(get_header_val, "server") %>%
  226. table(dnn = "server") %>%
  227. broom::tidy() %>%
  228. dplyr::arrange(desc(n))
  229. ## # A tibble: 14 x 2
  230. ## server n
  231. ## <chr> <int>
  232. ## 1 Apache 55
  233. ## 2 Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0 50
  234. ## 3 sffe 23
  235. ## 4 Microsoft-IIS/10.0 7
  236. ## 5 ESF 3
  237. ## 6 HTTP server (unknown) 2
  238. ## 7 ECAcc (bsa/EAD2) 1
  239. ## 8 ECD (sjc/16E0) 1
  240. ## 9 ECD (sjc/16EA) 1
  241. ## 10 ECD (sjc/16F4) 1
  242. ## 11 ECD (sjc/4E95) 1
  243. ## 12 ECD (sjc/4E9F) 1
  244. ## 13 ECS (bsa/EB1F) 1
  245. ## 14 gws 1
  246. ```
  247. ## Impersonating Other Browsers
  248. The various `render_` functions present themselves as modern WebKit Linux browser (which it is!). If you want more control, you need to go to the DSL to don a mask of your choosing. You may want to be precise and Bring Your Own User-agent string, but we've defined and exposed a few handy ones for you:
  249. - `ua_splashr`
  250. - `ua_win10_chrome`
  251. - `ua_win10_firefox`
  252. - `ua_win10_ie11`
  253. - `ua_win7_chrome`
  254. - `ua_win7_firefox`
  255. - `ua_win7_ie11`
  256. - `ua_macos_chrome`
  257. - `ua_macos_safari`
  258. - `ua_linux_chrome`
  259. - `ua_linux_firefox`
  260. - `ua_ios_safari`
  261. - `ua_android_samsung`
  262. - `ua_kindle`
  263. - `ua_ps4`
  264. - `ua_apple_tv`
  265. - `ua_chromecast`
  266. NOTE: These can be used with `curl`, `httr`, `rvest` and `RCurl` calls as well.
  267. We can wee it in action:
  268. ```{r eval=FALSE}
  269. URL <- "https://httpbin.org/user-agent"
  270. splash_local %>%
  271. splash_response_body(TRUE) %>%
  272. splash_user_agent(ua_macos_chrome) %>%
  273. splash_go(URL) %>%
  274. splash_html() %>%
  275. xml2::xml_text("body") %>%
  276. jsonlite::fromJSON()
  277. ## $`user-agent`
  278. ## [1] "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36"
  279. ```
  280. One more NOTE: It's good form to say who you really are when scraping. There are times when you have no choice but to wear a mask, but try to use your own user-agent that identifies who you are and what you're doing.
  281. ## The `splashr` Docker Interface
  282. Helping you get Docker and the R `docker` package up and running is beyond the scope of this pacakge. If you do manage to work that out (in my experience, it's most gnarly on Windows), then we've got some helper functions to enable you to manage Splash Docker containers from within R.
  283. - `install_splash()` --- Retrieve the Docker image for Splash
  284. - `start_splash()` --- Start a Splash server Docker container
  285. - `stop_splash()` --- Stop a running a Splash server Docker container
  286. - `killall_splash()` --- Prune all dead and running Splash Docker containers
  287. The `install_splash()` will pull the image locally for you. It takes a bit (the image size is around half a gigabyte at the time of this writing) and you can specify the `tag` you want if there's a newer image produced before the package gets updated.
  288. The best way to use start/stop is to:
  289. ```{r eval=FALSE}
  290. spi <- start_splash()
  291. # ... scraping tasks ...
  292. stop_splash(spi)
  293. ```
  294. Now, if you're like me and totally forget you started Splash Docker containers, you can use the `killall_splash()` function which will try to find them and stop/kill and remvoe them from your system. It doesn't remove the image, just running or stale containers.