Tools to Work with the 'Splash' JavaScript Rendering Service in R
25개 이상의 토픽을 선택하실 수 없습니다. Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

345 lines
14 KiB

---
title: "splashr Helper Functions and Data"
author: "Bob Rudis"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
toc: true
vignette: >
%\VignetteIndexEntry{splashr Helper Functions and Data}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
Splash has a ton of features and `splashr` exposes many of them. The `render_` functions and DSL can return everything from simple, tiny JSON data to huge, nested `list` structures of complex objects.
Furthermore, web content mining can be tricky. Modern sites can present information in different ways depending on the type of browser or device you use and many won't serve pages to "generic" browsers.
Finally, the Dockerized containers of Splash servers make it really easy to get started, but you may prefer an R console over the system command-line.
Let's see what extra goodies `splashr` provides to make our lives easier.
## Handling `splashr` Objects
One of the most powerful functions in `splashr` is `render_har()`. You get every component loaded by dynamic web page, and some sites have upwards of 100 elements for any given page. How can you get to the bits that you want?
We'll use `render_har()` to demonstrate how to find resources a site loads and use the data we gather to assess how "safe" these sites are — i.e. how many third-party javascript components they load and how safely they are loaded. Note that code in this vignette assumes a Splash instance is running locally on your system.
We'll check <https://apple.com/> first since Apple claims to care about our privacy. If that's true, then they'll will load few or no third-party content.
```{r eval=FALSE}
(apple <- render_har(url = "https://apple.com/", response_body = TRUE))
## --------HAR VERSION--------
## HAR specification version: 1.2
## --------HAR CREATOR--------
## Created by: Splash
## version: 3.3.1
## --------HAR BROWSER--------
## Browser: QWebKit
## version: 602.1
## --------HAR PAGES--------
## Page id: 1 , Page title: Apple
## --------HAR ENTRIES--------
## Number of entries: 84
## REQUESTS:
## Page: 1
## Number of entries: 84
## - https://apple.com/
## - https://www.apple.com/
## - https://www.apple.com/ac/globalnav/4/en_US/styles/ac-globalnav.built.css
## - https://www.apple.com/ac/localnav/4/styles/ac-localnav.built.css
## - https://www.apple.com/ac/globalfooter/4/en_US/styles/ac-globalfooter.built.css
## ........
## - https://www.apple.com/v/home/ea/images/heroes/iphone-xs/iphone_xs_0afef_mediumtall.jpg
## - https://www.apple.com/v/home/ea/images/heroes/iphone-xr/iphone_xr_5e40f_mediumtall.jpg
## - https://www.apple.com/v/home/ea/images/heroes/iphone-xs/iphone_xs_0afef_mediumtall.jpg
## - https://www.apple.com/v/home/ea/images/heroes/macbook-air/macbook_air_mediumtall.jpg
## - https://www.apple.com/v/home/ea/images/heroes/macbook-air/macbook_air_mediumtall.jpg
```
The HAR output shows that when you visit `apple.com` your browser makes at least 84 requests for resources. We can see what types of content is loaded:
```{r eval=FALSE}
har_entries(apple) %>%
purrr::map_chr(get_content_type) %>%
table(dnn = "content_type") %>%
broom::tidy() %>%
dplyr::arrange(desc(n))
## # A tibble: 9 x 2
## content_type n
## <chr> <int>
## 1 font/woff2 27
## 2 application/x-javascript 15
## 3 image/svg+xml 10
## 4 text/css 9
## 5 image/jpeg 7
## 6 image/png 6
## 7 application/font-woff 4
## 8 text/html 3
## 9 application/json 2
```
Lots of calls to fonts, 15 javascript files and even 2 JSON files. Let's see what the domains are for these resources:
```{r eval=FALSE}
har_entries(apple) %>%
purrr::map_chr(get_response_url) %>%
purrr::map_chr(urltools::domain) %>%
unique()
## [1] "apple.com" "www.apple.com" "securemetrics.apple.com"
```
Wow! Only calls to Apple-controlled resources.
I wonder what's in those JSON files, though:
```{r eval=FALSE}
har_entries(apple) %>%
purrr::keep(is_json) %>%
purrr::map(get_response_body, "text") %>%
purrr::map(jsonlite::fromJSON) %>%
str(3)
## List of 2
## $ :List of 2
## ..$ locale :List of 3
## .. ..$ country : chr "us"
## .. ..$ attr : chr "en-US"
## .. ..$ textDirection: chr "ltr"
## ..$ localeswitcher:List of 7
## .. ..$ name : chr "localeswitcher"
## .. ..$ metadata : Named list()
## .. ..$ displayIndex: int 1
## .. ..$ copy :List of 5
## .. ..$ continue :List of 5
## .. ..$ exit :List of 5
## .. ..$ select :List of 5
## $ :List of 2
## ..$ id : chr "ad6ca319-1ef1-20da-c4e0-5185088996cb"
## ..$ results:'data.frame': 2 obs. of 2 variables:
## .. ..$ sectionName : chr [1:2] "quickLinks" "suggestions"
## .. ..$ sectionResults:List of 2
```
So, locale metadata and something to do with on-page links/suggestions.
As demonstrated, the `har_entries()` function makes it easy to get to the individual elements and we used the `is_json()` helper with `purrr` functions to slice and dice the structure at will. Here are all the `is_` functions you can use with HAR objects:
- `is_binary()`
- `is_content_type()`
- `is_css()`
- `is_gif()`
- `is_html()`
- `is_javascript()`
- `is_jpeg()`
- `is_json()`
- `is_plain()`
- `is_png()`
- `is_svg()`
- `is_xhr()`
- `is_xml()`
You can also use various `get_` helpers to avoid gnarly `$` or `[[]]` constructs:
- `get_body_size()` --- Retrieve size of content | body | headers
- `get_content_size()` --- Retrieve size of content | body | headers
- `get_content_type()` --- Retrieve or test content type of a HAR request object
- `get_headers` --- Retrieve response headers as a data frame
- `get_headers_size()` --- Retrieve size of content | body | headers
- `get_request_type()` --- Retrieve or test request type
- `get_request_url()` --- Retrieve request URL
- `get_response_url()` --- Retrieve response URL
- `get_response_body()` --- Retrieve the body content of a HAR entry
We've seen one example of them already, here's another:
```{r eval=FALSE}
har_entries(apple) %>%
purrr::map_dbl(get_body_size)
## [1] 0 54521 95644 98069 43183 8689 19035 794210 66487 133730 311054 13850 199928 161859 90322 343189 19035
## [18] 794210 66487 133730 554 802 1002 1160 1694 264 1082 1661 390 416 108468 108828 100064 109728
## [35] 109412 99196 108856 109360 108048 8868 10648 10380 10476 137 311054 13850 3192 3253 4130 2027 1247
## [52] 1748 582 199928 109628 107832 109068 100632 108928 97812 108312 108716 107028 65220 73628 72188 72600 70400
## [69] 73928 72164 73012 71080 1185 161859 90322 343189 0 491 60166 58509 60166 58509 53281 53281
```
So, a visit to Apple's page transfers nearly 8MB of content down to your browser.
California also claims to care about your privacy, but is it _really_ true?
```{r eval=FALSE}
ca <- render_har(url = "https://www.ca.gov/", response_body = TRUE)
har_entries(ca) %>%
purrr::map_chr(~.x$response$url %>% urltools::domain()) %>%
unique()
## [1] "www.ca.gov" "fonts.googleapis.com" "california.azureedge.net"
## [4] "portal-california.azureedge.net" "az416426.vo.msecnd.net" "fonts.gstatic.com"
## [7] "ssl.google-analytics.com" "cse.google.com" "translate.google.com"
## [10] "api.stateentityprofile.ca.gov" "translate.googleapis.com" "www.google.com"
## [13] "clients1.google.com" "www.gstatic.com" "platform.twitter.com"
## [16] "dc.services.visualstudio.com"
```
Yikes! It _sure_ doesn't look that way given all the folks they let track you when you visit their main page. Are they executing javascript from those sites?
```{r eval=FALSE}
## # A tibble: 8 x 2
## dom type
## <chr> <chr>
## 1 california.azureedge.net application/javascript
## 2 california.azureedge.net application/x-javascript
## 3 az416426.vo.msecnd.net application/x-javascript
## 4 cse.google.com text/javascript
## 5 translate.google.com text/javascript
## 6 translate.googleapis.com text/javascript
## 7 www.google.com text/javascript
## 8 platform.twitter.com application/javascript
```
We can also examine the response headers to check for signs of safety as well (i.e. are there content security policy headers or other types of security-oriented headers):
```{r eval=FALSE}
har_entries(ca) %>%
purrr::map_df(get_headers) %>%
dplyr::count(name, sort=TRUE) %>%
print(n=50)
## # A tibble: 42 x 2
## name n
## <chr> <int>
## 1 date 149
## 2 server 148
## 3 content-type 142
## 4 last-modified 126
## 5 etag 104
## 6 content-encoding 83
## 7 access-control-allow-origin 78
## 8 accept-ranges 74
## 9 vary 69
## 10 content-length 66
## 11 x-ms-ref 57
## 12 x-ms-ref-originshield 57
## 13 access-control-expose-headers 56
## 14 content-md5 51
## 15 x-ms-blob-type 51
## 16 x-ms-lease-status 51
## 17 x-ms-request-id 51
## 18 x-ms-version 51
## 19 cache-control 37
## 20 expires 34
## 21 alt-svc 30
## 22 x-xss-protection 29
## 23 x-content-type-options 27
## 24 age 22
## 25 transfer-encoding 20
## 26 timing-allow-origin 14
## 27 x-powered-by 14
## 28 access-control-allow-headers 7
## 29 pragma 6
## 30 request-context 5
## 31 x-aspnet-version 5
## 32 x-frame-options 4
## 33 content-disposition 3
## 34 access-control-max-age 2
## 35 content-language 2
## 36 p3p 2
## 37 x-cache 2
## 38 access-control-allow-methods 1
## 39 location 1
## 40 set-cookie 1
## 41 strict-transport-security 1
## 42 x-ms-session-id 1
```
Unfortunately, they do let Google and Twitter execute javascript.
They seem to use quite a bit of Microsoft tech. Let's look at the HTTP servers they directly and indirectly rely on:
```{r eval=FALSE}
har_entries(ca) %>%
purrr::map_chr(get_header_val, "server") %>%
table(dnn = "server") %>%
broom::tidy() %>%
dplyr::arrange(desc(n))
## # A tibble: 14 x 2
## server n
## <chr> <int>
## 1 Apache 55
## 2 Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0 50
## 3 sffe 23
## 4 Microsoft-IIS/10.0 7
## 5 ESF 3
## 6 HTTP server (unknown) 2
## 7 ECAcc (bsa/EAD2) 1
## 8 ECD (sjc/16E0) 1
## 9 ECD (sjc/16EA) 1
## 10 ECD (sjc/16F4) 1
## 11 ECD (sjc/4E95) 1
## 12 ECD (sjc/4E9F) 1
## 13 ECS (bsa/EB1F) 1
## 14 gws 1
```
## Impersonating Other Browsers
The various `render_` functions present themselves as modern WebKit Linux browser (which it is!). If you want more control, you need to go to the DSL to don a mask of your choosing. You may want to be precise and Bring Your Own User-agent string, but we've defined and exposed a few handy ones for you:
- `ua_splashr`
- `ua_win10_chrome`
- `ua_win10_firefox`
- `ua_win10_ie11`
- `ua_win7_chrome`
- `ua_win7_firefox`
- `ua_win7_ie11`
- `ua_macos_chrome`
- `ua_macos_safari`
- `ua_linux_chrome`
- `ua_linux_firefox`
- `ua_ios_safari`
- `ua_android_samsung`
- `ua_kindle`
- `ua_ps4`
- `ua_apple_tv`
- `ua_chromecast`
NOTE: These can be used with `curl`, `httr`, `rvest` and `RCurl` calls as well.
We can wee it in action:
```{r eval=FALSE}
URL <- "https://httpbin.org/user-agent"
splash_local %>%
splash_response_body(TRUE) %>%
splash_user_agent(ua_macos_chrome) %>%
splash_go(URL) %>%
splash_html() %>%
xml2::xml_text("body") %>%
jsonlite::fromJSON()
## $`user-agent`
## [1] "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36"
```
One more NOTE: It's good form to say who you really are when scraping. There are times when you have no choice but to wear a mask, but try to use your own user-agent that identifies who you are and what you're doing.
## The `splashr` Docker Interface
Helping you get Docker and the R `docker` package up and running is beyond the scope of this pacakge. If you do manage to work that out (in my experience, it's most gnarly on Windows), then we've got some helper functions to enable you to manage Splash Docker containers from within R.
- `install_splash()` --- Retrieve the Docker image for Splash
- `start_splash()` --- Start a Splash server Docker container
- `stop_splash()` --- Stop a running a Splash server Docker container
- `killall_splash()` --- Prune all dead and running Splash Docker containers
The `install_splash()` will pull the image locally for you. It takes a bit (the image size is around half a gigabyte at the time of this writing) and you can specify the `tag` you want if there's a newer image produced before the package gets updated.
The best way to use start/stop is to:
```{r eval=FALSE}
spi <- start_splash()
# ... scraping tasks ...
stop_splash(spi)
```
Now, if you're like me and totally forget you started Splash Docker containers, you can use the `killall_splash()` function which will try to find them and stop/kill and remvoe them from your system. It doesn't remove the image, just running or stale containers.