Tools to Work with the 'Splash' JavaScript Rendering Service in R
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

113 lines
5.0KB

  1. ---
  2. title: "Working With the splashr DSL"
  3. author: "Bob Rudis"
  4. date: "`r Sys.Date()`"
  5. output:
  6. rmarkdown::html_vignette:
  7. toc: true
  8. vignette: >
  9. %\VignetteIndexEntry{Working With the splashr DSL}
  10. %\VignetteEngine{knitr::rmarkdown}
  11. %\VignetteEncoding{UTF-8}
  12. ---
  13. The introductory vignette provided a glimpse into the high-level `render_` functions of `splashr`. Those map directly to the high-level HTTP API calls in the Splash API. Underneath the simplified API lies a powerful [Lua](https://www.lua.org/) scripting interface that can control and query the HTML DOM and return objects of various complexity.
  14. A different vignette will cover the Lua interface, but it's bad enough we all need to know a bit of JavaScript (JS) and CSS and HTML and XML and XPath (etc) to get access to some gnarly web site content. Most of us really don't want to delve into the syntax of yet-another programming language. To make it easier to work at a more detailed level without learning Lua directly, `splashr` provides a pipe-able domain-specific language (DSL) that let you use R functions to covertly build Lua scripts.
  15. ## Using the `splashr` DSL
  16. When would you need to have this level of control? Well, say you wanted to scrape a page that requries you to go to a start page first to setup a session. That means you want to hit two URLs in succession, likely after some pause. We can pretent that the <https://analytics.usa.gov/> site has this requirement to illustrate how we'd move from one page to another using the `splashr` DSL (remember, there is an inherent assumption that you've got a Splash instance running on your local system for these vignette code samples):
  17. ```
  18. library(splashr)
  19. splash_local %>%
  20. splash_response_body(TRUE) %>%
  21. splash_user_agent(ua_macos_chrome) %>%
  22. splash_go("https://analytics.usa.gov/") %>%
  23. splash_wait(5) %>%
  24. splash_go("https://analytics.usa.gov/agriculture/") %>%
  25. splash_wait() %>%
  26. splash_png() -> agri_png
  27. ```
  28. Before showing the page image, let's walk through that chained function call. We:
  29. - started with the built-in object representing a local Splash instance
  30. - told the DSL we want to get content back and not just page resource metadata
  31. - told the Splash browser to impersonate a macOS Chrome browser (that matters for some sites)
  32. - went to our example URL
  33. - paused for a bit
  34. - shifted over to another URL on that site
  35. - paused for a tinier bit
  36. - took a screen shot
  37. Up until `splash_png()` the function chains were just collecting instructions that are eventually transcoded into a Lua script. The call to `splash_png()` triggers this transcoding and sending of the commands over to the Splash instance and waits for content to come back.
  38. Here's the result:
  39. ```
  40. agri_png
  41. ## format width height colorspace filesize
  42. ## 1 PNG 1024 2761 sRGB 532615
  43. ```
  44. <img width="100%" style="max-width:100%" src="figures/splashr02.png"/>
  45. We can even interact a bit with a site using this mid-level DSL. Let's fill in a form field on Wikipedia and see the result.
  46. ```
  47. splash_local %>%
  48. splash_go("https://en.wikipedia.org/wiki/Main_Page") %>%
  49. splash_focus("#searchInput") %>%
  50. splash_send_text("maine") %>%
  51. splash_send_keys("<Return>") %>%
  52. splash_wait() %>%
  53. splash_png() -> wiki_png
  54. wiki_png
  55. ## format width height colorspace filesize
  56. ## 1 PNG 1024 23042 sRGB 8517828
  57. ```
  58. <img width="100%" style="max-width:100%" src="figures/splashr03.png"/>
  59. (I chopped off that page result as it scrolls for 8MB worth of PNG content and the CRAN folks would not appreciate us taking up that much space for this vignette).
  60. ## With Great Power...
  61. ...comes time and effort to learn yet-another Shiny New Thing.
  62. However, if you're willing to spend some time to learn some Lua (it's not that bad, really), you can use `splashr` to return actual data from a site vs HTML you have to parse. Let's pull three specific pieces of data from one of the sub-pages of the analytics site we've been scraping:
  63. ```
  64. splash_local %>%
  65. execute_lua('
  66. function main(splash)
  67. splash:go("https://analytics.usa.gov/postal-service/")
  68. splash:wait(5)
  69. local title = splash:evaljs("document.title")
  70. local ppl = splash:select("#current_visitors")
  71. local tot = splash:select("#total_visitors")
  72. return {
  73. title = title,
  74. curr_visits = ppl.text(),
  75. total_vitis = tot.text()
  76. }
  77. end
  78. ') %>%
  79. readBin("character") %>%
  80. jsonlite::fromJSON() %>%
  81. str()
  82. ## List of 3
  83. ## $ total_vitis: chr "581.5 million"
  84. ## $ title : chr "analytics.usa.gov | The US government's web traffic."
  85. ## $ curr_visits: chr "14,750"
  86. ```
  87. We don't have to do any DOM parsing on the R end to get specific bits of data from the page itself. That's pretty handy and once you start making some simple Lua scripts, it gets easier. Note, too, thqat you can use (hopefully) familiar JS constructs to yank info from the DOM as well
  88. as Lua-specific methods.
  89. Don't hesitate to [file an issue](https://github.com/hrbrmstr/splashr/issues) if you'd like more of the lower-level Lua interface brought up to the `splashr` DSL-level.