Parse ‘URLs’ with ‘libcurl’
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
boB Rudis 1abd213df6
add repos
преди 5 години
R add repos преди 5 години
README_cache/gfm jeroen's build system преди 5 години
README_files/figure-gfm README преди 6 години
inst copyright/README преди 5 години
man add repos преди 5 години
src add repos преди 5 години
tests stuff that got left behind in the merge преди 5 години
tools add repos преди 5 години
.Rbuildignore travis преди 5 години
.codecov.yml R package repo initialization complete преди 6 години
.gitignore travis преди 5 години
.gitlab-ci.yml jeroen's build system преди 5 години
.travis.yml travis преди 5 години
DESCRIPTION named param преди 5 години
LICENSE jeroen's build system преди 5 години
NAMESPACE add repos преди 5 години
NEWS.md R package repo initialization complete преди 6 години
README.Rmd add repos преди 5 години
README.md add repos преди 5 години
appveyor.yml jeroen's build system преди 5 години
cleanup added jeroen's libcurl anticonf tho it won't work on Windows until the new libcurl is in release and in winlibs преди 6 години
configure jeroen's build system преди 5 години
configure.win added jeroen's libcurl anticonf tho it won't work on Windows until the new libcurl is in release and in winlibs преди 6 години
curlparse.Rproj jeroen's build system преди 5 години

README.md

Project Status: Active – The project has reached a stable, usablestate and is being activelydeveloped. Signedby Signed commit% Linux buildStatus Windows buildstatus CoverageStatus Minimal RVersion License

curlparse

Parse ‘URLs’ with ‘libcurl’

Description

Tools are provided to parse URLs using the modern ‘libcurl’ built-in parser.

NOTE

You need to have libcurl >= 7.62.0 for this to work since that’s when it began to expose the URL parsing API.

macOS users can do:

$ brew install curl

(provided you’re using Homebrew).

Windows users are able to just install this pacakge since it uses the same, clever “anticonf” that Jeroen uses in the {curl} pacakge.

The state of the availability of libcurl v7.62.0 across Linux distributions is sketch at best (as an example, Ubuntu bionic and comic are not even remotely at the current version). If your distribution does not have >= 7.62.0 available you will need to compile and install it manually ensuring the library and headers are available to R to build the package.

What’s Inside The Tin

The following functions are implemented:

  • is_valid_url: Test if a URL is a valid URL
  • parse_curl: Parse a character vector of URLs into component parts
  • scheme: Extract member components from a URL string
  • url_parse: Parse a character vector of URLs into component parts (urltools compatibility function)

Installation

remotes::install_git("https://git.rud.is/hrbrmstr/curlparse.git")
# or
remotes::install_git("https://git.sr.ht/~hrbrmstr/curlparse")
# or
remotes::install_gitlab("hrbrmstr/curlparse")
# or
remotes::install_bitbucket("hrbrmstr/curlparse")
# or
remotes::install_github("hrbrmstr/curlparse")

NOTE: To use the ‘remotes’ install options you will need to have the {remotes} package installed.

Usage

library(curlparse)

# current verison
packageVersion("curlparse")
## [1] '0.2.0'

Process Some URLs

library(urltools)
library(rvest)
library(curlparse)
library(tidyverse)
read_html("https://www.r-bloggers.com/blogs-list/") %>% 
  html_nodes(xpath=".//li[contains(., 'Contributing Blogs')]/ul/li/a[contains(@href, 'http')]") %>% 
  html_attr("href") -> blog_urls
(parsed <- parse_curl(blog_urls))
## # A tibble: 977 x 9
##    scheme user  password host                    port  path            options query fragment
##    <chr>  <chr> <chr>    <chr>                   <chr> <chr>           <chr>   <chr> <chr>   
##  1 http   <NA>  <NA>     reichlab.io             80    /               <NA>    <NA>  <NA>    
##  2 http   <NA>  <NA>     dmlc.ml                 80    /               <NA>    <NA>  <NA>    
##  3 https  <NA>  <NA>     lionel-.github.io       443   /               <NA>    <NA>  <NA>    
##  4 https  <NA>  <NA>     jean9208.github.io      443   /rss-R.xml      <NA>    <NA>  <NA>    
##  5 https  <NA>  <NA>     ryouready.wordpress.com 443   /               <NA>    <NA>  <NA>    
##  6 https  <NA>  <NA>     rveryday.wordpress.com  443   /               <NA>    <NA>  <NA>    
##  7 http   <NA>  <NA>     www.talyarkoni.org      80    /blog           <NA>    <NA>  <NA>    
##  8 https  <NA>  <NA>     rtricks.wordpress.com   443   /               <NA>    <NA>  <NA>    
##  9 https  <NA>  <NA>     xcafebabe.blogspot.com  443   /search/label/R <NA>    <NA>  <NA>    
## 10 http   <NA>  <NA>     4dpiecharts.com         80    /               <NA>    <NA>  <NA>    
## # … with 967 more rows

count(parsed, scheme, sort=TRUE)
## # A tibble: 2 x 2
##   scheme     n
##   <chr>  <int>
## 1 https    618
## 2 http     359

filter(parsed, !is.na(query))
## # A tibble: 7 x 9
##   scheme user  password host                   port  path               options query                           fragment
##   <chr>  <chr> <chr>    <chr>                  <chr> <chr>              <chr>   <chr>                           <chr>   
## 1 http   <NA>  <NA>     freakonometrics.blog.… 80    /index.php         <NA>    ""                              <NA>    
## 2 http   <NA>  <NA>     ludvigolsen.dk         80    /                  <NA>    lang=en                         <NA>    
## 3 https  <NA>  <NA>     medium.com             443   /principles-0/tag… <NA>    source=rss----489c2dec8959--r   <NA>    
## 4 https  <NA>  <NA>     medium.com             443   /tim-black/tagged… <NA>    source=rss----d71cf9ecf7ec--r   <NA>    
## 5 https  <NA>  <NA>     kevinkuang.net         443   /tagged/r-program… <NA>    source=rss----a1ff9aea4bf1--r_… <NA>    
## 6 https  <NA>  <NA>     medium.com             443   /@MattOldach_65321 <NA>    source=rss-459e62b88a2a------2  <NA>    
## 7 https  <NA>  <NA>     medium.com             443   /@zappingseb       <NA>    source=rss-dbc9f652035a------2  <NA>

Benchmark

curlparse includes a url_parse() function to make it easier to use this package for current users of urltools::url_parse() since it provides the same API and same results back (including it being a regular data frame and not a tbl).

Spoiler alert: urltools::url_parse() is faster by ~100µs (per-100 URLs) for “good” URLs (if there’s a mix of gnarly/bad URLs and valid ones they get closer to being on-par). The aim was not to try to beat it, though.

Per the blog post introducing this new set of API calls:

Applications that pass in URLs to libcurl would of course still very often need to parse URLs, create URLs or otherwise handle them, but libcurl has not been helping with that.

At the same time, the under-specification of URLs has led to a situation where there’s really no stable document anywhere describing how URLs are supposed to work and basically every implementer is left to handle the WHATWG URL spec, RFC 3986 and the world in between all by themselves. Understanding how their URL parsing libraries, libcurl, other tools and their favorite browsers differ is complicated.

By offering applications access to libcurl’s own URL parser, we hope to tighten a problematic vulnerable area for applications where the URL parser library would believe one thing and libcurl another. This could and has sometimes lead to security problems. (See for example Exploiting URL Parser in Trending Programming Languages! by Orange Tsai)

So, using this library adds consistency with how libcurl sees and handles URLs.

library(microbenchmark)

set.seed(0)
test_urls <- sample(blog_urls, 100) # pick 100 URLs at random

microbenchmark(
  curlparse = curlparse::url_parse(test_urls),
  urltools = urltools::url_parse(test_urls), # we loaded urltools before curlparse at the top so namespace loading wasn't a factor for the benchmarks
  times = 500
) -> mb

mb
## Unit: microseconds
##       expr     min       lq     mean   median       uq      max neval cld
##  curlparse 794.686 836.9005 893.4829 878.3295 919.0675 4492.718   500   b
##   urltools 670.631 705.1925 764.8708 738.3700 782.3535 4834.614   500  a

autoplot(mb)

The individual handlers are a bit more on-par but mostly still slower (except for fragment()). Note that urltools has no equivalent function to just extract query strings so that’s not in the test.

bind_rows(
  microbenchmark(curlparse = curlparse::scheme(blog_urls), urltools = urltools::scheme(blog_urls)) %>%
    mutate(test = "scheme"),
  microbenchmark(curlparse = curlparse::domain(blog_urls), urltools = urltools::domain(blog_urls)) %>%
    mutate(test = "domain"),
  microbenchmark(curlparse = curlparse::port(blog_urls), urltools = urltools::port(blog_urls)) %>%
    mutate(test = "port"),
  microbenchmark(curlparse = curlparse::path(blog_urls), urltools = urltools::path(blog_urls)) %>%
    mutate(test = "path"),
  microbenchmark(curlparse = curlparse::fragment(blog_urls), urltools = urltools::fragment(blog_urls)) %>%
    mutate(test = "fragment")
) %>% 
  mutate(test = factor(test, levels=c("scheme", "domain", "port", "path", "fragment"))) %>% 
  mutate(time = time / 1000000) %>% 
  ggplot(aes(expr, time)) +
  geom_violin(aes(fill=expr), show.legend = FALSE) +
  scale_y_continuous(name = "milliseconds", expand = c(0,0), limits=c(0, NA)) +
  hrbrthemes::scale_fill_ft() +
  facet_wrap(~test, ncol = 1) +
  coord_flip() +
  labs(x=NULL) +
  hrbrthemes::theme_ft_rc(grid="XY", strip_text_face = "bold") +
  theme(panel.spacing.y=unit(0, "lines"))

Stress Test

c(
  "", "foo", "foo;params?query#fragment", "http://foo.com/path", "http://foo.com",
  "//foo.com/path", "//user:pass@foo.com/", "http://user:pass@foo.com/", 
  "file:///tmp/junk.txt", "imap://mail.python.org/mbox1",
  "mms://wms.sys.hinet.net/cts/Drama/09006251100.asf", "nfs://server/path/to/file.txt",
  "svn+ssh://svn.zope.org/repos/main/ZConfig/trunk/",
  "git+ssh://git@github.com/user/project.git", "HTTP://WWW.PYTHON.ORG/doc/#frag",
  "http://www.python.org:080/", "http://www.python.org:/", "javascript:console.log('hello')",
  "javascript:console.log('hello');console.log('world')", "http://example.com/?", 
  "http://example.com/;", "tel:0108202201", "unknown:0108202201",
  "http://user@example.com:8080/path;param?query#fragment", 
  "http://www.python.org:65536/", "http://www.python.org:-20/",
  "http://www.python.org:8589934592/", "http://www.python.org:80hello/", 
  "http://:::cnn.com/", "http://./", "http://foo..com/", "http://foo../"
) -> ugly_urls

(u_parsed <- parse_curl(ugly_urls))
## # A tibble: 32 x 9
##    scheme user  password host            port  path          options query fragment
##    <chr>  <chr> <chr>    <chr>           <chr> <chr>         <chr>   <chr> <chr>   
##  1 <NA>   <NA>  <NA>     <NA>            <NA>  <NA>          <NA>    <NA>  <NA>    
##  2 <NA>   <NA>  <NA>     <NA>            <NA>  <NA>          <NA>    <NA>  <NA>    
##  3 <NA>   <NA>  <NA>     <NA>            <NA>  <NA>          <NA>    <NA>  <NA>    
##  4 http   <NA>  <NA>     foo.com         80    /path         <NA>    <NA>  <NA>    
##  5 http   <NA>  <NA>     foo.com         80    /             <NA>    <NA>  <NA>    
##  6 <NA>   <NA>  <NA>     <NA>            <NA>  <NA>          <NA>    <NA>  <NA>    
##  7 <NA>   <NA>  <NA>     <NA>            <NA>  <NA>          <NA>    <NA>  <NA>    
##  8 http   user  pass     foo.com         80    /             <NA>    <NA>  <NA>    
##  9 file   <NA>  <NA>     <NA>            0     /tmp/junk.txt <NA>    <NA>  <NA>    
## 10 imap   <NA>  <NA>     mail.python.org 143   /mbox1        <NA>    <NA>  <NA>    
## # … with 22 more rows

filter(u_parsed, !is.na(scheme))
## # A tibble: 14 x 9
##    scheme user  password host            port  path          options query fragment
##    <chr>  <chr> <chr>    <chr>           <chr> <chr>         <chr>   <chr> <chr>   
##  1 http   <NA>  <NA>     foo.com         80    /path         <NA>    <NA>  <NA>    
##  2 http   <NA>  <NA>     foo.com         80    /             <NA>    <NA>  <NA>    
##  3 http   user  pass     foo.com         80    /             <NA>    <NA>  <NA>    
##  4 file   <NA>  <NA>     <NA>            0     /tmp/junk.txt <NA>    <NA>  <NA>    
##  5 imap   <NA>  <NA>     mail.python.org 143   /mbox1        <NA>    <NA>  <NA>    
##  6 http   <NA>  <NA>     WWW.PYTHON.ORG  80    /doc/         <NA>    <NA>  frag    
##  7 http   <NA>  <NA>     www.python.org  80    /             <NA>    <NA>  <NA>    
##  8 http   <NA>  <NA>     www.python.org  80    /             <NA>    <NA>  <NA>    
##  9 http   <NA>  <NA>     example.com     80    /             <NA>    ""    <NA>    
## 10 http   <NA>  <NA>     example.com     80    /;            <NA>    <NA>  <NA>    
## 11 http   user  <NA>     example.com     8080  /path;param   <NA>    query fragment
## 12 http   <NA>  <NA>     .               80    /             <NA>    <NA>  <NA>    
## 13 http   <NA>  <NA>     foo..com        80    /             <NA>    <NA>  <NA>    
## 14 http   <NA>  <NA>     foo..           80    /             <NA>    <NA>  <NA>

filter(u_parsed, !is.na(user))
## # A tibble: 2 x 9
##   scheme user  password host        port  path        options query fragment
##   <chr>  <chr> <chr>    <chr>       <chr> <chr>       <chr>   <chr> <chr>   
## 1 http   user  pass     foo.com     80    /           <NA>    <NA>  <NA>    
## 2 http   user  <NA>     example.com 8080  /path;param <NA>    query fragment

filter(u_parsed, !is.na(password))
## # A tibble: 1 x 9
##   scheme user  password host    port  path  options query fragment
##   <chr>  <chr> <chr>    <chr>   <chr> <chr> <chr>   <chr> <chr>   
## 1 http   user  pass     foo.com 80    /     <NA>    <NA>  <NA>

filter(u_parsed, !is.na(host))
## # A tibble: 13 x 9
##    scheme user  password host            port  path        options query fragment
##    <chr>  <chr> <chr>    <chr>           <chr> <chr>       <chr>   <chr> <chr>   
##  1 http   <NA>  <NA>     foo.com         80    /path       <NA>    <NA>  <NA>    
##  2 http   <NA>  <NA>     foo.com         80    /           <NA>    <NA>  <NA>    
##  3 http   user  pass     foo.com         80    /           <NA>    <NA>  <NA>    
##  4 imap   <NA>  <NA>     mail.python.org 143   /mbox1      <NA>    <NA>  <NA>    
##  5 http   <NA>  <NA>     WWW.PYTHON.ORG  80    /doc/       <NA>    <NA>  frag    
##  6 http   <NA>  <NA>     www.python.org  80    /           <NA>    <NA>  <NA>    
##  7 http   <NA>  <NA>     www.python.org  80    /           <NA>    <NA>  <NA>    
##  8 http   <NA>  <NA>     example.com     80    /           <NA>    ""    <NA>    
##  9 http   <NA>  <NA>     example.com     80    /;          <NA>    <NA>  <NA>    
## 10 http   user  <NA>     example.com     8080  /path;param <NA>    query fragment
## 11 http   <NA>  <NA>     .               80    /           <NA>    <NA>  <NA>    
## 12 http   <NA>  <NA>     foo..com        80    /           <NA>    <NA>  <NA>    
## 13 http   <NA>  <NA>     foo..           80    /           <NA>    <NA>  <NA>

filter(u_parsed, !is.na(path))
## # A tibble: 14 x 9
##    scheme user  password host            port  path          options query fragment
##    <chr>  <chr> <chr>    <chr>           <chr> <chr>         <chr>   <chr> <chr>   
##  1 http   <NA>  <NA>     foo.com         80    /path         <NA>    <NA>  <NA>    
##  2 http   <NA>  <NA>     foo.com         80    /             <NA>    <NA>  <NA>    
##  3 http   user  pass     foo.com         80    /             <NA>    <NA>  <NA>    
##  4 file   <NA>  <NA>     <NA>            0     /tmp/junk.txt <NA>    <NA>  <NA>    
##  5 imap   <NA>  <NA>     mail.python.org 143   /mbox1        <NA>    <NA>  <NA>    
##  6 http   <NA>  <NA>     WWW.PYTHON.ORG  80    /doc/         <NA>    <NA>  frag    
##  7 http   <NA>  <NA>     www.python.org  80    /             <NA>    <NA>  <NA>    
##  8 http   <NA>  <NA>     www.python.org  80    /             <NA>    <NA>  <NA>    
##  9 http   <NA>  <NA>     example.com     80    /             <NA>    ""    <NA>    
## 10 http   <NA>  <NA>     example.com     80    /;            <NA>    <NA>  <NA>    
## 11 http   user  <NA>     example.com     8080  /path;param   <NA>    query fragment
## 12 http   <NA>  <NA>     .               80    /             <NA>    <NA>  <NA>    
## 13 http   <NA>  <NA>     foo..com        80    /             <NA>    <NA>  <NA>    
## 14 http   <NA>  <NA>     foo..           80    /             <NA>    <NA>  <NA>

filter(u_parsed, !is.na(query))
## # A tibble: 2 x 9
##   scheme user  password host        port  path        options query fragment
##   <chr>  <chr> <chr>    <chr>       <chr> <chr>       <chr>   <chr> <chr>   
## 1 http   <NA>  <NA>     example.com 80    /           <NA>    ""    <NA>    
## 2 http   user  <NA>     example.com 8080  /path;param <NA>    query fragment

filter(u_parsed, !is.na(fragment))
## # A tibble: 2 x 9
##   scheme user  password host           port  path        options query fragment
##   <chr>  <chr> <chr>    <chr>          <chr> <chr>       <chr>   <chr> <chr>   
## 1 http   <NA>  <NA>     WWW.PYTHON.ORG 80    /doc/       <NA>    <NA>  frag    
## 2 http   user  <NA>     example.com    8080  /path;param <NA>    query fragment

Make sure the vector extractors work the same as the data frame converter:

all(
  c(
    identical(u_parsed$scheme, scheme(ugly_urls)),
    identical(u_parsed$user, user(ugly_urls)),
    identical(u_parsed$password, password(ugly_urls)),
    identical(u_parsed$host, host(ugly_urls)),
    identical(u_parsed$path, path(ugly_urls)),
    identical(u_parsed$query, query(ugly_urls)),
    identical(u_parsed$fragment, fragment(ugly_urls))
  )
)
## [1] TRUE

curlparse Metrics

Lang # Files (%) LoC (%) Blank lines (%) # Lines (%)
C++ 2 0.2 280 0.68 65 0.48 58 0.30
Rmd 1 0.1 85 0.21 56 0.41 70 0.36
R 6 0.6 46 0.11 14 0.10 67 0.34
Bourne Shell 1 0.1 2 0.00 0 0.00 0 0.00

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.