Browse Source

README

master
boB Rudis 6 years ago
parent
commit
0dd88abd19
No known key found for this signature in database GPG Key ID: 1D7529BE14E2BBA9
  1. 1
      NAMESPACE
  2. 10
      R/RcppExports.R
  3. 70
      README.Rmd
  4. 96
      README.md
  5. 7
      README_cache/gfm/__packages
  6. BIN
      README_cache/gfm/unnamed-chunk-5_b30bb7f1dbf8b0ffd72ec16ab8ab30b4.RData
  7. 0
      README_cache/gfm/unnamed-chunk-5_b30bb7f1dbf8b0ffd72ec16ab8ab30b4.rdb
  8. 0
      README_cache/gfm/unnamed-chunk-5_b30bb7f1dbf8b0ffd72ec16ab8ab30b4.rdx
  9. BIN
      README_cache/gfm/unnamed-chunk-5_b9b66de41ab754116a06480c07795289.RData
  10. BIN
      README_files/figure-gfm/unnamed-chunk-7-1.png
  11. BIN
      README_files/figure-gfm/unnamed-chunk-8-1.png
  12. 17
      man/url_parse.Rd
  13. 12
      src/RcppExports.cpp
  14. 70
      src/curlparse-main.cpp

1
NAMESPACE

@ -11,6 +11,7 @@ export(port)
export(query)
export(scheme)
export(url_options)
export(url_parse)
export(user)
importFrom(Rcpp,sourceCpp)
importFrom(stringi,stri_detect_regex)

10
R/RcppExports.R

@ -11,6 +11,16 @@ parse_curl <- function(urls) {
.Call('_curlparse_parse_curl', PACKAGE = 'curlparse', urls)
}
#' Parse a character vector of URLs into component parts (`urltools` compatibility function)
#'
#' @md
#' @param urls character vector of URLs
#' @return data frame (not a tibble)
#' @export
url_parse <- function(urls) {
.Call('_curlparse_url_parse', PACKAGE = 'curlparse', urls)
}
#' Extract member components from a URL string
#'
#' @md

70
README.Rmd

@ -5,7 +5,7 @@ editor_options:
---
```{r include=FALSE}
knitr::opts_chunk$set(
message=FALSE, warning=FALSE, collapse = TRUE
message=FALSE, warning=FALSE, collapse = TRUE, fig.retina=2
)
```
# curlparse
@ -16,7 +16,7 @@ Parse 'URLs' with 'libcurl'
As of version 7.62.0 'libcurl' has exposed its 'URL' parser. Tools are provided to parse 'URLs' using this new parser feature.
**UNTIL `curl`/`libcurl` general release at the end of October you _must_ use the development version which can be cloned and built from <https://github.com/curl/curl>.
**UNTIL `curl`/`libcurl` general release at the end of October you _must_ use the development version which can be cloned and built from <https://github.com/curl/curl>.**
## What's Inside The Tin
@ -63,6 +63,7 @@ packageVersion("curlparse")
### Process Some URLs
```{r libs}
library(urltools)
library(rvest)
library(curlparse)
library(tidyverse)
@ -82,6 +83,71 @@ count(parsed, scheme, sort=TRUE)
filter(parsed, !is.na(query))
```
### Benchmark
`curlparse` includes a `url_parse()` function to make it easier to use this package for current users of `urltools::url_parse()` since it provides the same API and same results back (including it being a regular data frame and not a `tbl`).
Spoiler alert: `urltools::url_parse()` is faster by ~100µs (per-100 URLs) for "good" URLs (if there's a mix of gnarly/bad URLs and valid ones they get closer to being on-par). The aim was not to try to beat it, though.
>Per the [blog post introducing this new set of API calls](https://daniel.haxx.se/blog/2018/09/09/libcurl-gets-a-url-api/):
>
>Applications that pass in URLs to libcurl would of course still very often need to parse URLs, create URLs or otherwise handle them, but libcurl has not been helping with that.
>
>At the same time, the under-specification of URLs has led to a situation where there's really no stable document anywhere describing how URLs are supposed to work and basically every implementer is left to handle the WHATWG URL spec, RFC 3986 and the world in between all by themselves. Understanding how their URL parsing libraries, libcurl, other tools and their favorite browsers differ is complicated.
>
>By offering applications access to libcurl's own URL parser, we hope to tighten a problematic vulnerable area for applications where the URL parser library would believe one thing and libcurl another. This could and has sometimes lead to security problems. (See for example Exploiting URL Parser in Trending Programming Languages! by Orange Tsai)
So, using this library adds consistency with how `libcurl` sees and handles URLs.
```{r}
library(microbenchmark)
set.seed(0)
test_urls <- sample(blog_urls, 100) # pick 100 URLs at random
microbenchmark(
curlparse = curlparse::url_parse(test_urls),
urltools = urltools::url_parse(test_urls), # we loaded urltools before curlparse at the top so namespace loading wasn't a factor for the benchmarks
times = 500
) -> mb
mb
autoplot(mb)
```
The individual handlers are a bit more on-par but mostly still slower (except for `fragment()`). Note that `urltools` has no equivalent function to just extract query strings so that's not in the test.
```{r fig.width=6, fig.height=6}
bind_rows(
microbenchmark(curlparse = curlparse::scheme(blog_urls), urltools = urltools::scheme(blog_urls)) %>%
mutate(test = "scheme"),
microbenchmark(curlparse = curlparse::domain(blog_urls), urltools = urltools::domain(blog_urls)) %>%
mutate(test = "domain"),
microbenchmark(curlparse = curlparse::port(blog_urls), urltools = urltools::port(blog_urls)) %>%
mutate(test = "port"),
microbenchmark(curlparse = curlparse::path(blog_urls), urltools = urltools::path(blog_urls)) %>%
mutate(test = "path"),
microbenchmark(curlparse = curlparse::fragment(blog_urls), urltools = urltools::fragment(blog_urls)) %>%
mutate(test = "fragment")
) %>%
mutate(test = factor(test, levels=c("scheme", "domain", "port", "path", "fragment"))) %>%
mutate(time = time / 1000000) %>%
ggplot(aes(expr, time)) +
geom_violin(aes(fill=expr), show.legend = FALSE) +
scale_y_continuous(name = "milliseconds", expand = c(0,0), limits=c(0, NA)) +
hrbrthemes::scale_fill_ft() +
facet_wrap(~test, ncol = 1) +
coord_flip() +
labs(x=NULL) +
hrbrthemes::theme_ft_rc(grid="XY", strip_text_face = "bold") +
theme(panel.spacing.y=unit(0, "lines"))
```
```{r echo=FALSE}
unloadNamespace("urltools")
```
### Stress Test
```{r}

96
README.md

@ -8,9 +8,9 @@ Parse ‘URLs’ with ‘libcurl’
As of version 7.62.0 ‘libcurl’ has exposed its ‘URL’ parser. Tools are
provided to parse ‘URLs’ using this new parser feature.
\*\*UNTIL `curl`/`libcurl` general release at the end of October you
**UNTIL `curl`/`libcurl` general release at the end of October you
*must* use the development version which can be cloned and built from
<https://github.com/curl/curl>.
<https://github.com/curl/curl>.**
## What’s Inside The Tin
@ -57,6 +57,7 @@ packageVersion("curlparse")
### Process Some URLs
``` r
library(urltools)
library(rvest)
library(curlparse)
library(tidyverse)
@ -100,6 +101,97 @@ filter(parsed, !is.na(query))
## 2 https <NA> <NA> kevinkuang.net 443 /tagged/r-programming <NA> source=rss----a1ff9aea4bf1--r_pr… <NA>
```
### Benchmark
`curlparse` includes a `url_parse()` function to make it easier to use
this package for current users of `urltools::url_parse()` since it
provides the same API and same results back (including it being a
regular data frame and not a `tbl`).
Spoiler alert: `urltools::url_parse()` is faster by ~100µs (per-100
URLs) for “good” URLs (if there’s a mix of gnarly/bad URLs and valid
ones they get closer to being on-par). The aim was not to try to beat
it, though.
> Per the [blog post introducing this new set of API
> calls](https://daniel.haxx.se/blog/2018/09/09/libcurl-gets-a-url-api/):
>
> Applications that pass in URLs to libcurl would of course still very
> often need to parse URLs, create URLs or otherwise handle them, but
> libcurl has not been helping with that.
>
> At the same time, the under-specification of URLs has led to a
> situation where there’s really no stable document anywhere describing
> how URLs are supposed to work and basically every implementer is left
> to handle the WHATWG URL spec, RFC 3986 and the world in between all
> by themselves. Understanding how their URL parsing libraries, libcurl,
> other tools and their favorite browsers differ is complicated.
>
> By offering applications access to libcurl’s own URL parser, we hope
> to tighten a problematic vulnerable area for applications where the
> URL parser library would believe one thing and libcurl another. This
> could and has sometimes lead to security problems. (See for example
> Exploiting URL Parser in Trending Programming Languages\! by Orange
> Tsai)
So, using this library adds consistency with how `libcurl` sees and
handles URLs.
``` r
library(microbenchmark)
set.seed(0)
test_urls <- sample(blog_urls, 100) # pick 100 URLs at random
microbenchmark(
curlparse = curlparse::url_parse(test_urls),
urltools = urltools::url_parse(test_urls), # we loaded urltools before curlparse at the top so namespace loading wasn't a factor for the benchmarks
times = 500
) -> mb
mb
## Unit: microseconds
## expr min lq mean median uq max neval
## curlparse 753.914 831.5750 896.4327 859.1640 896.8245 4597.547 500
## urltools 647.077 710.7115 768.3054 734.9985 766.3750 4163.394 500
autoplot(mb)
```
<img src="README_files/figure-gfm/unnamed-chunk-7-1.png" width="672" />
The individual handlers are a bit more on-par but mostly still slower
(except for `fragment()`). Note that `urltools` has no equivalent
function to just extract query strings so that’s not in the test.
``` r
bind_rows(
microbenchmark(curlparse = curlparse::scheme(blog_urls), urltools = urltools::scheme(blog_urls)) %>%
mutate(test = "scheme"),
microbenchmark(curlparse = curlparse::domain(blog_urls), urltools = urltools::domain(blog_urls)) %>%
mutate(test = "domain"),
microbenchmark(curlparse = curlparse::port(blog_urls), urltools = urltools::port(blog_urls)) %>%
mutate(test = "port"),
microbenchmark(curlparse = curlparse::path(blog_urls), urltools = urltools::path(blog_urls)) %>%
mutate(test = "path"),
microbenchmark(curlparse = curlparse::fragment(blog_urls), urltools = urltools::fragment(blog_urls)) %>%
mutate(test = "fragment")
) %>%
mutate(test = factor(test, levels=c("scheme", "domain", "port", "path", "fragment"))) %>%
mutate(time = time / 1000000) %>%
ggplot(aes(expr, time)) +
geom_violin(aes(fill=expr), show.legend = FALSE) +
scale_y_continuous(name = "milliseconds", expand = c(0,0), limits=c(0, NA)) +
hrbrthemes::scale_fill_ft() +
facet_wrap(~test, ncol = 1) +
coord_flip() +
labs(x=NULL) +
hrbrthemes::theme_ft_rc(grid="XY", strip_text_face = "bold") +
theme(panel.spacing.y=unit(0, "lines"))
```
<img src="README_files/figure-gfm/unnamed-chunk-8-1.png" width="576" />
### Stress Test
``` r

7
README_cache/gfm/__packages

@ -1,10 +1,4 @@
base
methods
datasets
utils
grDevices
graphics
stats
curlparse
xml2
rvest
@ -17,3 +11,4 @@ purrr
dplyr
stringr
forcats
urltools

BIN
README_cache/gfm/unnamed-chunk-5_b30bb7f1dbf8b0ffd72ec16ab8ab30b4.RData

Binary file not shown.

0
README_cache/gfm/unnamed-chunk-5_b9b66de41ab754116a06480c07795289.rdb → README_cache/gfm/unnamed-chunk-5_b30bb7f1dbf8b0ffd72ec16ab8ab30b4.rdb

0
README_cache/gfm/unnamed-chunk-5_b9b66de41ab754116a06480c07795289.rdx → README_cache/gfm/unnamed-chunk-5_b30bb7f1dbf8b0ffd72ec16ab8ab30b4.rdx

BIN
README_cache/gfm/unnamed-chunk-5_b9b66de41ab754116a06480c07795289.RData

Binary file not shown.

BIN
README_files/figure-gfm/unnamed-chunk-7-1.png

Binary file not shown.

After

Width:  |  Height:  |  Size: 68 KiB

BIN
README_files/figure-gfm/unnamed-chunk-8-1.png

Binary file not shown.

After

Width:  |  Height:  |  Size: 99 KiB

17
man/url_parse.Rd

@ -0,0 +1,17 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/RcppExports.R
\name{url_parse}
\alias{url_parse}
\title{Parse a character vector of URLs into component parts (\code{urltools} compatibility function)}
\usage{
url_parse(urls)
}
\arguments{
\item{urls}{character vector of URLs}
}
\value{
data frame (not a tibble)
}
\description{
Parse a character vector of URLs into component parts (\code{urltools} compatibility function)
}

12
src/RcppExports.cpp

@ -16,6 +16,17 @@ BEGIN_RCPP
return rcpp_result_gen;
END_RCPP
}
// url_parse
DataFrame url_parse(CharacterVector urls);
RcppExport SEXP _curlparse_url_parse(SEXP urlsSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< CharacterVector >::type urls(urlsSEXP);
rcpp_result_gen = Rcpp::wrap(url_parse(urls));
return rcpp_result_gen;
END_RCPP
}
// scheme
CharacterVector scheme(CharacterVector urls);
RcppExport SEXP _curlparse_scheme(SEXP urlsSEXP) {
@ -118,6 +129,7 @@ END_RCPP
static const R_CallMethodDef CallEntries[] = {
{"_curlparse_parse_curl", (DL_FUNC) &_curlparse_parse_curl, 1},
{"_curlparse_url_parse", (DL_FUNC) &_curlparse_url_parse, 1},
{"_curlparse_scheme", (DL_FUNC) &_curlparse_scheme, 1},
{"_curlparse_user", (DL_FUNC) &_curlparse_user, 1},
{"_curlparse_password", (DL_FUNC) &_curlparse_password, 1},

70
src/curlparse-main.cpp

@ -91,7 +91,7 @@ DataFrame parse_curl(CharacterVector urls) {
_["query"] = query_vec,
_["fragment"] = fragment_vec,
_["stringsAsFactors"] = false
);
);
out.attr("class") = CharacterVector::create("tbl_df", "tbl", "data.frame");
@ -99,6 +99,74 @@ DataFrame parse_curl(CharacterVector urls) {
}
//' Parse a character vector of URLs into component parts (`urltools` compatibility function)
//'
//' @md
//' @param urls character vector of URLs
//' @return data frame (not a tibble)
//' @export
// [[Rcpp::export]]
DataFrame url_parse(CharacterVector urls) {
unsigned int input_size = urls.size();
CharacterVector scheme_vec(input_size);
CharacterVector host_vec(input_size);
CharacterVector port_vec(input_size);
CharacterVector path_vec(input_size);
CharacterVector query_vec(input_size);
CharacterVector fragment_vec(input_size);
CURLUcode rc;
CURLU *url;
for (unsigned int i = 0; i < input_size; i++) {
url = curl_url();
rc = curl_url_set(
url, CURLUPART_URL, Rcpp::as<std::string>(urls[i]).c_str(), 0
);
if (!rc) {
scheme_vec[i] = lc_url_get(url, CURLUPART_SCHEME, CURLU_DEFAULT_SCHEME);
host_vec[i] = lc_url_get(url, CURLUPART_HOST);
port_vec[i] = lc_url_get(url, CURLUPART_PORT, CURLU_DEFAULT_PORT);
path_vec[i] = lc_url_get(url, CURLUPART_PATH, CURLU_URLDECODE);
query_vec[i] = lc_url_get(url, CURLUPART_QUERY, CURLU_URLDECODE);
fragment_vec[i] = lc_url_get(url, CURLUPART_FRAGMENT);
} else {
scheme_vec[i] = NA_STRING;
host_vec[i] = NA_STRING;
port_vec[i] = NA_STRING;
path_vec[i] = NA_STRING;
query_vec[i] = NA_STRING;
fragment_vec[i] = NA_STRING;
}
curl_url_cleanup(url);
}
DataFrame out = DataFrame::create(
_["scheme"] = scheme_vec,
_["domain"] = host_vec,
_["port"] = port_vec,
_["path"] = path_vec,
_["query"] = query_vec,
_["fragment"] = fragment_vec,
_["stringsAsFactors"] = false
);
return(out);
}
CharacterVector lc_part(CharacterVector urls, CURLUPart what, unsigned int flags = 0) {
unsigned int input_size = urls.size();

Loading…
Cancel
Save