--- output: rmarkdown::github_document --- ```{r include=FALSE} knitr::opts_chunk$set( fig.width=10, fig.retina=2, message=FALSE, warning=FALSE, collapse=TRUE ) ``` # psl Extract Internet Domain Components Using the Public Suffix List ## Description The 'Public Suffix List' () is a collection of top-level domains ('TLDs') which include global top-level domainsa ('gTLDs') such as '.com' and '.net'; country top-level domains ('ccTLDs') such as '.de' and '.cn'; and, brand top-level domains such as '.apple' and '.google'. Tools are provided to extract internet domain components using the public suffix list base data. - `libpsl`: - Public Suffix List: ## What's Inside The Tin The following functions are implemented: - `apex_domain`: Return the apex/top-private domain from a vector of domains - `is_public_suffix`: Test whether a domain is a public suffix - `public_suffix`: Return the public suffix from a vector of domains - `suffix_extract`: Separate a domain into component parts - `suffix_extract2`: Separate a domain into component parts (urltools compatible output) ## PRE-Installation You need a recent `libpsl`. - macOS: `brew install libpsl` - Debian/Ubuntu-ish: Many repos have old versions so it's _highly_ suggested that you build from source and ensure the library & header files are accessible - Windows: Just use `urltools::suffix_extract()` until winlibs are available for psl ## Installation ```{r eval=FALSE} devtools::install_github("hrbrmstr/psl") ``` ```{r message=FALSE, warning=FALSE, error=FALSE, include=FALSE} options(width=120) ``` ## Usage ```{r message=FALSE, warning=FALSE, error=FALSE} library(psl) library(tidyverse) # current verison packageVersion("psl") ``` ```{r message=FALSE, warning=FALSE, error=FALSE} doms <- c( "", "com", "example.com", "www.example.com", ".com", ".example", ".example.com", ".example.example", "example", "example.example", "b.example.example", "a.b.example.example", "biz", "domain.biz", "b.domain.biz", "a.b.domain.biz", "com", "example.com", "b.example.com", "a.b.example.com", "uk.com", "example.uk.com", "b.example.uk.com", "a.b.example.uk.com", "test.ac", "cy", "c.cy", "b.c.cy", "a.b.c.cy", "jp", "test.jp", "www.test.jp", "ac.jp", "test.ac.jp", "www.test.ac.jp", "kyoto.jp", "test.kyoto.jp", "ide.kyoto.jp", "b.ide.kyoto.jp", "a.b.ide.kyoto.jp", "c.kobe.jp", "b.c.kobe.jp", "a.b.c.kobe.jp", "city.kobe.jp", "www.city.kobe.jp", "ck", "test.ck", "b.test.ck", "a.b.test.ck", "www.ck", "www.www.ck", "us", "test.us", "www.test.us", "ak.us", "test.ak.us", "www.test.ak.us", "k12.ak.us", "test.k12.ak.us", "www.test.k12.ak.us" ) apex_domain(doms) public_suffix(doms) is_public_suffix(doms) suffix_extract(doms) str(suffix_extract2(doms)) # urltools compatible output ``` ```{r bench, message=FALSE, warning=FALSE, error=FALSE, fig.width=10, fig.retina=2} library(microbenchmark) microbenchmark( urltools = urltools::suffix_extract(doms), psl = psl::suffix_extract(doms), # returns more data psl2 = psl::suffix_extract2(doms) # returns what urltools does ) -> mb autoplot(mb) ```