No Description
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
boB Rudis 4c6a22e21b
README
4 weeks ago
R addresses 2019-08-18 CRAN comments 4 weeks ago
inst/extdata Getting closer to CRAN (ref #1) 2 years ago
man addresses 2019-08-18 CRAN comments 4 weeks ago
src addresses 2019-08-18 CRAN comments 4 weeks ago
tests test assumption fixed 1 year ago
.Rbuildignore addresses 2019-08-18 CRAN comments 4 weeks ago
.codecov.yml initial commit 2 years ago
.gitignore initial commit 2 years ago
.travis.yml travis 2 years ago
CONDUCT.md initial commit 2 years ago
DESCRIPTION addresses 2019-08-18 CRAN comments 4 weeks ago
LICENSE initial commit 2 years ago
NAMESPACE package rename as requested by CRAN 2 years ago
NEWS.md Getting closer to CRAN (ref #1) 2 years ago
README.Rmd addresses 2019-08-18 CRAN comments 4 weeks ago
README.md README 4 weeks ago
appveyor.yml appveyor 2 years ago
codecov.yml code coverage 2 years ago
spiderbar.Rproj package rename as requested by CRAN 2 years ago

README.md

Signed
by Linux build
Status Windows build
status Coverage
Status cran
checks CRAN
status Minimal R
Version License

spiderbar

Parse and Test Robots Exclusion Protocol Files and Rules

Description

The ‘Robots Exclusion Protocol’ https://www.robotstxt.org/orig.html documents a set of standards for allowing or excluding robot/spider crawling of different areas of site content. Tools are provided which wrap The ‘rep-cpp’ https://github.com/seomoz/rep-cpp C++ library for processing these ‘robots.txt’ files.

What’s Inside the Tin

The following functions are implemented:

  • can_fetch: Test URL paths against a robxp robots.txt object
  • crawl_delays: Retrive all agent crawl delay values in a robxp robots.txt object
  • print.robxp: Custom printer for ’robxp“ objects
  • robxp: Parse a ‘robots.txt’ file & create a ‘robxp’ object
  • sitemaps: Retrieve a character vector of sitemaps from a parsed robots.txt object

Installation

install.packages("spiderbar", repos = "https://cinc.rud.is")
# or
remotes::install_git("https://git.rud.is/hrbrmstr/spiderbar.git")
# or
remotes::install_git("https://git.sr.ht/~hrbrmstr/spiderbar")
# or
remotes::install_gitlab("hrbrmstr/spiderbar")
# or
remotes::install_bitbucket("hrbrmstr/spiderbar")
# or
remotes::install_github("hrbrmstr/spiderbar")

NOTE: To use the ‘remotes’ install options you will need to have the {remotes} package installed.

Usage

library(spiderbar)
library(robotstxt)

# current verison
packageVersion("spiderbar")
## [1] '0.2.2'

# use helpers from the robotstxt package

rt <- robxp(get_robotstxt("https://cdc.gov"))

print(rt)
## <Robots Exclusion Protocol Object>

# or 

rt <- robxp(url("https://cdc.gov/robots.txt"))

can_fetch(rt, "/asthma/asthma_stats/default.htm", "*")
## [1] TRUE

can_fetch(rt, "/_borders", "*")
## [1] FALSE

gh_rt <- robxp(robotstxt::get_robotstxt("github.com"))

can_fetch(gh_rt, "/humans.txt", "*") # TRUE
## [1] TRUE

can_fetch(gh_rt, "/login", "*") # FALSE
## [1] FALSE

can_fetch(gh_rt, "/oembed", "CCBot") # FALSE
## [1] FALSE

can_fetch(gh_rt, c("/humans.txt", "/login", "/oembed"))
## [1]  TRUE FALSE FALSE

crawl_delays(gh_rt)
| agent | crawl\_delay | | :---------------- | -----------: | | yandex | \-1 | | twitterbot | \-1 | | ccbot | \-1 | | mail.ru\_bot | \-1 | | telefonica | \-1 | | slurp | \-1 | | seznambot | \-1 | | sanddollar | \-1 | | coccoc | \-1 | | ia\_archiver | \-1 | | swiftbot | \-1 | | red-app-gsa-p-one | \-1 | | naverbot | \-1 | | msnbot | \-1 | | teoma | \-1 | | \* | \-1 | | intuitgsacrawler | \-1 | | bingbot | \-1 | | daumoa | \-1 | | googlebot | \-1 | | httrack | \-1 | | duckduckbot | \-1 | | etaospider | \-1 | | rogerbot | \-1 | | dotbot | \-1 |

imdb_rt <- robxp(robotstxt::get_robotstxt("imdb.com"))

crawl_delays(imdb_rt)
| agent | crawl\_delay | | :---- | -----------: | | \* | \-1 |

sitemaps(imdb_rt)
## character(0)

spiderbar Metrics

Lang # Files (%) LoC (%) Blank lines (%) # Lines (%)
C++ 9 0.38 1763 0.78 257 0.55 258 0.38
C/C++ Header 7 0.29 395 0.18 152 0.33 280 0.42
R 7 0.29 68 0.03 26 0.06 101 0.15
Rmd 1 0.04 23 0.01 31 0.07 33 0.05

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.