You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
boB Rudis 4277d704f7
Getting closer to CRAN (ref #1)
7 years ago
R Getting closer to CRAN (ref #1) 7 years ago
inst/extdata Getting closer to CRAN (ref #1) 7 years ago
man Getting closer to CRAN (ref #1) 7 years ago
src Getting closer to CRAN (ref #1) 7 years ago
tests Getting closer to CRAN (ref #1) 7 years ago
.Rbuildignore code coverage 7 years ago
.codecov.yml initial commit 7 years ago
.gitignore initial commit 7 years ago
.travis.yml travis 7 years ago
CONDUCT.md initial commit 7 years ago
DESCRIPTION Getting closer to CRAN (ref #1) 7 years ago
LICENSE initial commit 7 years ago
NAMESPACE custom function to retrieve all crawl_delay settings for all user agents 7 years ago
NEWS.md Getting closer to CRAN (ref #1) 7 years ago
README.Rmd Getting closer to CRAN (ref #1) 7 years ago
README.md Getting closer to CRAN (ref #1) 7 years ago
appveyor.yml appveyor 7 years ago
codecov.yml code coverage 7 years ago
rep.Rproj initial commit 7 years ago

README.md

Travis-CI Build Status | AppVeyor Build Status | Coverage Status

rep

Tools to Parse and Test Robots Exclusion Protocol Files and Rules

Description

The 'Robots Exclusion Protocol' http://www.robotstxt.org/orig.html documents a set of standards for allowing or excluding robot/spider crawling of different areas of site content. Tools are provided which wrap The 'rep-cpp` https://github.com/seomoz/rep-cpp C++ library for processing these 'robots.txt' files.

Tools

The following functions are implemented:

  • can_fetch: Test URL path against robots.txt
  • crawl_delays: Get all agent crawl delay values
  • print.robxp: Custom printer for 'robexp' objects
  • robxp: Create a robots.txt object

Installation

devtools::install_github("hrbrmstr/rep")

Usage

library(rep)
library(robotstxt)

# current verison
packageVersion("rep")
## [1] '0.2.0'
rt <- robxp(get_robotstxt("https://cdc.gov"))

print(rt)
## <Robots Exclusion Protocol Object>
can_fetch(rt, "/asthma/asthma_stats/default.htm", "*")
## [1] TRUE
can_fetch(rt, "/_borders", "*")
## [1] FALSE
gh_rt <- robxp(robotstxt::get_robotstxt("github.com"))
can_fetch(gh_rt, "/humans.txt", "*") # TRUE
## [1] TRUE
can_fetch(gh_rt, "/login", "*") # FALSE
## [1] FALSE
can_fetch(gh_rt, "/oembed", "CCBot") # FALSE
## [1] FALSE
crawl_delays(gh_rt)
##                agent crawl_delay
## 1             yandex          -1
## 2         twitterbot          -1
## 3              ccbot          -1
## 4        mail.ru_bot          -1
## 5         telefonica          -1
## 6              slurp          -1
## 7          seznambot          -1
## 8         sanddollar          -1
## 9             coccoc          -1
## 10       ia_archiver          -1
## 11          swiftbot          -1
## 12 red-app-gsa-p-one          -1
## 13          naverbot          -1
## 14            msnbot          -1
## 15             teoma          -1
## 16                 *          -1
## 17  intuitgsacrawler          -1
## 18           bingbot          -1
## 19            daumoa          -1
## 20         googlebot          -1
## 21           httrack          -1
## 22       duckduckbot          -1
## 23        etaospider          -1
## 24          rogerbot          -1
## 25            dotbot          -1
imdb_rt <- robxp(robotstxt::get_robotstxt("imdb.com"))
crawl_delays(imdb_rt)
##      agent crawl_delay
## 1    slurp         0.1
## 2 scoutjet         3.0
## 3        *        -1.0

Test Results

library(rep)
library(testthat)

date()
## [1] "Sat Sep 23 09:14:02 2017"
test_dir("tests/")
## testthat results ========================================================================================================
## OK: 5 SKIPPED: 0 FAILED: 0
## 
## DONE ===================================================================================================================

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.