You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
boB Rudis
4277d704f7
|
7 years ago | |
---|---|---|
R | 7 years ago | |
inst/extdata | 7 years ago | |
man | 7 years ago | |
src | 7 years ago | |
tests | 7 years ago | |
.Rbuildignore | 7 years ago | |
.codecov.yml | 7 years ago | |
.gitignore | 7 years ago | |
.travis.yml | 7 years ago | |
CONDUCT.md | 7 years ago | |
DESCRIPTION | 7 years ago | |
LICENSE | 7 years ago | |
NAMESPACE | 7 years ago | |
NEWS.md | 7 years ago | |
README.Rmd | 7 years ago | |
README.md | 7 years ago | |
appveyor.yml | 7 years ago | |
codecov.yml | 7 years ago | |
rep.Rproj | 7 years ago |
README.md
Travis-CI Build Status | AppVeyor Build Status | Coverage Status
rep
Tools to Parse and Test Robots Exclusion Protocol Files and Rules
Description
The 'Robots Exclusion Protocol' http://www.robotstxt.org/orig.html documents a set of standards for allowing or excluding robot/spider crawling of different areas of site content. Tools are provided which wrap The 'rep-cpp` https://github.com/seomoz/rep-cpp C++ library for processing these 'robots.txt' files.
Tools
The following functions are implemented:
can_fetch
: Test URL path against robots.txtcrawl_delays
: Get all agent crawl delay valuesprint.robxp
: Custom printer for 'robexp' objectsrobxp
: Create a robots.txt object
Installation
devtools::install_github("hrbrmstr/rep")
Usage
library(rep)
library(robotstxt)
# current verison
packageVersion("rep")
## [1] '0.2.0'
rt <- robxp(get_robotstxt("https://cdc.gov"))
print(rt)
## <Robots Exclusion Protocol Object>
can_fetch(rt, "/asthma/asthma_stats/default.htm", "*")
## [1] TRUE
can_fetch(rt, "/_borders", "*")
## [1] FALSE
gh_rt <- robxp(robotstxt::get_robotstxt("github.com"))
can_fetch(gh_rt, "/humans.txt", "*") # TRUE
## [1] TRUE
can_fetch(gh_rt, "/login", "*") # FALSE
## [1] FALSE
can_fetch(gh_rt, "/oembed", "CCBot") # FALSE
## [1] FALSE
crawl_delays(gh_rt)
## agent crawl_delay
## 1 yandex -1
## 2 twitterbot -1
## 3 ccbot -1
## 4 mail.ru_bot -1
## 5 telefonica -1
## 6 slurp -1
## 7 seznambot -1
## 8 sanddollar -1
## 9 coccoc -1
## 10 ia_archiver -1
## 11 swiftbot -1
## 12 red-app-gsa-p-one -1
## 13 naverbot -1
## 14 msnbot -1
## 15 teoma -1
## 16 * -1
## 17 intuitgsacrawler -1
## 18 bingbot -1
## 19 daumoa -1
## 20 googlebot -1
## 21 httrack -1
## 22 duckduckbot -1
## 23 etaospider -1
## 24 rogerbot -1
## 25 dotbot -1
imdb_rt <- robxp(robotstxt::get_robotstxt("imdb.com"))
crawl_delays(imdb_rt)
## agent crawl_delay
## 1 slurp 0.1
## 2 scoutjet 3.0
## 3 * -1.0
Test Results
library(rep)
library(testthat)
date()
## [1] "Sat Sep 23 09:14:02 2017"
test_dir("tests/")
## testthat results ========================================================================================================
## OK: 5 SKIPPED: 0 FAILED: 0
##
## DONE ===================================================================================================================
Code of Conduct
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.