You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
2.4 KiB
2.4 KiB
---
output: rmarkdown::github_document
---
<!-- [![Build Status](https://travis-ci.org/hrbrmstr/rep.svg?branch=master)](https://travis-ci.org/hrbrmstr/rep) -->
<!-- [![Build status](https://ci.appveyor.com/api/projects/status/dakiw5y0xpq1m3bk?svg=true)](https://ci.appveyor.com/project/hrbrmstr/rep) -->
<!-- ![Coverage Status](https://img.shields.io/codecov/c/github/hrbrmstr/rep/master.svg) -->
# spiderbar
Parse and Test Robots Exclusion Protocol Files and Rules
## Description
The 'Robots Exclusion Protocol' (<http://www.robotstxt.org/orig.html>) documents a set of standards for allowing or excluding robot/spider crawling of different areas of site content. Tools are provided which wrap The `rep-cpp` (<https://github.com/seomoz/rep-cpp>) C++ library for processing these `robots.txt` files.
- [`rep-cpp`](https://github.com/seomoz/rep-cpp)
- [`url-cpp`](https://github.com/seomoz/url-cpp)
## Tools
The following functions are implemented:
- `robxp`: Parse a 'robots.txt' file & create a 'robxp' object
- `can_fetch`: Test URL paths against a 'robxp' 'robots.txt' object
- `crawl_delays`: Retrive all agent crawl delay values in a 'robxp' 'robots.txt' object
- `sitemaps`: Retrieve a character vector of sitemaps from a parsed robots.txt object
## Installation
```{r eval=FALSE}
devtools::install_github("hrbrmstr/spiderbar")
```
```{r message=FALSE, warning=FALSE, error=FALSE, include=FALSE}
options(width=120)
```
## Usage
```{r message=FALSE, warning=FALSE, error=FALSE}
library(spiderbar)
library(robotstxt)
# current verison
packageVersion("spiderbar")
# use helpers from the robotstxt package
rt <- robxp(get_robotstxt("https://cdc.gov"))
print(rt)
# or
rt <- robxp(url("https://cdc.gov/robots.txt"))
can_fetch(rt, "/asthma/asthma_stats/default.htm", "*")
can_fetch(rt, "/_borders", "*")
gh_rt <- robxp(robotstxt::get_robotstxt("github.com"))
can_fetch(gh_rt, "/humans.txt", "*") # TRUE
can_fetch(gh_rt, "/login", "*") # FALSE
can_fetch(gh_rt, "/oembed", "CCBot") # FALSE
can_fetch(gh_rt, c("/humans.txt", "/login", "/oembed"))
crawl_delays(gh_rt)
imdb_rt <- robxp(robotstxt::get_robotstxt("imdb.com"))
crawl_delays(imdb_rt)
sitemaps(imdb_rt)
```
## Test Results
```{r message=FALSE, warning=FALSE, error=FALSE}
library(rep)
library(testthat)
date()
test_dir("tests/")
```
## Code of Conduct
Please note that this project is released with a [Contributor Code of Conduct](CONDUCT.md). By participating in this project you agree to abide by its terms.