spiderbar

2.4 KiB

Raw Blame History

---
output: rmarkdown::github_document
---

<!-- [![Build Status](https://travis-ci.org/hrbrmstr/rep.svg?branch=master)](https://travis-ci.org/hrbrmstr/rep) -->
<!-- [![Build status](https://ci.appveyor.com/api/projects/status/dakiw5y0xpq1m3bk?svg=true)](https://ci.appveyor.com/project/hrbrmstr/rep) -->
<!-- ![Coverage Status](https://img.shields.io/codecov/c/github/hrbrmstr/rep/master.svg) -->

# spiderbar

Parse and Test Robots Exclusion Protocol Files and Rules

## Description

The 'Robots Exclusion Protocol' (<http://www.robotstxt.org/orig.html>) documents a set of standards for allowing or excluding robot/spider crawling of different areas of site content. Tools are provided which wrap The `rep-cpp` (<https://github.com/seomoz/rep-cpp>) C++ library for processing these `robots.txt` files.

- [`rep-cpp`](https://github.com/seomoz/rep-cpp)
- [`url-cpp`](https://github.com/seomoz/url-cpp)

## Tools

The following functions are implemented:

- `robxp`:	Parse a 'robots.txt' file & create a 'robxp' object
- `can_fetch`:	Test URL paths against a 'robxp' 'robots.txt' object
- `crawl_delays`:	Retrive all agent crawl delay values in a 'robxp' 'robots.txt' object
- `sitemaps`:	Retrieve a character vector of sitemaps from a parsed robots.txt object

## Installation

```{r eval=FALSE}
devtools::install_github("hrbrmstr/spiderbar")
```

```{r message=FALSE, warning=FALSE, error=FALSE, include=FALSE}
options(width=120)
```

## Usage

```{r message=FALSE, warning=FALSE, error=FALSE}
library(spiderbar)
library(robotstxt)

# current verison
packageVersion("spiderbar")

# use helpers from the robotstxt package

rt <- robxp(get_robotstxt("https://cdc.gov"))

print(rt)

# or 

rt <- robxp(url("https://cdc.gov/robots.txt"))

can_fetch(rt, "/asthma/asthma_stats/default.htm", "*")

can_fetch(rt, "/_borders", "*")

gh_rt <- robxp(robotstxt::get_robotstxt("github.com"))

can_fetch(gh_rt, "/humans.txt", "*") # TRUE

can_fetch(gh_rt, "/login", "*") # FALSE

can_fetch(gh_rt, "/oembed", "CCBot") # FALSE

can_fetch(gh_rt, c("/humans.txt", "/login", "/oembed"))

crawl_delays(gh_rt)

imdb_rt <- robxp(robotstxt::get_robotstxt("imdb.com"))

crawl_delays(imdb_rt)

sitemaps(imdb_rt)
```

## Test Results

```{r message=FALSE, warning=FALSE, error=FALSE}
library(rep)
library(testthat)

date()

test_dir("tests/")
```

## Code of Conduct

Please note that this project is released with a [Contributor Code of Conduct](CONDUCT.md). By participating in this project you agree to abide by its terms.
2.4 KiB Raw Blame History

2.4 KiB

Raw Blame History