Context Triggered Piecewise Hash Computation Using 'ssdeep'
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 

3.1 KiB

---
output: rmarkdown::github_document
editor_options:
chunk_output_type: console
---
```{r pkg-knitr-opts, include=FALSE}
hrbrpkghelpr::global_opts()
```

```{r badges, results='asis', echo=FALSE, cache=FALSE}
hrbrpkghelpr::stinking_badges()
```

```{r description, results='asis', echo=FALSE, cache=FALSE}
hrbrpkghelpr::yank_title_and_description()
```

## What's Inside The Tin

The following functions are implemented:

```{r ingredients, results='asis', echo=FALSE, cache=FALSE}
hrbrpkghelpr::describe_ingredients()
```

## Installation

You'll need `libfuzzy` installed and available for linking. See <https://ssdeep-project.github.io/ssdeep/index.html#platforms> for platform support.

On Ubuntu/Debian you can do:

```shell
sudo apt install libfuzzy-dev
```

On macOS you can do:

```shell
brew install ssdeep
```

The library works on Windows, I just need to do some manual labor for that.

Package installation:

```{r install-ex, results='asis', echo=FALSE, cache=FALSE}
hrbrpkghelpr::install_block()
```

## Usage

```{r lib-ex}
library(ssdeepr)

# current version
packageVersion("ssdeepr")

```

- `index.html` is a static copy of a blog main page with a bunch of `<div>`s with article snippets
- `index1.html` is the same file as `index.htmnl` with a changed cache timestamp at the end
- `index2.html` is the same file as `index.html` with one article snippet removed
- `RMacOSX-FAQ.html` is the CRAN 'R for Mac OS X FAQ'

```{r u-01}
system.file("extdat", package="ssdeepr") %>%
list.files(full.names = TRUE, pattern = "html$", include.dirs = FALSE) %>%
hash_file() -> hashes

hashes

hash_compare(hashes$hash[1], hashes$hash[1])
hash_compare(hashes$hash[1], hashes$hash[2])
hash_compare(hashes$hash[1], hashes$hash[3])
hash_compare(hashes$hash[1], hashes$hash[4])
```

Works with Connections, too. All three should be the same if the Wikipedia page hasn't changed since making local copies in the package.

Using `hash_con()` has the advantage of not requiring the entire contents of the target blob to be read into memory in exchange for a tiny cost to speed for most files (see Benchmarks).

NOTE that retrieving the URL contents with different user-agent strings and/or with javascript-enabled may/will likely generate different content and, thus, a different hash.

```{r u-02}
(k1 <- hash_con(url("https://en.wikipedia.org/wiki/Donald_Knuth",
header = setNames(splashr::ua_ios_safari, "User-Agent"))))

(k2 <- hash_con(file(system.file("knuth", "local.html", package = "ssdeepr"))))

(k3 <- hash_con(gzfile(system.file("knuth", "local.gz", package = "ssdeepr"))))

hash_compare(k1, k2)

hash_compare(k1, k3)

hash_compare(k2, k3)
```

Benchmarks

```{r u-03}
microbenchmark::microbenchmark(
con = hash_con(file(system.file("knuth", "local.html", package = "ssdeepr"))),
gzc = hash_con(gzfile(system.file("knuth", "local.gz", package = "ssdeepr"))),
fil = hash_file(system.file("knuth", "local.html", package = "ssdeepr"))
)
```

## ssdeepr Metrics

```{r cloc, echo=FALSE}
cloc::cloc_pkg_md()
```

## Code of Conduct

Please note that this project is released with a Contributor Code of Conduct.
By participating in this project you agree to abide by its terms.