You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
4.3 KiB
4.3 KiB
---
output: rmarkdown::github_document
editor_options:
chunk_output_type: console
---
[![Travis-CI Build Status](https://travis-ci.org/hrbrmstr/tlsh.svg?branch=master)](https://travis-ci.org/hrbrmstr/tlsh)
[![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/hrbrmstr/tlsh?branch=master&svg=true)](https://ci.appveyor.com/project/hrbrmstr/tlsh)
[![Coverage Status](https://img.shields.io/codecov/c/github/hrbrmstr/tlsh/master.svg)](https://codecov.io/github/hrbrmstr/tlsh?branch=master)
# tlsh
Local Sensitivity Hashing Using the 'Trend Micro' 'TLSH' Implementation
## Description
'Trend Micro' provides an open source library <https://github.com/trendmicro/tlsh/> for local sensitivity hashing. Methods are provided to compute and compare hashes from character/byte streams.
## References
- Jonathan Oliver, Chun Cheng and Yanggui Chen,
"[TLSH - A Locality Sensitive Hash](https://github.com/trendmicro/tlsh/blob/master/TLSH_CTC_final.pdf)"
4th Cybercrime and Trustworthy Computing Workshop, Sydney, November 2013
- Jonathan Oliver, Scott Forman and Chun Cheng,
"[Using Randomization to Attack Similarity Digests](https://github.com/trendmicro/tlsh/blob/master/Attacking_LSH_and_Sim_Dig.pdf)"
Applications and Techniques in Information Security. Springer Berlin Heidelberg, 2014. 199-210.
- Jonathan Oliver and Jayson Pryde's
[Trend Micro Blog Post](http://blog.trendmicro.com/trendlabs-security-intelligence/smart-whitelisting-using-locality-sensitive-hashing/)
## TODO
- [ ] File input utilities
- [ ] File input DSL verb
- [ ] Docs
- [ ] Tests
- [ ] `toString()` method
- [X] Reference class-backed DSL
## What's Inside The Tin
The following functions are implemented:
"Simple" interface (quick and dirty hashing):
- `tlsh_simple_hash`: Compute TLSH hash for a character or raw vector and return hash fingerprint
- `tlsh_simple_diff`: Compute the difference between two character hashes
DSL: (WIP)
- `tlsh`: Create a new 'tlsh' object
- `tlsh_reset`: Clear content and hash computation from a 'tlsh' object fingerprint
- `tlsh_update`: Update the 'tlsh' object with content
- `tlsh_finalize`: Finalize a 'tlsh' object hash
- `tlsh_is_valid`: Test if a 'tlsh' hash object is valid
- `tlsh_hash`: Retrieve the hex-encoded hash string for a 'tlsh' object
- `tlsh_dist`: Compute distance between two TLSH objects
- `tlsh_stats`: Return a data frame of lvalue and q1/2 ratios from a 'tlsh' object
TODO: Document DSL
## Installation
```{r eval=FALSE}
devtools::install_github("hrbrmstr/tlsh")
```
```{r message=FALSE, warning=FALSE, error=FALSE, include=FALSE}
options(width=120)
```
## Usage
```{r message=FALSE, warning=FALSE, error=FALSE}
library(tlsh)
library(tidyverse)
# current verison
packageVersion("tlsh")
```
## Example
- `index.html` is a static copy of a blog main page with a bunch of `<div>`s with article snippets
- `index1.html` is the same file as `index.htmnl` with a changed cache timestamp at the end
- `index2.html` is the same file as `index.html` with one article snippet removed
- `RMacOSX-FAQ.html` is the CRAN 'R for Mac OS X FAQ'
```{r}
doc1 <- as.character(xml2::read_html(system.file("extdat", "index.html", package="tlsh")))
doc2 <- as.character(xml2::read_html(system.file("extdat", "index1.html", package="tlsh")))
doc3 <- as.character(xml2::read_html(system.file("extdat", "index2.html", package="tlsh")))
doc4 <- as.character(xml2::read_html(system.file("extdat", "RMacOSX-FAQ.html", package="tlsh")))
# generate hashes
(h1 <- tlsh_simple_hash(doc1))
(h2 <- tlsh_simple_hash(doc2))
(h3 <- tlsh_simple_hash(doc3))
(h4 <- tlsh_simple_hash(doc4))
# compute distance
tlsh_simple_diff(h1, h2)
tlsh_simple_diff(h1, h3)
tlsh_simple_diff(h1, h4)
```
### DSL
```{r}
doc1 <- as.character(xml2::read_html(system.file("extdat", "index.html", package="tlsh")))
tlsh() %>%
tlsh_update(doc1) %>%
tlsh_finalize() -> x
tlsh_hash(x)
tlsh_is_valid(x)
tlsh_stats(x)
doc2 <- charToRaw(as.character(xml2::read_html(system.file("extdat", "index1.html", package="tlsh"))))
tlsh() %>%
tlsh_update(doc2) %>%
tlsh_finalize() -> y
tlsh_dist(x, y)
tlsh_reset(x)
tlsh_reset(y)
```
## Code of Conduct
Please note that this project is released with a [Contributor Code of Conduct](CONDUCT.md). By participating in this project you agree to abide by its terms.