--- output: rmarkdown::github_document editor_options: chunk_output_type: console --- [![Travis-CI Build Status](https://travis-ci.org/hrbrmstr/tlsh.svg?branch=master)](https://travis-ci.org/hrbrmstr/tlsh) [![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/hrbrmstr/tlsh?branch=master&svg=true)](https://ci.appveyor.com/project/hrbrmstr/tlsh) [![Coverage Status](https://img.shields.io/codecov/c/github/hrbrmstr/tlsh/master.svg)](https://codecov.io/github/hrbrmstr/tlsh?branch=master) # tlsh Local Sensitivity Hashing Using the 'Trend Micro' 'TLSH' Implementation ## Description 'Trend Micro' provides an open source library for local sensitivity hashing. Methods are provided to compute and compare hashes from character/byte streams. ## References - Jonathan Oliver, Chun Cheng and Yanggui Chen, "[TLSH - A Locality Sensitive Hash](https://github.com/trendmicro/tlsh/blob/master/TLSH_CTC_final.pdf)" 4th Cybercrime and Trustworthy Computing Workshop, Sydney, November 2013 - Jonathan Oliver, Scott Forman and Chun Cheng, "[Using Randomization to Attack Similarity Digests](https://github.com/trendmicro/tlsh/blob/master/Attacking_LSH_and_Sim_Dig.pdf)" Applications and Techniques in Information Security. Springer Berlin Heidelberg, 2014. 199-210. - Jonathan Oliver and Jayson Pryde's [Trend Micro Blog Post](http://blog.trendmicro.com/trendlabs-security-intelligence/smart-whitelisting-using-locality-sensitive-hashing/) ## TODO - [ ] File input utilities - [ ] File input DSL verb - [ ] Docs - [ ] Tests - [ ] `toString()` method - [X] Reference class-backed DSL ## What's Inside The Tin The following functions are implemented: "Simple" interface (quick and dirty hashing): - `tlsh_simple_hash`: Compute TLSH hash for a character or raw vector and return hash fingerprint - `tlsh_simple_diff`: Compute the difference between two character hashes DSL: (WIP) - `tlsh`: Create a new 'tlsh' object - `tlsh_reset`: Clear content and hash computation from a 'tlsh' object fingerprint - `tlsh_update`: Update the 'tlsh' object with content - `tlsh_finalize`: Finalize a 'tlsh' object hash - `tlsh_is_valid`: Test if a 'tlsh' hash object is valid - `tlsh_hash`: Retrieve the hex-encoded hash string for a 'tlsh' object - `tlsh_dist`: Compute distance between two TLSH objects - `tlsh_stats`: Return a data frame of lvalue and q1/2 ratios from a 'tlsh' object TODO: Document DSL ## Installation ```{r eval=FALSE} devtools::install_github("hrbrmstr/tlsh") ``` ```{r message=FALSE, warning=FALSE, error=FALSE, include=FALSE} options(width=120) ``` ## Usage ```{r message=FALSE, warning=FALSE, error=FALSE} library(tlsh) library(tidyverse) # current verison packageVersion("tlsh") ``` ## Example - `index.html` is a static copy of a blog main page with a bunch of `
`s with article snippets - `index1.html` is the same file as `index.htmnl` with a changed cache timestamp at the end - `index2.html` is the same file as `index.html` with one article snippet removed - `RMacOSX-FAQ.html` is the CRAN 'R for Mac OS X FAQ' ```{r} doc1 <- as.character(xml2::read_html(system.file("extdat", "index.html", package="tlsh"))) doc2 <- as.character(xml2::read_html(system.file("extdat", "index1.html", package="tlsh"))) doc3 <- as.character(xml2::read_html(system.file("extdat", "index2.html", package="tlsh"))) doc4 <- as.character(xml2::read_html(system.file("extdat", "RMacOSX-FAQ.html", package="tlsh"))) # generate hashes (h1 <- tlsh_simple_hash(doc1)) (h2 <- tlsh_simple_hash(doc2)) (h3 <- tlsh_simple_hash(doc3)) (h4 <- tlsh_simple_hash(doc4)) # compute distance tlsh_simple_diff(h1, h2) tlsh_simple_diff(h1, h3) tlsh_simple_diff(h1, h4) ``` ### DSL ```{r} doc1 <- as.character(xml2::read_html(system.file("extdat", "index.html", package="tlsh"))) tlsh() %>% tlsh_update(doc1) %>% tlsh_finalize() -> x tlsh_hash(x) tlsh_is_valid(x) tlsh_stats(x) doc2 <- charToRaw(as.character(xml2::read_html(system.file("extdat", "index1.html", package="tlsh")))) tlsh() %>% tlsh_update(doc2) %>% tlsh_finalize() -> y tlsh_dist(x, y) tlsh_reset(x) tlsh_reset(y) ``` ## Code of Conduct Please note that this project is released with a [Contributor Code of Conduct](CONDUCT.md). By participating in this project you agree to abide by its terms.