Browse Source

initial commit

master
boB Rudis 6 years ago
commit
0d4bd96a58
No known key found for this signature in database GPG Key ID: 1D7529BE14E2BBA9
  1. 12
      .Rbuildignore
  2. 1
      .codecov.yml
  3. 8
      .gitignore
  4. 6
      .travis.yml
  5. 25
      CONDUCT.md
  6. 28
      DESCRIPTION
  7. 244
      LICENSE
  8. 6
      NAMESPACE
  9. 2
      NEWS.md
  10. 15
      R/RcppExports.R
  11. 25
      R/tlsh-hash.R
  12. 13
      R/tlsh-package.R
  13. 77
      README.Rmd
  14. 108
      README.md
  15. 1889
      inst/extdat/RMacOSX-FAQ.html
  16. 768
      inst/extdat/index.html
  17. 768
      inst/extdat/index1.html
  18. 727
      inst/extdat/index2.html
  19. 15
      man/tlsh.Rd
  20. 19
      man/tlsh_simple_hash.Rd
  21. 3
      src/.gitignore
  22. 53
      src/RcppExports.cpp
  23. 60
      src/gen_arr2.cpp
  24. 51
      src/tlsh-pkg-main.cpp
  25. 192
      src/tlsh.cpp
  26. 182
      src/tlsh.h
  27. 556
      src/tlsh_impl.cpp
  28. 150
      src/tlsh_impl.h
  29. 4876
      src/tlsh_util.cpp
  30. 70
      src/tlsh_util.h
  31. 10
      src/version.h
  32. 2
      tests/test-all.R
  33. 6
      tests/testthat/test-tlsh.R
  34. 21
      tlsh.Rproj

12
.Rbuildignore

@ -0,0 +1,12 @@
^.*\.Rproj$
^\.Rproj\.user$
^\.travis\.yml$
^README\.*Rmd$
^README\.*html$
^NOTES\.*Rmd$
^NOTES\.*html$
^\.codecov\.yml$
^README_files$
^doc$
^tmp$
^CONDUCT\.md$

1
.codecov.yml

@ -0,0 +1 @@
comment: false

8
.gitignore

@ -0,0 +1,8 @@
.DS_Store
.Rproj.user
.Rhistory
.RData
.Rproj
src/*.o
src/*.so
src/*.dll

6
.travis.yml

@ -0,0 +1,6 @@
language: R
sudo: false
cache: packages
after_success:
- Rscript -e 'covr::codecov()'

25
CONDUCT.md

@ -0,0 +1,25 @@
# Contributor Code of Conduct
As contributors and maintainers of this project, we pledge to respect all people who
contribute through reporting issues, posting feature requests, updating documentation,
submitting pull requests or patches, and other activities.
We are committed to making participation in this project a harassment-free experience for
everyone, regardless of level of experience, gender, gender identity and expression,
sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion.
Examples of unacceptable behavior by participants include the use of sexual language or
imagery, derogatory comments or personal attacks, trolling, public or private harassment,
insults, or other unprofessional conduct.
Project maintainers have the right and responsibility to remove, edit, or reject comments,
commits, code, wiki edits, issues, and other contributions that are not aligned to this
Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed
from the project team.
Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by
opening an issue or contacting one or more of the project maintainers.
This Code of Conduct is adapted from the Contributor Covenant
(http:contributor-covenant.org), version 1.0.0, available at
http://contributor-covenant.org/version/1/0/0/

28
DESCRIPTION

@ -0,0 +1,28 @@
Package: tlsh
Type: Package
Title: Local Sensitivity Hashing Using the 'Trend Micro' 'TLSH' Implementation
Version: 0.1.0
Date: 2018-04-27
Authors@R: c(
person("Bob", "Rudis", email = "bob@rud.is", role = c("aut", "cre"),
comment = c(ORCID = "0000-0001-5670-2640")),
person("Trend Micro", comment = "https://github.com/trendmicro/tlsh",
role = c("cph"))
)
Maintainer: Bob Rudis <bob@rud.is>
Description: 'Trend Micro' provides an open source library <https://github.com/trendmicro/tlsh/>
for local sensitivity hashing. Methods are provided to compute and compare
hashes from character/byte streams.
URL: https://github.com/hrbrmstr/tlsh
BugReports: https://github.com/hrbrmstr/tlsh/issues
Encoding: UTF-8
License: Apache License 2.0 | file LICENSE
Suggests:
testthat,
covr
Depends:
R (>= 3.2.0)
Imports:
Rcpp
RoxygenNote: 6.0.1.9000
LinkingTo: Rcpp

244
LICENSE

@ -0,0 +1,244 @@
=====================
LICENSE OPTION NOTICE
=====================
NOTE: The R package is under the same license.
TLSH is provided for use under two licenses: Apache OR BSD.
Users may opt to use either license depending on the license
restictions of the systems with which they plan to integrate
the TLSH code.
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
BSD License Version 3
https://opensource.org/licenses/BSD-3-Clause
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its contributors
may be used to endorse or promote products derived from this software without
specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
OF THE POSSIBILITY OF SUCH DAMAGE.

6
NAMESPACE

@ -0,0 +1,6 @@
# Generated by roxygen2: do not edit by hand
export(tlsh_simple_diff)
export(tlsh_simple_hash)
importFrom(Rcpp,sourceCpp)
useDynLib(tlsh)

2
NEWS.md

@ -0,0 +1,2 @@
0.1.0
* Initial release

15
R/RcppExports.R

@ -0,0 +1,15 @@
# Generated by using Rcpp::compileAttributes() -> do not edit by hand
# Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393
tlsh_simple_hash_r <- function(v) {
.Call('_tlsh_tlsh_simple_hash_r', PACKAGE = 'tlsh', v)
}
tlsh_simple_hash_c <- function(v) {
.Call('_tlsh_tlsh_simple_hash_c', PACKAGE = 'tlsh', v)
}
tlsh_diff_fingerprints <- function(hash1, hash2) {
.Call('_tlsh_tlsh_diff_fingerprints', PACKAGE = 'tlsh', hash1, hash2)
}

25
R/tlsh-hash.R

@ -0,0 +1,25 @@
#' Compute TLSH hash for a character or raw vector and return hash fingerprint
#'
#' @md
#' @param x length 1 `character` or `raw` vector
#' @export
tlsh_simple_hash <- function(x) {
if (inherits(x, "character")) {
x <- x[1]
if (nchar(x) < 50L) stop("Byte stream minimum length is 50 bytes", call.=FALSE)
tlsh_simple_hash_c(x)
} else if (inherits(x, "raw")) {
if (length(x) < 50L) stop("Byte stream minimum length is 50 bytes", call.=FALSE)
tlsh_simple_hash_r(x)
} else {
NULL
}
}
#' @md
#' @rdname tlsh_simple_hash
#' @param x,y two hash fingerprints to compare
#' @export
tlsh_simple_diff <- function(x, y) {
tlsh_diff_fingerprints(x, y)
}

13
R/tlsh-package.R

@ -0,0 +1,13 @@
#' Local Sensitivity Hashing Using the 'Trend Micro' 'TLSH' Implementation
#'
#' 'Trend Micro' provides an open source library <https://github.com/trendmicro/tlsh/>
#' for local sensitivity hashing. Methods are provided to compute and compare
#' hashes from character/byte streams.
#'
#' @md
#' @name tlsh
#' @docType package
#' @author Bob Rudis (bob@@rud.is)
#' @useDynLib tlsh
#' @importFrom Rcpp sourceCpp
NULL

77
README.Rmd

@ -0,0 +1,77 @@
---
output: rmarkdown::github_document
---
# tlsh
Local Sensitivity Hashing Using the 'Trend Micro' 'TLSH' Implementation
## Description
'Trend Micro' provides an open source library <https://github.com/trendmicro/tlsh/> for local sensitivity hashing. Methods are provided to compute and compare hashes from character/byte streams.
## TODO
- [ ] File input
- [ ] Docs
- [ ] Tests
## What's Inside The Tin
The following functions are implemented:
## Installation
```{r eval=FALSE}
devtools::install_github("hrbrmstr/tlsh")
```
```{r message=FALSE, warning=FALSE, error=FALSE, include=FALSE}
options(width=120)
```
## Usage
```{r message=FALSE, warning=FALSE, error=FALSE}
library(tlsh)
# current verison
packageVersion("tlsh")
```
## Example
- `index.html` is a static copy of a blog main page with a bunch of `<div>`s with article snippets
- `index1.html` is the same file as `index.htmnl` with a changed cache timestamp at the end
- `index2.html` is the same file as `index.html` with one article snippet removed
- `RMacOSX-FAQ.html` is the CRAN 'R for Mac OS X FAQ'
```{r}
doc1 <- as.character(xml2::read_html(system.file("extdat", "index.html", package="tlsh")))
doc2 <- as.character(xml2::read_html(system.file("extdat", "index1.html", package="tlsh")))
doc3 <- as.character(xml2::read_html(system.file("extdat", "index2.html", package="tlsh")))
doc4 <- as.character(xml2::read_html(system.file("extdat", "RMacOSX-FAQ.html", package="tlsh")))
# generate hashes
(h1 <- tlsh_simple_hash(doc1))
(h2 <- tlsh_simple_hash(doc2))
(h3 <- tlsh_simple_hash(doc3))
(h4 <- tlsh_simple_hash(doc4))
# compute distance
tlsh_simple_diff(h1, h2)
tlsh_simple_diff(h1, h3)
tlsh_simple_diff(h1, h4)
```
## Code of Conduct
Please note that this project is released with a [Contributor Code of Conduct](CONDUCT.md). By participating in this project you agree to abide by its terms.

108
README.md

@ -0,0 +1,108 @@
# tlsh
Local Sensitivity Hashing Using the ‘Trend Micro’ ‘TLSH’ Implementation
## Description
‘Trend Micro’ provides an open source library
<https://github.com/trendmicro/tlsh/> for local sensitivity hashing.
Methods are provided to compute and compare hashes from character/byte
streams.
## TODO
- \[ \] File input
- \[ \] Docs
- \[ \] Tests
## What’s Inside The Tin
The following functions are implemented:
## Installation
``` r
devtools::install_github("hrbrmstr/tlsh")
```
## Usage
``` r
library(tlsh)
# current verison
packageVersion("tlsh")
```
## [1] '0.1.0'
## Example
- `index.html` is a static copy of a blog main page with a bunch of
`<div>`s with article snippets
- `index1.html` is the same file as `index.htmnl` with a changed cache
timestamp at the end
- `index2.html` is the same file as `index.html` with one article
snippet removed
- `RMacOSX-FAQ.html` is the CRAN ‘R for Mac OS X
FAQ’
<!-- end list -->
``` r
doc1 <- as.character(xml2::read_html(system.file("extdat", "index.html", package="tlsh")))
doc2 <- as.character(xml2::read_html(system.file("extdat", "index1.html", package="tlsh")))
doc3 <- as.character(xml2::read_html(system.file("extdat", "index2.html", package="tlsh")))
doc4 <- as.character(xml2::read_html(system.file("extdat", "RMacOSX-FAQ.html", package="tlsh")))
# generate hashes
(h1 <- tlsh_simple_hash(doc1))
```
## [1] "B253F9F3168DC8354B2363E2A585771CD25A803BCEA099C1FBED54ACA790EB5B137346"
``` r
(h2 <- tlsh_simple_hash(doc2))
```
## [1] "6153E8F3168DC8355B2363E2A585771CD26A803BCEA099C1FBED44AC9790EB5B137346"
``` r
(h3 <- tlsh_simple_hash(doc3))
```
## [1] "6443E8F3168DC8355B6262F2A9C5771CD25A802BCEA099C1FBED54AC9780FF4A137346"
``` r
(h4 <- tlsh_simple_hash(doc4))
```
## [1] "B8B3A52F93C0233E0F1216576F192FA812FD5C7EA3802188B557C67F8712D9A47666BB"
``` r
# compute distance
tlsh_simple_diff(h1, h2)
```
## [1] 7
``` r
tlsh_simple_diff(h1, h3)
```
## [1] 18
``` r
tlsh_simple_diff(h1, h4)
```
## [1] 334
## Code of Conduct
Please note that this project is released with a [Contributor Code of
Conduct](CONDUCT.md). By participating in this project you agree to
abide by its terms.

1889
inst/extdat/RMacOSX-FAQ.html

File diff suppressed because it is too large

768
inst/extdat/index.html

File diff suppressed because one or more lines are too long

768
inst/extdat/index1.html

File diff suppressed because one or more lines are too long

727
inst/extdat/index2.html

File diff suppressed because one or more lines are too long

15
man/tlsh.Rd

@ -0,0 +1,15 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/tlsh-package.R
\docType{package}
\name{tlsh}
\alias{tlsh}
\alias{tlsh-package}
\title{Local Sensitivity Hashing Using the 'Trend Micro' 'TLSH' Implementation}
\description{
'Trend Micro' provides an open source library \url{https://github.com/trendmicro/tlsh/}
for local sensitivity hashing. Methods are provided to compute and compare
hashes from character/byte streams.
}
\author{
Bob Rudis (bob@rud.is)
}

19
man/tlsh_simple_hash.Rd

@ -0,0 +1,19 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/tlsh-hash.R
\name{tlsh_simple_hash}
\alias{tlsh_simple_hash}
\alias{tlsh_simple_diff}
\title{Compute TLSH hash for a character or raw vector and return hash fingerprint}
\usage{
tlsh_simple_hash(x)
tlsh_simple_diff(x, y)
}
\arguments{
\item{x}{length 1 \code{character} or \code{raw} vector}
\item{x, y}{two hash fingerprints to compare}
}
\description{
Compute TLSH hash for a character or raw vector and return hash fingerprint
}

3
src/.gitignore

@ -0,0 +1,3 @@
*.o
*.so
*.dll

53
src/RcppExports.cpp

@ -0,0 +1,53 @@
// Generated by using Rcpp::compileAttributes() -> do not edit by hand
// Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393
#include <Rcpp.h>
using namespace Rcpp;
// tlsh_simple_hash_r
CharacterVector tlsh_simple_hash_r(std::vector < unsigned char> v);
RcppExport SEXP _tlsh_tlsh_simple_hash_r(SEXP vSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< std::vector < unsigned char> >::type v(vSEXP);
rcpp_result_gen = Rcpp::wrap(tlsh_simple_hash_r(v));
return rcpp_result_gen;
END_RCPP
}
// tlsh_simple_hash_c
CharacterVector tlsh_simple_hash_c(std::string v);
RcppExport SEXP _tlsh_tlsh_simple_hash_c(SEXP vSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< std::string >::type v(vSEXP);
rcpp_result_gen = Rcpp::wrap(tlsh_simple_hash_c(v));
return rcpp_result_gen;
END_RCPP
}
// tlsh_diff_fingerprints
NumericVector tlsh_diff_fingerprints(std::string hash1, std::string hash2);
RcppExport SEXP _tlsh_tlsh_diff_fingerprints(SEXP hash1SEXP, SEXP hash2SEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< std::string >::type hash1(hash1SEXP);
Rcpp::traits::input_parameter< std::string >::type hash2(hash2SEXP);
rcpp_result_gen = Rcpp::wrap(tlsh_diff_fingerprints(hash1, hash2));
return rcpp_result_gen;
END_RCPP
}
static const R_CallMethodDef CallEntries[] = {
{"_tlsh_tlsh_simple_hash_r", (DL_FUNC) &_tlsh_tlsh_simple_hash_r, 1},
{"_tlsh_tlsh_simple_hash_c", (DL_FUNC) &_tlsh_tlsh_simple_hash_c, 1},
{"_tlsh_tlsh_diff_fingerprints", (DL_FUNC) &_tlsh_tlsh_diff_fingerprints, 2},
{NULL, NULL, 0}
};
RcppExport void R_init_tlsh(DllInfo *dll) {
R_registerRoutines(dll, NULL, CallEntries, NULL, NULL);
R_useDynamicSymbols(dll, FALSE);
}

60
src/gen_arr2.cpp

@ -0,0 +1,60 @@
#include <math.h>
#include <stdlib.h>
#include <stdio.h>
/////////////////////////////////////////////////////////////////////////////
// Tlsh.java code to generate the bit_pairs_diff_table in tlsh_util.cpp
int result[256][256];
void generateTable()
{
for (int i = 0; i < 256; i++) {
for (int j = 0; j < 256; j++) {
int x = i, y = j, d, diff = 0;
d = abs(x % 4 - y % 4); diff += (d == 3 ? 6 : d);
x /= 4; y /= 4;
d = abs(x % 4 - y % 4); diff += (d == 3 ? 6 : d);
x /= 4; y /= 4;
d = abs(x % 4 - y % 4); diff += (d == 3 ? 6 : d);
x /= 4; y /= 4;
d = abs(x % 4 - y % 4); diff += (d == 3 ? 6 : d);
result[i][j] = diff;
}
}
}
/////////////////////////////////////////////////////////////////////////////
// Jon Oliver's functions to generate bit_pairs_diff_table
static int pairbit_diff(int pairb, int opairb)
{
int diff = abs(pairb - opairb);
if (diff <= 1)
return(diff);
else if (diff == 2)
return(2);
return(6);
}
int byte_diff(unsigned char bv, unsigned char obv)
{
int h1 = (unsigned char) bv / 16;
int oh1 = (unsigned char) obv / 16;
int h2 = (unsigned char) bv % 16;
int oh2 = (unsigned char) obv % 16;
int p1 = h1 / 4;
int op1 = oh1 / 4;
int p2 = h1 % 4;
int op2 = oh1 % 4;
int p3 = h2 / 4;
int op3 = oh2 / 4;
int p4 = h2 % 4;
int op4 = oh2 % 4;
int diff = 0;
diff = diff + pairbit_diff(p1, op1);
diff = diff + pairbit_diff(p2, op2);
diff = diff + pairbit_diff(p3, op3);
diff = diff + pairbit_diff(p4, op4);
return(diff);
}

51
src/tlsh-pkg-main.cpp

@ -0,0 +1,51 @@
#include <Rcpp.h>
#include "tlsh.h"
using namespace Rcpp;
// [[Rcpp::export]]
CharacterVector tlsh_simple_hash_r(std::vector < unsigned char> v) {
unsigned char *p_buf = (unsigned char *)&*v.begin();
Tlsh tlsh;
tlsh.update(p_buf, v.size());
tlsh.final();
return(CharacterVector::create(tlsh.getHash()));
}
// [[Rcpp::export]]
CharacterVector tlsh_simple_hash_c(std::string v) {
unsigned char *p_buf = (unsigned char *)&*v.begin();
Tlsh tlsh;
tlsh.update(p_buf, v.length());
tlsh.final();
return(CharacterVector::create(tlsh.getHash()));
}
// [[Rcpp::export]]
NumericVector tlsh_diff_fingerprints(std::string hash1, std::string hash2) {
Tlsh tlsh1, tlsh2;
if (tlsh1.fromTlshStr(hash1.c_str()) != 0) {
Rcpp::warning("First hash string is not valid.");
return(NA_REAL);
}
if (tlsh2.fromTlshStr(hash2.c_str()) != 0) {
Rcpp::warning("Second hash string is not valid.");
return(NA_REAL);
}
return(NumericVector::create(tlsh1.totalDiff(&tlsh2)));
}

192
src/tlsh.cpp

@ -0,0 +1,192 @@
/*
* TLSH is provided for use under two licenses: Apache OR BSD.
* Users may opt to use either license depending on the license
* restictions of the systems with which they plan to integrate
* the TLSH code.
*/
/* ==============
* Apache License
* ==============
* Copyright 2013 Trend Micro Incorporated
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/* ===========
* BSD License
* ===========
* Copyright (c) 2013, Trend Micro Incorporated
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification,
* are permitted provided that the following conditions are met:
*
* 1. Redistributions of source code must retain the above copyright notice, this
* list of conditions and the following disclaimer.
*
* 2. Redistributions in binary form must reproduce the above copyright notice,
* this list of conditions and the following disclaimer in the documentation
* and/or other materials provided with the distribution.
* 3. Neither the name of the copyright holder nor the names of its contributors
* may be used to endorse or promote products derived from this software without
* specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
* WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
* IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
* INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
* DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
* LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
* OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
* OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#include "tlsh.h"
#include "tlsh_impl.h"
#include "stdio.h"
#include "version.h"
#include <errno.h>
#include <string.h>
/////////////////////////////////////////////////////
// C++ Implementation
Tlsh::Tlsh():impl(NULL)
{
impl = new TlshImpl();
}
Tlsh::Tlsh(const Tlsh& other):impl(NULL)
{
impl = new TlshImpl();
*impl = *other.impl;
}
Tlsh::~Tlsh()
{
delete impl;
}
const char *Tlsh::version()
{
static char versionBuf[256];
if (versionBuf[0] == '\0')
snprintf(versionBuf, sizeof(versionBuf), "%d.%d.%d %s %s sliding_window=%d",
VERSION_MAJOR, VERSION_MINOR, VERSION_PATCH, TLSH_HASH, TLSH_CHECKSUM, SLIDING_WND_SIZE);
return versionBuf;
}
void Tlsh::update(const unsigned char* data, unsigned int len)
{
if ( NULL != impl )
impl->update(data, len);
}
void Tlsh::final(const unsigned char* data, unsigned int len, int force_option)
{
if ( NULL != impl ){
if ( NULL != data && len > 0 )
impl->update(data, len);
impl->final(force_option);
}
}
const char* Tlsh::getHash() const
{
if ( NULL != impl )
return impl->hash();
else
return "";
}
const char* Tlsh::getHash (char *buffer, unsigned int bufSize) const
{
if ( NULL != impl )
return impl->hash(buffer, bufSize);
else {
buffer[0] = '\0';
return buffer;
}
}
void Tlsh::reset()
{
if ( NULL != impl )
impl->reset();
}
Tlsh& Tlsh::operator=(const Tlsh& other)
{
if (this == &other)
return *this;
*impl = *other.impl;
return *this;
}
bool Tlsh::operator==(const Tlsh& other) const
{
if( this == &other )
return true;
else if( NULL == impl || NULL == other.impl )
return false;
else
return ( 0 == impl->compare(*other.impl) );
}
bool Tlsh::operator!=(const Tlsh& other) const
{
return !(*this==other);
}
int Tlsh::Lvalue()
{
return( impl->Lvalue() );
}
int Tlsh::Q1ratio()
{
return( impl->Q1ratio() );
}
int Tlsh::Q2ratio()
{
return( impl->Q2ratio() );
}
int Tlsh::totalDiff(const Tlsh *other, bool len_diff) const
{
if( NULL==impl || NULL == other || NULL == other->impl )
return -(EINVAL);
else if ( this == other )
return 0;
else
return (impl->totalDiff(*other->impl, len_diff));
}
int Tlsh::fromTlshStr(const char* str)
{
if ( NULL == impl )
return -(ENOMEM);
else if ( NULL == str )
return -(EINVAL);
else
return impl->fromTlshStr(str);
}
bool Tlsh::isValid() const
{
return (impl ? impl->isValid() : false);
}

182
src/tlsh.h

@ -0,0 +1,182 @@
// tlsh.h - TrendLSH Hash Algorithm
/*
* TLSH is provided for use under two licenses: Apache OR BSD.
* Users may opt to use either license depending on the license
* restictions of the systems with which they plan to integrate
* the TLSH code.
*/
/* ==============
* Apache License
* ==============
* Copyright 2013 Trend Micro Incorporated
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/* ===========
* BSD License
* ===========
* Copyright (c) 2013, Trend Micro Incorporated
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification,
* are permitted provided that the following conditions are met:
*
* 1. Redistributions of source code must retain the above copyright notice, this
* list of conditions and the following disclaimer.
*
* 2. Redistributions in binary form must reproduce the above copyright notice,
* this list of conditions and the following disclaimer in the documentation
* and/or other materials provided with the distribution.
* 3. Neither the name of the copyright holder nor the names of its contributors
* may be used to endorse or promote products derived from this software without
* specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
* WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
* IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
* INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
* DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
* LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
* OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
* OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#ifndef HEADER_TLSH_H
#define HEADER_TLSH_H
#include "version.h"
#ifndef NULL
#define NULL 0
#endif
#ifdef __cplusplus
class TlshImpl;
#define BUCKETS_128
// Define TLSH_STRING_LEN, which is the string lenght of the hex value of the Tlsh hash.
// BUCKETS_256 & CHECKSUM_3B are compiler switches defined in CMakeLists.txt
#if defined BUCKETS_256
#if defined CHECKSUM_3B
#define TLSH_STRING_LEN 138
#else
#define TLSH_STRING_LEN 134
#endif
// changed the minimum data length to 256 for version 3.3
#define MIN_DATA_LENGTH 256
// added the -force option for version 3.5
#define MIN_FORCE_DATA_LENGTH 50
#endif
#if defined BUCKETS_128
#if defined CHECKSUM_3B
#define TLSH_STRING_LEN 74
#else
#define TLSH_STRING_LEN 70
#endif
// changed the minimum data length to 256 for version 3.3
#define MIN_DATA_LENGTH 256
// added the -force option for version 3.5
#define MIN_FORCE_DATA_LENGTH 50
#endif
#if defined BUCKETS_48
// No 3 Byte checksum option for 48 Bucket min hash
#define TLSH_STRING_LEN 30
// changed the minimum data length to 256 for version 3.3
#define MIN_DATA_LENGTH 10
// added the -force option for version 3.5
#define MIN_FORCE_DATA_LENGTH 10
#endif
#define TLSH_STRING_BUFFER_LEN (TLSH_STRING_LEN+1)
#ifdef WINDOWS
#include <WinFunctions.h>
#else
#if defined(__SPARC) || defined(_AS_MK_OS_RH73)
#define TLSH_API
#else
#define TLSH_API __attribute__ ((visibility("default")))
#endif
#endif
class TLSH_API Tlsh{
public:
Tlsh();
Tlsh(const Tlsh& other);
/* allow the user to add data in multiple iterations */
void update(const unsigned char* data, unsigned int len);
/* to signal the class there is no more data to be added */
void final(const unsigned char* data = NULL, unsigned int len = 0, int force_option = 0);
/* to get the hex-encoded hash code */
const char* getHash() const ;
/* to get the hex-encoded hash code without allocating buffer in TlshImpl - bufSize should be TLSH_STRING_BUFFER_LEN */
const char* getHash(char *buffer, unsigned int bufSize) const;
/* to bring to object back to the initial state */
void reset();
// access functions
int Lvalue();
int Q1ratio();
int Q2ratio();
/* calculate difference */
/* The len_diff parameter specifies if the file length is to be included in the difference calculation (len_diff=true) or if it */
/* is to be excluded (len_diff=false). In general, the length should be considered in the difference calculation, but there */
/* could be applications where a part of the adversarial activity might be to add a lot of content. For example to add 1 million */
/* zero bytes at the end of a file. In that case, the caller would want to exclude the length from the calculation. */
int totalDiff(const Tlsh *, bool len_diff=true) const;
/* validate TrendLSH string and reset the hash according to it */
int fromTlshStr(const char* str);
/* check if Tlsh object is valid to operate */
bool isValid() const;
/* Return the version information used to build this library */
static const char *version();
// operators
Tlsh& operator=(const Tlsh& other);
bool operator==(const Tlsh& other) const;
bool operator!=(const Tlsh& other) const;
~Tlsh();
private:
TlshImpl* impl;
};
#ifdef TLSH_DISTANCE_PARAMETERS
void set_tlsh_distance_parameters(int length_mult_value, int qratio_mult_value, int hist_diff1_add_value, int hist_diff2_add_value, int hist_diff3_add_value);
#endif
#endif
#endif

556
src/tlsh_impl.cpp

@ -0,0 +1,556 @@
/*
* TLSH is provided for use under two licenses: Apache OR BSD.
* Users may opt to use either license depending on the license
* restictions of the systems with which they plan to integrate
* the TLSH code.
*/
/* ==============
* Apache License
* ==============
* Copyright 2013 Trend Micro Incorporated
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/* ===========
* BSD License
* ===========
* Copyright (c) 2013, Trend Micro Incorporated
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification,
* are permitted provided that the following conditions are met:
*
* 1. Redistributions of source code must retain the above copyright notice, this
* list of conditions and the following disclaimer.
*
* 2. Redistributions in binary form must reproduce the above copyright notice,
* this list of conditions and the following disclaimer in the documentation
* and/or other materials provided with the distribution.
* 3. Neither the name of the copyright holder nor the names of its contributors
* may be used to endorse or promote products derived from this software without
* specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
* WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
* IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
* INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
* DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
* LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
* OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
* OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#include "tlsh.h"
#include "tlsh_impl.h"
#include "tlsh_util.h"
#include <string>
#include <cassert>
#include <cstdio>
#include <cmath>
#include <algorithm>
#include <string.h>
#include <errno.h>
#define RANGE_LVALUE 256
#define RANGE_QRATIO 16
static void find_quartile(unsigned int *q1, unsigned int *q2, unsigned int *q3, const unsigned int * a_bucket);
static unsigned int partition(unsigned int * buf, unsigned int left, unsigned int right);
////////////////////////////////////////////////////////////////////////////////////////////////
TlshImpl::TlshImpl() : a_bucket(NULL), data_len(0), lsh_code(NULL), lsh_code_valid(false)
{
memset(this->slide_window, 0, sizeof this->slide_window);
memset(&this->lsh_bin, 0, sizeof this->lsh_bin);
assert (sizeof (this->lsh_bin.Q.QR) == sizeof (this->lsh_bin.Q.QB));
}
TlshImpl::~TlshImpl()
{
delete [] this->a_bucket;
delete [] this->lsh_code;
}
void TlshImpl::reset()
{
delete [] this->a_bucket; this->a_bucket = NULL;
memset(this->slide_window, 0, sizeof this->slide_window);
delete [] this->lsh_code; this->lsh_code = NULL;
memset(&this->lsh_bin, 0, sizeof this->lsh_bin);
this->data_len = 0;
this->lsh_code_valid = false;
}
#if SLIDING_WND_SIZE==5
#define SLIDING_WND_SIZE_M1 4
#elif SLIDING_WND_SIZE==4
#define SLIDING_WND_SIZE_M1 3
#elif SLIDING_WND_SIZE==6
#define SLIDING_WND_SIZE_M1 5
#elif SLIDING_WND_SIZE==7
#define SLIDING_WND_SIZE_M1 6
#elif SLIDING_WND_SIZE==8
#define SLIDING_WND_SIZE_M1 7
#endif
void TlshImpl::update(const unsigned char* data, unsigned int len)
{
#define RNG_SIZE SLIDING_WND_SIZE
#define RNG_IDX(i) ((i+RNG_SIZE)%RNG_SIZE)
int j = (int)(this->data_len % RNG_SIZE);
unsigned int fed_len = this->data_len;
if (this->a_bucket == NULL) {
this->a_bucket = new unsigned int [BUCKETS];
memset(this->a_bucket, 0, sizeof(int)*BUCKETS);
}
for( unsigned int i=0; i<len; i++, fed_len++, j=RNG_IDX(j+1) ) {
this->slide_window[j] = data[i];
if ( fed_len >= SLIDING_WND_SIZE_M1 ) {
//only calculate when input >= 5 bytes
int j_1 = RNG_IDX(j-1);
int j_2 = RNG_IDX(j-2);
int j_3 = RNG_IDX(j-3);
#if SLIDING_WND_SIZE>=5
int j_4 = RNG_IDX(j-4);
#endif
#if SLIDING_WND_SIZE>=6
int j_5 = RNG_IDX(j-5);
#endif
#if SLIDING_WND_SIZE>=7
int j_6 = RNG_IDX(j-6);
#endif
#if SLIDING_WND_SIZE>=8
int j_7 = RNG_IDX(j-7);
#endif
for (int k = 0; k < TLSH_CHECKSUM_LEN; k++) {
if (k == 0) {
this->lsh_bin.checksum[k] = b_mapping(0, this->slide_window[j], this->slide_window[j_1], this->lsh_bin.checksum[k]);
}
else {
// use calculated 1 byte checksums to expand the total checksum to 3 bytes
this->lsh_bin.checksum[k] = b_mapping(this->lsh_bin.checksum[k-1], this->slide_window[j], this->slide_window[j_1], this->lsh_bin.checksum[k]);
}
}
unsigned char r;
r = b_mapping(2, this->slide_window[j], this->slide_window[j_1], this->slide_window[j_2]);
this->a_bucket[r]++;
r = b_mapping(3, this->slide_window[j], this->slide_window[j_1], this->slide_window[j_3]);
this->a_bucket[r]++;
r = b_mapping(5, this->slide_window[j], this->slide_window[j_2], this->slide_window[j_3]);
this->a_bucket[r]++;
#if SLIDING_WND_SIZE>=5
r = b_mapping(7, this->slide_window[j], this->slide_window[j_2], this->slide_window[j_4]);
this->a_bucket[r]++;
r = b_mapping(11, this->slide_window[j], this->slide_window[j_1], this->slide_window[j_4]);
this->a_bucket[r]++;
r = b_mapping(13, this->slide_window[j], this->slide_window[j_3], this->slide_window[j_4]);
this->a_bucket[r]++;
#endif
#if SLIDING_WND_SIZE>=6
r = b_mapping(17, this->slide_window[j], this->slide_window[j_1], this->slide_window[j_5]);
this->a_bucket[r]++;
r = b_mapping(19, this->slide_window[j], this->slide_window[j_2], this->slide_window[j_5]);
this->a_bucket[r]++;
r = b_mapping(23, this->slide_window[j], this->slide_window[j_3], this->slide_window[j_5]);
this->a_bucket[r]++;
r = b_mapping(29, this->slide_window[j], this->slide_window[j_4], this->slide_window[j_5]);
this->a_bucket[r]++;
#endif
#if SLIDING_WND_SIZE>=7
r = b_mapping(31, this->slide_window[j], this->slide_window[j_1], this->slide_window[j_6]);
this->a_bucket[r]++;
r = b_mapping(37, this->slide_window[j], this->slide_window[j_2], this->slide_window[j_6]);
this->a_bucket[r]++;
r = b_mapping(41, this->slide_window[j], this->slide_window[j_3], this->slide_window[j_6]);
this->a_bucket[r]++;
r = b_mapping(43, this->slide_window[j], this->slide_window[j_4], this->slide_window[j_6]);
this->a_bucket[r]++;
r = b_mapping(47, this->slide_window[j], this->slide_window[j_5], this->slide_window[j_6]);
this->a_bucket[r]++;
#endif
#if SLIDING_WND_SIZE>=8
r = b_mapping(53, this->slide_window[j], this->slide_window[j_1], this->slide_window[j_7]);
this->a_bucket[r]++;
r = b_mapping(59, this->slide_window[j], this->slide_window[j_2], this->slide_window[j_7]);
this->a_bucket[r]++;
r = b_mapping(61, this->slide_window[j], this->slide_window[j_3], this->slide_window[j_7]);
this->a_bucket[r]++;
r = b_mapping(67, this->slide_window[j], this->slide_window[j_4], this->slide_window[j_7]);
this->a_bucket[r]++;
r = b_mapping(71, this->slide_window[j], this->slide_window[j_5], this->slide_window[j_7]);
this->a_bucket[r]++;
r = b_mapping(73, this->slide_window[j], this->slide_window[j_6], this->slide_window[j_7]);
this->a_bucket[r]++;
#endif
}
}
this->data_len += len;
}
/* to signal the class there is no more data to be added */
void TlshImpl::final(int force_option)
{
// incoming data must more than or equal to MIN_DATA_LENGTH bytes
if ((force_option == 0) && (this->data_len < MIN_DATA_LENGTH)) {
// this->lsh_code be empty
delete [] this->a_bucket; this->a_bucket = NULL;
return;
}
if ((force_option) && (this->data_len < MIN_FORCE_DATA_LENGTH)) {
// this->lsh_code be empty
delete [] this->a_bucket; this->a_bucket = NULL;
return;
}
unsigned int q1, q2, q3;
find_quartile(&q1, &q2, &q3, this->a_bucket);
// buckets must be more than 50% non-zero
int nonzero = 0;
for(unsigned int i=0; i<CODE_SIZE; i++) {
for(unsigned int j=0; j<4; j++) {
if (this->a_bucket[4*i + j] > 0) {
nonzero++;
}
}
}
#if defined BUCKETS_48
if (nonzero < 18) {
// printf("nonzero=%d\n", nonzero);
delete [] this->a_bucket; this->a_bucket = NULL;
return;
}
#else
if (nonzero <= 4*CODE_SIZE/2) {
delete [] this->a_bucket; this->a_bucket = NULL;
return;
}
#endif
for(unsigned int i=0; i<CODE_SIZE; i++) {
unsigned char h=0;
for(unsigned int j=0; j<4; j++) {
unsigned int k = this->a_bucket[4*i + j];
if( q3 < k ) {
h += 3 << (j*2); // leave the optimization j*2 = j<<1 or j*2 = j+j for compiler
} else if( q2 < k ) {
h += 2 << (j*2);
} else if( q1 < k ) {
h += 1 << (j*2);
}
}
this->lsh_bin.tmp_code[i] = h;
}
//Done with a_bucket so deallocate
delete [] this->a_bucket; this->a_bucket = NULL;
this->lsh_bin.Lvalue = l_capturing(this->data_len);
this->lsh_bin.Q.QR.Q1ratio = (unsigned int) ((float)(q1*100)/(float) q3) % 16;
this->lsh_bin.Q.QR.Q2ratio = (unsigned int) ((float)(q2*100)/(float) q3) % 16;
this->lsh_code_valid = true;
}
int TlshImpl::fromTlshStr(const char* str)
{
// Validate input string
for( int i=0; i < TLSH_STRING_LEN; i++ )
if (!(
(str[i] >= '0' && str[i] <= '9') ||
(str[i] >= 'A' && str[i] <= 'F') ||
(str[i] >= 'a' && str[i] <= 'f') ))
{
return 1;
}
this->reset();
lsh_bin_struct tmp;
from_hex( str, TLSH_STRING_LEN, (unsigned char*)&tmp );
// Reconstruct checksum, Qrations & lvalue
for (int k = 0; k < TLSH_CHECKSUM_LEN; k++) {
this->lsh_bin.checksum[k] = swap_byte(tmp.checksum[k]);
}
this->lsh_bin.Lvalue = swap_byte( tmp.Lvalue );
this->lsh_bin.Q.QB = swap_byte(tmp.Q.QB);
for( int i=0; i < CODE_SIZE; i++ ){
this->lsh_bin.tmp_code[i] = (tmp.tmp_code[CODE_SIZE-1-i]);
}
this->lsh_code_valid = true;
return 0;
}
const char* TlshImpl::hash(char *buffer, unsigned int bufSize) const
{
if (bufSize < TLSH_STRING_LEN + 1) {
strncpy(buffer, "", bufSize);
return buffer;
}
if (this->lsh_code_valid == false) {
strncpy(buffer, "", bufSize);
return buffer;
}
lsh_bin_struct tmp;
for (int k = 0; k < TLSH_CHECKSUM_LEN; k++) {
tmp.checksum[k] = swap_byte( this->lsh_bin.checksum[k] );
}
tmp.Lvalue = swap_byte( this->lsh_bin.Lvalue );
tmp.Q.QB = swap_byte( this->lsh_bin.Q.QB );
for( int i=0; i < CODE_SIZE; i++ ){
tmp.tmp_code[i] = (this->lsh_bin.tmp_code[CODE_SIZE-1-i]);
}
to_hex( (unsigned char*)&tmp, sizeof(tmp), buffer);
return buffer;
}
/* to get the hex-encoded hash code */
const char* TlshImpl::hash() const
{
if (this->lsh_code != NULL) {
// lsh_code has been previously calculated, so just return it
return this->lsh_code;
}
this->lsh_code = new char [TLSH_STRING_LEN+1];
memset(this->lsh_code, 0, TLSH_STRING_LEN+1);
return hash(this->lsh_code, TLSH_STRING_LEN+1);
}
// compare
int TlshImpl::compare(const TlshImpl& other) const
{
return (memcmp( &(this->lsh_bin), &(other.lsh_bin), sizeof(this->lsh_bin)));
}
////////////////////////////////////////////
// the default for these parameters is 12
////////////////////////////////////////////
static int length_mult = 12;
static int qratio_mult = 12;
#ifdef TLSH_DISTANCE_PARAMETERS
int hist_diff1_add = 1;
int hist_diff2_add = 2;
int hist_diff3_add = 6;
void set_tlsh_distance_parameters(int length_mult_value, int qratio_mult_value, int hist_diff1_add_value, int hist_diff2_add_value, int hist_diff3_add_value)
{
if (length_mult_value != -1) {
length_mult = length_mult_value;
}
if (qratio_mult_value != -1) {
qratio_mult = qratio_mult_value;
}
if (hist_diff1_add_value != -1) {
hist_diff1_add = hist_diff1_add_value;
}
if (hist_diff2_add_value != -1) {
hist_diff2_add = hist_diff2_add_value;
}
if (hist_diff3_add_value != -1) {
hist_diff3_add = hist_diff3_add_value;
}
}
#endif
int TlshImpl::Lvalue()
{
return(this->lsh_bin.Lvalue);
}
int TlshImpl::Q1ratio()
{
return(this->lsh_bin.Q.QR.Q1ratio);
}
int TlshImpl::Q2ratio()
{
return(this->lsh_bin.Q.QR.Q2ratio);
}
int TlshImpl::totalDiff(const TlshImpl& other, bool len_diff) const
{
int diff = 0;
if (len_diff) {
int ldiff = mod_diff( this->lsh_bin.Lvalue, other.lsh_bin.Lvalue, RANGE_LVALUE);
if ( ldiff == 0 )
diff = 0;
else if ( ldiff == 1 )
diff = 1;
else
diff += ldiff*length_mult;
}
int q1diff = mod_diff( this->lsh_bin.Q.QR.Q1ratio, other.lsh_bin.Q.QR.Q1ratio, RANGE_QRATIO);
if ( q1diff <= 1 )
diff += q1diff;
else
diff += (q1diff-1)*qratio_mult;
int q2diff = mod_diff( this->lsh_bin.Q.QR.Q2ratio, other.lsh_bin.Q.QR.Q2ratio, RANGE_QRATIO);
if ( q2diff <= 1)
diff += q2diff;
else
diff += (q2diff-1)*qratio_mult;
for (int k = 0; k < TLSH_CHECKSUM_LEN; k++) {
if (this->lsh_bin.checksum[k] != other.lsh_bin.checksum[k] ) {
diff ++;
break;
}
}
diff += h_distance( CODE_SIZE, this->lsh_bin.tmp_code, other.lsh_bin.tmp_code );
return (diff);
}
#define SWAP_UINT(x,y) do {\
unsigned int int_tmp = (x); \
(x) = (y); \
(y) = int_tmp; } while(0)
void find_quartile(unsigned int *q1, unsigned int *q2, unsigned int *q3, const unsigned int * a_bucket)
{
unsigned int bucket_copy[EFF_BUCKETS], short_cut_left[EFF_BUCKETS], short_cut_right[EFF_BUCKETS], spl=0, spr=0;
unsigned int p1 = EFF_BUCKETS/4-1;
unsigned int p2 = EFF_BUCKETS/2-1;
unsigned int p3 = EFF_BUCKETS-EFF_BUCKETS/4-1;
unsigned int end = EFF_BUCKETS-1;
for(unsigned int i=0; i<=end; i++) {
bucket_copy[i] = a_bucket[i];
}
for( unsigned int l=0, r=end; ; ) {
unsigned int ret = partition( bucket_copy, l, r );
if( ret > p2 ) {
r = ret - 1;
short_cut_right[spr] = ret;
spr++;
} else if( ret < p2 ){
l = ret + 1;
short_cut_left[spl] = ret;
spl++;
} else {
*q2 = bucket_copy[p2];
break;
}
}
short_cut_left[spl] = p2-1;
short_cut_right[spr] = p2+1;
for( unsigned int i=0, l=0; i<=spl; i++ ) {
unsigned int r = short_cut_left[i];
if( r > p1 ) {
for( ; ; ) {
unsigned int ret = partition( bucket_copy, l, r );
if( ret > p1 ) {
r = ret-1;
} else if( ret < p1 ) {
l = ret+1;
} else {
*q1 = bucket_copy[p1];
break;
}
}
break;
} else if( r < p1 ) {
l = r;
} else {
*q1 = bucket_copy[p1];
break;
}
}
for( unsigned int i=0, r=end; i<=spr; i++ ) {
unsigned int l = short_cut_right[i];
if( l < p3 ) {
for( ; ; ) {
unsigned int ret = partition( bucket_copy, l, r );
if( ret > p3 ) {
r = ret-1;
} else if( ret < p3 ) {
l = ret+1;
} else {
*q3 = bucket_copy[p3];
break;
}
}
break;
} else if( l > p3 ) {
r = l;
} else {
*q3 = bucket_copy[p3];
break;
}
}
}
unsigned int partition(unsigned int * buf, unsigned int left, unsigned int right)
{
if( left == right ) {
return left;
}
if( left+1 == right ) {
if( buf[left] > buf[right] ) {
SWAP_UINT( buf[left], buf[right] );
}
return left;
}
unsigned int ret = left, pivot = (left + right)>>1;
unsigned int val = buf[pivot];
buf[pivot] = buf[right];
buf[right] = val;
for( unsigned int i = left; i < right; i++ ) {
if( buf[i] < val ) {
SWAP_UINT( buf[ret], buf[i] );
ret++;
}
}
buf[right] = buf[ret];
buf[ret] = val;
return ret;
}

150
src/tlsh_impl.h

@ -0,0 +1,150 @@
/*
* TLSH is provided for use under two licenses: Apache OR BSD.
* Users may opt to use either license depending on the license
* restictions of the systems with which they plan to integrate
* the TLSH code.
*/
/* ==============
* Apache License
* ==============
* Copyright 2013 Trend Micro Incorporated
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/* ===========
* BSD License
* ===========
* Copyright (c) 2013, Trend Micro Incorporated
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification,
* are permitted provided that the following conditions are met:
*
* 1. Redistributions of source code must retain the above copyright notice, this
* list of conditions and the following disclaimer.
*
* 2. Redistributions in binary form must reproduce the above copyright notice,
* this list of conditions and the following disclaimer in the documentation
* and/or other materials provided with the distribution.
* 3. Neither the name of the copyright holder nor the names of its contributors
* may be used to endorse or promote products derived from this software without
* specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
* WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
* IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
* INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
* DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
* LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
* OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
* OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#ifndef HEADER_TLSH_IMPL_H
#define HEADER_TLSH_IMPL_H
#define SLIDING_WND_SIZE 5
#define BUCKETS 256
#define Q_BITS 2 // 2 bits; quartile value 0, 1, 2, 3
// BUCKETS_256 & CHECKSUM_3B are compiler switches defined in CMakeLists.txt
#if defined BUCKETS_256
#define EFF_BUCKETS 256
#define CODE_SIZE 64 // 256 * 2 bits = 64 bytes
#if defined CHECKSUM_3B
#define TLSH_CHECKSUM_LEN 3
// defined in tlsh.h #define TLSH_STRING_LEN 138 // 2 + 3 + 64 bytes = 138 hexidecimal chars
#else
#define TLSH_CHECKSUM_LEN 1
// defined in tlsh.h #define TLSH_STRING_LEN 134 // 2 + 1 + 64 bytes = 134 hexidecimal chars
#endif
#endif
#if defined BUCKETS_128
#define EFF_BUCKETS 128
#define CODE_SIZE 32 // 128 * 2 bits = 32 bytes
#if defined CHECKSUM_3B
#define TLSH_CHECKSUM_LEN 3
// defined in tlsh.h #define TLSH_STRING_LEN 74 // 2 + 3 + 32 bytes = 74 hexidecimal chars
#else
#define TLSH_CHECKSUM_LEN 1
// defined in tlsh.h #define TLSH_STRING_LEN 70 // 2 + 1 + 32 bytes = 70 hexidecimal chars
#endif
#endif
#if defined BUCKETS_48
#define EFF_BUCKETS 48
#define CODE_SIZE 12 // 48 * 2 bits = 12 bytes
#define TLSH_CHECKSUM_LEN 1
// defined in tlsh.h #define TLSH_STRING_LEN 30 // 2 + 1 + 12 bytes = 30 hexidecimal chars
#endif
class TlshImpl
{
public:
TlshImpl();
~TlshImpl();
public:
void update(const unsigned char* data, unsigned int len);
void final(int force_option = 0);
void reset();
const char* hash() const;
const char* hash(char *buffer, unsigned int bufSize) const; // saves allocating hash string in TLSH instance - bufSize should be TLSH_STRING_LEN + 1
int compare(const TlshImpl& other) const;
int totalDiff(const TlshImpl& other, bool len_diff=true) const;
int Lvalue();
int Q1ratio();
int Q2ratio();
int fromTlshStr(const char* str);
bool isValid() const { return lsh_code_valid; }
private:
unsigned int *a_bucket;
unsigned char slide_window[SLIDING_WND_SIZE];
unsigned int data_len;
struct lsh_bin_struct {
unsigned char checksum[TLSH_CHECKSUM_LEN]; // 1 to 3 bytes
unsigned char Lvalue; // 1 byte
union {
#if defined(__SPARC) || defined(_AIX)
#pragma pack(1)
#endif
unsigned char QB;
struct{
#if defined(__SPARC) || defined(_AIX)
unsigned char Q2ratio : 4;
unsigned char Q1ratio : 4;
#else
unsigned char Q1ratio : 4;
unsigned char Q2ratio : 4;
#endif
} QR;
} Q; // 1 bytes
unsigned char tmp_code[CODE_SIZE]; // 32/64 bytes
} lsh_bin;
mutable char *lsh_code; // allocated when hash() function without buffer is called - 70/134 bytes or 74/138 bytes
bool lsh_code_valid; // true iff final() or fromTlshStr complete successfully
};
#endif

4876
src/tlsh_util.cpp

File diff suppressed because it is too large

70
src/tlsh_util.h

@ -0,0 +1,70 @@
/*
* TLSH is provided for use under two licenses: Apache OR BSD.
* Users may opt to use either license depending on the license
* restictions of the systems with which they plan to integrate
* the TLSH code.
*/
/* ==============
* Apache License
* ==============
* Copyright 2013 Trend Micro Incorporated
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/* ===========
* BSD License
* ===========
* Copyright (c) 2013, Trend Micro Incorporated
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification,
* are permitted provided that the following conditions are met:
*
* 1. Redistributions of source code must retain the above copyright notice, this
* list of conditions and the following disclaimer.
*
* 2. Redistributions in binary form must reproduce the above copyright notice,
* this list of conditions and the following disclaimer in the documentation
* and/or other materials provided with the distribution.
* 3. Neither the name of the copyright holder nor the names of its contributors
* may be used to endorse or promote products derived from this software without
* specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
* WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
* IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
* INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
* DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
* LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
* OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
* OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#ifndef HEADER_TLSH_UTIL_H
#define HEADER_TLSH_UTIL_H
unsigned char b_mapping(unsigned char salt, unsigned char i, unsigned char j, unsigned char k);
unsigned char l_capturing(unsigned int len);
int mod_diff(unsigned int x, unsigned int y, unsigned int R);
int h_distance( int len, const unsigned char x[], const unsigned char y[]);
void to_hex( unsigned char * psrc, int len, char* pdest);
void from_hex( const char* psrc, int len, unsigned char* pdest);
unsigned char swap_byte( const unsigned char in );
#endif

10
src/version.h

@ -0,0 +1,10 @@
/****************************************************
* This file is generated by cmake. Modify the top
* level CMakeLists.txt to change the VERSION numbers
****************************************************/
#define VERSION_MAJOR 3
#define VERSION_MINOR 9
#define VERSION_PATCH 1
#define TLSH_HASH "compact hash"
#define TLSH_CHECKSUM "1 byte checksum"

2
tests/test-all.R

@ -0,0 +1,2 @@
library(testthat)
test_check("tlsh")

6
tests/testthat/test-tlsh.R

@ -0,0 +1,6 @@
context("minimal package functionality")
test_that("we can do something", {
#expect_that(some_function(), is_a("data.frame"))
})

21
tlsh.Rproj

@ -0,0 +1,21 @@
Version: 1.0
RestoreWorkspace: Default
SaveWorkspace: Default
AlwaysSaveHistory: Default
EnableCodeIndexing: Yes
UseSpacesForTab: Yes
NumSpacesForTab: 2
Encoding: UTF-8
RnwWeave: Sweave
LaTeX: pdfLaTeX
StripTrailingWhitespace: Yes
BuildType: Package
PackageUseDevtools: Yes
PackageInstallArgs: --no-multiarch --with-keep.source
PackageBuildArgs: --resave-data
PackageRoxygenize: rd,collate,namespace
Loading…
Cancel
Save