No Description
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
boB Rudis 438497c71f
more windows inanity
2 months ago
R added input checks, more tests, S3 quantile method; updated docs; 2 months ago
inst added input checks, more tests, S3 quantile method; updated docs; 2 months ago
man added input checks, more tests, S3 quantile method; updated docs; 2 months ago
src vectorised 2 months ago
tests more windows inanity 2 months ago
.Rbuildignore c-interface 2 months ago
.codecov.yml R package repo initialization complete 4 months ago
.gitignore R package repo initialization complete 4 months ago
.travis.yml R package repo initialization complete 4 months ago
CONDUCT.md R package repo initialization complete 4 months ago
DESCRIPTION spelling 2 months ago
LICENSE initial commit 4 months ago
LICENSE.md initial commit 4 months ago
NAMESPACE added input checks, more tests, S3 quantile method; updated docs; 2 months ago
NEWS.md added input checks, more tests, S3 quantile method; updated docs; 2 months ago
README.Rmd added input checks, more tests, S3 quantile method; updated docs; 2 months ago
README.md added input checks, more tests, S3 quantile method; updated docs; 2 months ago
tdigest.Rproj R package repo initialization complete 4 months ago

README.md

Travis-CI Build
Status Coverage
Status CRAN\_Status\_Badge

tdigest

Wicked Fast, Accurate Quantiles Using ‘t-Digests’

Description

The t-digest construction algorithm uses a variant of 1-dimensional k-means clustering to produce a very compact data structure that allows accurate estimation of quantiles. This t-digest data structure can be used to estimate quantiles, compute other rank statistics or even to estimate related measures like trimmed means. The advantage of the t-digest over previous digests for this purpose is that the t-digest handles data with full floating point resolution. With small changes, the t-digest can handle values from any ordered set for which we can compute something akin to a mean. The accuracy of quantile estimates produced by t-digests can be orders of magnitude more accurate than those produced by previous digest algorithms.

See the original paper by Ted Dunning for more details on t-Digests.

What’s Inside The Tin

The following functions are implemented:

  • is_tdigest: Test to see if an object is classed as tdigest
  • tdigest: Create a new t-digest histogram from a vector
  • td_add: Add a value to the t-digest with the specified count
  • td_create: Allocate a new histogram
  • td_merge: Merge one t-digest into another
  • td_quantile_of: Return the quantile of the value
  • td_total_count: Total items contained in the t-digest
  • td_value_at: Return the value at the specified quantile
  • tquantile: Calcuate sample quantiles from a t-digest

Installation

install.packages("tdigest", repos="https://cinc.rud.is/")
# or 
devtools::install_git("https://sr.ht.com/~hrbrmstr/tdigest.git")
# or
devtools::install_gitlab("hrbrmstr/tdigest")
# or (if you must)
devtools::install_github("hrbrmstr/tdigest")

Usage

library(tdigest)

# current version
packageVersion("tdigest")
## [1] '0.2.0'

Basic (Low-level interface)

td <- td_create(10)

td
## <tdigest; size=0>

td_total_count(td)
## [1] 0

td_add(td, 0, 1) %>% 
  td_add(10, 1)
## <tdigest; size=2>

td_total_count(td)
## [1] 2

td_value_at(td, 0.1) == 0
## [1] TRUE
td_value_at(td, 0.5) == 5
## [1] TRUE

quantile(td)
## [1]  0  0  5 10 10

Bigger (and Vectorised)

td <- tdigest(c(0, 10), 10)

is_tdigest(td)
## [1] TRUE

td_value_at(td, 0.1) == 0
## [1] TRUE
td_value_at(td, 0.5) == 5
## [1] TRUE

set.seed(1492)
x <- sample(0:100, 1000000, replace = TRUE)
td <- tdigest(x, 1000)

td_total_count(td)
## [1] 1e+06

tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
##  [1]   0.0000000   0.8632378   9.6763281  19.7028368  29.7718982  39.9706864  50.0032181  60.0859360  70.1951621
## [10]  80.2785864  90.3290326  99.5151872 100.0000000

quantile(td)
## [1]   0.00000  24.81839  50.00322  75.23076 100.00000

Proof it’s faster

microbenchmark::microbenchmark(
  tdigest = tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1)),
  r_quantile = quantile(x, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
)
## Unit: microseconds
##        expr       min        lq        mean     median         uq       max neval
##     tdigest    26.334    31.162    61.02889    67.3605    71.0985   177.618   100
##  r_quantile 61909.704 64146.167 66500.42677 65329.2830 68093.9355 78102.683   100

tdigest Metrics

Lang # Files (%) LoC (%) Blank lines (%) # Lines (%)
C 3 0.27 337 0.68 45 0.38 26 0.11
R 6 0.55 118 0.24 25 0.21 133 0.56
Rmd 1 0.09 32 0.06 37 0.32 51 0.21
C/C++ Header 1 0.09 10 0.02 10 0.09 28 0.12

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.