mirror of https://gitlab.com/hrbrmstr/tdigest.git
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
3.4 KiB
3.4 KiB
---
output: rmarkdown::github_document
editor_options:
chunk_output_type: console
---
```{r pkg-knitr-opts, include=FALSE}
hrbrpkghelpr::global_opts()
```
```{r badges, results='asis', echo=FALSE, cache=FALSE}
hrbrpkghelpr::stinking_badges(zenodo = "10.5281/3357770")
```
# tdigest
Wicked Fast, Accurate Quantiles Using 't-Digests'
## Description
The t-Digest construction algorithm uses a variant of 1-dimensional
k-means clustering to produce a very compact data structure that allows
accurate estimation of quantiles. This t-Digest data structure can be used
to estimate quantiles, compute other rank statistics or even to estimate
related measures like trimmed means. The advantage of the t-Digest over
previous digests for this purpose is that the t-Digest handles data with
full floating point resolution. The accuracy of quantile estimates produced
by t-Digests can be orders of magnitude more accurate than those produced
by previous digest algorithms. Methods are provided to create and update
t-Digests and retreive quantiles from the accumulated distributions.
See [the original paper by Ted Dunning & Otmar Ertl](https://arxiv.org/abs/1902.04023) for more details on t-Digests.
## What's Inside The Tin
The following functions are implemented:
```{r ingredients, results='asis', echo=FALSE, cache=FALSE}
hrbrpkghelpr::describe_ingredients()
```
## Installation
```{r install-ex, results='asis', echo = FALSE}
hrbrpkghelpr::install_block()
```
## Usage
```{r lib-ex}
library(tdigest)
# current version
packageVersion("tdigest")
```
### Basic (Low-level interface)
```{r basic}
td <- td_create(10)
td
td_total_count(td)
td_add(td, 0, 1) %>%
td_add(10, 1)
td_total_count(td)
td_value_at(td, 0.1) == 0
td_value_at(td, 0.5) == 5
quantile(td)
```
#### Bigger (and Vectorised)
```{r bigger}
td <- tdigest(c(0, 10), 10)
is_tdigest(td)
td_value_at(td, 0.1) == 0
td_value_at(td, 0.5) == 5
set.seed(1492)
x <- sample(0:100, 1000000, replace = TRUE)
td <- tdigest(x, 1000)
td_total_count(td)
tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
quantile(td)
```
#### Serialization
These [de]serialization functions make it possible to create & populate a tdigest,
serialize it out, read it in at a later time and continue populating it enabling
compact distribution accumulation & storage for large, "continuous" datasets.
```{r serialize}
set.seed(1492)
x <- sample(0:100, 1000000, replace = TRUE)
td <- tdigest(x, 1000)
tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
str(in_r <- as.list(td), 1)
td2 <- as_tdigest(in_r)
tquantile(td2, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
identical(in_r, as.list(td2))
```
#### ALTREP-aware
```{r altrep}
N <- 1000000
x.altrep <- seq_len(N) # this is an ALTREP in R version >= 3.5.0
td <- tdigest(x.altrep)
td[0.1]
td[0.5]
length(td)
```
#### Proof it's faster
```{r benchmark, cache=TRUE}
microbenchmark::microbenchmark(
tdigest = tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1)),
r_quantile = quantile(x, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
)
```
## tdigest Metrics
```{r cloc, echo=FALSE}
cloc::cloc_pkg_md()
```
## Code of Conduct
Please note that this project is released with a Contributor Code of Conduct.
By participating in this project you agree to abide by its terms.