Rcpp/C++11 wrapper for <https://github.com/nlohmann/json>
@ -26,7 +30,7 @@ The least painful way to do this is to install gcc >= 4.9 (and you should instal
FC=ccache gfortran
F77=ccache gfortran
### Why `ndjson` + Examples
## Why `ndjson` + Examples
An example of such files are the output from Rapid7 internet-wide scans, such as their [HTTPS study](https://scans.io/study/sonar.https). A gzip'd extract of 100,000 of one of those scans weighs in abt about 171MB. The records sometimes contain heavily nested JSON elements depending on how comprehensive the certificate data and other fields were. A typical record will look like this:
@ -93,6 +97,8 @@ All of the certificate sub-field data elents have been expanded and we have a hi
However, if you do end up trying to work with that scan data, it's highly recommended that you use `jq` to filter out the fields or records you want into a more compact ndjson file.
## What's inside the tin?
The following functions are implemented:
- `stream_in`: Stream in ndjson from a file (handles `.gz` files)
@ -101,7 +107,7 @@ The following functions are implemented:
There are no current plans for a `stream_out()` function since `jsonlite::stream_out()` does a great job tossing `data.frame`-like structures out to an ndjson file.
Please note that this project is released with a [Contributor Code of Conduct](CONDUCT.md). By participating in this project you agree to abide by its terms.
Rcpp/C++11 wrapper for <https://github.com/nlohmann/json>
The goal is to create a completely "flat" `data.frame`-like structure from ndjson records in plain text ndjson files or gzip'd ndjson files.
The goal is to create a completely “flat” `data.frame`-like structure
from ndjson records in plain text ndjson files or gzip’d ndjson files.
### Installation guidance for Linux/BSD-ish systems
CRAN has binaries for Windows and macOS. To build this on UNIX-like systems, you need at least g++4.9 or clang++. This is a forced requirement by the ndjson library.
CRAN has binaries for Windows and macOS. To build this on UNIX-like
systems, you need at least g++4.9 or clang++. This is a forced
requirement by the ndjson library.
The least painful way to do this is to install gcc >= 4.9 (and you should install ccache while you're at it) and mmodfiy `~/.R/Makevars` thusly:
The least painful way to do this is to install gcc \>= 4.9 (and you
should install ccache while you’re at it) and mmodfiy `~/.R/Makevars`
thusly:
# Use whatever version of (g++ >=4.9 or clang++) that you downloaded
VER=-4.9
@ -21,9 +33,14 @@ The least painful way to do this is to install gcc >= 4.9 (and you should ins
FC=ccache gfortran
F77=ccache gfortran
### Why `ndjson` + Examples
## Why `ndjson` + Examples
An example of such files are the output from Rapid7 internet-wide scans, such as their [HTTPS study](https://scans.io/study/sonar.https). A gzip'd extract of 100,000 of one of those scans weighs in abt about 171MB. The records sometimes contain heavily nested JSON elements depending on how comprehensive the certificate data and other fields were. A typical record will look like this:
An example of such files are the output from Rapid7 internet-wide scans,
such as their [HTTPS study](https://scans.io/study/sonar.https). A
gzip’d extract of 100,000 of one of those scans weighs in abt about
171MB. The records sometimes contain heavily nested JSON elements
depending on how comprehensive the certificate data and other fields
were. A typical record will look like this:
{
"vhost": "teamchat.buzzpoints.com",
@ -38,10 +55,13 @@ An example of such files are the output from Rapid7 internet-wide scans, such as
A `system.time(df <- stream_in("https-extract.json.gz"))` results in:
user system elapsed
14.822 0.224 15.189
```
user system elapsed
14.822 0.224 15.189
```
on a 13" MacBook Pro and produces:
on a 13" MacBook Pro and
produces:
Classes ‘tbl_dt’, ‘tbl’, ‘data.table’ and 'data.frame': 100000 obs. of 36 variables:
@ -82,27 +102,40 @@ on a 13" MacBook Pro and produces:
$ certsubject.Mail : chr NA NA NA NA ...
- attr(*, ".internal.selfref")=<externalptr>
All of the certificate sub-field data elents have been expanded and we have a highly performant `tbl_dt` to work with now either in `dplyr` syntax or `data.table` heiroglyphic syntax. Just go see what you have to do in `jsonlite` to get a similar output (and how long it will take).
All of the certificate sub-field data elents have been expanded and we
have a highly performant `tbl_dt` to work with now either in `dplyr`
syntax or `data.table` heiroglyphic syntax. Just go see what you have to
do in `jsonlite` to get a similar output (and how long it will take).
`pryr::object_size(df)` for that shows it's consuming `394 MB`, which means we can read in many more extracts comfortably on a reasonably configured system and most (if not all) of it on a well-configured AWS box.
`pryr::object_size(df)` for that shows it’s consuming `394 MB`, which
means we can read in many more extracts comfortably on a reasonably
configured system and most (if not all) of it on a well-configured AWS
box.
However, if you do end up trying to work with that scan data, it's highly recommended that you use `jq` to filter out the fields or records you want into a more compact ndjson file.
However, if you do end up trying to work with that scan data, it’s
highly recommended that you use `jq` to filter out the fields or records
you want into a more compact ndjson file.
## What’s inside the tin?
The following functions are implemented:
- `stream_in`: Stream in ndjson from a file (handles `.gz` files)
- `validate`: Validate JSON records in an ndjson file (handles `.gz` files)
- `flatten`: Flatten a character vector of individual JSON lines
- `stream_in`: Stream in ndjson from a file (handles `.gz` files)
- `validate`: Validate JSON records in an ndjson file (handles `.gz`
files)
- `flatten`: Flatten a character vector of individual JSON lines
There are no current plans for a `stream_out()` function since `jsonlite::stream_out()` does a great job tossing `data.frame`-like structures out to an ndjson file.
There are no current plans for a `stream_out()` function since
`jsonlite::stream_out()` does a great job tossing `data.frame`-like
Please note that this project is released with a [Contributor Code of Conduct](CONDUCT.md). By participating in this project you agree to abide by its terms.
Please note that this project is released with a [Contributor Code of
Conduct](CONDUCT.md). By participating in this project you agree to