Skip to content

Latest commit

 

History

History
743 lines (560 loc) · 24.7 KB

README.md

File metadata and controls

743 lines (560 loc) · 24.7 KB

jaq

Build status Crates.io Documentation Rust 1.64+

jaq (pronounced like Jacques1) is a clone of the JSON data processing tool jq. jaq aims to support a large subset of jq's syntax and operations.

jaq focuses on three goals:

  • Correctness: jaq aims to provide a more correct and predictable implementation of jq, while preserving compatibility with jq in most cases.

    Examples of surprising jq behaviour
    • nan > nan is false, while nan < nan is true.
    • [[]] | implode crashes jq, and this was not fixed at the time of writing despite being known since five years.
    • The jq manual claims that limit(n; exp) "extracts up to n outputs from exp". This holds for values of n > 1, e.g. jq -n '[limit(2; 1, 2, 3)]' yields [1, 2], but when n == 0, jq -n '[limit(0; 1, 2, 3)]' yields [1] instead of []. And perhaps even worse, when n < 0, then limit yields all outputs from exp, which is not documented.
  • Performance: I created jaq originally because I was bothered by jq's long start-up time, which amounts to about 50ms on my machine. This can be particularly seen when processing a large number of small files. jaq starts up about 30 times faster than jq 1.6 and outperforms jq also on many other benchmarks.

  • Simplicity: jaq aims to have a simple and small implementation, in order to reduce the potential for bugs and to facilitate contributions.

I drew inspiration from another Rust program, namely jql. However, unlike jql, jaq aims to closely imitate jq's syntax and semantics. This should allow users proficient in jq to easily use jaq.

Installation

From Source

To compile jaq, you need a Rust toolchain. See https://rustup.rs/ for instructions. (Note that Rust compilers shipped with Linux distributions may be too outdated to compile jaq.)

Any of the following commands install jaq:

$ cargo install --locked jaq
$ cargo install --locked --git https://github.com/01mf02/jaq # latest development version

On my system, both commands place the executable at ~/.cargo/bin/jaq.

If you have cloned this repository, you can also build jaq by executing one of the commands in the cloned repository:

$ cargo build --release # places binary into target/release/jaq
$ cargo install --locked --path jaq # installs binary

jaq should work on any system supported by Rust. If it does not, please file an issue.

Binaries

You may also install jaq using homebrew on macOS or Linux:

$ brew install jaq
$ brew install --HEAD jaq # latest development version

Or using scoop on Windows:

> scoop install main/jaq

Examples

The following examples should give an impression of what jaq can currently do. You should obtain the same outputs by replacing jaq with jq. If not, your filing an issue would be appreciated. :) The syntax is documented in the jq manual.

Access a field:

$ echo '{"a": 1, "b": 2}' | jaq '.a'
1

Add values:

$ echo '{"a": 1, "b": 2}' | jaq 'add'
3

Construct an array from an object in two ways and show that they are equal:

$ echo '{"a": 1, "b": 2}' | jaq '[.a, .b] == [.[]]'
true

Apply a filter to all elements of an array and filter the results:

$ echo '[0, 1, 2, 3]' | jaq 'map(.*2) | [.[] | select(. < 5)]'
[0, 2, 4]

Read (slurp) input values into an array and get the average of its elements:

$ echo '1 2 3 4' | jaq -s 'add / length'
2.5

Repeatedly apply a filter to itself and output the intermediate results:

$ echo '0' | jaq '[recurse(.+1; . < 3)]'
[0, 1, 2]

Lazily fold over inputs and output intermediate results:

$ seq 1000 | jaq -n 'foreach inputs as $x (0; . + $x)'
1 3 6 10 15 [...]

Performance

The following evaluation consists of several benchmarks that allow comparing the performance of jaq, jq, and gojq. The empty benchmark runs n times the filter empty with null input, serving to measure the startup time. The bf-fib benchmark runs a Brainfuck interpreter written in jq, interpreting a Brainfuck script that produces n Fibonacci numbers. The other benchmarks evaluate various filters with n as input; see bench.sh for details.

I generated the benchmark data with bench.sh target/release/jaq jq-1.7 gojq-0.12.13 jq-1.6 | tee bench.json on a Linux system with an AMD Ryzen 5 5500U.2 I then processed the results with a "one-liner" (stretching the term and the line a bit):

jq -rs '.[] | "|`\(.name)`|\(.n)|" + ([.time[] | min | (.*1000|round)? // "N/A"] | min as $total_min | map(if . == $total_min then "**\(.)**" else "\(.)" end) | join("|"))' bench.json

(Of course, you can also use jaq here instead of jq.) Finally, I concatenated the table header with the output and piped it through pandoc -t gfm.

Table: Evaluation results in milliseconds ("N/A" if more than 10 seconds).

Benchmark n jaq-1.2 jq-1.7 gojq-0.12.13 jq-1.6
empty 512 650 790 740 8340
bf-fib 13 410 1280 820 1420
reverse 1048576 60 680 310 630
sort 1048576 140 530 600 670
group-by 1048576 420 1850 1680 2830
min-max 1048576 220 320 290 310
add 1048576 480 650 1540 750
kv 131072 160 150 250 200
kv-update 131072 190 530 570 N/A
kv-entries 131072 580 1170 820 1110
ex-implode 1048576 460 1110 740 1080
reduce 1048576 740 880 N/A 850
try-catch 1048576 180 330 480 650
tree-flatten 17 650 360 0 480
tree-update 17 450 980 1850 1180
tree-paths 17 450 380 920 470
to-fromjson 65536 40 370 100 380
ack 7 570 680 1090 610
range-prop 128 260 310 320 580

The results show that jaq-1.2 is fastest on 16 benchmarks, whereas jq-1.7 is fastest on 2 benchmarks and gojq-0.12.13 is fastest on 1 benchmark. gojq is much faster on tree-flatten because it implements the filter flatten natively instead of by definition.

Features

Here is an overview that summarises:

  • features already implemented, and
  • features not yet implemented.

Contributions to extend jaq are highly welcome.

Basics

  • Identity (.)
  • Recursion (..)
  • Basic data types (null, boolean, number, string, array, object)
  • if-then-else (if .a < .b then .a else .b end)
  • Folding (reduce .[] as $x (0; . + $x), foreach .[] as $x (0; . + $x; . + .))
  • Error handling (try ... catch ...) (see the differences from jq)
  • String interpolation ("The successor of \(.) is \(.+1).")
  • Format strings (@json, @text, @csv, @tsv, @html, @sh, @base64, @base64d)

Paths

  • Indexing of arrays/objects (.[0], .a, .["a"])
  • Iterating over arrays/objects (.[])
  • Optional indexing/iteration (.a?, .[]?)
  • Array slices (.[3:7], .[0:-1])
  • String slices

Operators

  • Composition (|)
  • Binding (. as $x | $x)
  • Concatenation (,)
  • Plain assignment (=)
  • Update assignment (|=, +=, -=)
  • Alternation (//)
  • Logic (or, and)
  • Equality and comparison (.a == .b, .a < .b)
  • Arithmetic (+, -, *, /, %)
  • Negation (-)
  • Error suppression (?)

Definitions

  • Basic definitions (def map(f): [.[] | f];)
  • Recursive definitions (def r: r; r)

Core filters

  • Empty (empty)
  • Errors (error)
  • Input (inputs)
  • Length (length, utf8bytelength)
  • Rounding (floor, round, ceil)
  • String <-> JSON (fromjson, tojson)
  • String <-> integers (explode, implode)
  • String normalisation (ascii_downcase, ascii_upcase)
  • String prefix/postfix (startswith, endswith, ltrimstr, rtrimstr)
  • String splitting (split("foo"))
  • Array filters (reverse, sort, sort_by(-.), group_by, min_by, max_by)
  • Stream consumers (first, last, range, fold)
  • Stream generators (range, recurse)
  • Time (now, fromdateiso8601, todateiso8601)
  • More numeric filters (sqrt, sin, log, pow, ...) (list of numeric filters)
  • More time filters (strptime, strftime, strflocaltime, mktime, gmtime, and localtime)

Standard filters

These filters are defined via more basic filters. Their definitions are at std.jq.

  • Undefined (null)
  • Booleans (true, false, not)
  • Special numbers (nan, infinite, isnan, isinfinite, isfinite, isnormal)
  • Type (type)
  • Filtering (select(. >= 0))
  • Selection (values, nulls, booleans, numbers, strings, arrays, objects, iterables, scalars)
  • Conversion (tostring, tonumber)
  • Iterable filters (map(.+1), map_values(.+1), add, join("a"))
  • Array filters (transpose, first, last, nth(10), flatten, min, max)
  • Object-array conversion (to_entries, from_entries, with_entries)
  • Universal/existential (all, any)
  • Recursion (walk)
  • I/O (input)
  • Regular expressions (test, scan, match, capture, splits, sub, gsub)
  • Time (fromdate, todate)

Numeric filters

jaq imports many filters from libm and follows their type signature.

Full list of numeric filters defined in jaq

Zero-argument filters:

  • acos
  • acosh
  • asin
  • asinh
  • atan
  • atanh
  • cbrt
  • cos
  • cosh
  • erf
  • erfc
  • exp
  • exp10
  • exp2
  • expm1
  • fabs
  • frexp, which returns pairs of (float, integer).
  • ilogb, which returns integers.
  • j0
  • j1
  • lgamma
  • log
  • log10
  • log1p
  • log2
  • logb
  • modf, which returns pairs of (float, float).
  • nearbyint
  • pow10
  • rint
  • significand
  • sin
  • sinh
  • sqrt
  • tan
  • tanh
  • tgamma
  • trunc
  • y0
  • y1

Two-argument filters that ignore .:

  • atan2
  • copysign
  • drem
  • fdim
  • fmax
  • fmin
  • fmod
  • hypot
  • jn, which takes an integer as first argument.
  • ldexp, which takes an integer as second argument.
  • nextafter
  • nexttoward
  • pow
  • remainder
  • scalb
  • scalbln, which takes as integer as second argument.
  • yn, which takes an integer as first argument.

Three-argument filters that ignore .:

  • fma

Advanced features

jaq currently does not aim to support several features of jq, such as:

  • Modules
  • SQL-style operators
  • Streaming

Differences between jq and jaq

Numbers

jq uses 64-bit floating-point numbers (floats) for any number. By contrast, jaq interprets numbers such as 0 or -42 as machine-sized integers and numbers such as 0.0 or 3e8 as 64-bit floats. Many operations in jaq, such as array indexing, check whether the passed numbers are indeed integer. The motivation behind this is to avoid rounding errors that may silently lead to wrong results. For example:

$ jq  -n '[0, 1, 2] | .[1.0000000000000001]'
1
$ jaq -n '[0, 1, 2] | .[1.0000000000000001]'
Error: cannot use 1.0 as integer
$ jaq -n '[0, 1, 2] | .[1]'
1

The rules of jaq are:

  • The sum, difference, product, and remainder of two integers is integer.
  • Any other operation between two numbers yields a float.

Examples:

$ jaq -n '1 + 2'
3
$ jaq -n '10 / 2'
5.0
$ jaq -n '1.0 + 2'
3.0

You can convert an integer to a floating-point number e.g. by adding 0.0, by multiplying with 1.0, or by dividing with 1. You can convert a floating-point number to an integer by round, floor, or ceil:

$ jaq -n '1.2 | [floor, round, ceil]'
[1, 1, 2]

NaN and infinity

In jq, division by 0 has some surprising properties; for example, 0 / 0 yields nan, whereas 0 as $n | $n / 0 yields an error. In jaq, n / 0 yields nan if n == 0, infinite if n > 0, and -infinite if n < 0. jaq's behaviour is closer to the IEEE standard for floating-point arithmetic (IEEE 754).

jaq implements a total ordering on floating-point numbers to allow sorting values. Therefore, it unfortunately has to enforce that nan == nan. (jq gets around this by enforcing nan < nan, which breaks basic laws about total orders.)

Like jq, jaq prints nan and infinite as null in JSON, because JSON does not support encoding these values as numbers.

Preservation of fractional numbers

jaq preserves fractional numbers coming from JSON data perfectly (as long as they are not used in some arithmetic operation), whereas jq 1.6 may silently convert to 64-bit floating-point numbers:

$ echo '1e500' | jq '.'
1.7976931348623157e+308
$ echo '1e500' | jaq '.'
1e500

Therefore, unlike jq 1.6, jaq satisfies the following paragraph in the jq manual:

An important point about the identity filter is that it guarantees to preserve the literal decimal representation of values. This is particularly important when dealing with numbers which can't be losslessly converted to an IEEE754 double precision representation.

Please note that newer versions of jq, e.g. 1.7, seem to preserve the literal decimal representation as well.

Assignments

Like jq, jaq allows for assignments of the form p |= f. However, jaq interprets these assignments differently. Fortunately, in most cases, the result is the same.

In jq, an assignment p |= f first constructs paths to all values that match p. Only then, it applies the filter f to these values.

In jaq, an assignment p |= f applies f immediately to any value matching p. Unlike in jq, assignment does not explicitly construct paths.

jaq's implementation of assignment likely yields higher performance, because it does not construct paths. Furthermore, this also prevents several bugs in jq "by design". For example, given the filter [0, 1, 2, 3] | .[] |= empty, jq yields [1, 3], whereas jaq yields []. What happens here?

jq first constructs the paths corresponding to .[], which are .0, .1, .2, .3. Then, it removes the element at each of these paths. However, each of these removals changes the value that the remaining paths refer to. That is, after removing .0 (value 0), .1 does not refer to value 1, but value 2! That is also why value 1 (and in consequence also value 3) is not removed.

There is more weirdness ahead in jq; for example, 0 | 0 |= .+1 yields 1 in jq, although 0 is not a valid path expression. However, 1 | 0 |= .+1 yields an error. In jaq, any such assignment yields an error.

jaq attempts to use multiple outputs of the right-hand side, whereas jq uses only the first. For example, 0 | (., .) |= (., .+1) yields 0 1 1 2 in jaq, whereas it yields only 0 in jq. However, {a: 1} | .a |= (2, 3) yields {"a": 2} in both jaq and jq, because an object can only associate a single value with any given key, so we cannot use multiple outputs in a meaningful way here.

Because jaq does not construct paths, it does not allow some filters on the left-hand side of assignments, for example first, last, limit: For example, [1, 2, 3] | first(.[]) |= .-1 yields [0, 2, 3] in jq, but is invalid in jaq. Similarly, [1, 2, 3] | limit(2; .[]) |= .-1 yields [0, 1, 3] in jq, but is invalid in jaq. (Inconsequentially, jq also does not allow for last.)

Definitions

Like jq, jaq allows for the definition of filters, such as:

def map(f): [.[] | f];

Arguments can also be passed by value, such as:

def cartesian($f; $g): [$f, $g];

Filter definitions can be nested and recursive, i.e. refer to themselves. That is, a filter such as recurse can be defined in jaq:

def recurse(f): def r: ., (f | r); r;

Since jaq 1.2, jaq optimises tail calls, like jq. Since jaq 1.1, recursive filters can also have non-variable arguments, like in jq. For example:

def f(a): a, f(1+a);

Recursive filters with non-variable arguments can yield surprising effects; for example, a call f(0) builds up calls of the shape f(1+(..(1+0)...)), which leads to exponential execution times.

Recursive filters with non-variable arguments can very frequently be alternatively implemented by either:

  • A nested filter: for example, instead of def walk(f): (.[]? |= walk(f)) | f;, you can use def walk(f): def rec: (.[]? |= rec) | f; rec;.
  • A filter with variable arguments: for example, instead of def f(a): a, f(1+a);, you can equally well write def f($a): $a, f(1+$a);.
  • A filter with recurse: for example, you may write def f(a): a | recurse(1+.);. If you expect your filter to recurse deeply, it is advised to implement it using recurse, because jaq has an optimised implementation of recurse.

All of these options are supported by jaq.

Arguments

Like jq, jaq allows to define arguments via the command line, in particular by the options --arg, --rawfile, --slurpfile. This binds variables to values, and for every variable $x bound to v this way, $ARGS.named contains an entry with key x and value v. For example:

$ jaq -n --arg x 1 --arg y 2 '$x, $y, $ARGS.named'
"1"
"2"
{
  "x": "1",
  "y": "2"
}

Folding

jq and jaq provide filters reduce xs as $x (init; f) and foreach xs as $x (init; f).

In jaq, the output of these filters is defined very simply: Assuming that xs evaluates to x0, x1, ..., xn, reduce xs as $x (init; f) evaluates to

init
| x0 as $x | f
| ...
| xn as $x | f

and foreach xs as $x (init; f) evaluates to

init
| x0 as $x | f | (.,
| ...
| xn as $x | f | (.,
empty)...)

Additionally, jaq provides the filter for xs as $x (init; f) that evaluates to

init
| ., (x0 as $x | f
| ...
| ., (xn as $x | f
)...)

The difference between foreach and for is that for yields the output of init, whereas foreach omits it. For example, foreach (1, 2, 3) as $x (0; .+$x) yields 1, 3, 6, whereas for (1, 2, 3) as $x (0; .+$x) yields 0, 1, 3, 6.

The interpretation of reduce/foreach in jaq has the following advantages over jq:

  • It deals very naturally with filters that yield multiple outputs. In contrast, jq discriminates outputs of f, because it recurses only on the last of them, although it outputs all of them.

    Example

    foreach (5, 10) as $x (1; .+$x, -.) yields 6, -1, 9, 1 in jq, whereas it yields 6, 16, -6, -1, 9, 1 in jaq. We can see that both jq and jaq yield the values 6 and -1 resulting from the first iteration (where $x is 5), namely 1 | 5 as $x | (.+$x, -.). However, jq performs the second iteration (where $x is 10) only on the last value returned from the first iteration, namely -1, yielding the values 9 and 1 resulting from -1 | 10 as $x | (.+$x, -.). jaq yields these values too, but it also performs the second iteration on all other values returned from the first iteration, namely 6, yielding the values 16 and -6 that result from 6 | 10 as $x | (.+$x, -.).

  • It makes the implementation of reduce and foreach special cases of the same code, reducing the potential for bugs.

Compared to foreach ..., the filter for ... (where ... refers to xs as $x (init; f)) has a stronger relationship with reduce. In particular, the values yielded by reduce ... are a subset of the values yielded by for .... This does not hold if you replace for by foreach.

Example

As an example, if we set ... to empty as $x (0; .+$x), then foreach ... yields no value, whereas for ... and reduce ... yield 0.

Furthermore, jq provides the filter foreach xs as $x (init; f; proj) (foreach/3) and interprets foreach xs as $x (init; f) (foreach/2) as foreach xs as $x (init; f; .), whereas jaq does not provide foreach/3 because it requires completely separate logic from foreach/2 and reduce in both the parser and the interpreter.

Error handling

In jq, the try f catch g expression breaks out of the f stream as soon as an error occurs, ceding control to g after that. This is mentioned in its manual as a possible mechanism for breaking out of loops (here). jaq however doesn't interrupt the f stream, but instead sends each error value emitted to the g filter; the result is a stream of values emitted from f with values emitted from g interspersed where errors occurred.

Consider the following example: this expression is true in jq, because the first error(2) interrupts the stream:

[try (1, error(2), 3, error(4)) catch .] == [1, 2]

In jaq however, this holds:

[try (1, error(2), 3, error(4)) catch .] == [1, 2, 3, 4]

Miscellaneous

  • Slurping: When files are slurped in (via the -s / --slurp option), jq combines the inputs of all files into one single array, whereas jaq yields an array for every file. The behaviour of jq can be approximated in jaq; for example, to achieve the output of jq -s . a b, you may use jaq -s . <(cat a b).
  • Cartesian products: In jq, [(1,2) * (3,4)] yields [3, 6, 4, 8], whereas [{a: (1,2), b: (3,4)} | .a * .b] yields [3, 4, 6, 8]. jaq yields [3, 4, 6, 8] in both cases.
  • Indexing null: In jq, when given null input, .["a"] and .[0] yield null, but .[] yields an error. jaq yields an error in all cases to prevent accidental indexing of null values. To obtain the same behaviour in jq and jaq, you can use .["a"]? // null or .[0]? // null instead.
  • List updating: In jq, [0, 1] | .[3] = 3 yields [0, 1, null, 3]; that is, jq fills up the list with nulls if we update beyond its size. In contrast, jaq fails with an out-of-bounds error in such a case.
  • Input reading: When there is no more input value left, in jq, input yields an error, whereas in jaq, it yields no output value.
  • Joining: When given an array [x0, x1, ..., xn], in jq, join(x) converts all elements of the input array to strings and intersperses them with x, whereas in jaq, join(x) simply calculates x0 + x + x1 + x + ... + xn. When all elements of the input array and x are strings, jq and jaq yield the same output.

Contributing

Contributions to jaq are welcome. Please make sure that after your change, cargo test runs successfully.

Acknowledgements

jaq has profited tremendously from:

  • serde_json to read and colored_json to output JSON,
  • chumsky to parse and ariadne to pretty-print parse errors,
  • mimalloc to boost the performance of memory allocation, and
  • the Rust standard library, in particular its awesome Iterator, which builds the rock-solid base of jaq's filter execution

Footnotes

  1. I wanted to create a tool that should be discreet and obliging, like a good waiter. And when I think of a typical name for a (French) waiter, to my mind comes "Jacques". Later, I found out about the old French word jacquet, meaning "squirrel", which makes for a nice ex post inspiration for the name.

  2. The binaries for jq-1.7 and gojq-0.12.13 were retrieved from their GitHub release pages, the binary for jq-1.6 was installed from the standard Ubuntu repository.