Add "format string" support for tensors #655

AngelEzquerra · 2024-06-04T22:00:18Z

This comes in 2 forms:

Add a version of the pretty function that takes a format string as its input instead of a precision value. Also add a new "showHeader" argument to the original pretty function. - This new version of pretty let's you specify the format string that must be used to display each element. It also adds some new tensor-specific format string "tokens" that are used to control how tensors are displayed (beyond the format of each element).
Add a formatValue procedure that takes a tensor as its first input. This makes it possible to control how tensors are displayed in format strings, in the exact same way as if you were using the new pretty(specifier) procedure.

The special, tensor-specific tokens added by this change are:

"[:]": Display the tensor as if it were a nim "array". This makes it easy to use the representation of a tensor in your own code. No header is shown.
"[]": Same as "[:]" but displays the tensor in a single line. No header is shown.
"<>": Combined with the 2 above (i.e. "<>[:]" or "<>[]") adds a header with basic tensor info (type and shape). "<:>" can be used as a shortcut for "<>[:]" while "<>" on its own is equivalent to "<>[]". Can also be combined with "<>||" (see below).
"||": "Pretty-print" the tensor without a header. This can also be combined with "<>" (i.e. "<>||") to explicitly enable the default mode, which is pretty printing with a header.
'j': Formats complex values as A+Bj like in mathematics. Ignored for non-Complex tensors

Note that these are only used to control how the tensors themselves are displayed as a whole and are removed before displaying the individual elements.

Vindaar · 2024-06-07T13:39:18Z

I think this PR could use some other eyes / opinions. Not so much because of the added code (you're welcome to review of course), but more so because of the added syntax. @mratsim @HugoGranstrom @Clonkk (feel free to ping anyone else!)

If we decide to add it to arraymancer directly (and not e.g. as a submodule that can be imported import arraymancer / pretty_printing), we might want to be more certain that we like the syntax. Any input would be appreciated.

Clonkk · 2024-06-07T14:01:07Z

src/arraymancer/tensor/display.nim

+  ##
+  ## Inputs:
+  ##   - Input Tensor
+  ##   - specifier: A format specifier similar to those used in format strings,


Shouldn't those be an enum instead of static[string] ?
In my experience custom string format specifier are rather confusing.
Multiple format output could even be different proc so it's easier to navigate code and change it. We could also introduce a CT switch -d:... to choose which prettyPrint version is used on $

Using an enum is the first thing I tried but it was not very flexible. Using a string is how I let the user configure the individual element display in the same flexible manner that nim's strformat allows. For example, you can display integers as hex or as binary values, you can force the display of the sign for positive values, you can specify the number of digits in total and on the decimal part or (with nim 2.2) you can ask to use the "math-like" "j" complex display format (which displays complex numbers as 2.3-7.1j) which is particularly useful for a tensor library in the context of signal processing.

Clonkk · 2024-06-07T14:16:08Z

Good stuff over all. To me the biggest discussion to be had is about using string as specifier.

For non-mutually exclusive options we could use a set (something done in the std lib).
Maybe, another possibility is to have different proc and pass an enum to do the dispatch in prettyPrint ?

HugoGranstrom · 2024-06-07T16:08:52Z

Thanks for working on this @AngelEzquerra 😄

I agree with Clonkk that string formats are quite confusing.
I don't see what the <> format would add that a separate showHeader parameter couldn't bring. And the different modes ([], [:], ||) could be replaced with a printMode parameter that takes an enum as Clonkk suggested. So my suggestion would be a function with this signature:

type
  PrintMode = enum
    arrayMode
    singleLineMode
    prettyMode

proc pretty*[T](t: Tensor[T], formatString: string, showHeader: bool = true, printMode: PrintModeEnum = prettyMode): string =
    ...

Or is it something I'm missing here, some combination of tokens that can't be handled this way?

Clonkk · 2024-06-07T16:18:46Z

The only downside of enum I can see is you have 2 non mutually formatting options you need 3 enums (optA, optB, optA_B). And the more non mutually exclusive options you add the worse it gets.

That is why I.think we should also.consider set instead of enum since it can still work well even if we want/need different formatting options.

On the other hand if we want to keep limited possibility then enum is simpler.

HugoGranstrom · 2024-06-07T16:21:52Z

The only downside of enum I can see is you have 2 non mutually formatting options you need 3 enums (optA, optB, optA_B). And the more non mutually exclusive options you add the worse it gets.

I agree with you that it is more flexible. Do we have any plans to introduce any mutually inclusive options, though? If we do, it is easy enough to add an overload that just calls the set-version with the single argument in a set. :)

AngelEzquerra · 2024-06-07T16:51:22Z

Thank you @Vindaar for asking others to comment and thank you for those comments guys! :-)

Just to give more context to other people, the main reason I made this PR is that while the current pretty printing of tensors is nice (and pretty! :-) ), it makes it hard to compare arraymancer's results with numpy's, which uses pretty different format, and, above all, it makes it really hard to use arraymancer's own output in your arraymancer code (e.g. when creating a test or using arraymancer to build a simulator). Another reason is to be able to control the format of the elements themselves (e.g. to use hex format, etc). Finally, I wanted to be able to do all of those while using nim's strformat, which I use extensively and find very useful and nice.

Another comment I'd make is that I tried several options before landing on the current proposal. The first thing I tried is to add an enum argument to pretty. The problem with that was that it was not usable with format strings nor let me configure the display of the individual elements. That is why I decided to simply add support for format strings. At first, I tried to use simple letters to indicate that the format should be array-like ("a"), single line array-like ("s") or that it should not have a header ("n"). However, after some thought I got worried that nim's own strformat specifiers (in order to support the control of the format of the individual elements) might evolve and potentially add a use for those letters in the future. That's why I decided to use something that seems much more unlikely to collide with nim's strformat. In particular, I decided to use symbols (and in fact more than one of them). This has the nice side-effect of making the chosen tokens kind of look like the desired output (<> for headers, [] for single-line arrays, [:] more arrays with more than one line and || for the existing table format).

One small tweak that could be done to the current proposal is to force these "tokens" to surround the "element level format". That is, instead of (or perhaps in addition to) adding "[]" to "+06.2f" (i.e. "[]+06.2f"), we could do "[+06.2f]".

HugoGranstrom · 2024-06-07T16:56:30Z

Yes it is great and something I have also been a bit annoyed at some times that the string representation is a bit odd. 😅

The first thing I tried is to add an enum argument to pretty. The problem with that was that it was not usable with format strings nor let me configure the display of the individual elements.

One doesn't exclude the other. You can specify the array format with an enum and the element format with a string (like in my example). Or was there something that didn't work out with that approach? :)

AngelEzquerra · 2024-06-07T17:03:20Z

Thanks for working on this @AngelEzquerra 😄

I agree with Clonkk that string formats are quite confusing. I don't see what the <> format would add that a separate showHeader parameter couldn't bring. And the different modes ([], [:], ||) could be replaced with a printMode parameter that takes an enum as Clonkk suggested. So my suggestion would be a function with this signature:
type
  PrintMode = enum
    arrayMode
    singleLineMode
    prettyMode

proc pretty*[T](t: Tensor[T], formatString: string, showHeader: bool = true, printMode: PrintModeEnum = prettyMode): string =
    ...
Or is it something I'm missing here, some combination of tokens that can't be handled this way?

That is almost identical to the first thing I tried to do :-)
What that does not let you do (and the reason why I landed on this proposal) is configure the format of the individual elements. It also does not let you use nim's format strings (which I use very often) to print your tensors. With the current proposal you can do something like this:

echo &"Error the input tensor ({t:[]}) does not have the right rank ({t.rank})"

Or something like this:

echo {t=:+06.8f[]}

And get a nice representation of my tensor in which every element has the "+06.f" format and I can copy it into my code because it is valid nim syntax.

Clonkk · 2024-06-07T17:07:53Z

The problem with that was that it was not usable with format strings nor let me configure the display of the individual elements.

Yeah I understand, enum felt limited too when i thought about it.

In your opinion do you think a set of options would work ?

{HexaFmt, ArrayDisplay, NoHeader}

for example would display hexadecimal in array format without the header.

This seems flexible enough while allowing for high cardinality of options and at the same time doesn't require sanitizing input and is resilient to change in the specs (if an option disappears or is split into 2, it will instantly be a compile time error).

Since enums can have a string representation, we could define

type DisplayOpts = FormatNumbersEnum|HeadersEnum|...

Since enum can have a string value it shouldn't be too difficult to use the enum in a strformat.

EDIT :

With the current proposal you can do something like this:

This is a great argument, I am on my way to the airport and will look a bit more into your proposal.

HugoGranstrom · 2024-06-07T17:09:36Z

What that does not let you do (and the reason why I landed on this proposal) is configure the format of the individual elements.

That is what the formatString parameter is for, the format string for the individual elements. I could have been clearer about what I meant :D.

It also does not let you use nim's format strings (which I use very often) to print your tensors.

Okay, I think I'm starting to see the problem, it is specifically the &"{t:[]}" syntax you are after. I didn't read well enough and thought the updated pretty function was the important part here. 😅

AngelEzquerra · 2024-06-07T17:11:14Z

The problem with that was that it was not usable with format strings nor let me configure the display of the individual elements.

Yeah I understand, enum felt limited too when i thought about it.

In your opinion do you think a set of options would work ?

´´{HexaFmt, ArrayDisplay, NoHeader}´´ for example would display hexadecimal in array format without the header.

This seems flexible enough while allowing for high cardinality of options and at the same time doesn't require sanitizing input and is resilient to change in the specs (if an option disappears or is split into 2, it will instantly be a compile time error).

FYI, strformat strings also give compile time errors because they are all static strings :-) This helped me more than once when I was working on this feature.

Since enums can have a string representation, we could define

´´´ type DisplayOpts = FormatNumbersEnum|HeadersEnum|... ´´´

Since enum can have a string value it shouldn't be too difficult to use the enum in a strformat.

Sorry, can you explain that a bit more? how would you do that? I'm not sure I understand your proposal.

AngelEzquerra · 2024-06-07T17:17:02Z

What that does not let you do (and the reason why I landed on this proposal) is configure the format of the individual elements.

That is what the formatString parameter is for, the format string for the individual elements. I could have been clearer about what I meant :D.

It also does not let you use nim's format strings (which I use very often) to print your tensors.

Okay, I think I'm starting to see the problem, it is specifically the &"{t:[]}" syntax you are after. I didn't read well enough and thought the updated pretty function was the important part here. 😅

Yes, I guess I should have been clear about that too 😅. I wanted an easy way to use the non pretty syntax in format strings. I could have done that with a new function (e.g. toString and then do echo &"{t.toString}". But I also wanted to control the individual element format, which can only be done very flexibly by using nim's format strings. So once I added format string support it felt very natural to go all the way because using a format string as an argument to a function call inside a format string would be weird! that is echo &"{t.toString(\"6.2f\")" (which I don't know if it works?).

Plus echo &"{t:[]}" is pretty fast to write (and natural once you get used to it) for quick debug printing :)

Clonkk · 2024-06-07T19:41:37Z

Sorry, can you explain that a bit more? how would you do that? I'm not sure I understand your proposal.

My idea was to use set to pass options to pretty print :

Typically something like :

var A : Tensor[float] = randomTensor(3, 5 ,6, 1.0)
echo a.prettyPrint({optNoHeader, optHexFormat, ... })

This make having inclusive options easy and from a doc / api point of view it makes a lot of sense and it is easy to know what's possible and what isn't. It's nice to have but won't solve
the std/strformat problem.

We can always go for the string specifier and build a way to generate the string specifier from a set of enums (like have {optNoHeader, optHexFormat, ... } be converted to a specific string specifier at compile time ) so we could have the best of both world.

AngelEzquerra · 2024-06-08T09:42:58Z

Sorry, can you explain that a bit more? how would you do that? I'm not sure I understand your proposal.

My idea was to use set to pass options to pretty print :

Typically something like :
var A : Tensor[float] = randomTensor(3, 5 ,6, 1.0)
echo a.prettyPrint({optNoHeader, optHexFormat, ... })
This make having inclusive options easy and from a doc / api point of view it makes a lot of sense and it is easy to know what's possible and what isn't. It's nice to have but won't solve the std/strformat problem.

We can always go for the string specifier and build a way to generate the string specifier from a set of enums (like have {optNoHeader, optHexFormat, ... } be converted to a specific string specifier at compile time ) so we could have the best of both world.

What if we extended the version of pretty that doesn't take a format string (the one that takes a precision value) with these new set-based options, but kept the version that takes the format string as is? That way people who don't want to learn about this custom syntax or who don't care about format strings can use the more functional style while those that like format strings can use the new syntax?

Clonkk · 2024-06-08T12:55:22Z

What if we extended the version of pretty that doesn't take a format string (the one that takes a precision value) with these new set-based options, but kept the version that takes the format string as is? That way people who don't want to learn about this custom syntax or who don't care about format strings can use the more functional style while those that like format strings can use the new syntax?

Yes, this is what I had in mind in my last paragraph. But it can be done after this PR once string specifier are merged

This comes in 2 forms: 1. Add a version of the pretty function that takes a format string as its input instead of a precision value. Also add a new "showHeader" argument to the original pretty function. - This new version of pretty let's you specify the format string that must be used to display each element. It also adds some new tensor-specific format string "tokens" that are used to control how tensors are displayed (beyond the format of each element). 2. Add a formatValue procedure that takes a tensor as its first input. This makes it possible to control how tensors are displayed in format strings, in the exact same way as if you were using the new pretty(specifier) procedure. The special, tensor-specific tokens added by this change are: - "[:]": Display the tensor as if it were a nim "array". This makes it easy to use the representation of a tensor in your own code. No header is shown. - "[]": Same as "[:]" but displays the tensor in a single line. No header is shown. - "<>": Combined with the 2 above (i.e. "<>[:]" or "<>[]") adds a header with basic tensor info (type and shape). "<:>" can be used as a shortcut for "<>[:]" while "<>" on its own is equivalent to "<>[]". Can also be combined with "<>||" (see below). - "||": "Pretty-print" the tensor without a header. This can also be combined with "<>" (i.e. "<>||") to explicitly enable the default mode, which is pretty printing with a header. - 'j': Formats complex values as (A+Bj) like in mathematics. Ignored for non Complex tensors Note that these are only used to control how the tensors themselves are displayed as a whole, and are removed before displaying the individual elements.

Clonkk · 2024-06-10T10:26:26Z

src/arraymancer/tensor/display.nim

+  ##   - Input Tensor.
+  ##   - precision: The number of decimals printed (for float tensors),
+  ##                _including_ the decimal point.
+  ##   - showHeader: If true (the default) show a dscription header


Could we add row-major or col-major indication in the header ? That would be very helpful when passing arraymancer's buffer to other interface (like Numpy or Julia) ?

Fair point, but #660 raises a good point about our colMajor support.

Clonkk · 2024-06-10T10:28:14Z

src/arraymancer/tensor/display.nim

+  ##                - "||": "Pretty-print" the tensor _without_ a header. This
+  ##                        can also be combined with "<>" (i.e. "<>||") to
+  ##                        explicitly enable the default mode, which is pretty
+  ##                        printing with a header.


@AngelEzquerra Could we have a specifier to choose between showing the content 'in-memory' order OR in 'index' order ?

Clonkk · 2024-08-24T19:32:31Z

This seems good enough for me. I think we should merge this whenever CI is green to avoir PR rot.

cc @Vindaar cc @mratsim cc @HugoGranstrom

Vindaar · 2024-09-20T16:32:50Z

src/arraymancer/tensor/display.nim

+  ##   - Input Tensor.
+  ##   - precision: The number of decimals printed (for float tensors),
+  ##                _including_ the decimal point.
+  ##   - showHeader: If true (the default) show a dscription header


Suggested change

## - showHeader: If true (the default) show a dscription header

## - showHeader: If true (the default) show a description header

Will fix this typo in a commit taking care of all your comments.

Vindaar · 2024-09-20T16:46:12Z