-
Notifications
You must be signed in to change notification settings - Fork 860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make arrays heterogeneous #665
Comments
While I do get and can support the use of mixed numeric values inside a single array, the whole example with contributors gives me a slight headache. Here's my reasoning for the difference: Mixed type array of numbers For the numeric types, I can see my parser recognize that there are mixed type numbers inside the array. It can apply some TOML standard-described logic to it, to cast all numbers to the same type and end up with a single-typed array of for example floats. You mention it yourself too: "even strongly typed languages such as C or Java" have logic like this on board and it can work for TOML as well. Other mixed type arrays For the contributors and URLs from the examples, things are a lot different, which is probably why you only named dynamically typed languages at the start. If I'm in a strongly typed language such as C, Java, Go, etc., then the TOML parser can for sure take care of turning the TOML document into a syntax tree that describes the TOML document. However, the second step of translating that syntax tree into a strongly typed data structure is no longer directly possible, since there is no concept of an array that mixes in this case strings and objects. Unlike the numbers example, there cannot be automatic casting rules to normalize the data to something that is compatible with statically typed languages. Would it be impossible then to handle these then? Nope.. One sensible way that I can see to handle things in such case, would be to make the TOML parser more complex, by providing extension points that can be used to let a programmer provide casting logic to mangle the syntax tree into a syntax tree that can be translated into native data structures. So custom application-provided code would be needed for example, to turn My end verdict:
|
@mmakaay: Thanks for your perspective. Regarding number casting, I would agree that allowing mixed numeric arrays – with an explanation that parsers in strongly typed languages are allowed to convert integers in such arrays into floats – is a better-than-nothing fallback in case that generally heterogeneous arrays are rejected. I would be against the auto-casting of date/time values because it cannot really be performed in a safe way. When converting Offset Date-Time into Local Date-Time, you're throwing away potentially relevant information; doing it the other way around, you have to invent information which is potentially wrong. The latter is also true when converting a Local Date into an Offset or Local Date-Time; and I fail to see how a Local Time can be meaningfully converted into any other type.
It's true that right now, an array might be converted into Also I'd say that a proper "strongly typed data structure" cannot automatically be read from a TOML document at all, since for that you likely want to use some custom data types which are unknown to TOML. So, if you have a list of contributors, you don't want to deal with a |
Date/Time casting sorry, I wasn't fully clear there. I wasn't really aiming at Date/Time values being all interchangeable. That would indeed lead to too much guesswork. But I can imagine that "2019-10-10 11:22:33" and "2019-10-10" can be used together, the latter one being defined as "2019-10-10 00:00:00". Definitely not a fan of it, but that seemed the other type for which some ruled typecasting could be feasible. Personal preference is: keep them separate types, also in arrays, please. How to handle things in statically typed languages Things aren't as rigid as you describe for all statically typed languages. Languages like Go and C# do have the goodness of reflection to work with and it's perfectly well possible to give the user of my parser the joy of just having to define the target type that the TOML contents must be translated into (so no need for that converter function that you talk about). Error handling for types in the TOML document that don't match the output data type is possible without writing extra code. @BurntSushi does the same thing in his TOML parser. Put differently: currently, all code that you need to write, is the target data type definition, nothing else. In the current situation, there is no need for extra handling logic to make reading TOML into a datastructure possible. With mixed type arrays, this changes. The simplest way of implementing this in the statically typed world, would be to simply produce an error when the syntax tree (part of which is what you mentioned as "TOMLValue") contains a mixed type array. IMO, it would be bad design, if All this is possible, however not as straight forward as it is now. But.. let's not forget the final users of the TOML files I think the most important thing here, is of course the final user of the TOML documents in play. The whole of TOML is not designed to make it as simple as possible for the parser guys. It is designed to make it easy to read and write a configuration file. In that light, I would be very interested to see more actual examples of use cases for which people want to use mixed types. I think that strong examples would be the best way to defend the cause. |
DefinitionsI am using the following phrases with a given understanding, each:
Types are Valuable, but When and Why?A large part of the argument for maintaining homogenously typed arrays was in favor of making things easier for statically typed languages. Types are valuable in statically typed languages because they allow us to expect certain types of data and decide how to handle them. For a configuration format, it's obvious that typing is incredibly valuable, since if the configuration is incorrect, I can decide that the configuration is lies and ask for correct data. This saves a lot of time for everyone by making them do something to fix things rather than releasing an incorrect configuration. However, this doesn't mean all type expectations are sane. Statically Typed Languages Already Have OptionsAs already mentioned, and I will not linger on it much, statically-typed languages have the option to wrap values in a type abstraction, and trying to avoid asking strongly-typed languages to make decisions about how to structure values from data, when data is inherently "raw" when parsed and has to be thoughtfully handled, seems fruitless. TOML at least makes it easy to make decisions, which I think is good enough. No Guarantee of an Exact Match in Strongly Typed LanguagesAcross languages that use strong typing in general, for some languages, such as Erlang, "Array" is not a meaningful construct. You have Lists, Tuples, and Maps. Yet parsers exist for such languages, thus, we can expect an "Array" to actually be composed as a linked list, a tuple, a vector, or some other meaningfully equivalent but non-exactly-matching language construct. This is because numerous languages have different views of data! So we have already lost the battle on an exact match. And it hardly matters, because it's not like our arrays have resizing or append rules. But arguing that we should forbid Integers and Floats together elides that languages get to decide how to represent even tiny things like Integers in the array. You will hit issues where a language has native facility for representing a given value as an unsigned integer, but not a signed one, and that value and also a negative (thus, signed) integer shows up in the array. Translating that gets a lot more exciting than it was a moment ago! To the extent that TOML matches to anything, TOML bears the marks of convenience for scripting languages, so its arrays already wink at many "arrays". JavaScript's dynamically resizing map of pointers in in It's Data, Not CodeTOML does not describe transformations on data. Here we make an interesting structural decision that makes it much harder to include a heterogenous array by encouraging nesting as a hack, introducing pain, while not having language facilities to make interacting with it easier. If I need a heterogenous array, because I'm using a scripting language, and that was the easiest way to set up my configuration, then the array-of-arrays must be inspected and flatmapped at the end to get the desired data structure... or I simply ignore TOML rules. This distorts things more. If they could be simple values, it makes it easier to reach for the abstraction and move on, rather than trying to decide whether [ [3, 2], ["a", "bread"], [1.2] ] is intended to "actually" represent a heterogenous 5x1 array or "actually" is a 3x2 array. And it puts an unwise expectation on languages which do not have a care whether their Vec-alike has multiple types in it to enforce type restrictions, so the parsing difficulty is shifted around. What Makes Heterogenous Arrays Desirable?While I just specified it IS data, the heterogenous array does make a convenient mapping to the "tuple" data structure that is available in many languages and is often used in constructing configurations, that I've noticed. In fact, when homogenous arrays were included, tuples were going to be added, in #154. These were discarded in favor of inline tables, which are handy, but more cumbersome to write out as a "pseudo-tuple", since now I'm declaring a set of numbers as keys e.g. It also unencumbers possible grid structures that might be converted into raw data... we can take an Excel sheet, for instance, as a hyper literal case. The fairly naive cast for that is into an array of arrays in many languages, but these would, at the final end, still be typed arrays in TOML. Consumers Always WinWhile the heterogenous array is inconvenient to parse, a consumer can impose additional constraints upon the data it is handed. You write TOML configurations for programs that exist and have a need for a configuring key<->value structure. The program can reject data that doesn't match its preferred data structures, but it doesn't really need TOML to provide that facility: the program just decides the TOML file doesn't have an acceptable structure. The types are there to make parsing, matching, and casting decisions easier. Not as a straightjacket. The Long ViewThere was murmuring about doing this anyways, and pausing to choose this is not going to slow things down notably. It has been "1.0 time" for a bit, and it feels like avoiding making decisions when they seem obvious to make, "obvious candidates for 1.1.0", et cetera, seems to inspire people to wait rather than get on top of things. I realize there was a promise made, a long time ago, about near-100% backwards compatibility, and that's why decisions like this have been deferred. I must note, however, that was made when TOML was on version 0.4.0. It seems best to file this in the "near" part of the 100% backwards compatibility, with appropriate dread, and simply redouble efforts to untangle the rest quickly and affirmatively. A Final ConcessionI must note, I feel that much of my argument does in fact advance the argument to ban heterogeneity in array-of-arrays situations, enforcing deep array typing. To this, I can only say "you're right", and I would still much prefer this alternative to the current situation! |
Good drill down, I updated the terminology in my post to match the correct one ("statically typed"). Python is so many things, I'd like to add duck-typed to the list :-) TL;DR Using heterogeneous arrays to store tabular data in a config feels like a strong use case to me. I have never felt the need to use that in a config file myself, but I can see how this can be useful. When there are a lot of rows in the tabular data, then being forced to use TOML tables can make for a quite unwieldy and not easy to read TOML file . My main concern about tuples One example that I've seen come up a few times is ip + port definition, like [connection]
host = "127.0.0.1"
port = 6000
data_timeout = 1000
conn_timeout = 5000
lazy_connect = true
only_udp = false An inherent problem with tuple-style data, is that things can become quite unreadable when there's too much data that does not trigger associations with the reader (this goes for code as well, even in languages without tuple data types in the form of function arguments). connection = ["127.0.0.1", 6000, 1000, 5000, true, false] The statement "Consumers always Win" applies here. They are allowed to shoot themselves in the foot with unreadable configuration file formatting. Mixed arrays just provide the gun. However, tuples as a tabular data construct ... Tuples aren't a convincing use-case to me, since they still doesn't feel like a good format for configuration data in general, but when there's a need to store tabular data in a TOML file, then heterogeneous arrays would be actually nice. connections = [
# Host Port Data Conn Lazy Only
# timeout timeout connect UDP
# -------------------------------------------------------
[ "127.0.0.1", 6000, 1000, 5000, true, false ],
[ "1.2.3.4", 6000, 500, 5000, false, false ],
[ "5.5.5.5", 6000, 100, 2000, true, true ],
[ "90.10.133.17", 6000, 1000, 5000, true, false ],
[ "18.20.18.20", 7777, 5000, 1000, false, true ],
# -------------------------------------------------------
] I think that most people would agree that the above format is quite readable and provides a quick overview of the configuration to the unsuspecting reader. More so than by using an array of tables. A brain fart about an alternative syntax One idea I just had, triggered by the example that @workingjubilee provided, showing the current work-around:
Wouldn't it be a feasible idea, to modify the inline table syntax for this, and keeping the arrays as-is? IMO auto-numbering of unkeyed values in an inline table would do the trick here, and it would allow for a nice mixed-style configuration style. Here's what I mean by that:
This kind of matches the principle of having positional and named function parameters, like for example Python supports. |
Ah! I think everyone understands roughly what we mean by typing terminology, I just wanted to be relatively formal about it, but yes, everyone proceeding on the page probably is a good idea. Pythonic typing is definitely... wild, sometimes, thereby duck-typed most of the time, in spite of using a stronger construction than many weaker dynamic type systems. Thank you for drafting that table up! It was more or less exactly what I had in mind, yes. A relatively complicated table that gets awkward when it is an inline table. In general I feel like an unstated goal of data presentation formats in general should be "make it easy to transfer data to/from Excel sheets", both because of Excel being the elephant in the room (and hopefully we want to encourage people to use better things than Excel) and also because of SQL, CSV files, and other implicitly tabular data we wrangle all the time, with the matrices as well, and so on. We can technically represent it, right now, but it is definitely more aesthetic and also concise in the And offering the ability to combine both "ragged" and also tabular data, while being readable, helps even for configuration files, I think... because some configuration files are going to be generated by computers, probably based on some ETL process, but then reviewed (and if necessary, edited) by humans. Part of the point of tuples in general is a bit more abstract. Because we shouldn't necessarily automatically expect arrays to map to "actually, an Array, for real", but rather trust the language on the other end has some kind of data structure to hold the data in and provide a logical sequencing, many languages that refuse heterogenous arrays do support heterogenous tuples, which is why I brought it up, as a way of extending on my argument that I don't know that the homotypic array concern is as big a deal as it is made out to be. This was a more loosely-connected argument, unfortunately, in my original explanation, now that I reread things. And I agree generically that tuples are not the best format sometimes for things that we use them for, and they're "language internal" for a lot of things, but I do prefer structs and tuples in general for a lot of things people would use a string for because that logic clicks easier in my head. I realize this might mark me as insane. Nonetheless! I will try to make one small case for their preference even in the example you showed: Tuples for IP:port pairing. Which... frankly IS a bit weird, but even this might become more persuasive in a case like IPv6:port, where the choices are...
"Why did you put two string representations there?" Well, as it turns out, it's not true that all devices or programs that support IPv6 support using the double colon delimiter as the line between the IP and the port, which of course makes everything a lot more interesting than it has to be. So, both of those could get copied out. And of course, whether you put two colons in the last bit, or one, is a lot less obvious and a lot more finicky. Sometimes, leaking an implementation detail here or there is the way to remove ambiguity while maintaining concision. I feel like inlined tables supporting both positional and keyword parameters is... yes, it feels very natural, but I feel that way because I'm a Python programmer as well, and I'm trying to avoid advocating simply for my own brain patterns to win, even if I cannot deny my aesthetic tastes. For TOML, it feels like a wrong path for a few reasons, some of which feel hard to articulate. Most expressibly, it opens up more ambiguity in the overall TOML hash table and requires learning new patterns for the inline table, from the user perspective, when it had been stated that it was an inline table, i.e. like every other TOML table, just inline, whereas simply permitting heterogenous arrays does not do the same thing, and in fact permits the combination of keyword and positional parameters without introducing new patterns, e.g.
This requires some unpacking once we reach the struct... er, inline table... but we can interpret it relatively smoothly (it has an obvious meaning at-a-glance, both to humans and machines) and we do not face potential implicit namespace conflicts like...
|
Insane is good! With IPv6, things are definitely harder to write when using colon for ports. In fact, Messing things up is a good way to convince me that my brain fart was actually smelly ;-) |
@mmakaay: I agree that storing tabular data in a config is an additional use case of heterogeneous arrays that might occasionally be useful (that is, using arrays as tuples). It's not the one I advocated above and, in general, I would still consider inline tables (which might be considered "named tuples") or regular arrays of tables more useful for such purposes. But I'm pretty sure that there are a few cases where compact unnamed tuples are a reasonable data representation – such as in the example you gave. Generally, my feeling is that we, as language designers, shouldn't attempt to restraint and "educate" our users too much. If we give them freedom to model their data as they wish, they might shoot themselves in the foot, but they might also find useful, readable, and domain-appropriate representations which are made impossible by overly restrictive rules. Ultimately, we should trust our users and give them the freedom to do as they think right. I also agree with @workingjubilee that having some kind of "automatic" numbered keys in inline tables is an interesting, but ultimately very bad idea. |
@mojombo, @pradyunsg: What do you think? |
As an illustration, here is how the counter-examples gets done with the current syntax: Proposition:
Current syntax (1):
Current syntax (2):
Current syntax (3):
Proposition:
Current syntax:
Proposition:
Current syntax:
Proposition:
Current syntax:
Proposition:
Current syntax (1):
Current syntax (2):
|
Thanks to all for looking for edge cases. Has any of the proposition an acceptable alternative with the current syntax variant ? |
@josuah Some remarks about your first counter example.
This is not valid TOML. Maybe it was not completed edited? But more importantly, the semantics of the versions that you have provided are different from the original meaning I think. The original is:
So the example defines two URL's, where you seem to interpret the string as some sort of key and the object as the properties for that key (together forming only a single definition for a URL). |
Woops, I just noticed that. And yes, different semantics, only the same "vague goal". P.S.: A note that I am not arguing in either direction (I'm fine with both ways for different reasons). |
Ah, another note: Is this currently allowed?
|
@josuah: Yes. That is an array of inline tables. All inline tables are considered the same type. |
Hi, I agree with @ChristianSi 's original post recapitulating my arguments from #553, arguing why heterogeneous array should be allowed.
This is the main reason why I want heterogeneous array: data representation should be distinct from semantic validation. I also agree with @workingjubilee's comment. I feel that it echoes my comments in #553 that the primitive "array of TOML values" is more useful than the primitive "array of same-type TOML values". Where TOML tables are counterparts to structs and maps, TOML arrays should be counterparts to tuples and vectors. Regarding the claims that homogeneity may simplify some implementations, I strongly believe the opposite: allowing heterogeneous arrays would get rid of an artificial requirement on the parser. Requiring arrays to be homogeneous requires a definition of "same type". I don't think that TOML is able to (or even should try to) meaningfully define the "is same type" relation. This relation is contextual. TOML either differentiates types when users want to treat them the same (decimals and integers), or treats the values as having the same types when users may want to differentiate them (inline tables are treated as the same type, disregarding their fields). The types used to specify the TOML format are not the same as the user types, they're often the same but not always; conflating them for homogeneous arrays is an error. Edit: Fixed referenced issue, thanks @ChristianSi . |
I realize that I never noted this in #553 or here -- I like this, and am on board for this. This simplifies the conceptual overhead that might be introduced as a part of keeping "strong but shallow" model of typing that we have in TOML. Perhaps the most painful bit here, is that However, this does enable cases which are better served by the potential simplification this brings on board for conceptually-similar values. |
@pradyunsg Great, but how do we proceed from here? Do we have to wait for @mojombo to make a decision or can you or anyone else decide on this issue? |
Wild presumption: it is time to draft a PR for a rollback of the strict typing of arrays throughout the text. |
Let's draft a PR. I think I'll poke @mojombo to check if he has concerns with this but broadly, I think this should be fine to do. |
I've pushed a PR. Please review. |
This is an proposal to remove the constraint that "Data types may not be mixed" from TOML's array syntax, making arrays heterogeneous – just like they are in JSON, JavaScript, Python etc. It is a spin-off of #553 which originally went into the opposite direction (proposing to make array typing "deep" instead of "shallow", as it currently is) but then received various comments making the case for heterogeneous arrays. I'm collecting these arguments here (and adding some new ones) so that the case for heterogeneous arrays can be discussed independently of the "deep vs. shallow" question (which in #553 has already been labeled as "post-1.0").
First, I agree that it's a good practice to reserve arrays for items of the same type. But, as @demurgos pointed out, items that conceptually are of the same type might be reasonably expressed in ways that correspond to different types in TOML. For example, contributors to a code package:
Or uplink servers in a private package manager:
All these examples might be rewritten using arrays of inline tables with one or more fields. But if many of these tables essentially require just a single field that may be considered needlessly cumbersome. If app writers want to allow their contributors (or whatever else an array is used for) to specify their credentials either compactly as
"Name <email>"
or more verbosely as{ name = "...", email = "..." }
(with the second syntax also allowing other, optional fields which are not supported by the first syntax), why shouldn't they? Note that they don't have to make that choice – if they want to standardize on a single syntax, that's fine. But if they want to allow alternative syntaxes, I'd consider that a legitimate choice which should not be prohibited by an arbitrary decision of the TOML spec authors.Another problem of the typing restriction is that TOML has two numeric types, while, mathematically speaking, all numbers are numbers which might be written in different ways (e.g. 2, 2.0, 4/2 are three ways of writing the same number). But since TOML has no "array of numbers", users have to decide whether they want an "array of integers" or an "array of floats" and then write all numbers in the same fashion. Hence the following perfectly logical array is currently illegal:
This is more restrictive than even strongly typed languages such as C or Java, which will autoconvert int to float (or similar) as needed.
And forcing people who care about numbers but not about how their internal representation in computer memory might look like to add ".0" to every plain number just because some other numbers require a fractional part is just plain odd.
So, let's drop the problematic typing restriction and make array values free. Heterogeneous arrays haven't hurt the popularity of JSON or Python, neither will they TOML's.
Also, note that arrays of tables and of inline tables are already free – TOML neither forces all of them to have the same keys, nor does it force every key to map to the same type of value. Since tables are untyped, lets be consistent and un-type arrays as well.
Schema validation (which keys are in a table? which type does each of them have? which type or types are allowed in an array? how many values are required or allowed in an array? etc.) is best left to an external schema validator (see #116) or application logic. TOML can at most make a very bad job of it, so there is no point in trying.
I know this is a non-breaking change so it could be delayed until TOML 1.1. However, 1.0 might well be the "gold standard" for a long time so I would urge to seriously consider this for inclusion into 1.0 (ping @mojombo @pradyunsg @eksortso @demurgos).
The text was updated successfully, but these errors were encountered: