-
Notifications
You must be signed in to change notification settings - Fork 5
Add a dialect.type
property (with table dialect reorganized)
#82
Conversation
(Tagging @peterdesmet @ezwelty for review!) |
dialect.type
property (with table dialect reorganized)
Looks great @khusmann! BTW I think we need to list I agree that |
@roll excellent points One more thing I just realized -- Update summary:
|
@khusmann |
No worries! This would mean it would be for v2.1, right? Unfortunately, I'm not sure It's not a super big deal to miss out on this, it just means that we can't do as much up-front validation of table dialect descriptors (and make the data producer's dialect type choice explicit) -- implementations will have to postpone the final validation of the dialect properties until the data file format is determined. Moving forward, I think it'd be better to focus on expanding / clarifying which data formats can/should be associated with which dialect types (because the data format now has effectively become the "tag" in this type union) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@khusmann Thanks for the great work here. I made several suggestions with respect to wording, and some with respect to logic.
I think the structure works, but wonder whether it was a good idea to separate the types from their properties. Sure, it avoids some repetition, but leads to what I see are some bigger issues:
- It is more complicated to assemble a descriptor, since one must first look up the type, then read the definition of each property associated with it, then back to the type definition for the relationships between the properties and their defaults.
- The defaults for each property are defined in multiple places: once where the property is defined, and once where each associated type is defined. So what happens when the default depends on the type, and/or on the value of another property?
- We say "A Table Dialect descriptor
MAY
have theX
property" but this is misleading for properties that are required for some types, or may be required depending on the value or presence/absence of another property (again, type-specific). - Most importantly, the definition of each property is best written in the context of the type. For example
header
has a different meaning fordelimited
(line with field names) than forstructured
withitemType: array
(array with field names), andcommentChar
has (implicitly) a different meaning fordelimited
(first character of line) than forspreadsheet
(first character of cell A in a row).
This isn't the direction the documentation has been evolving, but I would have suggested a structure similar to Table Schema's field type
and "additional properties" (e.g. an integer field's decimalChar
), so that each type is described in full in one place, and repetition can be avoided when appropriate by referring to the property's definition in a previously mentioned type. Each type section would include a full list of properties (as now), their defaults and relationships (somewhat as now), short but type-specific descriptions (omitting the "A Table Dialect descriptor MAY
have the X
property that ..."), and type-specific example(s).
@@ -35,44 +35,31 @@ Table Dialect supersedes [CSV Dialect](https://specs.frictionlessdata.io/csv-dia | |||
|
|||
## Descriptor | |||
|
|||
Table Dialect descriptor `MUST` be a descriptor as per [Descriptor](../glossary/#descriptor) definition. A list of standard properties that can be included into a descriptor is defined in the [Properties](#properties) section. | |||
Table Dialect descriptor `MUST` be a descriptor as per [Descriptor](../glossary/#descriptor) definition. The descriptor `MAY` include a `type` property to indicate which of optional dialect properties `SHOULD` be considered when reading the target data format. Some [properties](#properties) are generic and can be used for multiple formats, while others are specific to one format. A list of standard dialect types are defined in the [Table Dialect Types](#table-dialect-types) section. The properties that can be used with these types are defined in the [Properties](#properties) section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The first sentence is a bit wonky with regards to articles. How about: "A Table Dialect
descriptorMUST
be a descriptor as per the Descriptor definition". - Not all properties are optional, so I would simplify "which of optional dialect properties" to "which properties".
- "generic" can have special meanings, and we should clarify that properties are assigned to
type
, not "target data format". So I would suggest "Some properties apply to multiple types, while others are specific to one type".
|
||
For the sake of simplicity, most of examples are written in the CSV data format. For example, this data file without providing any Table Dialect properties: | ||
Table Dialect can be used for different data formats, such as delimited text files, semi-structured formats and spreadsheets. The list of supported dialect types with associated data formats and related properties is as follows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Table Dialect can be used for different data formats
Table Dialect can be used for different tabular data formats
I know it is in the name "Table Dialect", but I feel that it doesn't hurt to repeat it.
|
||
`SHOULD` output this data: | ||
Delimited formats are textual formats such as CSV and TSV. Their charactistics can be expressed the following properties: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This definition feels rather circular to me (uses the term "delimited" and undefined acronyms). How about something like "Text-based formats where each line represents a row and row values are separated by specific delimiter characters, such as CSV (comma-separated values)."
- The second sentence needs fixing: "Their charactistics [→ characteristics] can be expressed [by] the following properties". However, I think it is better simpler: "They can be described with the following properties:"
- Should we provide a brief example?
id,name
1,apple
2,orange
|
||
Structured formats is a group of structured or semi-structured formats such as JSON and YAML. Their charactistics can be expressed the following properties: | ||
Structured formats are structured or semi-structured formats such as JSON and YAML. Their charactistics can be expressed the following properties: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Again, this definition feels rather circular to me, and clearly we don't mean to support any random JSON or YAML. So would this be accurate? "Tabular data represented using structured formats such as JSON and YAML as either an array of objects or an array of arrays." Is there an existing place in the documentation we can link to that defines these? Ah, I see that it is buried in the
itemType
definition. This doesn't feel ideal. - Again, I suggest "They can be described with the following properties:"
- Examples for each type?
[
{"id": 1, "name": "apple"},
{"id": 2, "name": "orange"}
]
[
["id", "name"],
[1, "apple"],
[2, "orange"]
]
|
||
- [$schema](#schema): `https://datapackage.org/profiles/1.0/tabledialect.json` by default | ||
- [type](#type): `structured` | ||
- [$schema](#schema): `https://datapackage.org/profiles/2.0/tabledialect.json` by default | ||
- [header](#header): `true` by default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect header
default for type structured
to be true
if itemType
is array and false
or undefined if itemType
is object.
@@ -576,6 +623,8 @@ With this dialect definition: | |||
|
|||
### `sheetName` | |||
|
|||
**Dialect Types:** [spreadsheet](#spreadsheet) | |||
|
|||
A Table Dialect descriptor `MAY` have the `sheetName` property that `MUST` be a string; undefined by default. This property specifies a sheet name of a table in the spreadsheet file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If sheetName
is specified, does it take precedence over sheetNumber
?
A Table Dialect descriptor `MAY` have the `table` property that `MUST` be a string; undefined by default. This property specifies a name of the table in the database. | ||
**Dialect Types:** [database](#database) | ||
|
||
A Table Dialect of type `database` `MUST` have a `table` property of type string. This property specifies a name of the table in the database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PostgreSQL has the concept of "schemas" (namespaces), such that a table's name may not be sufficient to identify it; the schema name may also be needed. Is this not supported, or can the name be a qualified name {schema}.{table}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently is not supported but there is a issue (frictionlessdata/datapackage#947) for adding it to version 2.1.
I think using {schema}.{table}
in dialect.table
would be problematic because .
is valid in table names in some databases.
|
||
- [$schema](#schema): `https://datapackage.org/profiles/1.0/tabledialect.json` by default | ||
- [type](#type): `delimited` | ||
- [$schema](#schema): `https://datapackage.org/profiles/2.0/tabledialect.json` by default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the lists of properties should be by order of importance, not alphabetical. So delimiter first, then header, ...
|
||
- [$schema](#schema): `https://datapackage.org/profiles/1.0/tabledialect.json` by default | ||
- [type](#type): `structured` | ||
- [$schema](#schema): `https://datapackage.org/profiles/2.0/tabledialect.json` by default | ||
- [header](#header): `true` by default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the lists of properties should be by order of importance, not alphabetical. So property
first (where), then itemType
, then itemKeys
and header
(which only apply to one value of itemType
).
@@ -123,23 +113,34 @@ Spreadsheet formats is a group of sheet-based formats such as Excel or ODS. Thei | |||
- [sheetNumber](#sheetnumber): `1` by default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the lists of properties should be by order of importance, not alphabetical. So sheetName
/ sheetNumber
first (where), followed by the rest.
@ezwelty Thanks for these comments! But as I mentioned in my previous comment, I'm second-guessing the That said I think many of your suggestions you made here are relevant to the organization of table dialect spec even without |
Hi @khusmann, I think thanks to our new immutable profiles concept the change still makes sense even after |
@roll good to hear! I've had my hands full the last couple weeks as I have been transitioning jobs -- if you can help with migrating this PR I'd super appreciate it! |
Migrated to frictionlessdata/datapackage#1055 🎉 |
Here's another attempt at
dialect.type
, rebased on Peter's new structure (see #74).dialect.type
also makes it possible to make different fields required for different dialect types. For example, for adatabase
dialect type, does it make sense to havetable
be an optional property? With spreadsheets, there's at least the concept of the "first" spreadsheet that can be loaded by default -- but with databases, is there such thing as a "first" table? If not, I thinkdialect.table
should be required for database dialect types.