You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'd like to suggest a standardised way to specify data endpoint URL's. Having those, it will be straightforward to locate the data a contract specifies, and possibly also to write SDK clients that know both where to find data, and how to read it.
In many ways, a standard for data endpoint URL's could be independent from the contract specification, but it feels natural that they live in the same standard as there should be mutually beneficial.
My suggestion is to use standardized rules for URLs which I've seen successful at e.g. Spotify and stolen with pride to my current employer. Using various fields in the existing contract specification, the url could be generated. However I find it'd be useful to have a URL that can be constructed by knowing where the data is stored -- to easily discover the contract -- and that can be used independently of contracts.
Suggestion 0:
Leverage pre-existing URL concepts like the postgresql urls: postgres://postgres:[email protected]:5432/dummy
or for google cloud storage gs://my-bucket-name/path/to/object
Alternatively, avoid collisions by using some prefix to the protocol.
Suggestion 1:
Let the protocol, e.g. bq:// define how the rest of the url is parsed. BQ in this example is of course Google BigQuery. I use a "sub-domain" to indicate that a table can be found at the location, that could also be a url parameter.
Suggestion 2:
Reserve YYYY MM DD as well as HH MM SS as url "placeholders", the example above indicates that the table name has a date suffix. This means that a URL could point at a single shard of data by using digits, or to the collection of "all shards" by using a pattern like above.
For natively (date) partitioned tables in bigquery, the _YYYYMMDD suffix in the example, could be represented as an additional / instead, so /YYYYMMDD
Suggestion 3:
Allow wildcards like in the following example for avro data stored in Google Cloud Storage gs://[BUCKET]/[PATH]/[FILE_PREFIX]*.avsc
Suggestion 4:
Allow "url parameters" to specify additional information required to read the data. It's up to each protocol to use them. For a stored procedure (e.g. a table value function in biquery) it could look like: bq://tvf.[PROJECT_CODE]/[DATASET_NAME]/[PROCEDURE_NAME]?date=DATE&some_nr=INTEGER
Note that the parameters above are not columns in some table, but rather the input parameters to a stored function -- which returns a table that could be covered by a contract.
Suggestion 5:
Discourage/disallow the use of the dollar sign, and any other symbols that have to be escaped when used in a terminal CLI.
Suggestion 6:
A rule of thumb could/should be that knowing the data url should allow you to find and read the data, without additional information from the contract -- and if you know where the data is stored, you should be able to construct a URL.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I'd like to suggest a standardised way to specify data endpoint URL's. Having those, it will be straightforward to locate the data a contract specifies, and possibly also to write SDK clients that know both where to find data, and how to read it.
In many ways, a standard for data endpoint URL's could be independent from the contract specification, but it feels natural that they live in the same standard as there should be mutually beneficial.
My suggestion is to use standardized rules for URLs which I've seen successful at e.g. Spotify and stolen with pride to my current employer. Using various fields in the existing contract specification, the url could be generated. However I find it'd be useful to have a URL that can be constructed by knowing where the data is stored -- to easily discover the contract -- and that can be used independently of contracts.
Below is an example:
bq://table.[PROJECT_CODE]/[DATASET_NAME]/[TABLE_NAME]_YYYMMDD
Suggestion 0:
Leverage pre-existing URL concepts like the postgresql urls:
postgres://postgres:[email protected]:5432/dummy
or for google cloud storage
gs://my-bucket-name/path/to/object
Alternatively, avoid collisions by using some prefix to the protocol.
Suggestion 1:
Let the protocol, e.g.
bq://
define how the rest of the url is parsed. BQ in this example is of course Google BigQuery. I use a "sub-domain" to indicate that a table can be found at the location, that could also be a url parameter.Suggestion 2:
Reserve YYYY MM DD as well as HH MM SS as url "placeholders", the example above indicates that the table name has a date suffix. This means that a URL could point at a single shard of data by using digits, or to the collection of "all shards" by using a pattern like above.
For natively (date) partitioned tables in bigquery, the
_YYYYMMDD
suffix in the example, could be represented as an additional/
instead, so/YYYYMMDD
Suggestion 3:
Allow wildcards like in the following example for avro data stored in Google Cloud Storage
gs://[BUCKET]/[PATH]/[FILE_PREFIX]*.avsc
Suggestion 4:
Allow "url parameters" to specify additional information required to read the data. It's up to each protocol to use them. For a stored procedure (e.g. a table value function in biquery) it could look like:
bq://tvf.[PROJECT_CODE]/[DATASET_NAME]/[PROCEDURE_NAME]?date=DATE&some_nr=INTEGER
Note that the parameters above are not columns in some table, but rather the input parameters to a stored function -- which returns a table that could be covered by a contract.
Suggestion 5:
Discourage/disallow the use of the dollar sign, and any other symbols that have to be escaped when used in a terminal CLI.
Suggestion 6:
A rule of thumb could/should be that knowing the data url should allow you to find and read the data, without additional information from the contract -- and if you know where the data is stored, you should be able to construct a URL.
Beta Was this translation helpful? Give feedback.
All reactions