-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] feature: add a new history format and history REST API #513
Conversation
We should have a way to specify the resampling per period length
|
"endDate": "2018-10-06T04:00:02Z", | ||
"generator": "Hand-written with love with the spec", | ||
|
||
"objects": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about results
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
results
implies that there is a request. I think this format is also really adapted for log files.
I initially had vessels
, aircraft
, aton
, etc but then decided to group them all together because the context is enough to distinguish them. Do we have a standard name for what all those things are? Maybe we could replace objects
per contexts
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this format is suitable for log files, as it is not appendable. Maybe this is just terminology: this works for capturing a set of data in a self contained file and response is not the best choice in that case.
I'd go with something like data
or seriesdata
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this format is suitable for log files, as it is not appendable.
Yes I agree about not appendable. This is a vocabulary problem. I meant "log" as in "log of one sail" or "record of one race/cruise". Not a running log that would be generated at runtime.
data
works for me. We could do history
too since that is the name we are giving the format. seriesData
is another very loaded term that will have different meaning for everyone. (of course history
and data
are pretty loaded too ...)
}, | ||
{ | ||
"path": "navigation.speedOverGround", | ||
"source": { "label": "NMEA1" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively the single string, dot notation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not following. Can you give an example?
Are you sure this is actually much less verbose? Looking at it with a human eye is misleading as we pretty print and compare. When you remove whitespace and CR/LF the effect is different. Waaay back I did a size comparison of delta vs sparse full format, there was very little difference, mostly around how much of the path was common to the values. Compression gives us a huge size reduction because of the repetition. But we do need a format for history. BTW we should avoid calling the playback and snapshot functionality 'history'. Its becoming confusing since they are likely to be different api's and formats. eg
|
I do recognize that this is useful and I have just tried a few formats that feel very contrived. I think the best way might be to use the source for this. For example:
I think this fits well into our general model. Thoughts? Notes:
Yes for the query. For the response, it would be great to know the time slice so you can directly access time t with If we decide to do this, I will extend the validation to actually verify this too (also need to check that they are all in chronological order and that they are not repeated). |
So I added support for 'history' format (very WIP) to my conversion tool in
Conclusions:
Would be interesting too to measure the size 'in memory' of the file in different engines. I know how to do this with Chrome but not with NodeJS. If anyone has ideas, let me know! |
"objects": [ | ||
{ | ||
"context": "vessels.urn:mrn:xxx", | ||
"timestamps": [ "2018-10-06T04:00:00Z", "2018-10-06T04:01:00Z", "2018-10-06T04:00:02Z"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the history always contiguous? If yes the timestamps as JSON strings are redundant. Parsing them to any native datetime format takes also a nontrivial amount of effort, compared to generating time data during processing.
Should there be two formats, one for arbitrary timeseries with gaps and another for (start, end, periodlength)
data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having a history-like format for data without requiring that the history be contiguous is helpful. Making the history contiguous and regularly spaced is very hard. Making sure that all objects in the file have data on the same rhythm is also hard.
I think having two formats (or two variants of one format) might make sense here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One word of caution though: if start/end/period is defined at the top-level then all AIS target will need to provide values for every individual timestamp which means probably a huge number of null
in the file.
If we think one main vessel and a bunch of AIS targets is going to be a frequent and important use-case (I do) then the start/end should be defined at the vessel/object level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making the history contiguous and regularly spaced is very hard.
Totally depends on the implementation. For example InfluxDb does it for you out of the box - nulls for holes.
Often you want data in coarse time intervals. Then some small holes or having different sampling intervals are not a problem, but missing head or tail parts naturally are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An important use case is "give me all history between these times", eg I want a copy of all the raw data. So assuming there will always be a timeslice is not valid.
Also assuming startTime+(timeslice*n)= actual time is usually going to be valid, but may fail erratically when we have nanos in the db (influx) and ms in the data, or other rounding. Key values may move from one timeslice to another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For raw data retrieval I don’t think we really need anything else than deltas in an array, possibly with a bit of metadata in the header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then you have two formats for history depending on the query. Pushing out gzipped deltas works for both cases, and solves the RAM problems at both ends as they can be written and read/processed as a stream
In memory size will quickly overrun the little RPi :-( The history implementation should be using streaming, as we can easily produce arbitrarily large datasets. That doesnt mean ws etc, just streaming internally so memory stays tiny. If we reply with a gzipped stream of updates (via http) then the format problem is also solved. Size is excellent, and we already have the handling code. |
Unless we agree on the use-cases, I think we can all be right at the same time and still disagree on the best solution. Reading comments here, I think we have very different scenarios in mind. We need to clarify what use-cases we are trying to solve for so we can make a decision: What are the types of apps that are consuming this history format? What are they doing with it? My main use-case is displaying how data series change over time on a map or a graph. To do this, I need all the data in memory at once and I need to be able to quickly find the min/max of the value (for the scale of a graph for example) and quickly access data in an index format. This is very expensive to do with the delta format and that is why I would be happy to see a different format (such as the one I proposed in this PR) that would be much easier to consume directly. |
I agree with @sarfata, we should start from real world use cases. We can also first build some of the applications, at the risk of doing things a few times over, and then come back to the spec issue once the most predrominant use cases have been worked out. |
We had a lengthy discussion with Rob over Slack, where the main point (from my point of view at least) was that to enable efficient stream-based handling the format should have self contained units: for example having first a list of timestamps and then the corresponding data forces the consumer to collate data from different parts of the dataset. Instead putting the related items together allows more efficient processing in most circumstances. Too bad GeoJSON does not allow extending the coordinates... |
My additional use case is to collect track, depth and other info within a bounding box and display on the chart. Same for wind, sog, etc for polar comparisons, engine performance comparisons, etc |
To me the whole point of having a history API is to provide fast and convenient access to historical data with different aggregation options. Digesting the original delta stream does not fit any of those criteria. |
Cross-posting the format being discussed:
|
I'm open to other formats that:
|
A thought here: a lot of the discussion is about the use case, and what suits it. Obviously thats different for different use cases. I think we should concentrate a the best format to transfer data, not to process data. This format could be used to dump data for backup, transfer data between signalk instances, consolidate data to a different timeslice, or return data for a history query. It should handle bad connections, potentially huge data sizes, and multi-vessel/complex query responses. In the delta format we only considered now(), so it was keyed on the vessel. In this case the natural key is The production of data, and the clients use of the data is actually implementation detail. There are so many use-cases that we cant optimise a generic format for any specific one. If we really need to do that, then we should have a specific API for that use-case. aka |
It sounds like you are after a transfer format and we should create it separate from a more use case driven format. |
I agree with @tkurki , seems like we should have a separate API for the transfer use cases. |
So, how this format will looks like? Like @rob42 mentioned above? |
I think this is still useful. While the PR is stale now, the issue is going to come up again as soon as we use history in more complex ways. It also relates to #543 since both need a high volume and very efficient transfer format |
Closing as stale, to be implemented as OpenApi description in the future, see SignalK/signalk-server#1653 |
Summary
Propose to create a new SignalK data format that can be used to convey data about multiple SignalK objects (vessels, aton, aircraft, etc) over a period of time.
Motivation
Multiple developers are working on history related features and there is a strong interest in agreeing on one common data format. Discussion was started and continues on Slack #history-api.
In this proposal, I suggest a new format for the following reasons:
update
object andvalue
object for each timestamp represented. It also repeats the key names for each timestamp;This format can be used as both a "log" file format and a format served over HTTP from a server.
Detailed design
History data format
The history object provides essential information about the data included:
self
if it is known (it is not required because in some cases we do not know)generator
key to include information on who generated this file (especially useful for logfiles)And then a list of
objects
. Each object is identified by acontext
like we already do ondelta
objects but then for each we provide:A few important notes:
null
s.timestamps
array must be equal to the length of theproperties.*.values
arrays.null
in thevalues
array.History REST endpoint
(WIP - Need more work on this)
Drawback
This is one more format.
Alternatives
See the slack #history-api archive or https://docs.google.com/document/d/1s4_lHVVyKJlfacpq5LcUPQEZHvU5nSVtdtE0BxSbbSw/edit# for a summary of other proposals discussed.
Using delta format
Using geo-json
navigation.position
key is not included