[WIP] feature: add a new history format and history REST API #513

sarfata · 2018-10-09T15:00:34Z

Summary

Propose to create a new SignalK data format that can be used to convey data about multiple SignalK objects (vessels, aton, aircraft, etc) over a period of time.

Motivation

Multiple developers are working on history related features and there is a strong interest in agreeing on one common data format. Discussion was started and continues on Slack #history-api.

In this proposal, I suggest a new format for the following reasons:

The delta format can only carry information about one context;
The delta format is very verbose in this use-case because it needs to provide at least one update object and value object for each timestamp represented. It also repeats the key names for each timestamp;
The client needs to apply a lot of logic on a delta object to figure out:
Which values (path) are available
Which sources are available for each value
Get a list of values for a given path

This format can be used as both a "log" file format and a format served over HTTP from a server.

Detailed design

History data format

The history object provides essential information about the data included:

startDate / endDate
the identifier of self if it is known (it is not required because in some cases we do not know)
an optional generator key to include information on who generated this file (especially useful for logfiles)

  "version": "1.1.0",
  "startDate": "2018-10-06T04:00:00Z",
  "endDate": "2018-10-06T04:00:02Z",

And then a list of objects. Each object is identified by a context like we already do on delta objects but then for each we provide:

A list of timestamp
A list of properties
- Each property has a path and a source (this way we can have data from multiple sources)
- A list of values which must have the same length as the list of timestamps

  "objects": [
    {
      "context": "vessels.urn:mrn:xxx",
      "timestamps": [ "2018-10-06T04:00:00Z", "2018-10-06T04:01:00Z", "2018-10-06T04:00:02Z"],
      "properties": [
        {
          "path": "navigation.position",
          "source": { "label": "NMEA1" },
          "values": [
            { "longitude": -182.2, "latitude": -42.1},
            { "longitude": -182.2, "latitude": -42.1}
          ]
        }
      ]
    }
  ]

A few important notes:

The timestamps may be different for each object. They just need to be within the bounds of the file. That is because in many situation we will not have the same amount of information on all objects (an AIS target may have been visible only for a few minutes) and having only one list of timestamps for the entire file would force us to include a lot of nulls.
The length of the timestamps array must be equal to the length of the properties.*.values arrays.
If a value is not available for a given timestamp, it can be null in the values array.

History REST endpoint

(WIP - Need more work on this)

Drawback

This is one more format.

Alternatives

See the slack #history-api archive or https://docs.google.com/document/d/1s4_lHVVyKJlfacpq5LcUPQEZHvU5nSVtdtE0BxSbbSw/edit# for a summary of other proposals discussed.

Using delta format

Very verbose
Only one context at a time
Harder to use for a client

Using geo-json

Not very appropriate when the navigation.position key is not included

tkurki · 2018-10-09T16:42:38Z

We should have a way to specify the resampling per period length

what samples the client requests: min/max/average/first/last and a way to have for example min and max for navigation.speedOverGround for each time period
period length / time slice, in query and in response

tkurki · 2018-10-09T16:44:26Z

See also #89 History/time series api and #363 Track API.

tkurki · 2018-10-09T16:53:13Z

samples/history/basic.json

+  "endDate": "2018-10-06T04:00:02Z",
+  "generator": "Hand-written with love with the spec",
+
+  "objects": [


How about results?

results implies that there is a request. I think this format is also really adapted for log files.

I initially had vessels, aircraft, aton, etc but then decided to group them all together because the context is enough to distinguish them. Do we have a standard name for what all those things are? Maybe we could replace objects per contexts?

I don't think this format is suitable for log files, as it is not appendable. Maybe this is just terminology: this works for capturing a set of data in a self contained file and response is not the best choice in that case.

I'd go with something like data or seriesdata.

I don't think this format is suitable for log files, as it is not appendable.

Yes I agree about not appendable. This is a vocabulary problem. I meant "log" as in "log of one sail" or "record of one race/cruise". Not a running log that would be generated at runtime.

data works for me. We could do history too since that is the name we are giving the format. seriesData is another very loaded term that will have different meaning for everyone. (of course history and data are pretty loaded too ...)

tkurki · 2018-10-09T16:53:54Z

samples/history/basic.json

+        },
+        {
+          "path": "navigation.speedOverGround",
+          "source": { "label": "NMEA1" },


Alternatively the single string, dot notation.

I am not following. Can you give an example?

rob42 · 2018-10-09T19:49:01Z

Are you sure this is actually much less verbose? Looking at it with a human eye is misleading as we pretty print and compare. When you remove whitespace and CR/LF the effect is different.

Waaay back I did a size comparison of delta vs sparse full format, there was very little difference, mostly around how much of the path was common to the values. Compression gives us a huge size reduction because of the repetition.

But we do need a format for history.

BTW we should avoid calling the playback and snapshot functionality 'history'. Its becoming confusing since they are likely to be different api's and formats. eg

'Playback' - extension to /signalk/v1/stream to replay data
'Snapshot' - extension to /signalk/v1/api to retrieve data at a point in time
'History' - new api to get bulk historic data by date-range.

sarfata · 2018-10-10T07:55:43Z

what samples the client requests: min/max/average/first/last and a way to have for example min and max for navigation.speedOverGround for each time period

I do recognize that this is useful and I have just tried a few formats that feel very contrived. I think the best way might be to use the source for this.

For example:

  {
          "path": "navigation.speedOverGround",
          "source": { "label": "aggregation.max", type: "max", "originalSource": { "label": "NMEA1" } },
          "values": [ 12.2, 12.1, 10.9 ]
  }

I think this fits well into our general model. Thoughts?

Notes:

We still need to discuss how to request it
Samples easily map to timestamps but with aggregation, we will need to clarify whether the samples apply to the interval between two timestamps (in which case there should be one less value than timestamps); or "at" the timestamp. Need to look into how influxdb does this more.

period length / time slice, in query and in response

Yes for the query.

For the response, it would be great to know the time slice so you can directly access time t with values[(t - startTime) / period] but that means we really enforce the fact that all the timestamps are precisely following the period. Are we ready to include that requirement? Maybe it's an optional field but when it's provided timestamp n must be equal to startTime + n * period?

If we decide to do this, I will extend the validation to actually verify this too (also need to check that they are all in chronological order and that they are not repeated).

sarfata · 2018-10-10T14:19:52Z

Are you sure this is actually much less verbose? Looking at it with a human eye is misleading as we pretty print and compare. When you remove whitespace and CR/LF the effect is different.

Waaay back I did a size comparison of delta vs sparse full format, there was very little difference, mostly around how much of the path was common to the values. Compression gives us a huge size reduction because of the repetition.

So I added support for 'history' format (very WIP) to my conversion tool in strongly-signalk and did some tests using some real log files people have sent me during charted sails development.

File	Original size	Original size + GZ	Size in SK Delta	Size in SK Delta + GZ	Size in SK History	Size in SK History + GZ
Velocitek logfile (.vcc which is GPX like) (9 hours)	1.7 MB	208k	3.4 MB	312kB	1.3MB	236 kB
Log from Cassiopeia (1 hour SignalK log with 176k updates / 217k values)	29 MB	3.2 MB	33 MB	1.5 MB	28 MB	1.1 MB
Log from tkurki in SK (38mn SignalK log with 37k updates / 65k values)	7.1 MB	916 kB	8.1 MB	372kB	5.8 MB	280 kB
Expedition san-francisco.csv example 16k updates/92k values	1.4 MB	328kB	8.1MB	632kB	2.7MB	316kB

Conclusions:

The best gains are obtained through compression. No doubt about this.
History format as proposed here is always smaller than delta, both before and after compression and by a significant amount (30% to 70% reduction).
CSV is not a bad format for what we want to do here... (but it would be hard to include sources, multiple boats, etc)

Would be interesting too to measure the size 'in memory' of the file in different engines. I know how to do this with Chrome but not with NodeJS. If anyone has ideas, let me know!

tkurki · 2018-10-10T14:29:01Z

gitbook-docs/data_model.md

+  "objects": [
+    {
+      "context": "vessels.urn:mrn:xxx",
+      "timestamps": [ "2018-10-06T04:00:00Z", "2018-10-06T04:01:00Z", "2018-10-06T04:00:02Z"],


Is the history always contiguous? If yes the timestamps as JSON strings are redundant. Parsing them to any native datetime format takes also a nontrivial amount of effort, compared to generating time data during processing.

Should there be two formats, one for arbitrary timeseries with gaps and another for (start, end, periodlength) data?

I think having a history-like format for data without requiring that the history be contiguous is helpful. Making the history contiguous and regularly spaced is very hard. Making sure that all objects in the file have data on the same rhythm is also hard.

I think having two formats (or two variants of one format) might make sense here.

One word of caution though: if start/end/period is defined at the top-level then all AIS target will need to provide values for every individual timestamp which means probably a huge number of null in the file.

If we think one main vessel and a bunch of AIS targets is going to be a frequent and important use-case (I do) then the start/end should be defined at the vessel/object level.

Making the history contiguous and regularly spaced is very hard.

Totally depends on the implementation. For example InfluxDb does it for you out of the box - nulls for holes.

Often you want data in coarse time intervals. Then some small holes or having different sampling intervals are not a problem, but missing head or tail parts naturally are.

An important use case is "give me all history between these times", eg I want a copy of all the raw data. So assuming there will always be a timeslice is not valid.
Also assuming startTime+(timeslice*n)= actual time is usually going to be valid, but may fail erratically when we have nanos in the db (influx) and ms in the data, or other rounding. Key values may move from one timeslice to another.

For raw data retrieval I don’t think we really need anything else than deltas in an array, possibly with a bit of metadata in the header.

But then you have two formats for history depending on the query. Pushing out gzipped deltas works for both cases, and solves the RAM problems at both ends as they can be written and read/processed as a stream

rob42 · 2018-10-10T19:43:47Z

In memory size will quickly overrun the little RPi :-(

The history implementation should be using streaming, as we can easily produce arbitrarily large datasets. That doesnt mean ws etc, just streaming internally so memory stays tiny.

If we reply with a gzipped stream of updates (via http) then the format problem is also solved. Size is excellent, and we already have the handling code.

sarfata · 2018-10-14T10:04:26Z

Unless we agree on the use-cases, I think we can all be right at the same time and still disagree on the best solution. Reading comments here, I think we have very different scenarios in mind. We need to clarify what use-cases we are trying to solve for so we can make a decision: What are the types of apps that are consuming this history format? What are they doing with it?

My main use-case is displaying how data series change over time on a map or a graph. To do this, I need all the data in memory at once and I need to be able to quickly find the min/max of the value (for the scale of a graph for example) and quickly access data in an index format. This is very expensive to do with the delta format and that is why I would be happy to see a different format (such as the one I proposed in this PR) that would be much easier to consume directly.

tkurki · 2018-10-14T19:03:51Z

I agree with @sarfata, we should start from real world use cases.

We can also first build some of the applications, at the risk of doing things a few times over, and then come back to the spec issue once the most predrominant use cases have been worked out.

tkurki · 2018-10-14T19:09:07Z

We had a lengthy discussion with Rob over Slack, where the main point (from my point of view at least) was that to enable efficient stream-based handling the format should have self contained units: for example having first a list of timestamps and then the corresponding data forces the consumer to collate data from different parts of the dataset. Instead putting the related items together allows more efficient processing in most circumstances.

Too bad GeoJSON does not allow extending the coordinates...

rob42 · 2018-10-14T19:26:42Z

My additional use case is to collect track, depth and other info within a bounding box and display on the chart. Same for wind, sog, etc for polar comparisons, engine performance comparisons, etc
Also data export to the cloud with intermittent connection.
@tkurki since you need the full message to make the array in ram work, you could just ingest the stream and build the array locally?

tkurki · 2018-10-14T19:38:15Z

To me the whole point of having a history API is to provide fast and convenient access to historical data with different aggregation options.

Digesting the original delta stream does not fit any of those criteria.

rob42 · 2018-10-14T19:50:31Z

Cross-posting the format being discussed:

[
    {"2015-03-07T12:37:10.523+13:00":[
            {"vessels.urn:mrn:imo:mmsi:234567890":[
                    { "navigation.position": {
                                    "$source": "a.suitable.path",
                                    "average":{
                                        "longitude": 24.9025173,
                                        "latitude": 60.039317
                                    }
                                }
                        },
                        "environment.depth.belowSurface": {
                                    "$source": "a.suitable.path",
                                    "avg":2.5,
                                    "max": 2.8,
                                    "min": 2.5
                                    }
                ]
            },
            ...more vessels
        ]
    },
    ...more timestamps
]

rob42 · 2018-10-14T20:05:27Z

I'm open to other formats that:

are 'packetised' so we can write/read them with low resources and they are resilient when partially sent/recieved over intermittent links.
Handle multi-vessels
Handle complex combinations of paths, nulls, missing data, etc

rob42 · 2018-10-15T00:27:22Z

A thought here: a lot of the discussion is about the use case, and what suits it. Obviously thats different for different use cases. I think we should concentrate a the best format to transfer data, not to process data.

This format could be used to dump data for backup, transfer data between signalk instances, consolidate data to a different timeslice, or return data for a history query. It should handle bad connections, potentially huge data sizes, and multi-vessel/complex query responses. In the delta format we only considered now(), so it was keyed on the vessel. In this case the natural key is timestamp, hence the natural self-contained unit or packet is one timeSlice

The production of data, and the clients use of the data is actually implementation detail. There are so many use-cases that we cant optimise a generic format for any specific one. If we really need to do that, then we should have a specific API for that use-case. aka /track/

tkurki · 2018-10-15T04:33:55Z

It sounds like you are after a transfer format and we should create it separate from a more use case driven format.

sbender9 · 2018-10-15T19:42:30Z

I agree with @tkurki , seems like we should have a separate API for the transfer use cases.

gdavydov · 2018-10-30T22:13:13Z

So, how this format will looks like? Like @rob42 mentioned above?

fabdrol · 2019-06-05T11:43:49Z

@sarfata @rob42 @tkurki this seems stale. In any case there are conflicts. Please update or we should close and revisit at a later time. Thoughts?

rob42 · 2019-06-06T04:48:46Z

I think this is still useful. While the PR is stale now, the issue is going to come up again as soon as we use history in more complex ways. It also relates to #543 since both need a high volume and very efficient transfer format

tkurki · 2024-03-01T14:36:49Z

Closing as stale, to be implemented as OpenApi description in the future, see SignalK/signalk-server#1653

feature: add a new history format and history REST API

c730e03

sarfata changed the title ~~feature: add a new history format and history REST API~~ (wip) feature: add a new history format and history REST API Oct 9, 2018

tkurki changed the title ~~(wip) feature: add a new history format and history REST API~~ [WIP] feature: add a new history format and history REST API Oct 9, 2018

tkurki reviewed Oct 9, 2018

View reviewed changes

tkurki reviewed Oct 10, 2018

View reviewed changes

tkurki closed this Mar 1, 2024

[WIP] feature: add a new history format and history REST API #513

[WIP] feature: add a new history format and history REST API #513

Uh oh!

Conversation

sarfata commented Oct 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Detailed design

History data format

History REST endpoint

Drawback

Alternatives

Using delta format

Using geo-json

Uh oh!

tkurki commented Oct 9, 2018

Uh oh!

tkurki commented Oct 9, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rob42 commented Oct 9, 2018

Uh oh!

sarfata commented Oct 10, 2018

Uh oh!

sarfata commented Oct 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sarfata Oct 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rob42 Oct 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rob42 commented Oct 10, 2018

Uh oh!

sarfata commented Oct 14, 2018

Uh oh!

tkurki commented Oct 14, 2018

Uh oh!

tkurki commented Oct 14, 2018

Uh oh!

rob42 commented Oct 14, 2018

Uh oh!

tkurki commented Oct 14, 2018

Uh oh!

rob42 commented Oct 14, 2018

Uh oh!

rob42 commented Oct 14, 2018

Uh oh!

rob42 commented Oct 15, 2018

Uh oh!

tkurki commented Oct 15, 2018

Uh oh!

sbender9 commented Oct 15, 2018

Uh oh!

gdavydov commented Oct 30, 2018

sarfata commented Oct 9, 2018 •

edited

Loading

sarfata commented Oct 10, 2018 •

edited

Loading

sarfata Oct 10, 2018 •

edited

Loading

rob42 Oct 10, 2018 •

edited

Loading