Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] feature: add a new history format and history REST API #513

Closed
wants to merge 1 commit into from

Conversation

sarfata
Copy link
Contributor

@sarfata sarfata commented Oct 9, 2018

Summary

Propose to create a new SignalK data format that can be used to convey data about multiple SignalK objects (vessels, aton, aircraft, etc) over a period of time.

Motivation

Multiple developers are working on history related features and there is a strong interest in agreeing on one common data format. Discussion was started and continues on Slack #history-api.

In this proposal, I suggest a new format for the following reasons:

  • The delta format can only carry information about one context;
  • The delta format is very verbose in this use-case because it needs to provide at least one update object and value object for each timestamp represented. It also repeats the key names for each timestamp;
  • The client needs to apply a lot of logic on a delta object to figure out:
  • Which values (path) are available
  • Which sources are available for each value
  • Get a list of values for a given path

This format can be used as both a "log" file format and a format served over HTTP from a server.

Detailed design

History data format

The history object provides essential information about the data included:

  • startDate / endDate
  • the identifier of self if it is known (it is not required because in some cases we do not know)
  • an optional generator key to include information on who generated this file (especially useful for logfiles)
  "version": "1.1.0",
  "startDate": "2018-10-06T04:00:00Z",
  "endDate": "2018-10-06T04:00:02Z",

And then a list of objects. Each object is identified by a context like we already do on delta objects but then for each we provide:

  • A list of timestamp
  • A list of properties
    • Each property has a path and a source (this way we can have data from multiple sources)
    • A list of values which must have the same length as the list of timestamps
  "objects": [
    {
      "context": "vessels.urn:mrn:xxx",
      "timestamps": [ "2018-10-06T04:00:00Z", "2018-10-06T04:01:00Z", "2018-10-06T04:00:02Z"],
      "properties": [
        {
          "path": "navigation.position",
          "source": { "label": "NMEA1" },
          "values": [
            { "longitude": -182.2, "latitude": -42.1},
            { "longitude": -182.2, "latitude": -42.1}
          ]
        }
      ]
    }
  ]

A few important notes:

  • The timestamps may be different for each object. They just need to be within the bounds of the file. That is because in many situation we will not have the same amount of information on all objects (an AIS target may have been visible only for a few minutes) and having only one list of timestamps for the entire file would force us to include a lot of nulls.
  • The length of the timestamps array must be equal to the length of the properties.*.values arrays.
  • If a value is not available for a given timestamp, it can be null in the values array.

History REST endpoint

(WIP - Need more work on this)

Drawback

This is one more format.

Alternatives

See the slack #history-api archive or https://docs.google.com/document/d/1s4_lHVVyKJlfacpq5LcUPQEZHvU5nSVtdtE0BxSbbSw/edit# for a summary of other proposals discussed.

Using delta format

  • Very verbose
  • Only one context at a time
  • Harder to use for a client

Using geo-json

  • Not very appropriate when the navigation.position key is not included

@sarfata sarfata changed the title feature: add a new history format and history REST API (wip) feature: add a new history format and history REST API Oct 9, 2018
@tkurki tkurki changed the title (wip) feature: add a new history format and history REST API [WIP] feature: add a new history format and history REST API Oct 9, 2018
@tkurki
Copy link
Member

tkurki commented Oct 9, 2018

We should have a way to specify the resampling per period length

  • what samples the client requests: min/max/average/first/last and a way to have for example min and max for navigation.speedOverGround for each time period
  • period length / time slice, in query and in response

@tkurki
Copy link
Member

tkurki commented Oct 9, 2018

See also #89 History/time series api and #363 Track API.

"endDate": "2018-10-06T04:00:02Z",
"generator": "Hand-written with love with the spec",

"objects": [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about results?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

results implies that there is a request. I think this format is also really adapted for log files.

I initially had vessels, aircraft, aton, etc but then decided to group them all together because the context is enough to distinguish them. Do we have a standard name for what all those things are? Maybe we could replace objects per contexts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this format is suitable for log files, as it is not appendable. Maybe this is just terminology: this works for capturing a set of data in a self contained file and response is not the best choice in that case.

I'd go with something like data or seriesdata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this format is suitable for log files, as it is not appendable.

Yes I agree about not appendable. This is a vocabulary problem. I meant "log" as in "log of one sail" or "record of one race/cruise". Not a running log that would be generated at runtime.

data works for me. We could do history too since that is the name we are giving the format. seriesData is another very loaded term that will have different meaning for everyone. (of course history and data are pretty loaded too ...)

},
{
"path": "navigation.speedOverGround",
"source": { "label": "NMEA1" },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively the single string, dot notation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not following. Can you give an example?

@rob42
Copy link
Contributor

rob42 commented Oct 9, 2018

Are you sure this is actually much less verbose? Looking at it with a human eye is misleading as we pretty print and compare. When you remove whitespace and CR/LF the effect is different.

Waaay back I did a size comparison of delta vs sparse full format, there was very little difference, mostly around how much of the path was common to the values. Compression gives us a huge size reduction because of the repetition.

But we do need a format for history.

BTW we should avoid calling the playback and snapshot functionality 'history'. Its becoming confusing since they are likely to be different api's and formats. eg

  1. 'Playback' - extension to /signalk/v1/stream to replay data
  2. 'Snapshot' - extension to /signalk/v1/api to retrieve data at a point in time
  3. 'History' - new api to get bulk historic data by date-range.

@sarfata
Copy link
Contributor Author

sarfata commented Oct 10, 2018

  • what samples the client requests: min/max/average/first/last and a way to have for example min and max for navigation.speedOverGround for each time period

I do recognize that this is useful and I have just tried a few formats that feel very contrived. I think the best way might be to use the source for this.

For example:

  {
          "path": "navigation.speedOverGround",
          "source": { "label": "aggregation.max", type: "max", "originalSource": { "label": "NMEA1" } },
          "values": [ 12.2, 12.1, 10.9 ]
  }

I think this fits well into our general model. Thoughts?

Notes:

  • We still need to discuss how to request it
  • Samples easily map to timestamps but with aggregation, we will need to clarify whether the samples apply to the interval between two timestamps (in which case there should be one less value than timestamps); or "at" the timestamp. Need to look into how influxdb does this more.

period length / time slice, in query and in response

Yes for the query.

For the response, it would be great to know the time slice so you can directly access time t with values[(t - startTime) / period] but that means we really enforce the fact that all the timestamps are precisely following the period. Are we ready to include that requirement? Maybe it's an optional field but when it's provided timestamp n must be equal to startTime + n * period?

If we decide to do this, I will extend the validation to actually verify this too (also need to check that they are all in chronological order and that they are not repeated).

@sarfata
Copy link
Contributor Author

sarfata commented Oct 10, 2018

Are you sure this is actually much less verbose? Looking at it with a human eye is misleading as we pretty print and compare. When you remove whitespace and CR/LF the effect is different.

Waaay back I did a size comparison of delta vs sparse full format, there was very little difference, mostly around how much of the path was common to the values. Compression gives us a huge size reduction because of the repetition.

So I added support for 'history' format (very WIP) to my conversion tool in strongly-signalk and did some tests using some real log files people have sent me during charted sails development.

File Original size Original size + GZ Size in SK Delta Size in SK Delta + GZ Size in SK History Size in SK History + GZ
Velocitek logfile (.vcc which is GPX like) (9 hours) 1.7 MB 208k 3.4 MB 312kB 1.3MB 236 kB
Log from Cassiopeia (1 hour SignalK log with 176k updates / 217k values) 29 MB 3.2 MB 33 MB 1.5 MB 28 MB 1.1 MB
Log from tkurki in SK (38mn SignalK log with 37k updates / 65k values) 7.1 MB 916 kB 8.1 MB 372kB 5.8 MB 280 kB
Expedition san-francisco.csv example 16k updates/92k values 1.4 MB 328kB 8.1MB 632kB 2.7MB 316kB

Conclusions:

  • The best gains are obtained through compression. No doubt about this.
  • History format as proposed here is always smaller than delta, both before and after compression and by a significant amount (30% to 70% reduction).
  • CSV is not a bad format for what we want to do here... (but it would be hard to include sources, multiple boats, etc)

Would be interesting too to measure the size 'in memory' of the file in different engines. I know how to do this with Chrome but not with NodeJS. If anyone has ideas, let me know!

"objects": [
{
"context": "vessels.urn:mrn:xxx",
"timestamps": [ "2018-10-06T04:00:00Z", "2018-10-06T04:01:00Z", "2018-10-06T04:00:02Z"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the history always contiguous? If yes the timestamps as JSON strings are redundant. Parsing them to any native datetime format takes also a nontrivial amount of effort, compared to generating time data during processing.

Should there be two formats, one for arbitrary timeseries with gaps and another for (start, end, periodlength) data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having a history-like format for data without requiring that the history be contiguous is helpful. Making the history contiguous and regularly spaced is very hard. Making sure that all objects in the file have data on the same rhythm is also hard.

I think having two formats (or two variants of one format) might make sense here.

Copy link
Contributor Author

@sarfata sarfata Oct 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One word of caution though: if start/end/period is defined at the top-level then all AIS target will need to provide values for every individual timestamp which means probably a huge number of null in the file.

If we think one main vessel and a bunch of AIS targets is going to be a frequent and important use-case (I do) then the start/end should be defined at the vessel/object level.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the history contiguous and regularly spaced is very hard.

Totally depends on the implementation. For example InfluxDb does it for you out of the box - nulls for holes.

Often you want data in coarse time intervals. Then some small holes or having different sampling intervals are not a problem, but missing head or tail parts naturally are.

Copy link
Contributor

@rob42 rob42 Oct 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An important use case is "give me all history between these times", eg I want a copy of all the raw data. So assuming there will always be a timeslice is not valid.
Also assuming startTime+(timeslice*n)= actual time is usually going to be valid, but may fail erratically when we have nanos in the db (influx) and ms in the data, or other rounding. Key values may move from one timeslice to another.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For raw data retrieval I don’t think we really need anything else than deltas in an array, possibly with a bit of metadata in the header.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then you have two formats for history depending on the query. Pushing out gzipped deltas works for both cases, and solves the RAM problems at both ends as they can be written and read/processed as a stream

@rob42
Copy link
Contributor

rob42 commented Oct 10, 2018

In memory size will quickly overrun the little RPi :-(

The history implementation should be using streaming, as we can easily produce arbitrarily large datasets. That doesnt mean ws etc, just streaming internally so memory stays tiny.

If we reply with a gzipped stream of updates (via http) then the format problem is also solved. Size is excellent, and we already have the handling code.

@sarfata
Copy link
Contributor Author

sarfata commented Oct 14, 2018

Unless we agree on the use-cases, I think we can all be right at the same time and still disagree on the best solution. Reading comments here, I think we have very different scenarios in mind. We need to clarify what use-cases we are trying to solve for so we can make a decision: What are the types of apps that are consuming this history format? What are they doing with it?

My main use-case is displaying how data series change over time on a map or a graph. To do this, I need all the data in memory at once and I need to be able to quickly find the min/max of the value (for the scale of a graph for example) and quickly access data in an index format. This is very expensive to do with the delta format and that is why I would be happy to see a different format (such as the one I proposed in this PR) that would be much easier to consume directly.

@tkurki
Copy link
Member

tkurki commented Oct 14, 2018

I agree with @sarfata, we should start from real world use cases.

We can also first build some of the applications, at the risk of doing things a few times over, and then come back to the spec issue once the most predrominant use cases have been worked out.

@tkurki
Copy link
Member

tkurki commented Oct 14, 2018

We had a lengthy discussion with Rob over Slack, where the main point (from my point of view at least) was that to enable efficient stream-based handling the format should have self contained units: for example having first a list of timestamps and then the corresponding data forces the consumer to collate data from different parts of the dataset. Instead putting the related items together allows more efficient processing in most circumstances.

Too bad GeoJSON does not allow extending the coordinates...

@rob42
Copy link
Contributor

rob42 commented Oct 14, 2018

My additional use case is to collect track, depth and other info within a bounding box and display on the chart. Same for wind, sog, etc for polar comparisons, engine performance comparisons, etc
Also data export to the cloud with intermittent connection.
@tkurki since you need the full message to make the array in ram work, you could just ingest the stream and build the array locally?

@tkurki
Copy link
Member

tkurki commented Oct 14, 2018

To me the whole point of having a history API is to provide fast and convenient access to historical data with different aggregation options.

Digesting the original delta stream does not fit any of those criteria.

@rob42
Copy link
Contributor

rob42 commented Oct 14, 2018

Cross-posting the format being discussed:

[
    {"2015-03-07T12:37:10.523+13:00":[
            {"vessels.urn:mrn:imo:mmsi:234567890":[
                    { "navigation.position": {
                                    "$source": "a.suitable.path",
                                    "average":{
                                        "longitude": 24.9025173,
                                        "latitude": 60.039317
                                    }
                                }
                        },
                        "environment.depth.belowSurface": {
                                    "$source": "a.suitable.path",
                                    "avg":2.5,
                                    "max": 2.8,
                                    "min": 2.5
                                    }
                ]
            },
            ...more vessels
        ]
    },
    ...more timestamps
]

@rob42
Copy link
Contributor

rob42 commented Oct 14, 2018

I'm open to other formats that:

  1. are 'packetised' so we can write/read them with low resources and they are resilient when partially sent/recieved over intermittent links.
  2. Handle multi-vessels
  3. Handle complex combinations of paths, nulls, missing data, etc

@rob42
Copy link
Contributor

rob42 commented Oct 15, 2018

A thought here: a lot of the discussion is about the use case, and what suits it. Obviously thats different for different use cases. I think we should concentrate a the best format to transfer data, not to process data.

This format could be used to dump data for backup, transfer data between signalk instances, consolidate data to a different timeslice, or return data for a history query. It should handle bad connections, potentially huge data sizes, and multi-vessel/complex query responses. In the delta format we only considered now(), so it was keyed on the vessel. In this case the natural key is timestamp, hence the natural self-contained unit or packet is one timeSlice

The production of data, and the clients use of the data is actually implementation detail. There are so many use-cases that we cant optimise a generic format for any specific one. If we really need to do that, then we should have a specific API for that use-case. aka /track/

@tkurki
Copy link
Member

tkurki commented Oct 15, 2018

It sounds like you are after a transfer format and we should create it separate from a more use case driven format.

@sbender9
Copy link
Member

I agree with @tkurki , seems like we should have a separate API for the transfer use cases.

@gdavydov
Copy link

So, how this format will looks like? Like @rob42 mentioned above?

@fabdrol
Copy link
Member

fabdrol commented Jun 5, 2019

@sarfata @rob42 @tkurki this seems stale. In any case there are conflicts. Please update or we should close and revisit at a later time. Thoughts?

@rob42
Copy link
Contributor

rob42 commented Jun 6, 2019

I think this is still useful. While the PR is stale now, the issue is going to come up again as soon as we use history in more complex ways. It also relates to #543 since both need a high volume and very efficient transfer format

@tkurki
Copy link
Member

tkurki commented Mar 1, 2024

Closing as stale, to be implemented as OpenApi description in the future, see SignalK/signalk-server#1653

@tkurki tkurki closed this Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants