You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Storing the OD results (estimates and residuals) in CSV files is inefficient and cumbersome. We propose the implementation of a to_parquet function for the ODProcess structure. The Apache Parquet format is well suited for this type of data. By storing the data in Parquet, the estimates and residuals can be queried and analyzed much more efficiently by the users. The data should be stored in base units (it's currently in kilometers, leads to error plots being in "micro kilometers", which is confusing).
Requirements
The ODProcess structure must have a to_parquet method to export its data to Parquet.
The to_parquet method must take:
A path to the Parquet file.
(Optional) A list of EventEvaluator to evaluate events. If provided, the events data is also exported.
An ExportCfg configuration object to configure the export (consider reuse of the current one used in trajectory export).
The method must export:
The estimates (state, state deviation, nominal state)
The residuals (prefit and postfit for each measurement type)
The events data (if evaluators were provided)
Null values should be used when there is no data for an epoch (e.g. no measurement, no event). (Human: this is achievable already because the FloatType is an Option<f64>.)
The schema of the Parquet file should be flexible to support different measurement and event types between OD processes.
The export process should be efficient and not require much overhead. The larger the OD result set, the more benefits from using Parquet.
Appropriate error handling should be in place in case of issues exporting or writing to the Parquet file.
Test plans
Unit tests:
Export a small OD result set (few epochs) to Parquet and read it back, asserting the data is correct.
Export an OD result set with null values (missing measurements/events) and assert they are handled properly.
Export an OD result set with different measurement types and assert the schema is flexible enough.
Assert any error during export is handled properly.
Integration tests:
Export a large OD result set (thousands of epochs) to Parquet and read it back to assert no loss of data or performance issues.
~~Query and analyze the data in various ways to ensure the benefits of using Parquet are achieved. ~~ This is evident compared to the current CSV format.
Edge cases:
An empty OD result set (no estimates or residuals).
An OD result set with only estimates and no residuals. Not possible.
A malformed ExportCfg object. Not possible.
Invalid path provided. Handled by the path object
Lack of write permissions to the path. Handled by the path object
Corrupted Parquet file as input. Not possible since I create a new parquet file
Out of memory issues when exporting very large result sets. This would require failures to be handled gracefully. This would be a dyn Error, which is supported. Not sure how to test this on any of the machines I have since they have several GBs of RAM.
Benchmark tests
Compare performance of exporting to Parquet vs CSV for large result sets. Parquet should provide major speedups and efficiency gains.
Documentation and examples:
Clearly document the to_parquet method and ExportCfg to ensure proper usage.
Provide examples of querying and analyzing the exported Parquet data.
Design
Here is a Mermaid JS diagram showing the proposed implementation:
sequenceDiagram
participant User
participant ODProcess
participant ParquetExporter
User->>ODProcess: Call to_parquet() method
ODProcess->>ParquetExporter: Instantiate exporter
ODProcess->>ParquetExporter: Provide estimates and residuals data
ParquetExporter->>ParquetExporter: Validate input data
ParquetExporter->>ParquetExporter: Create Parquet schema
ParquetExporter->>ParquetExporter: Write data to Parquet file
ParquetExporter-->>ODProcess: Return path to exported file
ODProcess-->>User: Return path to exported file
Coauthors: Claude by Anthropic and GPT-4
High level description
Storing the OD results (estimates and residuals) in CSV files is inefficient and cumbersome. We propose the implementation of a
to_parquet
function for theODProcess
structure. The Apache Parquet format is well suited for this type of data. By storing the data in Parquet, the estimates and residuals can be queried and analyzed much more efficiently by the users. The data should be stored in base units (it's currently in kilometers, leads to error plots being in "micro kilometers", which is confusing).Requirements
ODProcess
structure must have a to_parquet method to export its data to Parquet.to_parquet
method must take:EventEvaluator
to evaluate events. If provided, the events data is also exported.ExportCfg
configuration object to configure the export (consider reuse of the current one used in trajectory export).FloatType
is anOption<f64>
.)Test plans
Unit tests:
Integration tests:
Edge cases:
An OD result set with only estimates and no residuals.Not possible.A malformed ExportCfg object.Not possible.Invalid path provided.Handled by the path objectLack of write permissions to the path.Handled by the path objectCorrupted Parquet file as input.Not possible since I create a new parquet fileOut of memory issues when exporting very large result sets. This would require failures to be handled gracefully.This would be a dyn Error, which is supported. Not sure how to test this on any of the machines I have since they have several GBs of RAM.Benchmark tests
Documentation and examples:
Design
Here is a Mermaid JS diagram showing the proposed implementation:
Consider using the building pattern from https://github.com/apache/arrow-rs/blob/master/arrow/examples/builders.rs or parquet_derive directly (but this might not work because it needs a custom struct for each variation), cf. https://github.com/apache/arrow-rs/blob/master/parquet_derive/README.md .
The text was updated successfully, but these errors were encountered: