Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve memory footprint of columnar deserialization using relevant filter #895

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

TonyXiang8787
Copy link
Member

@TonyXiang8787 TonyXiang8787 commented Feb 13, 2025

Background

When using PGM deserialization, the user can select a relevant filter for columnar data format. The de-serializer will return only relevant attributes for that component.

Internally, the Python routine still make buffers of all attributes. Then the attributes with only nan are discarded. This is un-wanted memory footprint.

Improvement

In general, we cannot know which attributes are relevant before parse the whole dataset. However, if the condense attribute list is given in the dataset header, and there are no map-like elements in the data, we know for sure that the relevant attributes can only be those attributes specified in the header. In this way, we can ask the Python routine to only create those buffers. This should save memory footprint of the de-serializer.

Implementation

  • C++ core
  • C API
  • Python API
  • C++ unit test
  • C API test
  • Python unit test

TonyXiang8787 and others added 19 commits January 18, 2025 14:42
Signed-off-by: Tony Xiang <[email protected]>
Signed-off-by: Tony Xiang <[email protected]>
Signed-off-by: Tony Xiang <[email protected]>
Signed-off-by: Tony Xiang <[email protected]>
Signed-off-by: Tony Xiang <[email protected]>
Signed-off-by: Tony Xiang <[email protected]>
Signed-off-by: Tony Xiang <[email protected]>
Signed-off-by: Tony Xiang <[email protected]>
Signed-off-by: Tony Xiang <[email protected]>
Signed-off-by: Tony Xiang <[email protected]>
Signed-off-by: Tony Xiang <[email protected]>
Signed-off-by: Tony Xiang <[email protected]>
Signed-off-by: Tony Xiang <[email protected]>
@TonyXiang8787 TonyXiang8787 marked this pull request as draft February 13, 2025 13:33
Signed-off-by: Tony Xiang <[email protected]>
@@ -336,6 +339,21 @@ template <dataset_type_tag dataset_type_> class Dataset {
add_component_info_impl(component, elements_per_scenario, total_elements);
}

void enable_atrribute_indications(std::string_view component)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Suggested change
void enable_atrribute_indications(std::string_view component)
void enable_attribute_indications(std::string_view component)

@@ -152,6 +152,14 @@ struct DefaultNullVisitor : msgpack::null_visitor {
}
};

struct NullVisitorCheckMap : DefaultNullVisitor {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not really a null-visitor anymore, right?

Suggested change
struct NullVisitorCheckMap : DefaultNullVisitor {
struct CheckHasMap : DefaultNullVisitor {

Comment on lines +87 to +90
* @brief Return if a component has attribute indications.
*
* Attribute indications are used to indicate the presence of meaningful attributes for a certain component in the
* dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's good to also mention the behavior if it has both preamble attribute indications and in-data attributes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants