Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to truncate decorator data when serializing MAST #1580

Open
yasonk opened this issue Nov 19, 2024 · 4 comments
Open

Allow to truncate decorator data when serializing MAST #1580

yasonk opened this issue Nov 19, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@yasonk
Copy link
Contributor

yasonk commented Nov 19, 2024

Feature description

For use cases described in this comment, need to have the capability to truncate the decorator data when serializing MAST.

This comment describes one approach for implementing this feature.

Should probably introduce CLI flag for the compile command to write the compiled program with truncated optional data.

Why is this feature needed?

Per the comment on Pull Request #1531 :
... general idea is that producers, e.g. the compiler, will emit a fully-fleshed out package containing all the things that could be useful for interacting with that package (e.g. debug info, documentation, sources in some caes, etc.), and the MAST contained in the package would have all of the decorators (naturally required to support debugging, etc.). There are contexts where all of that stuff is simply dead weight though, or you want to optimize for binary size, in which case you can strip all of that, resulting in a minimal package that only contains exactly what is needed to execute what it contains. One such potential use case, would be publishing packages on-chain, which would only be feasible if they are as small as possible.

@yasonk yasonk added the enhancement New feature or request label Nov 19, 2024
@yasonk
Copy link
Contributor Author

yasonk commented Dec 2, 2024

With the advice map, calculating the decorator offset becomes a little more complicated. Potentially a SizeCalculator is warranted. You would first use SizeCalculator to dry-run all writing operations that you want to calculate, then write the decorator data offset, then write the data:

struct SizeCalculator {
    total_size: usize,
}

impl SizeCalculator {
    fn new() -> Self {
        Self { total_size: 0 }
    }

    fn size(&self) -> usize {
        self.total_size
    }
}

impl ByteWriter for SizeCalculator {
    fn write_u8(&mut self, _value: u8) {
        // Increment size for a single byte
        self.total_size += 1;
    }

    fn write_bytes(&mut self, bytes: &[u8]) {
        // Increment size by the length of the byte slice
        self.total_size += bytes.len();
    }

    // All other methods from ByteWriter can remain unimplemented
    // because they eventually rely on write_bytes or write_u8.
}

I can see some brittleness of this approach if the concrete class of ByteWriter used for writing the program changes without changing the SizeCalculator. But not sure how to couple them in a graceful manner.

Using size_hints seems even more brittle because then any change in structure/implementation may require modifying size_hint, which seems like a bigger mental burden compared to ByteWriter changes.

Also not sure what performance implications would be if there is a large number of nodes. I would expect Dry-Run to be significantly faster because it is not actually writing to memory, but the CPU still has to cycle through all the nodes and relevant serialization logic.

@bobbinth
Copy link
Contributor

bobbinth commented Dec 3, 2024

Do you mean the advice map referred to here? And is the problem that we don't know the size of the advice map before serializing it?

If so, we have two potential solutions (maybe more):

  • Add a method to the AdviceMap to get the number of bytes it'll take to serialize it. For example, something like AdviceMap::serialized_byte_len().
  • Serialize the map into a separate vector as one of the first steps. Get the length of the vector and use it to compute the required offset, and then, later on just write this vector to the target writer (this is similar to the dry-run approach you've suggested, but is limited to the advice map only).

@yasonk
Copy link
Contributor Author

yasonk commented Dec 3, 2024

Yes, the original idea was to calculate decorator data offset. But actually, I'm not sure this is needed. If CLI gets a flag to truncate decorators, it could create mast forest without decorators and write that to the binary. (no need for decorator data offset, because no decorators are written).

The only reason to actually store the "decorator data section offset" is if you wanted to take an existing binary and truncate it without first reading it. And I'm not sure that the "without first reading it" part is possible.

I'm seeing these two potential workflows:

  1. Write truncated binary. (decorator section offset doesn't seem to be needed in this case)
  2. Write full binary and then truncate it. (not sure when this would be better than the above)

If we do need to store the offset, here are some trade offs:
SizeCalculator to calculate size, then serialize into binary.
Trade offs: Require more CPU cycles, but uses little heap space.

Serializing into vector, calculate size, then serialize into binary.
Trade offs: uses more heap space, but uses less CPU.

For anything that uses usize would be a good idea to use one of the methods above because usize can be 8 or 9 bytes. (see ByteWriter)

How big can AdviceMap map get?

@bobbinth
Copy link
Contributor

bobbinth commented Dec 9, 2024

How big can AdviceMap map get?

Shouldn't be too big for the use cases we are thinking about - probably less then a few dozen KB (and in most case probably much less than that).

The only reason to actually store the "decorator data section offset" is if you wanted to take an existing binary and truncate it without first reading it. And I'm not sure that the "without first reading it" part is possible.

One possibility is to read the first few bytes of a MAST forest and then based on that read only the required data (e.g., don't read the decorator portion). But not sure where we'll do this in practice.

If we do need to store the offset, here are some trade offs: SizeCalculator to calculate size, then serialize into binary. Trade offs: Require more CPU cycles, but uses little heap space.

Serializing into vector, calculate size, then serialize into binary. Trade offs: uses more heap space, but uses less CPU.

Between these two options, I think serialize into vector, calculate size, and then write the vector into the binary probably makes the most sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants