Copy & embed any string/constant pool entry in the serialized output #1537

eregon · 2023-09-16T14:31:45Z

I think we should copy & embed any string/constant pool entry in the serialized output and therefore not need the source at all when deserializing.
That would mean larger serialized size but no need to read the source at all which could be a nice speedup in the case the serialized form is saved on disk and newer/equal the source file.
I think the serialized size wouldn't be worse than before + eager loading constant pool and strings, it would be about the same, but gaining to not have to read the source.
That might actually be fairly easy, and would be less temporal memory than serialized + source at the same time in memory (but that's not a big deal).

This would also remove the need for accessing owned when serializing the constant pool: #1449 (comment)

Originally posted by @eregon in #1479 (comment)

The text was updated successfully, but these errors were encountered:

kddnewton · 2023-12-06T22:27:28Z

I'm not sure about this. It makes it so that all serialization consumers now get a much bigger payload each time, even if they have the source available. This would make the JS one, for example, much bigger while it always has the source available.

@enebo what do you think?

eregon · 2023-12-06T22:43:15Z

There are two gains:

It would make loading from the serialized AST on disk quite a bit faster, by not needing to read the source (and only check its mtime to know if up to date).
It will be important when JRuby & TruffleRuby do lazy deserialization of method literals, so there is no need to keep the Source around for the whole program duration when it would otherwise not be needed.
I think that's a really valuable optimization to have and we already have some pieces ready in Prism (the offset on DefNode), but it's something for later, in 2024.

Also currently we need to hold onto the Source even after deserializing the entire Prism AST at once, for things like IntegerNode ( 😓 we should probably do something about that). That's rather suboptimal in memory usage, especially with lazy method compilation/translation (!= lazy deserialization, TruffleRuby already uses this with Prism), as then both the full AST and the source are kept for the entire program duration while maybe only a single/few methods of the file are used.

enebo · 2023-12-07T15:04:30Z

There are two levels here to consider: loaded_tree and serialized_blob.

For loaded tree the only source we depend on at this point is Integer, Float, and Rational. They need to be moved into the tree and not loaded from source. We just haven't made a decision on how to do this. Keeping source a round AND the tree is a huge memory cost over the legacy parser. I assume we 100% will be doing this soon. We are so close to not having source here.

For serialization I think there is a pretty big carrot to moving constants into serialization: pre-compilation. It would be nice to eventually have a stable serialized format we can load and not bother to load from source at all (assuming it is trusted). We have plans to replace our AOT serialization format with this since Loader requires no C.

As @eregon said we both do lazy method loading and whatever allows us to use as little memory as possible (also helps for startup time). Ultimately, what we would have for methods would be a slice of the constant pool and a slice of the offsets representing the method body (whether we actually slice this or keep whole serialized blob is up to experimentation). The time saved will be not loading huge portions of the tree until it is used. As some Rails commands I have see up to 80% of all method definitions loaded are not called.

The last bit also mentioned by @eregon but he only mentions memory...it is faster to make this pool in C than have us create it in Java when Java is just starting up and still somewhat cold. If we have to depend on C we may as well have C do as much heavy lifting for us as we can get.

@kddnewton I have said this many times now but there is a pretty large dual concern in prism and we keep trying to compromise to serve both. I think it is still working but there will always be some tradeoff so long as we use a single format. A larger serialized format is less memory for compilation but more memory for syntax consumers.

A question would be how much larger would moving constant pool into serialization actually be? We will swap start+length for pool_offset + bytes in pool but for things like identifiers we will have quite a bit of sharing. So in those sorts of cases we may actually reduce size. Strings will be larger period. So, I do think it will be larger but I am unsure how much larger.

kddnewton · 2023-12-07T15:11:20Z

Thanks @enebo for your thoughts here.

I imagine it's okay to slow down the serialization API as it's only being used for Java and JS at the moment. The only real thing I'm concerned about here in terms of size would be strings. That could be quite a large serialized blob if the files have large strings in them. We would also need the DATA constant to be stored in the file as well, which could end up being large.

If you all think it is worth it, I'm definitely game to try after we ship 3.3.

enebo · 2023-12-07T15:42:11Z

@kddnewton yeah I am not sure what percentage of strings are within files but some files will end up much larger being pushed into serialization (although smaller than having both serialized+source). For me, this is a nice to have thing and something I definitely want but not at the cost of having prism being 100% usable. So deferring this is ok in my opinion.

I would like the three numeric types to have values in the AST though. Carrying around source during build is some code I would love to delete and it will make a significant change in memory size (since we still do lazy method but just with AST+source). Removing source would get us near in memory to what we do with the legacy parser.

kddnewton · 2023-12-07T16:00:11Z

@enebo I added the integer value storage to the next milestone so we'll do that. From reading I think I understand how to do that. However I am very unclear on how to do floats/rationals. I'm not tackling it today so it's not pressing, but if you have some resources I'm going to need them.

enebo · 2023-12-07T16:07:48Z

@kddnewton an interim solution may just be for us to copy the number strings to the constant pool and then entertain the native storage later.

eregon · 2023-12-08T17:03:21Z

We would also need the DATA constant to be stored in the file as well

True, although this would only be for the main file/script. I think Prism doesn't know about this currently, but it's a boolean we could pass in, and we'd only need Prism to serialize DATA in that case.

an interim solution may just be for us to copy the number strings to the constant pool and then entertain the native storage later.

Yeah, this would be an easy way for now. OTOH that might be suboptimal for non-serializing use cases?

I was thinking we might be able to just special-case them in Loader.java to store an extra byte[] for them (and declare that extra field in IntegerNode and FloatNode). Not the cleanest, but if the extra type: constant field for IntegerNode and FloatNode is undesirable then it might be a good trade-off.

enebo · 2023-12-08T17:11:08Z

@eregon yeah I thought about us just adding stuff not in config.yml but it feels messy. If you do it though I won't complain. We just need to revisit this later when the real solutions occurs.

eregon mentioned this issue Sep 16, 2023

[rust] fix accessing ConstantId slices #1449

Merged

eregon added the enhancement New feature or request label Sep 16, 2023

froydnj mentioned this issue Sep 19, 2023

make yp_constant_id_ts useful for constant pool indexing #1554

Closed

kddnewton added suggestion and removed enhancement New feature or request labels Oct 23, 2023

kddnewton added this to the Ruby 3.4.0 milestone Dec 7, 2023

kddnewton added enhancement New feature or request and removed suggestion labels Dec 14, 2023

kddnewton modified the milestones: CRuby unblocked, Long-term Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Copy & embed any string/constant pool entry in the serialized output #1537

Copy & embed any string/constant pool entry in the serialized output #1537

eregon commented Sep 16, 2023 •

edited

Loading

kddnewton commented Dec 6, 2023

eregon commented Dec 6, 2023

enebo commented Dec 7, 2023

kddnewton commented Dec 7, 2023

enebo commented Dec 7, 2023

kddnewton commented Dec 7, 2023

enebo commented Dec 7, 2023

eregon commented Dec 8, 2023

enebo commented Dec 8, 2023

Copy & embed any string/constant pool entry in the serialized output #1537

Copy & embed any string/constant pool entry in the serialized output #1537

Comments

eregon commented Sep 16, 2023 • edited Loading

kddnewton commented Dec 6, 2023

eregon commented Dec 6, 2023

enebo commented Dec 7, 2023

kddnewton commented Dec 7, 2023

enebo commented Dec 7, 2023

kddnewton commented Dec 7, 2023

enebo commented Dec 7, 2023

eregon commented Dec 8, 2023

enebo commented Dec 8, 2023

eregon commented Sep 16, 2023 •

edited

Loading