Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UTF8 abstraction in the TASTy format #19090

Merged
merged 1 commit into from
Nov 27, 2023

Conversation

nicolasstucki
Copy link
Contributor

We add a Utf8 encoding to the grammar. This should not to be confused with the UTF8 name tag. This mistake was made in the Comment format. We also add corresponding writeUtf8 and readUtf8 methods to the TastyBuffer.

This is also useful for #18948

@@ -20,12 +20,9 @@ class CommentUnpickler(reader: TastyReader) {
while (!isAtEnd) {
val addr = readAddr()
val length = readNat()
if (length > 0) {
val bytes = readBytes(length)
val position = new Span(readLongInt())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a bug. If the comment is empty we should read this long int, otherwise the next comment will start by reading this value instead of the the length of the comment.

I assume it was fine because we never pickled empty documentation. I wonder what should be the behaviour of /***/. In that case we still have some coordinates we should pickle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All pickled comments contain the /** and */ in the comments section. Therefore they can never be empty. I wonder if we could optimize that away at some point.

Copy link
Member

@bishabosha bishabosha Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this long int should always be read

We add a `Utf8` encoding to the grammar. This should not to be confused
with the `UTF8` name tag. This mistake was made in the `Comment` format.
We also add corresponding `writeUtf8` and `readUtf8` methods to the
`TastyBuffer`.
@@ -48,13 +42,12 @@ class TastyPickler(val rootCls: ClassSymbol) {
val uuidHi: Long = otherSectionHashes.fold(0L)(_ ^ _)

val headerBuffer = {
val buf = new TastyBuffer(header.length + TastyPickler.versionStringBytes.length + 32)
val buf = new TastyBuffer(header.length + TastyPickler.versionString.length + 32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems wrong - string length != utf-8 bytes length, e.g.

scala> val sc = "Scala 3.3.1➽"
val sc: String = Scala 3.3.1➽
                                                                                                                                                                                  
scala> sc.length
val res0: Int = 12
                                                                                                                                                                                  
scala> val scBytes = sc.getBytes(java.nio.charset.StandardCharsets.UTF_8)
val scBytes: Array[Byte] = Array(83, 99, 97, 108, 97, 32, 51, 46, 51, 46, 49, -30, -98, -67)
                                                                                                                                                                                  
scala> scBytes.length
val res1: Int = 14

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The + 32 covers a bit more than it needs. At least for the current version we do not have to relocate the buffer.

We also do not have an exact formula to know how much space the Nats will take.

Copy link
Member

@bishabosha bishabosha Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess in practice we shouldn't have these non-ascii strings but :/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my assumption.

/** Write a UTF8 string encoded as `Nat UTF8-CodePoint*`,
* where the `Nat` is the length of the code-points bytes.
*/
def writeUtf8(x: String): Unit = {
Copy link
Member

@bishabosha bishabosha Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can have an overload for Array[Byte] (IArray?) where you assume the bytes are already utf-8 encoded (so you can cache the bytes of Tool Version string)

@bishabosha bishabosha merged commit 78c3721 into scala:main Nov 27, 2023
19 checks passed
@bishabosha bishabosha deleted the tasty-format-utf8 branch November 27, 2023 16:05
@Kordyjan Kordyjan added this to the 3.4.0 milestone Dec 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants