-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add UTF8 abstraction in the TASTy format #19090
Conversation
@@ -20,12 +20,9 @@ class CommentUnpickler(reader: TastyReader) { | |||
while (!isAtEnd) { | |||
val addr = readAddr() | |||
val length = readNat() | |||
if (length > 0) { | |||
val bytes = readBytes(length) | |||
val position = new Span(readLongInt()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a bug. If the comment is empty we should read this long int, otherwise the next comment will start by reading this value instead of the the length of the comment.
I assume it was fine because we never pickled empty documentation. I wonder what should be the behaviour of /***/
. In that case we still have some coordinates we should pickle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All pickled comments contain the /**
and */
in the comments section. Therefore they can never be empty. I wonder if we could optimize that away at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this long int should always be read
b55f95d
to
b6cbad1
Compare
We add a `Utf8` encoding to the grammar. This should not to be confused with the `UTF8` name tag. This mistake was made in the `Comment` format. We also add corresponding `writeUtf8` and `readUtf8` methods to the `TastyBuffer`.
b6cbad1
to
486af2f
Compare
@@ -48,13 +42,12 @@ class TastyPickler(val rootCls: ClassSymbol) { | |||
val uuidHi: Long = otherSectionHashes.fold(0L)(_ ^ _) | |||
|
|||
val headerBuffer = { | |||
val buf = new TastyBuffer(header.length + TastyPickler.versionStringBytes.length + 32) | |||
val buf = new TastyBuffer(header.length + TastyPickler.versionString.length + 32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems wrong - string length != utf-8 bytes length, e.g.
scala> val sc = "Scala 3.3.1➽"
val sc: String = Scala 3.3.1➽
scala> sc.length
val res0: Int = 12
scala> val scBytes = sc.getBytes(java.nio.charset.StandardCharsets.UTF_8)
val scBytes: Array[Byte] = Array(83, 99, 97, 108, 97, 32, 51, 46, 51, 46, 49, -30, -98, -67)
scala> scBytes.length
val res1: Int = 14
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The + 32
covers a bit more than it needs. At least for the current version we do not have to relocate the buffer.
We also do not have an exact formula to know how much space the Nat
s will take.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess in practice we shouldn't have these non-ascii strings but :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my assumption.
/** Write a UTF8 string encoded as `Nat UTF8-CodePoint*`, | ||
* where the `Nat` is the length of the code-points bytes. | ||
*/ | ||
def writeUtf8(x: String): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you can have an overload for Array[Byte]
(IArray
?) where you assume the bytes are already utf-8 encoded (so you can cache the bytes of Tool Version string)
We add a
Utf8
encoding to the grammar. This should not to be confused with theUTF8
name tag. This mistake was made in theComment
format. We also add correspondingwriteUtf8
andreadUtf8
methods to theTastyBuffer
.This is also useful for #18948