Add UTF8 abstraction in the TASTy format #19090

nicolasstucki · 2023-11-27T09:11:13Z

We add a Utf8 encoding to the grammar. This should not to be confused with the UTF8 name tag. This mistake was made in the Comment format. We also add corresponding writeUtf8 and readUtf8 methods to the TastyBuffer.

This is also useful for #18948

nicolasstucki · 2023-11-27T09:17:55Z

compiler/src/dotty/tools/dotc/core/tasty/CommentUnpickler.scala

@@ -20,12 +20,9 @@ class CommentUnpickler(reader: TastyReader) {
    while (!isAtEnd) {
      val addr = readAddr()
      val length = readNat()
-      if (length > 0) {
-        val bytes = readBytes(length)
-        val position = new Span(readLongInt())


This seems like a bug. If the comment is empty we should read this long int, otherwise the next comment will start by reading this value instead of the the length of the comment.

I assume it was fine because we never pickled empty documentation. I wonder what should be the behaviour of /***/. In that case we still have some coordinates we should pickle.

All pickled comments contain the /** and */ in the comments section. Therefore they can never be empty. I wonder if we could optimize that away at some point.

I agree this long int should always be read

We add a `Utf8` encoding to the grammar. This should not to be confused with the `UTF8` name tag. This mistake was made in the `Comment` format. We also add corresponding `writeUtf8` and `readUtf8` methods to the `TastyBuffer`.

bishabosha · 2023-11-27T14:53:46Z

compiler/src/dotty/tools/dotc/core/tasty/TastyPickler.scala

@@ -48,13 +42,12 @@ class TastyPickler(val rootCls: ClassSymbol) {
    val uuidHi: Long = otherSectionHashes.fold(0L)(_ ^ _)

    val headerBuffer = {
-      val buf = new TastyBuffer(header.length + TastyPickler.versionStringBytes.length + 32)
+      val buf = new TastyBuffer(header.length + TastyPickler.versionString.length + 32)


this seems wrong - string length != utf-8 bytes length, e.g.

scala> val sc = "Scala 3.3.1➽" val sc: String = Scala 3.3.1➽ scala> sc.length val res0: Int = 12 scala> val scBytes = sc.getBytes(java.nio.charset.StandardCharsets.UTF_8) val scBytes: Array[Byte] = Array(83, 99, 97, 108, 97, 32, 51, 46, 51, 46, 49, -30, -98, -67) scala> scBytes.length val res1: Int = 14

The + 32 covers a bit more than it needs. At least for the current version we do not have to relocate the buffer.

We also do not have an exact formula to know how much space the Nats will take.

I guess in practice we shouldn't have these non-ascii strings but :/

That was my assumption.

bishabosha · 2023-11-27T14:57:29Z

tasty/src/dotty/tools/tasty/TastyBuffer.scala

+  /** Write a UTF8 string encoded as `Nat UTF8-CodePoint*`,
+   *  where the `Nat` is the length of the code-points bytes.
+   */
+  def writeUtf8(x: String): Unit = {


maybe you can have an overload for Array[Byte] (IArray?) where you assume the bytes are already utf-8 encoded (so you can cache the bytes of Tool Version string)

nicolasstucki commented Nov 27, 2023

View reviewed changes

nicolasstucki force-pushed the tasty-format-utf8 branch from b55f95d to b6cbad1 Compare November 27, 2023 09:23

Add UTF8 abstraction in the TASTy format

486af2f

We add a `Utf8` encoding to the grammar. This should not to be confused with the `UTF8` name tag. This mistake was made in the `Comment` format. We also add corresponding `writeUtf8` and `readUtf8` methods to the `TastyBuffer`.

nicolasstucki force-pushed the tasty-format-utf8 branch from b6cbad1 to 486af2f Compare November 27, 2023 09:25

nicolasstucki requested a review from bishabosha November 27, 2023 09:29

nicolasstucki assigned bishabosha Nov 27, 2023

nicolasstucki marked this pull request as ready for review November 27, 2023 11:04

bishabosha reviewed Nov 27, 2023

View reviewed changes

nicolasstucki requested a review from bishabosha November 27, 2023 15:39

bishabosha approved these changes Nov 27, 2023

View reviewed changes

bishabosha merged commit 78c3721 into scala:main Nov 27, 2023

bishabosha deleted the tasty-format-utf8 branch November 27, 2023 16:05

Kordyjan added this to the 3.4.0 milestone Dec 20, 2023

WojciechMazur mentioned this pull request Jun 23, 2024

Backport "Add UTF8 abstraction in the TASTy format" to LTS #20766

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add UTF8 abstraction in the TASTy format #19090

Add UTF8 abstraction in the TASTy format #19090

Uh oh!

nicolasstucki commented Nov 27, 2023

Uh oh!

nicolasstucki Nov 27, 2023

Uh oh!

nicolasstucki Nov 27, 2023

Uh oh!

bishabosha Nov 27, 2023 •

edited

Loading

Uh oh!

bishabosha Nov 27, 2023

Uh oh!

nicolasstucki Nov 27, 2023

Uh oh!

bishabosha Nov 27, 2023 •

edited

Loading

Uh oh!

nicolasstucki Nov 27, 2023

Uh oh!

bishabosha Nov 27, 2023 •

edited

Loading

Uh oh!

Uh oh!

Add UTF8 abstraction in the TASTy format #19090

Add UTF8 abstraction in the TASTy format #19090

Uh oh!

Conversation

nicolasstucki commented Nov 27, 2023

Uh oh!

nicolasstucki Nov 27, 2023

Choose a reason for hiding this comment

Uh oh!

nicolasstucki Nov 27, 2023

Choose a reason for hiding this comment

Uh oh!

bishabosha Nov 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bishabosha Nov 27, 2023

Choose a reason for hiding this comment

Uh oh!

nicolasstucki Nov 27, 2023

Choose a reason for hiding this comment

Uh oh!

bishabosha Nov 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicolasstucki Nov 27, 2023

Choose a reason for hiding this comment

Uh oh!

bishabosha Nov 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bishabosha Nov 27, 2023 •

edited

Loading

bishabosha Nov 27, 2023 •

edited

Loading

bishabosha Nov 27, 2023 •

edited

Loading