Improve algorithm to count digits in Long #413

Egorand · 2024-11-14T18:59:27Z

Copies the PR merged into Okio: square/okio#1548.

The algorithm is based on "Down Another Rabbit Hole" by Romain Guy.

TLDR: this algorithm improves the performance of calculating the number of digits in a Long number by 40%, based on Romain's benchmarks.

fzhinkin · 2024-11-14T20:21:47Z

core/common/src/Sinks.kt

@@ -135,6 +107,34 @@ public fun Sink.writeDecimalLong(long: Long) {
    }
 }

+private fun countDigitsIn(v: Long): Int {
+    val guess = ((64 - v.countLeadingZeroBits()) * 10) ushr 5
+    return guess + (if (v > DigitCountToLargestValue[guess]) 1 else 0)


IIRC from the time I read Romain's blogpost, by extending DigitCountToLargestValue's length to the next power of two (32 in this case) and replacing DigitCountToLargestValue[guess] with DigitCountToLargestValue[guess.and(0x1f)] you can win a few extra percents of performance on JVM (as it should optimize out bounds checks performed on array access).

DigitCountToLargestValue is actually slightly different than the table used in the blogpost:

private val PowersOfTen = longArrayOf( 0, 10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000, 10000000000, 100000000000, 1000000000000, 10000000000000, 100000000000000, 1000000000000000, 10000000000000000, 100000000000000000, 1000000000000000000 )

The main reason is that the original table doesn't work when the input is Long.MAX_VALUE, as it's bigger than 10^18 (last value in the array), but 10^19 is outside of the Long range.

I wonder if the one in the PR performs better? Worth benchmarking them against each other?

What I meant is that loads from DigitCountToLargestValue table are compiled into a code that checks if an index is within array's bounds before performing a load.
However, if compiler can prove that indices are always in bounds, it'll abstain from generating the check.
By expanding the table to have a power-of-two length (and filling meaningless cells with, let's say, -1) and then explicitly truncating index's most significant bits (i.e., dividing an index by table's length and taking the remainder), we can hint a compiler that a value is always in bounds and it'll generate faster code: https://gist.github.com/fzhinkin/42997a2cfc18a437f88e9c31bef969c9

BTW I checked and on Android the power-of-two array + truncation doesn't remove the bounds check. It just adds an extra instruction. See https://godbolt.org/z/jdTzMcxbf

fzhinkin · 2024-11-14T20:29:54Z

@Egorand thanks for opening the PR!

fzhinkin · 2024-11-14T20:41:16Z

We have a benchmark on writeDecimalLong performance (this one), but it writes the same value over and over again, so the old implementation might have an advantage.

So I drafted a benchmark that writes a pack of different values:

@State(Scope.Benchmark)
open class DecimalLongWriteOnlyBenchmark : BufferRWBenchmarkBase() {
    val rng = Random(42)
    val limits = longArrayOf(
        0L,
        10L,
        100L,
        1000L,
        10000L,
        100000L,
        1000000L,
        10000000L,
        100000000L,
        1000000000L,
        10000000000L,
        100000000000L,
        1000000000000L,
        10000000000000L,
        100000000000000L,
        1000000000000000L,
        10000000000000000L,
        100000000000000000L,
        1000000000000000000L,
        Long.MAX_VALUE
    )

    // TODO: It might be better to have values following Zipf-distribution
    val values = (1 ..< limits.size).asSequence()
        .flatMap {
            val lb = limits[it - 1]
            val up = limits[it]

            generateSequence { rng.nextLong(lb, up) }.take(10)
        }
        .toList()
        .shuffled(rng)
        .toLongArray()

    override fun padding(): ByteArray {
        return with(Buffer()) {
            for (value in values) {
                writeDecimalLong(value)
                writeByte(' '.code.toByte())
            }
            readByteArray()
        }
    }

    @Benchmark
    fun benchmark() {
        val sz = buffer.size
        for (value in values) {
            buffer.writeDecimalLong(value)
            buffer.writeByte(' '.code.toByte())
        }
        buffer.skip(buffer.size - sz)
    }
}

For some reason, code using the old implementation (from the develop) outperforms code using the new one (from this PR); results collected on MacBook w/ AS M3 CPU, JDK 17.0.12:

# results for the benchmark built from develop branch
Benchmark                                (minGap)   Mode  Cnt       Score      Error  Units
DecimalLongWriteOnlyBenchmark.benchmark       128  thrpt   15  387634.472 ± 2489.095  ops/s

# result for the benchmark built from this PR:
Benchmark                                (minGap)   Mode  Cnt       Score     Error  Units
DecimalLongWriteOnlyBenchmark.benchmark       128  thrpt   15  362477.693 ± 869.341  ops/s

It's worth checking what's causing the regression.

Egorand · 2024-11-15T11:27:12Z

For some reason, code using the old implementation (from the develop) outperforms code using the new one (from this PR)

That's interesting! @romainguy - wonder if you could share your benchmarks for comparison, and whether you have thoughts on what could be causing the results.

I'll find some time to dig deeper and investigate!

romainguy · 2024-11-15T15:24:47Z

I don't have the original benchmark but it wasn't done on JVM but on Android, so different runtime and hardware. However I used a dataset with a zipf distribution to be somewhat realistic and avoid favoring well predicted branches.

@fzhinkin's trick is something I've used in the past (it works great in C++ but for other reasons) and it's definitely worth a try.

Improve algorithm to count digits in Long

94ca93f

Egorand mentioned this pull request Nov 14, 2024

Improve algorithm to count digits in Long square/okio#1548

Merged

fzhinkin self-requested a review November 14, 2024 19:29

fzhinkin added the enhancement label Nov 14, 2024

fzhinkin reviewed Nov 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve algorithm to count digits in Long #413

Improve algorithm to count digits in Long #413

Egorand commented Nov 14, 2024

fzhinkin Nov 14, 2024 •

edited

Loading

Egorand Nov 15, 2024

fzhinkin Nov 15, 2024

romainguy Nov 15, 2024

fzhinkin commented Nov 14, 2024

fzhinkin commented Nov 14, 2024

Egorand commented Nov 15, 2024

romainguy commented Nov 15, 2024

Improve algorithm to count digits in Long #413

Are you sure you want to change the base?

Improve algorithm to count digits in Long #413

Conversation

Egorand commented Nov 14, 2024

fzhinkin Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

Egorand Nov 15, 2024

Choose a reason for hiding this comment

fzhinkin Nov 15, 2024

Choose a reason for hiding this comment

romainguy Nov 15, 2024

Choose a reason for hiding this comment

fzhinkin commented Nov 14, 2024

fzhinkin commented Nov 14, 2024

Egorand commented Nov 15, 2024

romainguy commented Nov 15, 2024

fzhinkin Nov 14, 2024 •

edited

Loading