Skip to content

6.18 Handling strings in WASM without burning yourself

Claude Roux edited this page Apr 7, 2023 · 10 revisions

Introduction

Let's be honest, the arrival of ChatGPT has changed a bit the environment in which we were working quietly. However, we have to come back to reality, and a little reminder will do us all good. (By the way, I'm quite a fan of ChatGPT)

Computer science as a profession still exists...

For those who don't know, WebAssembly is this new W3C standard that consists in turning our favorite browsers into virtual machines. As it is said in the Bible: "nihil novi sub sole", "there's nothing new under the sole" or something similar.

Frankly, "docker" wasn't enough for you as an infinite source of bugs?

You had to put a VM in the browser.

Well... The idea in itself in -7A.G. (2015), was good, we'll extend JS capabilities with code in C++, C# or Rust(re), that we'll compile with LLVM to generate biblios that we'll be able to execute in the browser.

Honestly, the first part, the compilation, is super simple. You install Emscripten and paf!!! You just have to replace gcc by em++ and go ahead.

Really it's not more complicated than that... Just take a look at the following Makefile to see for yourself.

Options

Let's have a look at the C++ compilation options that you have to master a little bit.

I have compiled a language of my own, called lispe which is written in C++. I have written a post about it.

 -o lispe.html -O3 -sEXPORT_ALL=1 -sWASM=1 -fexceptions -sINITIAL_MEMORY=47972352 -sSTACK_SIZE=20971520
  • -o lispe.html: in this case, it also generates a test HMTL file plus a loading JS file.
  • -O3: The level of optimization, speed and size of the bib at compile time
  • -sWASM=1: You have to tell it that the target of compilation is WebAssembly -fexceptions: This is for handling C++ exceptions. It also allows you to export malloc to JS.
  • The rest is memory initialization to handle the library.

As you can see, to compile C++, it doesn't burn that many neurons...

Just a word about -o lispe.hmtl, if you replace it with -o lispe.wasm, it only compiles the WASM library.

Don't believe in Santa Claus

It's a principle of life.

Especially since it stings a bit.

Because from compiling to executing is rarely a clear path in the sunshine on a mild spring afternoon. Usually, that's when the nettles and brambles start to invade the muddy path, with steel spikes in the potholes, and a rain to cut through wrought iron.

You sigh with relief at having compiled your biniou and then you discover the first time you blow into it that there are holes everywhere.

For example, WebAssembly doesn't know what a string is.

Unchained

We say to ourselves, okay, let's take a better look at what a string is in JS. This is the normal, reassuring, mature approach.

JS handles strings encoded in... UTF-16. Was... Warum so viel Hass?

Yes, not UTF-8 or clean Unicode in UTF-32, no... UTF-16.

Just for fun I'll put a little routine I wrote to transform UTF-16 into UTF-32...

I give it to you for free:

bool c_utf16_to_unicode(char32_t& r, char32_t code, bool second) {
    //We realized that it was a code on 32 bits, we add the second part
    if (second) {
        r |= code & 0x3FF;
        return false;
    }
    
    //if the first byte is 0xD8000000, it is an encoding on four bytes
    if ((code & 0xFF00) == 0xD800) {
        //You like it, isn't it beautiful?
        r = ((((code & 0x03C0) >> 6) + 1) << 16) | ((code & 0x3F) << 10);
        return true;
    }
    
    //if r is the same as in UTF-32
    r = code;
    return false;
}

In fact, I've already forgotten how I ever wrote this code... Sport is a bitch... Don't start...

Table of numbers

We search, we wonder, we despair to understand one day what is on StackOverflow and sometimes we discover bits of explanation. (Anyway, StackOverflow is a divine punishment for those who still believe that computer science is learned from the demons of the 9th circle).

For example, to pass a string to WebAssembly, you have to pass it as an integer array.

But beware (see the remark about Santa Claus), the array of numbers must be declared in the common space within the WASM library.

Ok... What's the result?

I also give it to you for free (these functions are present in lispe_functions.js):

function provideStringAsInt32(code) {
    //We give ourselves some extra space
    nb = code.length + 1;
    nb = Math.max(20, nb);

    //first we create our integer array
    arr = new Int32Array(nb);
    //in which we arrange our string, character by character
    for (i = 0; i < code.length; i++) {
        arr[i] = code.charCodeAt(i);
    }
    arr[code.length] = 0;
    // Then we allocate an array of nb*4 bytes
    //An Int32 is stored on 4 bytes
    a_buffer = Module._malloc(nb * 4);
    //We store the values in our array
    //Note the division by 4 of a_buffer (>> 2) in order to get the correct index in the HEAP32 array.
    // Once again an Int32 is on 4 bytes.
    Module.HEAP32.set(arr, a_buffer >> 2)
    return a_buffer;
}

And some people say that JS is an easy language.

But, where it's even more fun is to do the opposite operation:

//sz is the number of elements in the array: array
function arrayToString(array, sz) {   
    str = "";        
    sz *= 4;
    //Each element is stored on 4 bytes
    //Same thing, we divide our array by 4 to have its exact address
    //And we walk around 4 bytes by 4 bytes, that we transform each one in character
    for (let pointer=0; pointer < sz; pointer+=4) {
        str += String.fromCharCode(Module.HEAP32[pointer + array>>2]);
    }
    //And we don't forget to free it
    Module._free(array);
    return str;
};
Clone this wiki locally