-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decoding from uint16_t[] #15
Comments
I'm a bit unclear on what exactly you want to do, and with which data type. Do you want to base64-decode an // `src` is an array of uint16_t.
// `srclen` is the length of `src` counted in 16-bit characters.
int ret = base64_decode((char *)src, srclen * 2, out, &outlen, 0); We cast I have a feeling that's not what you mean though. Is it something more like this? V8 is giving you a base64-encoded string where each 16-bit character only has the lower byte set. Your plain ASCII strings (including your encoded base64 strings) look like this:
So in order to do the base64 decoding on plain, unpadded 8-bit input, you would need to preprocess the input to squeeze out every second byte. Reversely, to encode a plain 8-bit bytestream, you would have to "upsample" the output from 8-bit to this 16-bit representation by adding padding bytes. Is that what you mean? |
@aklomp FYI V8 uses UTF-16 for representing strings and abuses uint16_t as raw UTF-16 code unit type. So while the high byte is always zero for valid base64 input, it isn't sufficient to "squeeze out every second byte", because invalid input (i.e. non base64 characters) might be silently converted to valid input which might be dangerous in certain contexts. |
Yes, what @BurningEnlightenment said. The example of simply casting to a char array won't work because of the leading zero bytes causing early termination. Now that I've thought about it some more, it makes more sense to keep the length the same and just add a check on the upper byte and break early if it's non-zero. |
So the proposed solution would be to duplicate the API and the underlying logic to add first-class support for these wide chars (as I'll call them). I'm reluctant to do this, because of separation of concerns. In my view, this library does one simple thing, but does it in the best possible way, and that is encoding and decoding of flat base64 strings. I see any enlargement of that scope as being principally undesirable. So I'm reserved about making the big changes that would be necessary to support wide chars in the streaming API. I'd need to duplicate the API, add flags into all the codecs, and probably duplicate code. I'd really rather not, though it might be interesting to set up a development branch to test how feasible all this is. The good news is that for the case where you don't need streaming conversions (i.e. you have a fixed buffer of known length), you can get along with implementing simple input and output wrappers around the non-streaming API functions ( The input filter for decoding would (1) check if any of the high bytes are set, and (2) squeeze out every second byte. It would then feed the modified buffer to The output filter for encoding would simply pad the output with zero bytes. This too can be done very efficiently with SIMD instructions. Of course, SIMD instructions work on many bytes at the same time, so you'd also need to handle individual bytes bytes at the tail, and so on. But on the whole it's not infeasible. Also, if you're not interested in duplication of the whole API and what not, and if you're fine with using a development branch, you could "hack" the streaming codecs to do the byte shuffling on the fly. Was that the kind of adaptation you were alluding to when you said that you looked at the SSSE3 implementation but were confused by the numbers? I could perhaps help you there. |
Oops, guess I shouldn't think before coffee.... the non-plain codec changes won't be that trivial. I really only need 16-bit decoding, so that makes it a little easier. I'm still a little stumped as to what the non-plain codec changes would need to be (as far as adjusting the calculations and such), as I'm not very familiar with the intrinsics. After some brief research it does look like there are 16-bit instrinsics that could be used, but there are other (less obvious, to me) changes that I suspect would also need to be made, such as a 16-bit-compatible EDIT: I don't mind duplicating the decode functions locally, but figuring out the necessary (efficient) instrinsics changes is what I guess I'd be struggling with. EDIT 2: I should also note, I'm currently not using the streaming functionality, just |
If you're not using the streaming functions, why not use a two-stage pipeline and preprocess the input string using SSSE3 functions before sending it to the decoder? That should be roughly as fast as doing the conversion on the fly (we're doing the same amount of work), but wouldn't require any changes to the base64 library. You'd never have to touch the decoder code. You can do the conversion in-place, because it's guaranteed that the output will always be shorter than the input. As for the preprocessor function, it would contain a loop that branches on a condition: if the remaining input size is >= 16 bytes, then use the SSSE3 code, else use the fallback code (which works byte-by-byte). The fallback code is pretty trivial. The SSSE3 code can look something like this (off-the-cuff pseudocode and not tested): // `in` is an __mm128i containing 8 16-bit characters.
// `o` is a uint8_t* pointing to the output buffer.
// Compare with zero.
// `iszero` will contain 0xFF where input byte was zero, 0x00 where it was not.
__mm128i iszero = _mm_cmpeq_epi8(in, _mm_zero_si128());
// Extract a bitmask from the iszero mask.
// Bits will be 1 if the corresponding byte in `in` was equal to zero:
int mask = _mm_movemask_epi8(iszero);
// Check if any of the high bytes were nonzero, return error if so:
// FIXME: this is GCC-only binary literal syntax, just to give the idea):
if ((mask & 0b0101010101010101) != 0b0101010101010101)
return -1;
// Squeeze out every high byte, so that only the low bytes remain:
// FIXME: not sure of the byte mask pattern.
__mm128i shuf = _mm_shuffle_epi8(in, _mm_set_epi8(
-1, -1, -1, -1,
-1, -1, -1, -1,
0, 2, 4, 6,
8, 10, 12, 14));
// Write back 16 bytes, of which 8 bytes are output:
_mm_storeu_si128((__m128i *)o, shuf);
// Move output pointer 8 bytes:
o += 8; |
Thanks, that got me started in the right direction. FWIW the byte mask pattern ended up needing to be (at least for little endian?):
The high byte check isn't working, but I haven't really looked at that yet. One other problem I have when decoding is that I need to ignore invalid characters instead of returning early on such characters. Currently I'm doing something like this inside dec/ssse3.c to ignore invalid characters: int invalid = _mm_movemask_epi8(CMPEQ(delta, 0));
if (invalid) {
// Pack valid bytes together
uint8_t pos = 0;
#define CHECK_BYTE(n) \
if (!(invalid & 1<<n)) \
c[pos++] = c[n];
CHECK_BYTE(0);
CHECK_BYTE(1);
CHECK_BYTE(2);
CHECK_BYTE(3);
CHECK_BYTE(4);
CHECK_BYTE(5);
CHECK_BYTE(6);
CHECK_BYTE(7);
CHECK_BYTE(8);
CHECK_BYTE(9);
CHECK_BYTE(10);
CHECK_BYTE(11);
CHECK_BYTE(12);
CHECK_BYTE(13);
CHECK_BYTE(14);
CHECK_BYTE(15);
// Shift remaining bytes to fill in gap created by invalid bytes
memmove(c + pos, c + 16, srclen - (16 - pos));
// Retry current position with the newly shifted in data, with the source
// length adjusted appropriately
srclen -= (16 - pos);
continue;
} This seems to work, but do you know if there is a more efficient way to do this? The only other way I could think of would be to load position changes into a new |
The reason to ignore invalid bytes on decode is to skip newlines and carriage returns, right? I've been thinking about adding an option to the library for that actually. As you've found, it's not trivial in pure SSE to remove those bytes from the input stream and shift over the rest. But there must be a way, there's usually a crafty left-field approach if you think about it hard enough. I'll see if I can come up with something. In the meantime, you could perhaps get away with a slight alteration in the current library. If the SSE code now encounters an invalid input char, it falls back to the bytewise, per-character handling in This starts to get slow if you have a lot of invalid characters, since you're doing bytewise decoding on each one of them, but if there's occasionally a character here or there, it shouldn't matter too much. |
The main purpose is to skip whitespace, but all other invalid characters should be ignored as well. The reason being I need it for backwards compatibility with an existing base64 decoder. |
Here's the scheme I've come up with so far. It's probably not optimal, and I didn't benchmark or even compile it. Say we have these nine valid bytes of input. We pick all nine of them for the output:
Now we have the same thing but with two invalid characters. Each time we encounter an invalid character, we need to increment the shuffle mask index:
Also, in this example the total valid length becomes two characters shorter. We must account for that when we calculate This is probably not the fastest possible method, but we can construct the shuffle mask on-the-fly from the bitmask. In code, it might look something like this (untested): int i = 0;
// Invalid character bitmask:
uint16_t mask = _mm_movemask_epi8(CMPEQ(delta, 0));
// Create temporary array to construct shuffle mask:
uint8_t shm[16] __attribute__((align (16)));
// Create shuffle mask:
for (int j = 0; j < 16; j++)
if (!(mask & (1U << j)))
shm[i++] = j;
// This is not strictly necessary:
while (i < 16)
shm[i++] = -1;
// Shuffle input string:
str = _mm_shuffle_epi8(str, _mm_load_si128((__m128i *)shm));
// Number of error characters encountered
// (the string length should be decremented by this amount):
int errchars = __builtin_popcount(mask); I didn't test this code, so maybe the endianness is wrong or something. Just to give an idea. EDIT: the first version of the loop was incorrect. |
Regarding the |
Without taking into account the whitespace skipping (which would also apply to the I tested 2 things:
What I like about the first one is that it requires less work and would allow for other preprocessing stages to be implemented the same way. But... it's slower than the second one. The second implementation didn't required that much code duplication after all. Overall, it's 15% faster. Here are the results from 2 stages implementation:
Direct decoding:
|
I had a gut feeling that having a separate Some questions:
|
Yes having a direct
|
@mayeut Thanks, I ended up using blend to (first) replace instances of I also noticed It would be nice to see the built-in 16-bit decoding merged in the main repo here, but I understand if you feel like it may not be common enough to warrant its inclusion. At any rate, I want to thank you all for your input, guidance, and patience :-) |
@mayeut while I think your proof-of-concept is remarkable (how are you so damn fast? :) I'm hesitant to merge something like it for the reasons mentioned above. In my opinion, Base64 is an algorithm that works on octets and produces octets. Limited scope. (It's easier to argue this when the code and API changes involved to support On the other hand, I think that skipping whitespace and newlines is a good candidate for inclusion into this library, because invalid characters are a runtime property of the data instead of a compile-time property of the data's type. Skipping invalid characters is also a feature that has been talked about in the past. |
First off, thank you for providing this open source library (and under the BSD license).
My question is: would it be possible to see versions of the decoding functions that support
uint16_t[]
strings in addition toconst char[]
strings? I tried looking at just the ssse3 implementation (since that is what is used on my dev machine) to see how that might be implemented, but the numbers in the comments have me confused.My particular use case is that I get a
uint16_t[]
from the V8 JS engine, which is how it internally stores its strings. Being able to use that value type directly would be more efficient than having to make a copy to an 8-bit array first.I'd like to get
uint16_t[]
supported for each codec, since we run on a variety of platforms. Any help you can give would be much appreciated.The text was updated successfully, but these errors were encountered: