Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of Phi-3-mini-4k-instruct (Q8_0 or Q4_0). #6

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

srogmann
Copy link

This is a PR to run Phi-3-Mini-4K. It only includes Phi3.java. I wrote this file because phi-3 is faster at simple tasks.

I intentionally left Llama3.java unchanged, even though some synergies could have been achieved. The current Llama3.java is a beautiful, complete example of a transformer model. It might be a shame to introduce additions for phi-3, which would actually be beneficial for reusability.

The nice thing is that thanks to the roughly 2,000 lines in Llama3.java, phi-3 can be added with only 800 lines (or a bit less if debug lines would be removed) . But phi-3 is not llama-3, so it would be plausible if you decided that phi-3 does not belong here. Because who knows where this ends (Gemma-2 is also interesting ;-) ).

@mukel
Copy link
Owner

mukel commented Jul 10, 2024

Hey Sascha, that's awesome!
I also ported Phi3; in my implementation I added support for tensor slices, so there's a large qkv tensor and q, k, v are just slices, which avoids the copies. Same for ffn_gate and ffn_up combined tensors.

To avoid duplicated work, I also ported Mistral/Codestral, Gemma 1 and 2 (no sliding window attention yet), Qwen2 and Phi3.
You are quite resourceful; you were able to port Phi3 on your own so I assume you became familiar with the codebase already, would you like to contribute further e.g. new models, refactors, improvements ... ?

@mukel
Copy link
Owner

mukel commented Jul 10, 2024

I see you also implemented the right RoPE. Nice!

@srogmann
Copy link
Author

I added support for tensor slices

That's great! I didn't like the copies.

would you like to contribute further e.g. new models, refactors, improvements ... ?

I could take a look at Q6_K, if there's no objection. I was planning on looking into that part anyway.

@mukel
Copy link
Owner

mukel commented Jul 11, 2024

Great!
With Q6_K support I won't need to re-publish pure quantized versions of the models anymore.

I found the graphical representation of the quantized blocks quite understandable here: https://www.modular.com/blog/whats-new-in-max-24-4-max-on-macos-fast-local-llama3-native-quantization-and-gguf-support

Also, a small note on performance; in the current implementation if you mix several implementations e.g. Q4_0 and Q6_K, the compiler will generate a not-so-good version of the matmul. This can be fixed with with minor adjustments, it's been in my backlog for some time.

@srogmann
Copy link
Author

@mukel I have a Q6_K-implementation and first measurements, but I'm not really happy with them.

Q8_0 with 512 bits runs at 8.2 token/s, Q8_0 with 256 bits runs at 6.1 token/s but Q6_K with 256 bits runs at 1.5 token/s only.

Q6_K with 512 bits runs at 0.34 tokens/s only. Initially, I assumed that 512 bits would be ideal for Q6_K quantization. Q6_K is more complicated than Q8_0 or Q4_0, I expected Q6_K to be a bit slower, but not to this extent.

The current dot-method with 256 bits:

private static float vectorDot256(Q6_KFloatTensor thiz, int thisOffset, FloatTensor that, int thatOffset, int size) {
    float result = 0f;
    int j = 0;

    // Align thisOffset + j to type().getBlockSize().
    int blockSize = GGMLType.Q6_K.getBlockSize();
    assert Integer.bitCount(blockSize) == 1 : "power of 2";
    int alignmentBound = Math.min(size, -thisOffset & (blockSize - 1));
    if (alignmentBound > 0) {
        result += FloatTensor.scalarDot(thiz, thisOffset, that, thatOffset, alignmentBound);
        j += alignmentBound;
    }
    assert (thisOffset + j) % blockSize == 0;

    FloatVector val = FloatVector.zero(F_SPECIES); // FloatVector.SPECIES_256
    int typeSize = GGMLType.Q6_K.getTypeSize();
    int superblockSize = GGMLType.QK_K / 16;
    int blockOffset = (thisOffset + j) / blockSize * typeSize;
    int upperBound = size / blockSize * blockSize;
    var bSpecies128 = ByteVector.SPECIES_128; // 16 bytes
    var bSpecies256 = ByteVector.SPECIES_256; // 32 bytes
    // Layout:
    //   uint8 ql[QK_K/2]: quants, lower 4 bits
    //   uint8 qh[QK_K/4]: quants, upper 2 bits
    //   int8  scales[QK_K/16]: scales, quantized with 8 bits
    //   fp16  d: super-block scale
    for (; j < upperBound; j += blockSize, blockOffset += typeSize) {
        int offsetQl = blockOffset; // size GK_K/2 = 128 bytes = 256 quants
        int offsetQh = offsetQl + GGMLType.QK_K / 2; // size GK_K/4 = 64 bytes = 256 quants
        int offsetScales = offsetQh + GGMLType.QK_K / 4; // size GK_K/16 = 16 bytes = 16 block-scales
        int offsetD = offsetScales + superblockSize;
        float wScaleValue = Float.float16ToFloat(thiz.memorySegment.get(JAVA_SHORT_LE, offsetD));
        var wScale = FloatVector.broadcast(F_SPECIES, wScaleValue);
        var wScale8 = ByteVector.fromMemorySegment(bSpecies128, thiz.memorySegment, offsetScales, ByteOrder.LITTLE_ENDIAN);
        for (int blk128 = 0; blk128 < 2; blk128++) {
            final ByteVector bytesQl0 = ByteVector.fromMemorySegment(bSpecies256, thiz.memorySegment, offsetQl + 64 * blk128, ByteOrder.LITTLE_ENDIAN);
            final ByteVector bytesQl1 = ByteVector.fromMemorySegment(bSpecies256, thiz.memorySegment, offsetQl + 64 * blk128 + 32, ByteOrder.LITTLE_ENDIAN);
            final ByteVector bytesQh0 = ByteVector.fromMemorySegment(bSpecies256, thiz.memorySegment, offsetQh + 32 * blk128, ByteOrder.LITTLE_ENDIAN);

            var ql00 = bytesQl0.and((byte) 0x0f);
            var ql01 = bytesQl1.and((byte) 0x0f);
            var ql10 = bytesQl0.lanewise(VectorOperators.LSHR, 4); // index(ql1) = index(ql0) + QK_K/4
            var ql11 = bytesQl1.lanewise(VectorOperators.LSHR, 4); // index(ql1) = index(ql0) + QK_K/4
            var qh00 = bytesQh0.lanewise(VectorOperators.LSHL, 4).and((byte) 0b110000);
            var qh10 = bytesQh0.lanewise(VectorOperators.LSHL, 2).and((byte) 0b110000); // index(qh1) = index(qh0) + QK_K/8
            var qh20 = bytesQh0.and((byte) 0b110000);
            var qh30 = bytesQh0.lanewise(VectorOperators.LSHR, 2).and((byte) 0b110000);
            ByteVector[] q = new ByteVector[] {
                ql00.add(qh00).sub((byte) 32),
                ql01.add(qh10).sub((byte) 32),
                ql10.add(qh20).sub((byte) 32),
                ql11.add(qh30).sub((byte) 32),
            };
            final int blk128Idx = 8 * blk128;
            FloatVector sum = FloatVector.broadcast(F_SPECIES, 0f);
            for (int i = 0; i < 16; i++) {
                final int idx2 = i / 2;
                final int idx4 = i / 4;
                sum = sum.add(that.getFloatVector(F_SPECIES, thatOffset + j + (2 * blk128Idx + i) * F_SPECIES.length()).mul(q[idx4].castShape(F_SPECIES, i % 4)).mul(wScale8.lane(blk128Idx + idx2)));
            }
            val = sum.fma(wScale, val);                    
        }
    }
    result += val.reduceLanes(VectorOperators.ADD);

    // Remaining entries.
    if (j < size) {
        System.out.println("size-j=" + (size - j));
        result += FloatTensor.scalarDot(thiz, thisOffset + j, that, thatOffset + j, size - j);
    }

    return result;
}

But perhaps there is a better layout of the bit-operations :-).

@mukel
Copy link
Owner

mukel commented Jul 29, 2024

Are you on Apple silicon or Intel?
Apple Silicon (up to M3) only supports NEON with up-to 128-bit vectors, anything larger will default to the slow fallback (slower than scalar).
Just using the Vector API doesn't guarantee proper compilation, if that's the case, then this may serve as a real-world benchmark for C2 and Graal to ensure proper compilation.

@mukel
Copy link
Owner

mukel commented Jul 30, 2024

You should not store vectors in arrays nor fields, otherwise they get materialized, thus slow. It may work but I wouldn't trust C2 escape analysis here.

@srogmann
Copy link
Author

I'm on Intel (i7-11800H, openjdk 21.0.2 2024-01-16) and AMD (Ryzen 9 7900X, openjdk 22 2024-03-19).

CPU Quant Species Speed
AMD Ryzen Q8_0 S_512_BIT 8.95 tokens/s
AMD Ryzen Q8_0 S_256_BIT 10.58 tokens/s
AMD Ryzen Q8_0 S_128_BIT 10.48 tokens/s
AMD Ryzen Q6_K S_512_BIT 0.65 tokens/s
AMD Ryzen Q6_K S_256_BIT 2.45 tokens/s
AMD Ryzen Q6_K S_128_BIT 0.45 tokens/s

The vectorDot512-method didn't use arrays but vectorDot256 and vectorDot128 of Q6_K used arrays. I will try without arrays. The JVM announced S_512_BIT as preferred size. I'll have to check the code to see if I missed something.

@srogmann
Copy link
Author

srogmann commented Aug 5, 2024

@mukel
In https://github.com/srogmann/llama3.java/tree/feature/llm_server there is a small digression on the provision of a tcp based LLM-server. I used it to access llama3.java in an IntelliJ-plugin I wrote (not yet online but planned). The server uses a simple binary protocol because I didn't want to add dependencies.

Do you plan to distinguish between llama3.java (simple) and llama3.java (extended)? The first one would be a nice 2.000 liner. The second one could have some extensions (Q6_K, server, ...).

@mukel
Copy link
Owner

mukel commented Aug 6, 2024

The goal is to promote Llama3.java into a larger effort e.g. https://github.com/llama4j to implement more models in the same place and share common parts.
The umbrella project will focus on the "inference engine" part but can easily consumed by high-level projects like https://github.com/langchain4j/langchain4j
Having an HTTP server there is a welcome addition.

@mukel
Copy link
Owner

mukel commented Aug 8, 2024

Hi @srogmann, can you share the Q6_K implementations (all the variants)?
My colleagues in the GraalVM team are extending the compiler support for the Vector API; Graal can already properly compile Q8_0 and Q4_0 and I expect it to do a good job on the other quantizations as well, I'd like to test Q6_K to ensure Graal is not missing anything.

@srogmann
Copy link
Author

srogmann commented Aug 8, 2024

Hi @mukel ,

can you share the Q6_K implementations (all the variants)?

Yes, I will share the Q6_K implementation(s) (it's on another device).

Some remarks: I had a look at https://github.com/ggerganov/llama.cpp/blob/master/ggml/src/ggml.c because I wasn't satisfied with the performance of my Q6_K implementation. In architectural #define AVX2 of function ggml_vec_dot_q6_K_q8_K in https://github.com/ggerganov/llama.cpp/blob/master/ggml/src/ggml-quants.c there is a 256 bit edition only:

    const __m256i m4 = _mm256_set1_epi8(0xF);
    const __m256i m2 = _mm256_set1_epi8(3);
    const __m256i m32s = _mm256_set1_epi8(32);

So I'm not disappointed that my 512-bit implementation was not as fast as I hoped. But I was surprised to see, that the second factor of the dot product in ggml_vec_dot_q6_K_q8_K is Q8_K, not FLOAT32. This gives the Q6_K-dot implementation more performance and its implementation is more compact at the tail.

@mukel
Copy link
Owner

mukel commented Aug 8, 2024

I'm very surprised as well.
Does llama.cpp really quantizes all vectors in the matrix-vector multiplication with Q8_K? I find this very hard to believe...
Q8_K is the same as Q8_0 but with blocks of 256 elements + additional sums for each 16-elements blocks.

EDIT: Confirmed: ggerganov/llama.cpp#3477

@mukel
Copy link
Owner

mukel commented Aug 8, 2024

I remember reading somewhere how tensors with dimensions % 256 != 0 were problematic, this may be an explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants