Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Good First Issue]: Create a GGUF reader #1665

Open
AlexKoff88 opened this issue Feb 3, 2025 · 14 comments
Open

[Good First Issue]: Create a GGUF reader #1665

AlexKoff88 opened this issue Feb 3, 2025 · 14 comments
Assignees
Labels
good first issue Good for newcomers

Comments

@AlexKoff88
Copy link
Collaborator

The idea is to have a functionality that allows reading GGUF format and creating OpenVINO GenAI compatible representation that can be used to instantiate LLMPipeline() from it.
This task includes:

The initial scope can include support of llama-based LLMs (e.g. llama-3.2 and SmoLMs) and FP16, Q8_0, Q4_0, Q4_1 models.
All the code should be written in C++.

@AlexKoff88 AlexKoff88 converted this from a draft issue Feb 3, 2025
@ilya-lavrenov ilya-lavrenov added the good first issue Good for newcomers label Feb 3, 2025
@ilya-lavrenov ilya-lavrenov changed the title Create a GGUF reader [Good First Issue]: Create a GGUF reader Feb 3, 2025
@Geeks-Sid
Copy link

Can this be broken down into smaller exact tasks ? This would allow us to pick off tasks one by one and help contributors slowly build something instead of all of at once.

@AlexKoff88
Copy link
Collaborator Author

AlexKoff88 commented Feb 4, 2025

It can be for sure but the way I see it assumes that these tasks should be executed subsequently. For example:

  • One can start by enabling llama-3.2-1b in FP16.
  • Parsing and converting tokenizer from GGUF format to OpenVINO (tokenizer/detokenizer models). After that, we will have core functionality in place.
  • Then, a few tasks can be executed in parallel:
    • Enable Q8_0 llama
    • Enable Q4_0 and Q4_1 llama
    • Enable and verify other llama-based models such as Llama-3.1-8B, SmolLMs
    • Enable the most popular quantization schemes such as Q4_K_M
    • Enable Qwen model family

...

@11happy
Copy link

11happy commented Feb 12, 2025

.take

Copy link

Thank you for looking into this issue! Please let us know if you have any questions or require any help.

@Captain-MUDIT
Copy link

Hello @AlexKoff88 I would like to work on this can this assigned to me?

@11happy
Copy link

11happy commented Feb 15, 2025

Hi @AlexKoff88 ,@ilya-lavrenov , I am able to parse gguf model via using the gguf tools that you provided above howver they are written in C (had to do some linkning on my end) , in the POC you provided in load_gguf_model function
the weights is a very complex map containing vector, vector<vector> (layers) should I continue with similar implementation on my end? please point me if I am headed in the right direction. Also I had doubt as you mentioned earlier

Parsing and converting tokenizer from GGUF format to OpenVINO (tokenizer/detokenizer models). After that, we will have core functionality in place.

do we only need to parse the tokenizer not the whole model ?

Thank you

this is the update on my end

main.cpp

extern "C" {
    #include "gguf-tools/gguflib.h"
}
#include "openvino/op/concat.hpp"
#include "openvino/op/constant.hpp"
#include "openvino/op/convert.hpp"
#include "openvino/op/gather_elements.hpp"
#include "openvino/op/unsqueeze.hpp"
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <string.h>
#include <assert.h>
#include <errno.h>
#include <math.h>
#include <inttypes.h>
#include <iostream>
#include <bits/stdc++.h>

struct {
    int verbose;        // --verbose option
    int diffable;       // --diffable option
} Opt = {0};


std::pair<std::map<std::string,int>,std::map<std::string,std::vector<double>>> load_gguf_model(const char *model_path){
    
    std::map<std::string,int> config,meta_data;
    std::map<std::string,std::vector<double>> weights;
    gguf_ctx *ctx = gguf_open(model_path);
    if (ctx == NULL) {
        perror(model_path);
        exit(1);
    }
    gguf_key key;
    while (gguf_get_key(ctx,&key)) {
        meta_data[std::string(key.name,key.namelen)] = key.val->uint32;
        gguf_next(ctx,key.type,key.val,Opt.verbose);
    }
    config.emplace("layer_num",meta_data["llama.block_count"]);
    config.emplace("head_num",meta_data["llama.attention.head_count"]);
    config.emplace("head_size",meta_data["llama.embedding_length"]/meta_data["llama.attention.head_count"]);
    config.emplace("head_num_kv",(meta_data.count("llama.attention.head_count_kv")?meta_data["llama.attention.head_count_kv"]:meta_data["llama.attention.head_count"]));
    config.emplace("max_position_embeddings",((meta_data.count("llama.context_length")?meta_data["llama.context_length"]:2048)));
    config.emplace("rotary_dims",meta_data["llama.rope.dimension_count"]);
    config.emplace("rms_norm_eps",meta_data["llama.attention.layer_norm_rms_epsilon"]);
    config.emplace("rope_freq_base",((meta_data.count("llama.rope.freq_base")?meta_data["llama.rope.freq_base"]:10000.0)));
    
    for(auto x : config){
        std::cout<<x.first<<" : "<<x.second<<std::endl;
    }
    return{config,weights};
}

int main(int argc, char* argv[]){
    std::cout<<"helloworld]n\n";
    std::string filename = argv[1];
    std::pair<std::map<std::string,int>,std::map<std::string,std::vector<double>>> model = load_gguf_model(argv[1]);

    return 0;
}

make file:

CC=gcc
CXX=g++
CFLAGS=-march=native -ffast-math -g -ggdb -Wall -W -pedantic -O3
INCLUDES=-I./gguf-tools

OBJECTS=gguf-tools/gguflib.o gguf-tools/sds.o gguf-tools/fp16.o

main: $(OBJECTS) main.cpp
	$(CXX) $(CFLAGS) $(INCLUDES) main.cpp $(OBJECTS) -o main

%.o: %.c
	$(CC) $(CFLAGS) $(INCLUDES) -c $< -o $@

clean:
	rm -f main $(OBJECTS)

I am also implementing the model conversion logic as pointed in POC, will update as i finish it.
Thank you

@11happy
Copy link

11happy commented Feb 15, 2025

@Captain-MUDIT would love to collaborate with you, lets wait for the Alex's response & then we can decide how to proceed.

@Captain-MUDIT
Copy link

Captain-MUDIT commented Feb 16, 2025

Sure @11happy

@Captain-MUDIT
Copy link

@AlexKoff88
.take

Copy link

Thanks for being interested in this issue. It looks like this ticket is already assigned to a contributor. Please communicate with the assigned contributor to confirm the status of the issue.

@AlexKoff88
Copy link
Collaborator Author

Hi @11happy and @Captain-MUDIT, thank you for your interest.

@11happy, regarding your question about how to load the GGUF in the right way. You can go with how MLX does it so you can add "gguf-tool" to submodules and borrow the code from MLX that parses GGUF with "gguf-tool". Details are here: https://github.com/ml-explore/mlx/blob/main/mlx/io/gguf.cpp

@Captain-MUDIT, you can take the tokenizer conversion part. The task is to transform GGUF tokenizer data to OpenVINO tokenizer. OpenVINO has a dedicated project for converting tokenizers from HF Transformers to OpenVINO format: https://github.com/openvinotoolkit/openvino_tokenizers. The idea is to take tokenizer config, vocab and metadata and use a part of openvino_tokenizers lib to do the conversion. Adding @apaniukov for consultations.

@janviisonii23
Copy link

@AlexKoff88 can I also work on this issue?

@AlexKoff88
Copy link
Collaborator Author

@janviisonii23, you can as there are a few subtasks in it but you have to wait a bit until the core part is implemented.

@apaniukov
Copy link
Contributor

@11happy @Captain-MUDIT

There is a TokenizerPipeline class for building tokenizer/detokenizer models. The easiest way is to parse the tokenizer data from .gguf file, make such a pipeline and get models from it, see HF-tiktoken tokenizer example.
You can get an example of what steps are created, by checking a steps attribute of the pipeline object that is created here by converting gguf tokenizer using HF AutoTokenizer class. Note that the resulting tokenizer might not accurately represent GGUF tokenizer because each conversion step (GGUF $\rightarrow$ HF $\rightarrow$ OV) might introduce some errors.

The other way is to build the tokenizer by creating the model graph directly, like in this RWKV example.

You might also have to create several base pipelines for different tokenizer types:
Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
Status: Contributors Needed
Development

No branches or pull requests

7 participants