Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Support huggingface transformers LLM model #335

Closed
Zjq9409 opened this issue May 11, 2023 · 18 comments
Closed

[Feature]: Support huggingface transformers LLM model #335

Zjq9409 opened this issue May 11, 2023 · 18 comments

Comments

@Zjq9409
Copy link

Zjq9409 commented May 11, 2023

Is your feature request related to a problem? Please describe.

Can huggingface LLM model chat caching be support?

Describe the solution you'd like.

No response

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

@SimFG
Copy link
Collaborator

SimFG commented May 11, 2023

You can try to use the GPTCache api, a simple example like:

from gptcache.adapter.api import put, get, init_similar_cache

init_similar_cache()
put("hello", "foo")
print(get("hello"))

@Zjq9409
Copy link
Author

Zjq9409 commented May 11, 2023

Thank you for your prompt response. Is calling cache through LangChainLLMs encapsulating the hugging face model? GPTCache do not support Hugging Face Hub when reading the document.

image

@SimFG
Copy link
Collaborator

SimFG commented May 11, 2023

Yes, you can use the LangChainLLMs, like:

from gptcache.adapter.langchain_models import LangChainLLMs
from langchain.llms import OpenAI

langchain_openai = OpenAI(model_name="text-ada-001")
llm = LangChainLLMs(llm=langchain_openai)
answer = llm(prompt=question)

if you use the langchain, you can also use it like:

import langchain
from langchain.cache import GPTCache
from gptcache.adapter.api import init_similar_cache

langchain.llm_cache = GPTCache(init_func=lambda cache: init_similar_cache(cache_obj=cache))

@SimFG
Copy link
Collaborator

SimFG commented May 12, 2023

@Zjq9409 If there isn't other question, i will close the issue

@Zjq9409
Copy link
Author

Zjq9409 commented May 12, 2023

from langchain.llms import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM
from langchain import PromptTemplate, HuggingFaceHub, LLMChain
from langchain.llms import OpenAI
from langchain import PromptTemplate

from gptcache.adapter.langchain_models import LangChainLLMs
from gptcache.manager import get_data_manager, CacheBase, VectorBase
from gptcache import Cache
from gptcache.embedding import Onnx
from gptcache.processor.pre import get_prompt
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation

import time
import os
from transformers import AutoModel, AutoTokenizer

os.environ['HUGGINGFACEHUB_API_TOKEN'] = ''
model_id = 'google/flan-t5-base'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
pipe = pipeline(
    "text2text-generation",
    model=model, 
    tokenizer=tokenizer, 
    max_length=100
)

local_llm = HuggingFacePipeline(pipeline=pipe)
llm_cache = Cache()
onnx = Onnx()
cache_base = CacheBase('sqlite')
vector_base = VectorBase('faiss', dimension=onnx.dimension)
data_manager = get_data_manager(cache_base, vector_base, max_size=10, clean_size=2)
llm_cache.init(
    pre_embedding_func=get_prompt,
    embedding_func=onnx.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation(),
)
question = "上海有什么好吃的"
before = time.time()
cached_llm = LangChainLLMs(llm=local_llm)
answer = cached_llm(prompt=question, cache_obj=llm_cache)
print(answer)
print("Read through Time Spent =", time.time() - before)

questions = [
    '上海有哪些好吃的地方',
    '上海有哪些好吃的美食',
    '上海的美食有什么',
    '上海有什么好玩的地方',
    '怎么还花呗'
]
for question in questions:
        before = time.time()
        answer = cached_llm(prompt=question, cache_obj=llm_cache)
        print(f'Question: {question}')
        print(answer)
        print("Cache Hit Time Spent =", time.time() - before)
上海有什么好吃的 上海是一个美食之都,有很多好吃的选择。以下是一些可以尝试的:

1. 小笼包:小笼包是上海最有名的美食之一。可以去著名的“朱家角小笼包”品尝。

2. 生煎包:生煎包是上海另一种著名的美食。可以去“沈大成”或者“老正兴”品尝。

3. 红烧肉:红烧肉是上海传统的名菜之一。可以去“红烧肉大师”
Read through Time Spent = 0.35816311836242676
Question: 上海有哪些好吃的地方
上海有什么好吃的 上海是一个美食之都,有很多好吃的选择。以下是一些可以尝试的:

1. 小笼包:小笼包是上海最有名的美食之一。可以去著名的“朱家角小笼包”品尝。

2. 生煎包:生煎包是上海另一种著名的美食。可以去“沈大成”或者“老正兴”品尝。

3. 红烧肉:红烧肉是上海传统的名菜之一。可以去“红烧肉大师”
Cache Hit Time Spent = 0.14227008819580078
Question: 上海有哪些好吃的美食
上海有什么好吃的 上海是一个美食之都,有很多好吃的选择。以下是一些可以尝试的:

1. 小笼包:小笼包是上海最有名的美食之一。可以去著名的“朱家角小笼包”品尝。

2. 生煎包:生煎包是上海另一种著名的美食。可以去“沈大成”或者“老正兴”品尝。

3. 红烧肉:红烧肉是上海传统的名菜之一。可以去“红烧肉大师”
Cache Hit Time Spent = 0.13515877723693848
Question: 上海的美食有什么
上海有什么好吃的 上海是一个美食之都,有很多好吃的选择。以下是一些可以尝试的:

1. 小笼包:小笼包是上海最有名的美食之一。可以去著名的“朱家角小笼包”品尝。

2. 生煎包:生煎包是上海另一种著名的美食。可以去“沈大成”或者“老正兴”品尝。

3. 红烧肉:红烧肉是上海传统的名菜之一。可以去“红烧肉大师”
Cache Hit Time Spent = 0.13379120826721191
Question: 上海有什么好玩的地方
上海有什么好吃的 上海是一个美食之都,有很多好吃的选择。以下是一些可以尝试的:

1. 小笼包:小笼包是上海最有名的美食之一。可以去著名的“朱家角小笼包”品尝。

2. 生煎包:生煎包是上海另一种著名的美食。可以去“沈大成”或者“老正兴”品尝。

3. 红烧肉:红烧肉是上海传统的名菜之一。可以去“红烧肉大师”
Cache Hit Time Spent = 0.14519858360290527
Question: 怎么还花呗
上海有什么好吃的 上海是一个美食之都,有很多好吃的选择。以下是一些可以尝试的:

1. 小笼包:小笼包是上海最有名的美食之一。可以去著名的“朱家角小笼包”品尝。

2. 生煎包:生煎包是上海另一种著名的美食。可以去“沈大成”或者“老正兴”品尝。

3. 红烧肉:红烧肉是上海传统的名菜之一。可以去“红烧肉大师”
Cache Hit Time Spent = 0.14341497421264648

Why is the last question is cached?

@SimFG
Copy link
Collaborator

SimFG commented May 12, 2023

because the onnx.to_embeddings is the english embedding model, you need a chinese embedding model, reference: #317

@Zjq9409
Copy link
Author

Zjq9409 commented May 13, 2023

Does it support huggingface conversation caching?

@SimFG
Copy link
Collaborator

SimFG commented May 15, 2023

About the conversation situation, it depends on how to use the llm. If you can give all the conversation info to the GPTCache, it will work finely, like the the messages of openai's chat complete, which provides the full conversation info.

@Zjq9409
Copy link
Author

Zjq9409 commented May 17, 2023

from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms.base import LLM
from langchain.document_loaders import UnstructuredFileLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from transformers import AutoTokenizer, AutoModel, AutoConfig
from sentence_transformers import SentenceTransformer
import torch
import os
import torch
from typing import List
import re
from tqdm import tqdm
import datetime
import numpy as np
import intel_extension_for_pytorch as ipex

EMBEDDING_MODEL = "text2vec" # embedding 模型,对应 embedding_model_dict
VECTOR_SEARCH_TOP_K = 6
LLM_MODEL = "chatglm-6b"     # LLM 模型名,对应 llm_model_dict
LLM_HISTORY_LEN = 3
DEVICE = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
STREAMING = False
SENTENCE_SIZE = 100
CHUNK_SIZE = 250
embeddings = None
embedding_model_dict = {
    "text2vec": "/home/intel/zjq/prompt_test/text2vec-large-chinese/",
}

llm_model_dict = {
    "chatglm-6b-int4-qe": "THUDM/chatglm-6b-int4-qe",
    "chatglm-6b-int4": "THUDM/chatglm-6b-int4",
    "chatglm-6b": "/home/intel/zjq/chatglm",
}
VS_ROOT_PATH = os.path.join(os.path.dirname(os.path.dirname(__file__)), "vector_store")
class ChatGLM(LLM):
    max_token: int = 10000
    temperature: float = 0.8
    top_p = 0.9
    tokenizer: object = None
    model: object = None
    model_bf16 : object = None
    history_len: int = 10

    def __init__(self):
        super().__init__()

    @property
    def _llm_type(self) -> str:
        return "ChatGLM"

    def _call(self,
              prompt: str,
              history: List[List[str]] = [],
              streaming: bool = STREAMING,
              stop: List[str] = []
              ):  # -> Tuple[str, List[List[str]]]:
        import intel_extension_for_pytorch as ipex
        
        if streaming:
            
            for inum, (stream_resp, _) in enumerate(self.model_bf16.stream_chat(
                    self.tokenizer,
                    prompt,
                    history=history[-self.history_len:-1] if self.history_len > 0 else [],
                    max_length=self.max_token,
                    temperature=self.temperature,
                    top_p=self.top_p,
            )):
                if inum == 0:
                    history += [[prompt, stream_resp]]
                else:
                    history[-1] = [prompt, stream_resp]
                yield stream_resp, history
        else:
            response, _ = self.model_bf16.chat(
                self.tokenizer,
                prompt,
                history=history[-self.history_len:] if self.history_len > 0 else [],
                max_length=self.max_token,
                temperature=self.temperature,
                top_p=self.top_p,
            )
            history += [[prompt, response]]
            yield response, history

    def load_model(self,
                   model_name_or_path: str = "THUDM/chatglm-6b-int4",
                   llm_device=DEVICE,
                   **kwargs):
        self.tokenizer = AutoTokenizer.from_pretrained(
            model_name_or_path,
            trust_remote_code=True
        )
        model_config = AutoConfig.from_pretrained(model_name_or_path, trust_remote_code=True)
        self.model = AutoModel.from_pretrained(model_name_or_path, config=model_config, trust_remote_code=True,
                                              **kwargs)
        self.model = self.model.float().to(llm_device)
        self.model = self.model.eval()
        self.model_bf16 = ipex.optimize(
                self.model,
                dtype=torch.bfloat16,
                graph_mode=True,
                auto_kernel_selection=True,
                inplace=True,
                replace_dropout_with_identity=True)
        self.model_bf16 = self.model_bf16.eval()

from gptcache.adapter.langchain_models import LangChainLLMs
from gptcache.adapter.langchain_models import LangChainChat

from gptcache.manager import get_data_manager, CacheBase, VectorBase
from gptcache import Cache
from gptcache.embedding import Onnx
from gptcache.processor.pre import get_prompt
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation
from gptcache.embedding import Huggingface


llm_cache = Cache()
cache_base = CacheBase('sqlite')
vector_base = VectorBase('faiss', dimension=huggingface.dimension)
data_manager = get_data_manager('sqlite', vector_base, max_size=10, clean_size=2)
llm_cache.init(
    pre_embedding_func=get_prompt,
    embedding_func=huggingface.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation(),
)

from gptcache.adapter.langchain_models import LangChainChat

llm = ChatGLM()
llm.load_model(model_name_or_path="THUDM/chatglm-6b-int4",
                            llm_device='cpu')
cached_llm = LangChainLLMs(llm=llm)

answer = cached_llm(prompt="你好", cache_obj=llm_cache)

I used chatglm to generate chat and need to cache it, but got an error.

 /home/intel/zjq/anaconda3/envs/chatglm/lib/python3.10/site-packages/gptcache/adapter/langchain_m │
│ odels.py:60 in __call__                                                                          │
│                                                                                                  │
│    57 │   │   )                                                                                  │
│    58 │                                                                                          │
│    59 │   def __call__(self, prompt: str, stop: Optional[List[str]] = None, **kwargs) -> str:    │
│ ❱  60 │   │   return self._call(prompt=prompt, stop=stop, **kwargs)                              │
│    61                                                                                            │
│    62                                                                                            │
│    63 # pylint: disable=protected-access                                                         │
│                                                                                                  │
│ /home/intel/zjq/anaconda3/envs/chatglm/lib/python3.10/site-packages/gptcache/adapter/langchain_m │
│ odels.py:49 in _call                                                                             │
│                                                                                                  │
│    46 │                                                                                          │
│    47 │   def _call(self, prompt: str, stop: Optional[List[str]] = None, **kwargs) -> str:       │
│    48 │   │   session = self.session if "session" not in kwargs else kwargs.pop("session")       │
│ ❱  49 │   │   return adapt(                                                                      │
│    50 │   │   │   self.llm,                                                                      │
│    51 │   │   │   cache_data_convert,                                                            │
│    52 │   │   │   update_cache_callback,                                                         │
│                                                                                                  │
│ /home/intel/zjq/anaconda3/envs/chatglm/lib/python3.10/site-packages/gptcache/adapter/adapter.py: │
│ 142 in adapt                                                                                     │
│                                                                                                  │
│   139 │   │   │   llm_handler, cache_data_convert, update_cache_callback, *args, **kwargs        │
│   140 │   │   )                                                                                  │
│   141 │   else:                                                                                  │
│ ❱ 142 │   │   llm_data = llm_handler(*args, **kwargs)                                            │
│   143 │                                                                                          │
│   144 │   if cache_enable:                                                                       │
│   145 │   │   try:                                                                               │
│                                                                                                  │
│ /home/intel/zjq/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/llms/base.py:246   │
│ in __call__                                                                                      │
│                                                                                                  │
│   243 │                                                                                          │
│   244 │   def __call__(self, prompt: str, stop: Optional[List[str]] = None) -> str:              │
│   245 │   │   """Check Cache and run the LLM on the given prompt and input."""                   │
│ ❱ 246 │   │   return self.generate([prompt], stop=stop).generations[0][0].text                   │
│   247 │                                                                                          │
│   248 │   @property                                                                              │
│   249 │   def _identifying_params(self) -> Mapping[str, Any]:                                    │
│                                                                                                  │
│ /home/intel/zjq/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/llms/base.py:140   │
│ in generate                                                                                      │
│                                                                                                  │
│   137 │   │   │   │   output = self._generate(prompts, stop=stop)                                │
│   138 │   │   │   except (KeyboardInterrupt, Exception) as e:                                    │
│   139 │   │   │   │   self.callback_manager.on_llm_error(e, verbose=self.verbose)                │
│ ❱ 140 │   │   │   │   raise e                                                                    │
│   141 │   │   │   self.callback_manager.on_llm_end(output, verbose=self.verbose)                 │
│   142 │   │   │   return output                                                                  │
│   143 │   │   params = self.dict()                                                               │
│                                                                                                  │
│ /home/intel/zjq/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/llms/base.py:137   │
│ in generate                                                                                      │
│                                                                                                  │
│   134 │   │   │   │   {"name": self.__class__.__name__}, prompts, verbose=self.verbose           │
│   135 │   │   │   )                                                                              │
│   136 │   │   │   try:                                                                           │
│ ❱ 137 │   │   │   │   output = self._generate(prompts, stop=stop)                                │
│   138 │   │   │   except (KeyboardInterrupt, Exception) as e:                                    │
│   139 │   │   │   │   self.callback_manager.on_llm_error(e, verbose=self.verbose)                │
│   140 │   │   │   │   raise e                                                                    │
│                                                                                                  │
│ /home/intel/zjq/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/llms/base.py:325   │
│ in _generate                                                                                     │
│                                                                                                  │
│   322 │   │   generations = []                                                                   │
│   323 │   │   for prompt in prompts:                                                             │
│   324 │   │   │   text = self._call(prompt, stop=stop)                                           │
│ ❱ 325 │   │   │   generations.append([Generation(text=text)])                                    │
│   326 │   │   return LLMResult(generations=generations)                                          │
│   327 │                                                                                          │
│   328 │   async def _agenerate(                                                                  │
│                                                                                                  │
│ /home/intel/zjq/ChatGLM-6B/prompt_engine/pydantic/main.py:341 in                                 │
│ pydantic.main.BaseModel.__init__                                                                 │
│                                                                                                  │
│ [Errno 2] No such file or directory: '/home/intel/zjq/ChatGLM-6B/prompt_engine/pydantic/main.py' │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValidationError: 1 validation error for Generation

@SimFG
Copy link
Collaborator

SimFG commented May 17, 2023

@Zjq9409 From the error stack, can you try to run directly the llm, like:

llm(prompt="你好")

because i guess the error is caused by the empty text in the generations.append([Generation(text=text)]) from the last error trace

@Zjq9409
Copy link
Author

Zjq9409 commented May 17, 2023

still report the same problem.

@SimFG
Copy link
Collaborator

SimFG commented May 17, 2023

@Zjq9409 if there is the same problem when you run the code:

llm(prompt="你好")

Looks like it should not be caused by GPTCache and it's caused the llm model.

@Zjq9409
Copy link
Author

Zjq9409 commented May 17, 2023

Actually, the usage method is: llm._call("你好") , but i need to combine with LangChainLLMs to use Cache, right?

@SimFG
Copy link
Collaborator

SimFG commented May 17, 2023

@Zjq9409 yap

@SimFG
Copy link
Collaborator

SimFG commented May 17, 2023

when the cache is empty, it will call the origin llm model to get the answer, and then the answer will be saved to cache. In the next time, you will get the answer from the cache when you request a similar request.

@SimFG
Copy link
Collaborator

SimFG commented May 22, 2023

@Zjq9409 Is your problem solved? If you want to use the huggingface transformers LLM model, you can use the GPTCache api. If you encounter other problems, you can open a new issue.

@SimFG SimFG closed this as completed May 22, 2023
@iunique
Copy link

iunique commented May 26, 2023

I also encountered this problem, how to solve this problem

@SimFG
Copy link
Collaborator

SimFG commented May 26, 2023

@iunique hi, you can open a new issue and describe your problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants