[Feature]: Support huggingface transformers LLM model #335

Zjq9409 · 2023-05-11T09:13:44Z

Is your feature request related to a problem? Please describe.

Can huggingface LLM model chat caching be support?

Describe the solution you'd like.

No response

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

SimFG · 2023-05-11T09:27:34Z

You can try to use the GPTCache api, a simple example like:

from gptcache.adapter.api import put, get, init_similar_cache

init_similar_cache()
put("hello", "foo")
print(get("hello"))

Zjq9409 · 2023-05-11T13:00:28Z

Thank you for your prompt response. Is calling cache through LangChainLLMs encapsulating the hugging face model? GPTCache do not support Hugging Face Hub when reading the document.

SimFG · 2023-05-11T13:10:25Z

Yes, you can use the LangChainLLMs, like:

from gptcache.adapter.langchain_models import LangChainLLMs
from langchain.llms import OpenAI

langchain_openai = OpenAI(model_name="text-ada-001")
llm = LangChainLLMs(llm=langchain_openai)
answer = llm(prompt=question)

if you use the langchain, you can also use it like:

import langchain
from langchain.cache import GPTCache
from gptcache.adapter.api import init_similar_cache

langchain.llm_cache = GPTCache(init_func=lambda cache: init_similar_cache(cache_obj=cache))

SimFG · 2023-05-12T03:10:03Z

@Zjq9409 If there isn't other question, i will close the issue

Zjq9409 · 2023-05-12T09:33:36Z

from langchain.llms import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM
from langchain import PromptTemplate, HuggingFaceHub, LLMChain
from langchain.llms import OpenAI
from langchain import PromptTemplate

from gptcache.adapter.langchain_models import LangChainLLMs
from gptcache.manager import get_data_manager, CacheBase, VectorBase
from gptcache import Cache
from gptcache.embedding import Onnx
from gptcache.processor.pre import get_prompt
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation

import time
import os
from transformers import AutoModel, AutoTokenizer

os.environ['HUGGINGFACEHUB_API_TOKEN'] = ''
model_id = 'google/flan-t5-base'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
pipe = pipeline(
    "text2text-generation",
    model=model, 
    tokenizer=tokenizer, 
    max_length=100
)

local_llm = HuggingFacePipeline(pipeline=pipe)
llm_cache = Cache()
onnx = Onnx()
cache_base = CacheBase('sqlite')
vector_base = VectorBase('faiss', dimension=onnx.dimension)
data_manager = get_data_manager(cache_base, vector_base, max_size=10, clean_size=2)
llm_cache.init(
    pre_embedding_func=get_prompt,
    embedding_func=onnx.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation(),
)
question = "上海有什么好吃的"
before = time.time()
cached_llm = LangChainLLMs(llm=local_llm)
answer = cached_llm(prompt=question, cache_obj=llm_cache)
print(answer)
print("Read through Time Spent =", time.time() - before)

questions = [
    '上海有哪些好吃的地方',
    '上海有哪些好吃的美食',
    '上海的美食有什么',
    '上海有什么好玩的地方',
    '怎么还花呗'
]
for question in questions:
        before = time.time()
        answer = cached_llm(prompt=question, cache_obj=llm_cache)
        print(f'Question: {question}')
        print(answer)
        print("Cache Hit Time Spent =", time.time() - before)

上海有什么好吃的 上海是一个美食之都,有很多好吃的选择。以下是一些可以尝试的:

1. 小笼包:小笼包是上海最有名的美食之一。可以去著名的“朱家角小笼包”品尝。

2. 生煎包:生煎包是上海另一种著名的美食。可以去“沈大成”或者“老正兴”品尝。

3. 红烧肉:红烧肉是上海传统的名菜之一。可以去“红烧肉大师”
Read through Time Spent = 0.35816311836242676
Question: 上海有哪些好吃的地方
上海有什么好吃的 上海是一个美食之都,有很多好吃的选择。以下是一些可以尝试的:

1. 小笼包:小笼包是上海最有名的美食之一。可以去著名的“朱家角小笼包”品尝。

2. 生煎包:生煎包是上海另一种著名的美食。可以去“沈大成”或者“老正兴”品尝。

3. 红烧肉:红烧肉是上海传统的名菜之一。可以去“红烧肉大师”
Cache Hit Time Spent = 0.14227008819580078
Question: 上海有哪些好吃的美食
上海有什么好吃的 上海是一个美食之都,有很多好吃的选择。以下是一些可以尝试的:

1. 小笼包:小笼包是上海最有名的美食之一。可以去著名的“朱家角小笼包”品尝。

2. 生煎包:生煎包是上海另一种著名的美食。可以去“沈大成”或者“老正兴”品尝。

3. 红烧肉:红烧肉是上海传统的名菜之一。可以去“红烧肉大师”
Cache Hit Time Spent = 0.13515877723693848
Question: 上海的美食有什么
上海有什么好吃的 上海是一个美食之都,有很多好吃的选择。以下是一些可以尝试的:

1. 小笼包:小笼包是上海最有名的美食之一。可以去著名的“朱家角小笼包”品尝。

2. 生煎包:生煎包是上海另一种著名的美食。可以去“沈大成”或者“老正兴”品尝。

3. 红烧肉:红烧肉是上海传统的名菜之一。可以去“红烧肉大师”
Cache Hit Time Spent = 0.13379120826721191
Question: 上海有什么好玩的地方
上海有什么好吃的 上海是一个美食之都,有很多好吃的选择。以下是一些可以尝试的:

1. 小笼包:小笼包是上海最有名的美食之一。可以去著名的“朱家角小笼包”品尝。

2. 生煎包:生煎包是上海另一种著名的美食。可以去“沈大成”或者“老正兴”品尝。

3. 红烧肉:红烧肉是上海传统的名菜之一。可以去“红烧肉大师”
Cache Hit Time Spent = 0.14519858360290527
Question: 怎么还花呗
上海有什么好吃的 上海是一个美食之都,有很多好吃的选择。以下是一些可以尝试的:

1. 小笼包:小笼包是上海最有名的美食之一。可以去著名的“朱家角小笼包”品尝。

2. 生煎包:生煎包是上海另一种著名的美食。可以去“沈大成”或者“老正兴”品尝。

3. 红烧肉:红烧肉是上海传统的名菜之一。可以去“红烧肉大师”
Cache Hit Time Spent = 0.14341497421264648

Why is the last question is cached?

SimFG · 2023-05-12T09:41:07Z

because the onnx.to_embeddings is the english embedding model, you need a chinese embedding model, reference: #317

Zjq9409 · 2023-05-13T07:23:46Z

Does it support huggingface conversation caching？

SimFG · 2023-05-15T02:34:19Z

About the conversation situation, it depends on how to use the llm. If you can give all the conversation info to the GPTCache, it will work finely, like the the messages of openai's chat complete, which provides the full conversation info.

Zjq9409 · 2023-05-17T06:50:52Z

from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms.base import LLM
from langchain.document_loaders import UnstructuredFileLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from transformers import AutoTokenizer, AutoModel, AutoConfig
from sentence_transformers import SentenceTransformer
import torch
import os
import torch
from typing import List
import re
from tqdm import tqdm
import datetime
import numpy as np
import intel_extension_for_pytorch as ipex

EMBEDDING_MODEL = "text2vec" # embedding 模型，对应 embedding_model_dict
VECTOR_SEARCH_TOP_K = 6
LLM_MODEL = "chatglm-6b"     # LLM 模型名，对应 llm_model_dict
LLM_HISTORY_LEN = 3
DEVICE = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
STREAMING = False
SENTENCE_SIZE = 100
CHUNK_SIZE = 250
embeddings = None
embedding_model_dict = {
    "text2vec": "/home/intel/zjq/prompt_test/text2vec-large-chinese/",
}

llm_model_dict = {
    "chatglm-6b-int4-qe": "THUDM/chatglm-6b-int4-qe",
    "chatglm-6b-int4": "THUDM/chatglm-6b-int4",
    "chatglm-6b": "/home/intel/zjq/chatglm",
}
VS_ROOT_PATH = os.path.join(os.path.dirname(os.path.dirname(__file__)), "vector_store")
class ChatGLM(LLM):
    max_token: int = 10000
    temperature: float = 0.8
    top_p = 0.9
    tokenizer: object = None
    model: object = None
    model_bf16 : object = None
    history_len: int = 10

    def __init__(self):
        super().__init__()

    @property
    def _llm_type(self) -> str:
        return "ChatGLM"

    def _call(self,
              prompt: str,
              history: List[List[str]] = [],
              streaming: bool = STREAMING,
              stop: List[str] = []
              ):  # -> Tuple[str, List[List[str]]]:
        import intel_extension_for_pytorch as ipex
        
        if streaming:
            
            for inum, (stream_resp, _) in enumerate(self.model_bf16.stream_chat(
                    self.tokenizer,
                    prompt,
                    history=history[-self.history_len:-1] if self.history_len > 0 else [],
                    max_length=self.max_token,
                    temperature=self.temperature,
                    top_p=self.top_p,
            )):
                if inum == 0:
                    history += [[prompt, stream_resp]]
                else:
                    history[-1] = [prompt, stream_resp]
                yield stream_resp, history
        else:
            response, _ = self.model_bf16.chat(
                self.tokenizer,
                prompt,
                history=history[-self.history_len:] if self.history_len > 0 else [],
                max_length=self.max_token,
                temperature=self.temperature,
                top_p=self.top_p,
            )
            history += [[prompt, response]]
            yield response, history

    def load_model(self,
                   model_name_or_path: str = "THUDM/chatglm-6b-int4",
                   llm_device=DEVICE,
                   **kwargs):
        self.tokenizer = AutoTokenizer.from_pretrained(
            model_name_or_path,
            trust_remote_code=True
        )
        model_config = AutoConfig.from_pretrained(model_name_or_path, trust_remote_code=True)
        self.model = AutoModel.from_pretrained(model_name_or_path, config=model_config, trust_remote_code=True,
                                              **kwargs)
        self.model = self.model.float().to(llm_device)
        self.model = self.model.eval()
        self.model_bf16 = ipex.optimize(
                self.model,
                dtype=torch.bfloat16,
                graph_mode=True,
                auto_kernel_selection=True,
                inplace=True,
                replace_dropout_with_identity=True)
        self.model_bf16 = self.model_bf16.eval()

from gptcache.adapter.langchain_models import LangChainLLMs
from gptcache.adapter.langchain_models import LangChainChat

from gptcache.manager import get_data_manager, CacheBase, VectorBase
from gptcache import Cache
from gptcache.embedding import Onnx
from gptcache.processor.pre import get_prompt
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation
from gptcache.embedding import Huggingface


llm_cache = Cache()
cache_base = CacheBase('sqlite')
vector_base = VectorBase('faiss', dimension=huggingface.dimension)
data_manager = get_data_manager('sqlite', vector_base, max_size=10, clean_size=2)
llm_cache.init(
    pre_embedding_func=get_prompt,
    embedding_func=huggingface.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation(),
)

from gptcache.adapter.langchain_models import LangChainChat

llm = ChatGLM()
llm.load_model(model_name_or_path="THUDM/chatglm-6b-int4",
                            llm_device='cpu')
cached_llm = LangChainLLMs(llm=llm)

answer = cached_llm(prompt="你好", cache_obj=llm_cache)

I used chatglm to generate chat and need to cache it, but got an error.

 /home/intel/zjq/anaconda3/envs/chatglm/lib/python3.10/site-packages/gptcache/adapter/langchain_m │
│ odels.py:60 in __call__                                                                          │
│                                                                                                  │
│    57 │   │   )                                                                                  │
│    58 │                                                                                          │
│    59 │   def __call__(self, prompt: str, stop: Optional[List[str]] = None, **kwargs) -> str:    │
│ ❱  60 │   │   return self._call(prompt=prompt, stop=stop, **kwargs)                              │
│    61                                                                                            │
│    62                                                                                            │
│    63 # pylint: disable=protected-access                                                         │
│                                                                                                  │
│ /home/intel/zjq/anaconda3/envs/chatglm/lib/python3.10/site-packages/gptcache/adapter/langchain_m │
│ odels.py:49 in _call                                                                             │
│                                                                                                  │
│    46 │                                                                                          │
│    47 │   def _call(self, prompt: str, stop: Optional[List[str]] = None, **kwargs) -> str:       │
│    48 │   │   session = self.session if "session" not in kwargs else kwargs.pop("session")       │
│ ❱  49 │   │   return adapt(                                                                      │
│    50 │   │   │   self.llm,                                                                      │
│    51 │   │   │   cache_data_convert,                                                            │
│    52 │   │   │   update_cache_callback,                                                         │
│                                                                                                  │
│ /home/intel/zjq/anaconda3/envs/chatglm/lib/python3.10/site-packages/gptcache/adapter/adapter.py: │
│ 142 in adapt                                                                                     │
│                                                                                                  │
│   139 │   │   │   llm_handler, cache_data_convert, update_cache_callback, *args, **kwargs        │
│   140 │   │   )                                                                                  │
│   141 │   else:                                                                                  │
│ ❱ 142 │   │   llm_data = llm_handler(*args, **kwargs)                                            │
│   143 │                                                                                          │
│   144 │   if cache_enable:                                                                       │
│   145 │   │   try:                                                                               │
│                                                                                                  │
│ /home/intel/zjq/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/llms/base.py:246   │
│ in __call__                                                                                      │
│                                                                                                  │
│   243 │                                                                                          │
│   244 │   def __call__(self, prompt: str, stop: Optional[List[str]] = None) -> str:              │
│   245 │   │   """Check Cache and run the LLM on the given prompt and input."""                   │
│ ❱ 246 │   │   return self.generate([prompt], stop=stop).generations[0][0].text                   │
│   247 │                                                                                          │
│   248 │   @property                                                                              │
│   249 │   def _identifying_params(self) -> Mapping[str, Any]:                                    │
│                                                                                                  │
│ /home/intel/zjq/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/llms/base.py:140   │
│ in generate                                                                                      │
│                                                                                                  │
│   137 │   │   │   │   output = self._generate(prompts, stop=stop)                                │
│   138 │   │   │   except (KeyboardInterrupt, Exception) as e:                                    │
│   139 │   │   │   │   self.callback_manager.on_llm_error(e, verbose=self.verbose)                │
│ ❱ 140 │   │   │   │   raise e                                                                    │
│   141 │   │   │   self.callback_manager.on_llm_end(output, verbose=self.verbose)                 │
│   142 │   │   │   return output                                                                  │
│   143 │   │   params = self.dict()                                                               │
│                                                                                                  │
│ /home/intel/zjq/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/llms/base.py:137   │
│ in generate                                                                                      │
│                                                                                                  │
│   134 │   │   │   │   {"name": self.__class__.__name__}, prompts, verbose=self.verbose           │
│   135 │   │   │   )                                                                              │
│   136 │   │   │   try:                                                                           │
│ ❱ 137 │   │   │   │   output = self._generate(prompts, stop=stop)                                │
│   138 │   │   │   except (KeyboardInterrupt, Exception) as e:                                    │
│   139 │   │   │   │   self.callback_manager.on_llm_error(e, verbose=self.verbose)                │
│   140 │   │   │   │   raise e                                                                    │
│                                                                                                  │
│ /home/intel/zjq/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/llms/base.py:325   │
│ in _generate                                                                                     │
│                                                                                                  │
│   322 │   │   generations = []                                                                   │
│   323 │   │   for prompt in prompts:                                                             │
│   324 │   │   │   text = self._call(prompt, stop=stop)                                           │
│ ❱ 325 │   │   │   generations.append([Generation(text=text)])                                    │
│   326 │   │   return LLMResult(generations=generations)                                          │
│   327 │                                                                                          │
│   328 │   async def _agenerate(                                                                  │
│                                                                                                  │
│ /home/intel/zjq/ChatGLM-6B/prompt_engine/pydantic/main.py:341 in                                 │
│ pydantic.main.BaseModel.__init__                                                                 │
│                                                                                                  │
│ [Errno 2] No such file or directory: '/home/intel/zjq/ChatGLM-6B/prompt_engine/pydantic/main.py' │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValidationError: 1 validation error for Generation

SimFG · 2023-05-17T07:04:27Z

@Zjq9409 From the error stack, can you try to run directly the llm, like:

llm(prompt="你好")

because i guess the error is caused by the empty text in the generations.append([Generation(text=text)]) from the last error trace

Zjq9409 · 2023-05-17T09:52:27Z

still report the same problem.

SimFG · 2023-05-17T09:58:08Z

@Zjq9409 if there is the same problem when you run the code:

llm(prompt="你好")

Looks like it should not be caused by GPTCache and it's caused the llm model.

Zjq9409 · 2023-05-17T12:21:22Z

Actually, the usage method is: llm._call("你好") , but i need to combine with LangChainLLMs to use Cache, right?

SimFG · 2023-05-17T12:23:14Z

@Zjq9409 yap

SimFG · 2023-05-17T12:25:22Z

when the cache is empty, it will call the origin llm model to get the answer, and then the answer will be saved to cache. In the next time, you will get the answer from the cache when you request a similar request.

SimFG · 2023-05-22T07:18:32Z

@Zjq9409 Is your problem solved? If you want to use the huggingface transformers LLM model, you can use the GPTCache api. If you encounter other problems, you can open a new issue.

iunique · 2023-05-26T08:28:59Z

I also encountered this problem, how to solve this problem

SimFG · 2023-05-26T09:01:26Z

@iunique hi, you can open a new issue and describe your problem.

SimFG closed this as completed May 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Support huggingface transformers LLM model #335

[Feature]: Support huggingface transformers LLM model #335

Zjq9409 commented May 11, 2023

SimFG commented May 11, 2023

Zjq9409 commented May 11, 2023

SimFG commented May 11, 2023

SimFG commented May 12, 2023

Zjq9409 commented May 12, 2023

SimFG commented May 12, 2023

Zjq9409 commented May 13, 2023

SimFG commented May 15, 2023 •

edited

Loading

Zjq9409 commented May 17, 2023

SimFG commented May 17, 2023 •

edited

Loading

Zjq9409 commented May 17, 2023

SimFG commented May 17, 2023

Zjq9409 commented May 17, 2023

SimFG commented May 17, 2023

SimFG commented May 17, 2023

SimFG commented May 22, 2023

iunique commented May 26, 2023

SimFG commented May 26, 2023

[Feature]: Support huggingface transformers LLM model #335

[Feature]: Support huggingface transformers LLM model #335

Comments

Zjq9409 commented May 11, 2023

Is your feature request related to a problem? Please describe.

Describe the solution you'd like.

Describe an alternate solution.

Anything else? (Additional Context)

SimFG commented May 11, 2023

Zjq9409 commented May 11, 2023

SimFG commented May 11, 2023

SimFG commented May 12, 2023

Zjq9409 commented May 12, 2023

SimFG commented May 12, 2023

Zjq9409 commented May 13, 2023

SimFG commented May 15, 2023 • edited Loading

Zjq9409 commented May 17, 2023

SimFG commented May 17, 2023 • edited Loading

Zjq9409 commented May 17, 2023

SimFG commented May 17, 2023

Zjq9409 commented May 17, 2023

SimFG commented May 17, 2023

SimFG commented May 17, 2023

SimFG commented May 22, 2023

iunique commented May 26, 2023

SimFG commented May 26, 2023

SimFG commented May 15, 2023 •

edited

Loading

SimFG commented May 17, 2023 •

edited

Loading