RAG pdf AGENT with BitNet #269
Replies: 1 comment
-
I got a solution for this problem, llama.cpp (and thus llama‑cpp‑python) does not support the custom I2_S quantization so I have replace it with BitNet‑Native Binary. BitNet’s own C++ entry‑point (llama-cli built from the BitNet repo) understands I2_S and the custom IQ4_NL layout. Copy or symlink that binary into your RAG app: In your app, replace Llama‑cpp calls with a subprocess helper: CLI_PATH = os.path.abspath("bin/bitnet-cli") def generate_with_bitnet(prompt: str) -> str: This path preserves full 1‑bit size and speed improvements without waiting on llama.cpp support like if this helps you🙏 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
In Python based Rag pipeline i am using BitNet b1.58‑2B‑4T model for generation and i am getting below error
ValueError: Failed to load model from file: /home/rag_pdf/models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf
Loading model from: /home/rag_pdf/models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf
Exists? True
Hugging Face Discussion: “gguf not llama.cpp compatible yet”
`import os
import streamlit as st
import asyncio
import time
App configuration: must be first Streamlit command
st.set_page_config(page_title="PDF Chat Expert", layout="wide")
Fix for Windows event loop issue
if os.name == 'nt':
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
PDF processing dependencies
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyMuPDFLoader
from llama_cpp import Llama
Initialize session states
st.session_state.setdefault('chat_history', [])
st.session_state.setdefault('vector_store', None)
st.session_state.setdefault('all_docs', [])
Model configuration
Path to your BitNet GGUF model
MODEL_PATH = os.path.abspath("models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf")
print("Loading model from:", MODEL_PATH)
print("Exists?", os.path.exists(MODEL_PATH))
llm = Llama(
model_path=MODEL_PATH,
n_ctx=4096,
n_threads=4,
)
Load BitNet model via llama-cpp-python
if 'llm' not in st.session_state:
with st.spinner(f"Loading BitNet on CPU..."):
st.session_state.llm = Llama(
model_path=MODEL_PATH,
n_ctx=4096, # match your model’s context length
n_threads=2, # parallel threads for inference
verbose=true,
) # streams directly in-process :contentReference[oaicite:5]{index=5}.
App title
st.title("🤖 PDF Chat Expert")
st.write("Upload and initialize your PDF knowledge base, then ask expert-level questions.")
Initialize PDF database (developer-triggered)
if st.button("Initialize PDF Database"):
with st.spinner("Processing PDFs and creating vector store..."):
def load_and_clean_docs():
pdf_paths = [os.path.join(root, f)
for root, _, files in os.walk('rag-dataset')
for f in files if f.lower().endswith('.pdf')]
if not pdf_paths:
st.error("No PDF files found in 'rag-dataset' directory!")
return []
User question input and submit button
user_input = st.text_area("Your question:", height=150)
submit = st.button("Submit")
Query handler: retrieval + streaming response
def handle_query(question: str) -> str:
if not st.session_state.vector_store:
st.error("Please initialize the PDF database first.")
return ""
Streaming generation with BitNet
for resp in st.session_state.llm(
prompt,
max_tokens=512,
temperature=0.7,
top_p=0.4,
top_k=40,
stream=True, # enable token-by-token streaming :contentReference[oaicite:6]{index=6}
):
token = resp['choices'][0]['text']
output += token
placeholder.markdown(output)
Trigger on submit
if user_input and submit:
with st.spinner("Generating answer..."):
start = time.time()
ans = handle_query(user_input)
elapsed = time.time() - start
st.session_state.chat_history.append((user_input, ans, elapsed))
Display chat history
st.divider()
st.subheader("Chat History")
for q, a, t in reversed(st.session_state.chat_history):
with st.expander(f"Q: {q}", expanded=True):
st.markdown(f"Answer:\n{a}")
st.markdown(f"Response Time: {t:.2f} seconds")
st.divider()
`
Beta Was this translation helpful? Give feedback.
All reactions