RAG pdf AGENT with BitNet #269

AmolMagar2000 · 2025-05-15T16:34:53Z

AmolMagar2000
May 15, 2025

In Python based Rag pipeline i am using BitNet b1.58‑2B‑4T model for generation and i am getting below error
ValueError: Failed to load model from file: /home/rag_pdf/models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf
Loading model from: /home/rag_pdf/models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf
Exists? True

Hugging Face Discussion: “gguf not llama.cpp compatible yet”

`import os
import streamlit as st
import asyncio
import time

App configuration: must be first Streamlit command

st.set_page_config(page_title="PDF Chat Expert", layout="wide")

Fix for Windows event loop issue

if os.name == 'nt':
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())

PDF processing dependencies

from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyMuPDFLoader

from llama_cpp import Llama

Initialize session states

st.session_state.setdefault('chat_history', [])
st.session_state.setdefault('vector_store', None)
st.session_state.setdefault('all_docs', [])

Model configuration

Path to your BitNet GGUF model

MODEL_PATH = os.path.abspath("models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf")
print("Loading model from:", MODEL_PATH)
print("Exists?", os.path.exists(MODEL_PATH))

llm = Llama(
model_path=MODEL_PATH,
n_ctx=4096,
n_threads=4,
)

Load BitNet model via llama-cpp-python

if 'llm' not in st.session_state:
with st.spinner(f"Loading BitNet on CPU..."):
st.session_state.llm = Llama(
model_path=MODEL_PATH,
n_ctx=4096, # match your model’s context length
n_threads=2, # parallel threads for inference
verbose=true,
) # streams directly in-process :contentReference[oaicite:5]{index=5}.

App title

st.title("🤖 PDF Chat Expert")
st.write("Upload and initialize your PDF knowledge base, then ask expert-level questions.")

Initialize PDF database (developer-triggered)

if st.button("Initialize PDF Database"):
with st.spinner("Processing PDFs and creating vector store..."):
def load_and_clean_docs():
pdf_paths = [os.path.join(root, f)
for root, _, files in os.walk('rag-dataset')
for f in files if f.lower().endswith('.pdf')]
if not pdf_paths:
st.error("No PDF files found in 'rag-dataset' directory!")
return []

        splitter = RecursiveCharacterTextSplitter(
            chunk_size=512,
            chunk_overlap=50,
            separators=["\n\n", "\n", ". ", "! ", "? ", " ", ""]
        )
        docs = []
        for path in pdf_paths:
            loader = PyMuPDFLoader(path)
            pages = loader.load_and_split(splitter)
            for page in pages:
                # Optional: keep source metadata
                page.metadata['source'] = os.path.basename(path)
            docs.extend(pages)
        # Clean out very short chunks
        return [d for d in docs if len(d.page_content.strip()) > 100]

    def create_store(docs):
        embeddings = HuggingFaceEmbeddings(
            model_name="sentence-transformers/all-MiniLM-L6-v2",
            model_kwargs={'device': 'cpu'},
            encode_kwargs={'normalize_embeddings': True}
        )
        return FAISS.from_documents(docs, embeddings)

    all_docs = load_and_clean_docs()
    if all_docs:
        st.session_state.all_docs = all_docs
        st.session_state.vector_store = create_store(all_docs)
        st.success(f"Loaded {len(all_docs)} document chunks.")

User question input and submit button

user_input = st.text_area("Your question:", height=150)
submit = st.button("Submit")

Query handler: retrieval + streaming response

def handle_query(question: str) -> str:
if not st.session_state.vector_store:
st.error("Please initialize the PDF database first.")
return ""

# Retrieve top-k chunks
retriever = st.session_state.vector_store.as_retriever(search_kwargs={'k': 3})
docs = retriever.get_relevant_documents(question)

if not docs:
    return "I’m sorry, but I couldn’t find any information in the provided PDFs."

# Build context
context = "\n\n".join(
    f"Source ({doc.metadata.get('source','unknown')}):\n{doc.page_content}" 
    for doc in docs
)

prompt = (
    "You are a knowledgeable assistant. Use ONLY the following context to answer the question."
    f"\n\nContext:\n{context}\n\nQuestion:\n{question}\n\nAnswer:"
)

# Streamed generation
placeholder = st.empty()
output = ""

Streaming generation with BitNet

for resp in st.session_state.llm(
prompt,
max_tokens=512,
temperature=0.7,
top_p=0.4,
top_k=40,
stream=True, # enable token-by-token streaming :contentReference[oaicite:6]{index=6}
):
token = resp['choices'][0]['text']
output += token
placeholder.markdown(output)

Trigger on submit

if user_input and submit:
with st.spinner("Generating answer..."):
start = time.time()
ans = handle_query(user_input)
elapsed = time.time() - start
st.session_state.chat_history.append((user_input, ans, elapsed))

Display chat history

st.divider()
st.subheader("Chat History")
for q, a, t in reversed(st.session_state.chat_history):
with st.expander(f"Q: {q}", expanded=True):
st.markdown(f"Answer:\n{a}")
st.markdown(f"Response Time: {t:.2f} seconds")
st.divider()
`

AmolMagar2000 · 2025-05-21T06:30:01Z

AmolMagar2000
May 21, 2025
Author

I got a solution for this problem, llama.cpp (and thus llama‑cpp‑python) does not support the custom I2_S quantization so I have replace it with BitNet‑Native Binary. BitNet’s own C++ entry‑point (llama-cli built from the BitNet repo) understands I2_S and the custom IQ4_NL layout.

Copy or symlink that binary into your RAG app:
mkdir -p ~/rag_pdf/bin
ln -s ~/BitNet/build/bin/llama-cli ~/rag_pdf/bin/bitnet-cli
chmod +x ~/rag_pdf/bin/bitnet-cli

In your app, replace Llama‑cpp calls with a subprocess helper:

CLI_PATH = os.path.abspath("bin/bitnet-cli")

def generate_with_bitnet(prompt: str) -> str:
cmd = [
CLI_PATH,
"--model", MODEL_PATH,
"--prompt", prompt,
"--threads", "4",
"--n_predict", "256",
]

This path preserves full 1‑bit size and speed improvements without waiting on llama.cpp support

like if this helps you🙏

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RAG pdf AGENT with BitNet #269

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

RAG pdf AGENT with BitNet #269

Uh oh!

AmolMagar2000 May 15, 2025

App configuration: must be first Streamlit command

Fix for Windows event loop issue

PDF processing dependencies

Initialize session states

Model configuration

Path to your BitNet GGUF model

Load BitNet model via llama-cpp-python

App title

Initialize PDF database (developer-triggered)

User question input and submit button

Query handler: retrieval + streaming response

Streaming generation with BitNet

Trigger on submit

Display chat history

Replies: 1 comment

Uh oh!

Uh oh!

AmolMagar2000 May 21, 2025 Author

AmolMagar2000
May 15, 2025

AmolMagar2000
May 21, 2025
Author