A guide for building a basic chatbot

This guide shows one example of how to make a chatbot which can answer questions about PDF files.

In Part 1, we will build a command-line bot application, and in Part 2 we add a UI to the bot.

But first, here is an overview of the components needed to build such a bot.

Building blocks

LLM

Large Language Model. For this guide, we'll use GPT-4o via OpenAI APIs.

In addition to GPTs, there are many other models, such as PaLM, Claude, Llama, Falcon, Cohere and Mistral.

Vector database

A database that stores vectors. Data is converted into a vector space, where similar items get clustered closer to each other. When searching for data, the search term is converted to the same vector space, and results are then fetched using a similarity metric such as cosine similarity.

We'll use ChromaDB, which is an open source vector database. It uses SQLite as the default backend, which is nice and easy for simple prototyping.

Here's a few other vector databases:

Pinecone. Free account allows one vector index
OpenSearch
pgvector, a vector db extension for Postgres
Milvus
Qdrant
Marqo

Embedding model

Embedding model is a neural network whose job is to get data as input and output numerical vectors.

ChromaDB ships with a default embedding model, let's use that for simplicity. Basically one could use any embedding model to vectorize and then search for data, as long as you use the same model for data and the search query.

More about embeddings:

Data processing

Preparing data to be stored in a vector database is a large topic and its complexity depends on things like how complex/varying the data is (e.g. a simple text file vs images containing text on a PDF) and the intended use case.

The basic idea is to chop data into short pieces that are then converted into vectors. The data can be anything, such a website, txt file, csv, pdf, transcribed audio, output from an LLM, and so on.

For this guide we'll go with the document loader api of Langchain and PyPdfLoader, which reads text from PDFs.

More about data processing:

https://www.pinecone.io/learn/chunking-strategies/
For a more complex set of data processing tools, there are libraries like Unstructured, which also has a Langchain integration.

Chat user interface

A chat UI can naturally be built with any tech, but instead of focusing on React hooks or Vue templates, we'll take an off-the-shelf solution and use a library called Chainlit.

Another library for LLM/data focused UI creation is Streamlit.

Middleware / LLM orchestration

LLM orhcestration libraries wrap the above concepts (data processing, vector databases, LLMs) into a package that's easier to use and build pipelines where one can easily switch the different components such as LLMs and vector databases. They also implement more complex use cases such as different kind of agents.

While an orchestration library is very useful in practice, it's not mandatory. For the sake of learning, we'll go without a library in this guide.

Here's a few libraries for LLM orchestration:

Let's begin

This guide was made with Python 3.11 and a Mac. In addition to Python and a shell, you'll be needing an OpenAI API key.

Setup

First, create a Python virtual env and activate it:

python -m venv .venv
source .venv/bin/activate

(to get out of the env, call deactivate)

Then install the dependencies:

pip install -r requirements.txt

(requirements.txt is basically the contents from this)
pip install chromadb openai langchain pypdf chainlit

Part 1: Building a command-line version

Head on to PART1_CMDLINE.md for building the first part.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.chainlit		.chainlit
img		img
.gitignore		.gitignore
PART1_CMDLINE.md		PART1_CMDLINE.md
PART2_UI.md		PART2_UI.md
README.md		README.md
chainlit.md		chainlit.md
cmd_example.py		cmd_example.py
luoto.pdf		luoto.pdf
requirements.txt		requirements.txt
ui_example.py		ui_example.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A guide for building a basic chatbot

Building blocks

LLM

Vector database

Embedding model

Data processing

Chat user interface

Middleware / LLM orchestration

Let's begin

Setup

Part 1: Building a command-line version

About

Releases

Packages

Languages

LuotoCompany/basic-bot-tutorial

Folders and files

Latest commit

History

Repository files navigation

A guide for building a basic chatbot

Building blocks

LLM

Vector database

Embedding model

Data processing

Chat user interface

Middleware / LLM orchestration

Let's begin

Setup

Part 1: Building a command-line version

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages