Skip to content

Commit

Permalink
v1.0.1
Browse files Browse the repository at this point in the history
  • Loading branch information
Boomm-shakalaka committed Jun 5, 2024
1 parent 4a8f6c1 commit 94acecc
Show file tree
Hide file tree
Showing 14 changed files with 242 additions and 55 deletions.
31 changes: 20 additions & 11 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,25 +1,34 @@
# 设置基础镜像,这里选择 Python 3.8
FROM python:3.8.19
# 使用 Ubuntu 22.04 作为基础镜像
FROM ubuntu:22.04

# 设置工作目录
WORKDIR /app

# 复制项目文件到容器中的工作目录
COPY . /app

# 安装 Python 依赖
RUN pip install --no-cache-dir -r requirements.txt
# 安装系统依赖项
RUN apt-get update && \
apt-get install -y libgl1-mesa-glx libpython3-dev

# 安装 Node.js 和 npm
RUN curl -fsSL https://deb.nodesource.com/setup_14.x | bash - \
&& apt-get install -y nodejs \
&& rm -rf /var/lib/apt/lists/*
# 安装 Python 3.9
RUN apt-get install -y python3.9

# 安装 npm 依赖和 Playwright 浏览器
RUN npm install && npx playwright install
# 安装 pip
RUN apt-get install -y python3-pip

# 安装 Python 依赖项
RUN pip3 install --no-cache-dir -r requirements.txt

# 安装 Playwright 及其依赖项
RUN playwright install --with-deps chromium

# 暴露端口
EXPOSE 8501

# 设置环境变量以指定操作系统
ENV OS_TYPE="linux"

# 运行 Streamlit 应用
CMD ["streamlit", "run", "web_ui.py", "--server.port", "8501"]
CMD ["python3", "-m", "streamlit", "run", "web_ui.py", "--server.port", "8501"]

49 changes: 41 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
### 基于LLM大模型的AI机器人
# 基于LLM大模型的AI机器人
一款开源的AI语言模型机器人,集成人机对话,信息检索生成,PDF和URL解析对话等功能。该平台优势为全部采用免费开源API,以最低成本实现LLM定制化功能。

## 工具和平台
Expand All @@ -7,8 +7,6 @@ Langchain, Streamlit, Oracle Cloud, Groq,Google cloud, Baidu Cloud, Docker
## DEMO链接
[Link](http://168.138.28.54:8501)

## DEMO链接
[Link](http://168.138.28.54:8501)
## 文件结构描述
<pre>
.
Expand Down Expand Up @@ -39,12 +37,12 @@ Langchain, Streamlit, Oracle Cloud, Groq,Google cloud, Baidu Cloud, Docker
├── web_ui.py # main interface
</pre>

以下是优化后的Markdown写法:

## 功能描述

### Crawler爬虫模块


* 该模块主要包含三种爬虫方法: [Selenium](https://selenium-python.readthedocs.io/)[Playwright](https://playwright.dev/python/docs/intro)[基于Langchain的DuckDuckGo](https://api.python.langchain.com/en/latest/tools/langchain_community.tools.ddg_search.tool.DuckDuckGoSearchResults.html)

* 实验显示,Playwright的耗时只有Selenium的一半:
Expand All @@ -53,7 +51,8 @@ Langchain, Streamlit, Oracle Cloud, Groq,Google cloud, Baidu Cloud, Docker
| selenium_url_crawler | 27s |
| playwright_url_crawler | 11s |

* 由于Streamlit和Playwright的同步方式会产生冲突,所以应使用异步方法。 [参考](https://discuss.streamlit.io/t/using-playwright-with-streamlit/28380/5)
* 由于Streamlit和Playwright的同步方式会产生冲突,所以应使用异步方法。[参考](https://discuss.streamlit.io/t/using-playwright-with-streamlit/28380/5)


### Chat模块 (在线和离线)

Expand Down Expand Up @@ -92,6 +91,13 @@ Langchain, Streamlit, Oracle Cloud, Groq,Google cloud, Baidu Cloud, Docker
3. 根据问题检索top_k个相关文档。
4. 基于文档内容回答问题。

### PDF解析模块
1. 基于[Streamlit-PDF-API](https://discuss.streamlit.io/t/display-pdf-in-streamlit/62274)[Langchain-PDFMinerLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.PDFMinerLoader.html)
2. 使用流程:
1. 上传PDF
2. 解析PDF内容大模型基于prompt总结PDF
3. 根据问题和PDF内容进行回答

## 使用教程

### 本地部署
Expand Down Expand Up @@ -125,11 +131,37 @@ Langchain, Streamlit, Oracle Cloud, Groq,Google cloud, Baidu Cloud, Docker
streamlit run web_ui.py
```
### 服务器部署
1. [Docker链接](https://hub.docker.com/repository/docker/jiyuanc1/aibot/general)
2. 服务器部署教程:[wiki链接](https://github.com/Boomm-shakalaka/AIBot-LLM/wiki/Oracle%E6%9C%8D%E5%8A%A1%E5%99%A8%E6%90%AD%E5%BB%BA%E6%95%99%E7%A8%8B)
方法一: Linux环境本地安装和执行Docker
* 服务器拉取github仓库
* 构建镜像

方法二: Docker Hub拉取和执行镜像
* [Docker Hub链接](https://hub.docker.com/repository/docker/jiyuanc1/aibot/general)

部署教程
* 服务器部署教程:[wiki链接](https://github.com/Boomm-shakalaka/AIBot-LLM/wiki/Oracle%E6%9C%8D%E5%8A%A1%E5%99%A8%E6%90%AD%E5%BB%BA%E6%95%99%E7%A8%8B)

## Docker构建镜像已知问题
1. Google-genai打包失败,没有找到该问题原因
```bash
ERROR: Could not find a version that satisfies the requirement langchain-google-genai (from -r requirements.txt (line 11)) (from >versions: none)
ERROR: No matching distribution found for langchain-google-genai (from -r requirements.txt (line 11))
```
2. 对于windows和linux 不同操作系统,异步方法也不同 [参考](https://stackoverflow.com/questions/67964463/what-are-selectoreventloop-and-proactoreventloop-in-python-asyncio)
```python
if sys.platform == "win32":
loop = asyncio.ProactorEventLoop() #windows系统
else:
loop = asyncio.SelectorEventLoop()#linux系统
```
3. playwright无法直接打包进Docker! 需要基于Ubuntu镜像环境[参考](https://stackoverflow.com/questions/72181737/issue-running-playwright-python-in-docker-container)
## 版本更新记录
v1.0.0 (oracle cloud)
v1.0.1 (oracle)
1. 解决Docker构建镜像问题,解决不同操作系统存在的异步方法
v1.0.0
1. 优化pdf chat功能中的简历评估功能,增加对话
2. 新增playwright爬虫模块,优化异步调用
3. 新增url chat爬虫模块调用和来源检索选择功能
Expand All @@ -139,6 +171,7 @@ v1.0.0 (oracle cloud)
7. 整合prompt配置内容
8. 页面美化
9. 新增about页面
10. 更新Dockerfile
v0.0.5
1. 新增百度千帆大模型(ERNIE-Lite-8K和ERNIE-Speed-128K免费开放)
Expand Down
Binary file modified __pycache__/crawler_modules.cpython-39.pyc
Binary file not shown.
20 changes: 15 additions & 5 deletions crawler_modules.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import asyncio
import re
import sys
from bs4 import BeautifulSoup
import requests
from langchain_core.documents import Document
Expand Down Expand Up @@ -169,10 +170,13 @@ async def playwright_crawler_async(url):

def selenium_url_crawler(url):
options = Options()
options.add_argument('--headless')
options.add_argument("--headless") # Run Chrome in headless mode
options.add_argument("--no-sandbox") # Bypass OS security model
options.add_argument("--disable-dev-shm-usage") # Overcome limited resource problems
# options.add_argument('--window-size=1920x1080')

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
# driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
driver = webdriver.Chrome(options=options)
driver.get(url)
# time.sleep(2)

Expand Down Expand Up @@ -241,16 +245,22 @@ def duckduck_search(question):
# print(data_playwright)

'''playwright_async'''
# loop = asyncio.ProactorEventLoop()
# if sys.platform == "win32":
# loop = asyncio.ProactorEventLoop()
# else:
# loop = asyncio.SelectorEventLoop()
# data_playwright_async = loop.run_until_complete(playwright_crawler_async('https://www.google.com/search?q=墨尔本天气'))
# print(data_playwright_async)

'''google_search_sync'''
# data_sync = google_search_sync(question)
# print(data_sync)

'''google_search_async'''
loop = asyncio.ProactorEventLoop()
'''google_search_async'''
if sys.platform == "win32":
loop = asyncio.ProactorEventLoop()
else:
loop = asyncio.SelectorEventLoop()
data_async = loop.run_until_complete(google_search_async(question))
print(data_async)

Expand Down
8 changes: 6 additions & 2 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@ BeautifulSoup4
langchain_cohere
chromadb
duckduckgo-search
langchain-google-genai
qianfan
asyncio
webdriver-manager
# langchain-google-genai
pdfminer.six
selenium
selenium
playwright
Binary file modified web_pages/__pycache__/chat_page.cpython-39.pyc
Binary file not shown.
Binary file modified web_pages/__pycache__/online_chat_page.cpython-39.pyc
Binary file not shown.
Binary file modified web_pages/__pycache__/pdf_page.cpython-39.pyc
Binary file not shown.
Binary file modified web_pages/__pycache__/url_page.cpython-39.pyc
Binary file not shown.
4 changes: 2 additions & 2 deletions web_pages/about_page.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,9 +40,9 @@ def about_page():
* [Bootstrap官网](https://getbootstrap.com/)
作者:Boomm-shakalaka
版本:1.0
版本:1.1
Github项目地址:[AIBot-LLM](https://github.com/Boomm-shakalaka/AIBot-LLM)
'报告Bug':[Github Issues](https://github.com/Boomm-shakalaka/AIBot-LLM/issues)
报告Bug:[Github Issues](https://github.com/Boomm-shakalaka/AIBot-LLM/issues)
"""
)

Expand Down
34 changes: 29 additions & 5 deletions web_pages/chat_page.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,23 +6,36 @@
from langchain_groq import ChatGroq
from langchain_community.chat_models import QianfanChatEndpoint
from config_setting import model_config,prompt_config
from langchain_google_genai import ChatGoogleGenerativeAI
# from langchain_google_genai import ChatGoogleGenerativeAI
from dotenv import find_dotenv, load_dotenv

class chatbot:
def __init__(self):
"""
初始化ChatBot类的实例,加载环境变量并设置模型选项和令牌数。
"""
load_dotenv(find_dotenv())#加载环境变量
self.model_option = None
self.model_tokens = None
self.llm = None

def get_response(self,question,chat_history):
"""
根据用户的问题和对话历史获取响应。
Parameters:
question (str): 用户的问题。
chat_history (list): 对话历史列表。
Returns:
str或generator: 如果使用流式输出,返回一个生成器对象;否则返回一个字符串。
"""
try:
if self.model_option =='ERNIE-Lite-8K' or self.model_option=='ERNIE-speed-128k': #选择百度千帆大模型
self.llm = QianfanChatEndpoint(model=self.model_option)
elif self.model_option == 'gemini-1.5-flash-latest': #选择谷歌Gemma大模型,不支持流式输出暂未使用
model_choice=random.choice(["gemini-1.5-flash-latest",'gemini-1.0-pro-001','gemini-1.5-pro-latest',"gemini-1.0-pro"])
self.llm = ChatGoogleGenerativeAI(model=model_choice,temperature=0.7)#ChatGoogleGenerativeAI模型
# elif self.model_option == 'gemini-1.5-flash-latest': #选择谷歌Gemma大模型,不支持流式输出暂未使用
# model_choice=random.choice(["gemini-1.5-flash-latest",'gemini-1.0-pro-001','gemini-1.5-pro-latest',"gemini-1.0-pro"])
# self.llm = ChatGoogleGenerativeAI(model=model_choice,temperature=0.7)#ChatGoogleGenerativeAI模型
else:
self.llm = ChatGroq(model_name=self.model_option,temperature=0.5,max_tokens=self.model_tokens)#ChatGroq模型
prompt = ChatPromptTemplate.from_template(prompt_config.chatbot_prompt)
Expand All @@ -37,13 +50,24 @@ def get_response(self,question,chat_history):
return f"当前模型{self.model_option}暂不可用,请在左侧栏选择其他模型。"

def init_params():
"""
初始化会话状态参数。
如果会话状态中不存在"chat_message"键,则创建一个空列表并将其赋值给"chat_message"。
如果会话状态中不存在"chat_bot"键,则创建一个新的ChatBot实例并将其赋值给"chat_bot"。
"""
if "chat_message" not in st.session_state:
st.session_state.chat_message = []
if "chat_bot" not in st.session_state:
st.session_state.chat_bot = chatbot()

#清除聊天记录
def clear():
"""
清除会话状态中的聊天记录和模型实例。
将会话状态中的"chat_message"键对应的值重置为空列表。
创建一个新的ChatBot实例并将其赋值给"chat_bot"键。
"""
st.session_state.chat_message = [] #清除聊天记录
st.session_state.chat_bot = chatbot() #重新初始化模型

Expand Down
53 changes: 52 additions & 1 deletion web_pages/online_chat_page.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import asyncio
import sys
import streamlit as st
from config_setting import prompt_config
from langchain_core.messages import AIMessage, HumanMessage
Expand All @@ -13,12 +14,25 @@

class SearchBot:
def __init__(self):
"""
初始化SearchBot类的实例,加载环境变量并设置模型选项和令牌数。
"""
load_dotenv(find_dotenv())#加载环境变量.env
self.model_option = None
self.model_tokens = None
self.content=None

def generate_based_history_query(self,question,chat_history):
"""
根据问题和对话历史生成搜索查询。
Parameters:
question (str): 用户的问题。
chat_history (list): 对话历史列表。
Returns:
str: 生成的搜索查询。
"""
prompt=PromptTemplate.from_template(prompt_config.query_generated_prompt)
rag_chain = prompt | self.llm | StrOutputParser()
result=rag_chain.invoke(
Expand All @@ -30,6 +44,16 @@ def generate_based_history_query(self,question,chat_history):
return result

def judge_search(self,question,chat_history):
"""
判断是否需要进行搜索。
Parameters:
question (str): 用户的问题。
chat_history (list): 对话历史列表。
Returns:
str: 判断结果。
"""
judge_model=QianfanChatEndpoint(model='ERNIE-Lite-8K')
prompt = PromptTemplate.from_template(prompt_config.judge_search_prompt)
chain = prompt | judge_model | StrOutputParser()
Expand All @@ -40,6 +64,17 @@ def judge_search(self,question,chat_history):
return response

def get_response(self, question,select_search_type,chat_history):
"""
根据用户的问题和对话历史获取响应。
Parameters:
question (str): 用户的问题。
select_search_type (str): 选择的搜索类型。
chat_history (list): 对话历史列表。
Returns:
str或generator: 如果使用流式输出,返回一个生成器对象;否则返回一个字符串。
"""
if self.model_option =='ERNIE-Lite-8K' or self.model_option=='ERNIE-speed-128k': #选择百度千帆大模型
self.llm = QianfanChatEndpoint(model=self.model_option)
else:
Expand All @@ -51,7 +86,11 @@ def get_response(self, question,select_search_type,chat_history):
if select_search_type=="duckduckgo":
self.content=crawler_modules.duckduck_search(query)
else:
loop = asyncio.ProactorEventLoop()#创建事件循环,用于playwright异步搜索
sys_type=sys.platform
if sys_type == "win32":
loop = asyncio.ProactorEventLoop()#windows系统
else:
loop = asyncio.SelectorEventLoop()#linux系统
self.content = loop.run_until_complete(crawler_modules.google_search_async(query))#异步搜索
prompt = ChatPromptTemplate.from_template(prompt_config.searchbot_prompt)
chain = prompt | self.llm | StrOutputParser()
Expand All @@ -71,13 +110,25 @@ def get_response(self, question,select_search_type,chat_history):
return f"当前模型{self.model_option}暂不可用,请在左侧栏选择其他模型。"

def init_params():
"""
初始化会话状态参数。
如果会话状态中不存在"search_message"键,则创建一个空列表并将其赋值给"search_message"。
如果会话状态中不存在"searchbot"键,则创建一个新的SearchBot实例并将其赋值给"searchbot"。
"""
if "search_message" not in st.session_state:
st.session_state.search_message=[]
if "searchbot" not in st.session_state:
st.session_state.search_bot = SearchBot()


def clear():
"""
清除会话状态中的搜索记录和SearchBot实例。
将会话状态中的"search_message"键对应的值重置为空列表。
创建一个新的SearchBot实例并将其赋值给"searchbot"键。
"""
st.session_state.search_message = []
st.session_state.search_bot = SearchBot()

Expand Down
Loading

0 comments on commit 94acecc

Please sign in to comment.