v1.0.1

Boomm-shakalaka · Jun 5, 2024 · 94acecc · 94acecc
1 parent 4a8f6c1
commit 94acecc
Show file tree

Hide file tree

Showing 14 changed files with 242 additions and 55 deletions.
diff --git a/Dockerfile b/Dockerfile
@@ -1,25 +1,34 @@
-# 设置基础镜像，这里选择 Python 3.8
-FROM python:3.8.19
+# 使用 Ubuntu 22.04 作为基础镜像
+FROM ubuntu:22.04
 
 # 设置工作目录
 WORKDIR /app
 
 # 复制项目文件到容器中的工作目录
 COPY . /app
 
-# 安装 Python 依赖
-RUN pip install --no-cache-dir -r requirements.txt   
+# 安装系统依赖项
+RUN apt-get update && \
+    apt-get install -y libgl1-mesa-glx libpython3-dev
 
-# 安装 Node.js 和 npm
-RUN curl -fsSL https://deb.nodesource.com/setup_14.x | bash - \
-    && apt-get install -y nodejs \
-    && rm -rf /var/lib/apt/lists/*
+# 安装 Python 3.9
+RUN apt-get install -y python3.9 
 
-# 安装 npm 依赖和 Playwright 浏览器
-RUN npm install && npx playwright install
+# 安装 pip
+RUN apt-get install -y python3-pip
+
+# 安装 Python 依赖项
+RUN pip3 install --no-cache-dir -r requirements.txt
+
+# 安装 Playwright 及其依赖项
+RUN playwright install --with-deps chromium 
 
 # 暴露端口
 EXPOSE 8501
 
+# 设置环境变量以指定操作系统
+ENV OS_TYPE="linux"
+
 # 运行 Streamlit 应用
-CMD ["streamlit", "run", "web_ui.py", "--server.port", "8501"]
+CMD ["python3", "-m", "streamlit", "run", "web_ui.py", "--server.port", "8501"]
+
diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-### 基于LLM大模型的AI机器人
+# 基于LLM大模型的AI机器人
 一款开源的AI语言模型机器人，集成人机对话，信息检索生成，PDF和URL解析对话等功能。该平台优势为全部采用免费开源API，以最低成本实现LLM定制化功能。
 
 ## 工具和平台
@@ -7,8 +7,6 @@ Langchain, Streamlit, Oracle Cloud, Groq,Google cloud, Baidu Cloud, Docker
 ## DEMO链接
 [Link](http://168.138.28.54:8501)
 
-## DEMO链接
-[Link](http://168.138.28.54:8501)
 ## 文件结构描述
 <pre>
 .
@@ -39,12 +37,12 @@ Langchain, Streamlit, Oracle Cloud, Groq,Google cloud, Baidu Cloud, Docker
 ├── web_ui.py   # main interface
 </pre>
 
-以下是优化后的Markdown写法：
 
 ## 功能描述
 
 ### Crawler爬虫模块
 
+
 *  该模块主要包含三种爬虫方法: [Selenium](https://selenium-python.readthedocs.io/)，[Playwright](https://playwright.dev/python/docs/intro)，[基于Langchain的DuckDuckGo](https://api.python.langchain.com/en/latest/tools/langchain_community.tools.ddg_search.tool.DuckDuckGoSearchResults.html)。
 
 *  实验显示，Playwright的耗时只有Selenium的一半：
@@ -53,7 +51,8 @@ Langchain, Streamlit, Oracle Cloud, Groq,Google cloud, Baidu Cloud, Docker
     | selenium_url_crawler   | 27s       |
     | playwright_url_crawler | 11s       |
 
-*  由于Streamlit和Playwright的同步方式会产生冲突，所以应使用异步方法。 [参考](https://discuss.streamlit.io/t/using-playwright-with-streamlit/28380/5)
+*  由于Streamlit和Playwright的同步方式会产生冲突，所以应使用异步方法。[参考](https://discuss.streamlit.io/t/using-playwright-with-streamlit/28380/5)
+
 
 ### Chat模块 (在线和离线)
 
@@ -92,6 +91,13 @@ Langchain, Streamlit, Oracle Cloud, Groq,Google cloud, Baidu Cloud, Docker
     3. 根据问题检索top_k个相关文档。
     4. 基于文档内容回答问题。
 
+### PDF解析模块
+1. 基于[Streamlit-PDF-API](https://discuss.streamlit.io/t/display-pdf-in-streamlit/62274)和[Langchain-PDFMinerLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.PDFMinerLoader.html)
+2. 使用流程:
+    1. 上传PDF
+    2. 解析PDF内容大模型基于prompt总结PDF
+    3. 根据问题和PDF内容进行回答
+
 ## 使用教程
 
 ### 本地部署
@@ -125,11 +131,37 @@ Langchain, Streamlit, Oracle Cloud, Groq,Google cloud, Baidu Cloud, Docker
     streamlit run web_ui.py
     ```
 ### 服务器部署
-1. [Docker链接](https://hub.docker.com/repository/docker/jiyuanc1/aibot/general)
-2. 服务器部署教程：[wiki链接](https://github.com/Boomm-shakalaka/AIBot-LLM/wiki/Oracle%E6%9C%8D%E5%8A%A1%E5%99%A8%E6%90%AD%E5%BB%BA%E6%95%99%E7%A8%8B)
+方法一:  Linux环境本地安装和执行Docker
+* 服务器拉取github仓库
+* 构建镜像
+
+方法二:  Docker Hub拉取和执行镜像
+* [Docker Hub链接](https://hub.docker.com/repository/docker/jiyuanc1/aibot/general)
+
+部署教程
+* 服务器部署教程：[wiki链接](https://github.com/Boomm-shakalaka/AIBot-LLM/wiki/Oracle%E6%9C%8D%E5%8A%A1%E5%99%A8%E6%90%AD%E5%BB%BA%E6%95%99%E7%A8%8B)
+
+## Docker构建镜像已知问题
+1. Google-genai打包失败,没有找到该问题原因
+    ```bash
+    ERROR: Could not find a version that satisfies the requirement langchain-google-genai (from -r requirements.txt (line 11)) (from >versions: none) 
+    ERROR: No matching distribution found for langchain-google-genai (from -r requirements.txt (line 11))
+    ```
+2. 对于windows和linux 不同操作系统，异步方法也不同 [参考](https://stackoverflow.com/questions/67964463/what-are-selectoreventloop-and-proactoreventloop-in-python-asyncio)
+    ```python
+    if sys.platform == "win32":
+        loop = asyncio.ProactorEventLoop() #windows系统
+    else:
+        loop = asyncio.SelectorEventLoop()#linux系统
+    ```
+3. playwright无法直接打包进Docker! 需要基于Ubuntu镜像环境[参考](https://stackoverflow.com/questions/72181737/issue-running-playwright-python-in-docker-container)
+
 
 ## 版本更新记录
-v1.0.0 (oracle cloud)
+v1.0.1 (oracle)
+1. 解决Docker构建镜像问题，解决不同操作系统存在的异步方法
+
+v1.0.0 
 1. 优化pdf chat功能中的简历评估功能，增加对话
 2. 新增playwright爬虫模块，优化异步调用
 3. 新增url chat爬虫模块调用和来源检索选择功能
@@ -139,6 +171,7 @@ v1.0.0 (oracle cloud)
 7. 整合prompt配置内容
 8. 页面美化
 9. 新增about页面
+10. 更新Dockerfile
 
 v0.0.5
 1. 新增百度千帆大模型(ERNIE-Lite-8K和ERNIE-Speed-128K免费开放)

diff --git a/__pycache__/crawler_modules.cpython-39.pyc b/__pycache__/crawler_modules.cpython-39.pyc
diff --git a/crawler_modules.py b/crawler_modules.py
@@ -1,5 +1,6 @@
 import asyncio
 import re
+import sys
 from bs4 import BeautifulSoup
 import requests
 from langchain_core.documents import Document
@@ -169,10 +170,13 @@ async def playwright_crawler_async(url):
 
 def selenium_url_crawler(url):
     options = Options()
-    options.add_argument('--headless')
+    options.add_argument("--headless")  # Run Chrome in headless mode
+    options.add_argument("--no-sandbox")  # Bypass OS security model
+    options.add_argument("--disable-dev-shm-usage")  # Overcome limited resource problems
     # options.add_argument('--window-size=1920x1080')
 
-    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
+    # driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
+    driver = webdriver.Chrome(options=options)
     driver.get(url)
     # time.sleep(2) 
 
@@ -241,16 +245,22 @@ def duckduck_search(question):
     # print(data_playwright)
 
     '''playwright_async'''
-    # loop = asyncio.ProactorEventLoop()
+    # if sys.platform == "win32":
+    #     loop = asyncio.ProactorEventLoop()
+    # else:
+    #     loop = asyncio.SelectorEventLoop()
     # data_playwright_async = loop.run_until_complete(playwright_crawler_async('https://www.google.com/search?q=墨尔本天气'))
     # print(data_playwright_async)
 
     '''google_search_sync'''
     # data_sync = google_search_sync(question)
     # print(data_sync)
 
-    '''google_search_async'''
-    loop = asyncio.ProactorEventLoop()
+    '''google_search_async'''                
+    if sys.platform == "win32":
+        loop = asyncio.ProactorEventLoop()
+    else:
+        loop = asyncio.SelectorEventLoop()
     data_async = loop.run_until_complete(google_search_async(question))
     print(data_async)
 

diff --git a/requirements.txt b/requirements.txt
@@ -8,6 +8,10 @@ BeautifulSoup4
 langchain_cohere
 chromadb
 duckduckgo-search
-langchain-google-genai
+qianfan
+asyncio
+webdriver-manager
+# langchain-google-genai
 pdfminer.six
-selenium
+selenium
+playwright
diff --git a/web_pages/__pycache__/chat_page.cpython-39.pyc b/web_pages/__pycache__/chat_page.cpython-39.pyc
diff --git a/web_pages/__pycache__/online_chat_page.cpython-39.pyc b/web_pages/__pycache__/online_chat_page.cpython-39.pyc
diff --git a/web_pages/__pycache__/pdf_page.cpython-39.pyc b/web_pages/__pycache__/pdf_page.cpython-39.pyc
diff --git a/web_pages/__pycache__/url_page.cpython-39.pyc b/web_pages/__pycache__/url_page.cpython-39.pyc
diff --git a/web_pages/about_page.py b/web_pages/about_page.py
@@ -40,9 +40,9 @@ def about_page():
         * [Bootstrap官网](https://getbootstrap.com/)
 
         作者：Boomm-shakalaka  
-        版本：1.0  
+        版本：1.1  
         Github项目地址：[AIBot-LLM](https://github.com/Boomm-shakalaka/AIBot-LLM)  
-        '报告Bug'：[Github Issues](https://github.com/Boomm-shakalaka/AIBot-LLM/issues)
+        报告Bug：[Github Issues](https://github.com/Boomm-shakalaka/AIBot-LLM/issues)
         """
     )
 

diff --git a/web_pages/chat_page.py b/web_pages/chat_page.py
@@ -6,23 +6,36 @@
 from langchain_groq import ChatGroq
 from langchain_community.chat_models import QianfanChatEndpoint
 from config_setting import model_config,prompt_config
-from langchain_google_genai import ChatGoogleGenerativeAI
+# from langchain_google_genai import ChatGoogleGenerativeAI
 from dotenv import find_dotenv, load_dotenv
 
 class chatbot:
     def __init__(self):
+        """
+        初始化ChatBot类的实例，加载环境变量并设置模型选项和令牌数。
+        """
         load_dotenv(find_dotenv())#加载环境变量
         self.model_option = None
         self.model_tokens = None
         self.llm = None
 
     def get_response(self,question,chat_history):
+        """
+        根据用户的问题和对话历史获取响应。
+
+        Parameters:
+        question (str): 用户的问题。
+        chat_history (list): 对话历史列表。
+
+        Returns:
+        str或generator: 如果使用流式输出，返回一个生成器对象；否则返回一个字符串。
+        """
         try:
             if self.model_option =='ERNIE-Lite-8K' or self.model_option=='ERNIE-speed-128k': #选择百度千帆大模型
                 self.llm = QianfanChatEndpoint(model=self.model_option)
-            elif self.model_option == 'gemini-1.5-flash-latest': #选择谷歌Gemma大模型,不支持流式输出暂未使用
-                model_choice=random.choice(["gemini-1.5-flash-latest",'gemini-1.0-pro-001','gemini-1.5-pro-latest',"gemini-1.0-pro"])
-                self.llm = ChatGoogleGenerativeAI(model=model_choice,temperature=0.7)#ChatGoogleGenerativeAI模型
+            # elif self.model_option == 'gemini-1.5-flash-latest': #选择谷歌Gemma大模型,不支持流式输出暂未使用
+            #     model_choice=random.choice(["gemini-1.5-flash-latest",'gemini-1.0-pro-001','gemini-1.5-pro-latest',"gemini-1.0-pro"])
+            #     self.llm = ChatGoogleGenerativeAI(model=model_choice,temperature=0.7)#ChatGoogleGenerativeAI模型
             else:
                 self.llm = ChatGroq(model_name=self.model_option,temperature=0.5,max_tokens=self.model_tokens)#ChatGroq模型
             prompt = ChatPromptTemplate.from_template(prompt_config.chatbot_prompt)
@@ -37,13 +50,24 @@ def get_response(self,question,chat_history):
             return f"当前模型{self.model_option}暂不可用，请在左侧栏选择其他模型。"
 
 def init_params():
+    """
+    初始化会话状态参数。
+
+    如果会话状态中不存在"chat_message"键，则创建一个空列表并将其赋值给"chat_message"。
+    如果会话状态中不存在"chat_bot"键，则创建一个新的ChatBot实例并将其赋值给"chat_bot"。
+    """
     if "chat_message" not in st.session_state:
         st.session_state.chat_message = []
     if "chat_bot" not in st.session_state:
         st.session_state.chat_bot = chatbot()
 
-#清除聊天记录
 def clear():
+    """
+    清除会话状态中的聊天记录和模型实例。
+
+    将会话状态中的"chat_message"键对应的值重置为空列表。
+    创建一个新的ChatBot实例并将其赋值给"chat_bot"键。
+    """
     st.session_state.chat_message = [] #清除聊天记录
     st.session_state.chat_bot = chatbot() #重新初始化模型
 

diff --git a/web_pages/online_chat_page.py b/web_pages/online_chat_page.py
@@ -1,4 +1,5 @@
 import asyncio
+import sys
 import streamlit as st
 from config_setting import prompt_config
 from langchain_core.messages import AIMessage, HumanMessage
@@ -13,12 +14,25 @@
 
 class SearchBot:
     def __init__(self):
+        """
+        初始化SearchBot类的实例，加载环境变量并设置模型选项和令牌数。
+        """
         load_dotenv(find_dotenv())#加载环境变量.env
         self.model_option = None
         self.model_tokens = None
         self.content=None
 
     def generate_based_history_query(self,question,chat_history):
+        """
+        根据问题和对话历史生成搜索查询。
+
+        Parameters:
+        question (str): 用户的问题。
+        chat_history (list): 对话历史列表。
+
+        Returns:
+        str: 生成的搜索查询。
+        """ 
         prompt=PromptTemplate.from_template(prompt_config.query_generated_prompt)
         rag_chain = prompt | self.llm | StrOutputParser()
         result=rag_chain.invoke(
@@ -30,6 +44,16 @@ def generate_based_history_query(self,question,chat_history):
         return result
 
     def judge_search(self,question,chat_history):
+        """
+        判断是否需要进行搜索。
+
+        Parameters:
+        question (str): 用户的问题。
+        chat_history (list): 对话历史列表。
+
+        Returns:
+        str: 判断结果。
+        """
         judge_model=QianfanChatEndpoint(model='ERNIE-Lite-8K')
         prompt = PromptTemplate.from_template(prompt_config.judge_search_prompt)
         chain = prompt | judge_model | StrOutputParser()
@@ -40,6 +64,17 @@ def judge_search(self,question,chat_history):
         return response
 
     def get_response(self, question,select_search_type,chat_history):
+        """
+        根据用户的问题和对话历史获取响应。
+
+        Parameters:
+        question (str): 用户的问题。
+        select_search_type (str): 选择的搜索类型。
+        chat_history (list): 对话历史列表。
+
+        Returns:
+        str或generator: 如果使用流式输出，返回一个生成器对象；否则返回一个字符串。
+        """   
         if self.model_option =='ERNIE-Lite-8K' or self.model_option=='ERNIE-speed-128k': #选择百度千帆大模型
             self.llm = QianfanChatEndpoint(model=self.model_option)
         else:
@@ -51,7 +86,11 @@ def get_response(self, question,select_search_type,chat_history):
                 if select_search_type=="duckduckgo":
                     self.content=crawler_modules.duckduck_search(query)
                 else:
-                    loop = asyncio.ProactorEventLoop()#创建事件循环，用于playwright异步搜索
+                    sys_type=sys.platform
+                    if sys_type == "win32":
+                        loop = asyncio.ProactorEventLoop()#windows系统
+                    else:
+                        loop = asyncio.SelectorEventLoop()#linux系统
                     self.content = loop.run_until_complete(crawler_modules.google_search_async(query))#异步搜索
                 prompt = ChatPromptTemplate.from_template(prompt_config.searchbot_prompt)
                 chain = prompt | self.llm | StrOutputParser()
@@ -71,13 +110,25 @@ def get_response(self, question,select_search_type,chat_history):
             return f"当前模型{self.model_option}暂不可用，请在左侧栏选择其他模型。"
 
 def init_params():
+    """
+    初始化会话状态参数。
+
+    如果会话状态中不存在"search_message"键，则创建一个空列表并将其赋值给"search_message"。
+    如果会话状态中不存在"searchbot"键，则创建一个新的SearchBot实例并将其赋值给"searchbot"。
+    """
     if "search_message" not in st.session_state:
         st.session_state.search_message=[]
     if "searchbot" not in st.session_state:
         st.session_state.search_bot = SearchBot()
 
 
 def clear():
+    """
+    清除会话状态中的搜索记录和SearchBot实例。
+
+    将会话状态中的"search_message"键对应的值重置为空列表。
+    创建一个新的SearchBot实例并将其赋值给"searchbot"键。
+    """
     st.session_state.search_message = []
     st.session_state.search_bot = SearchBot()