Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/title decision #164

Merged
merged 19 commits into from
Jan 16, 2025
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 62 additions & 0 deletions docs/document_formatter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# DocumentFormatter

`CosmosDBManager`クラスからドキュメントを追加する場合`src/sc_system_ai/template/document_formatter.py`の関数群を使用します。

## 基本動作

テキストを分割する関数は、文章を1000文字程度で分割します。この大きさは引数`chunk_size`、`chunk_overlap`から指定可能です。

以下のメタデータを付与します。

- created_at : 作成日時
- updated_at : 更新日時

### `md_formatter()`

#### 引数

| 引数名 | 型 | 説明 |
|----------------|-------------------|--------------------------------|
| `text` | str | Markdown形式のテキスト |
| `title` | str (optional) | タイトル |
| `metadata` | dict[str, Any] (optional) | メタデータ |
| `chunk_size` | int (optional) | 分割するサイズ |
| `chunk_overlap`| int (optional) | オーバーラップのサイズ |

#### 動作

マークダウン形式のテキストを分割し、メタデータを付与します。
`Document`オブジェクトを返却します。
メタデータにはヘッダーが付与されています。

テキストの分割はヘッダー毎に行います。
分割したテキストがチャンクサイズを超える場合また分割を行います。
2度目の分割を行ったテキストにはセクション番号がメタデータとして付与されます。

`title`を与えず呼び出した場合、対応するヘッダーをタイトルとしてメタデータに与えます。
ヘッダーがない場合は分割後のテキストの最初のテキストをタイトルとします。

### `text_formatter()`

#### 引数

| 引数名 | 型 | 説明 |
|----------------|-------------------|--------------------------------|
| `text` | str | テキスト |
| `title` | str (optional) | タイトル |
| `metadata` | dict[str, Any] (optional) | メタデータ |
| `separator` | str (optional) | 区切り文字 |
| `chunk_size` | int (optional) | 分割するサイズ |
| `chunk_overlap`| int (optional) | オーバーラップのサイズ |

#### 動作

セパレータとチャンクサイズで分割を行い、メタデータを付与します。

`title`を与えず呼び出した場合、分割後のテキストの最初のテキストをタイトルとします。

## `CosmosDBManager`での動作

`create_document`メソッドでベクターストアにドキュメントを作成します。

`updata_document`メソッドではメタデータ`updated_at`の更新を行います。
29 changes: 15 additions & 14 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,43 +1,44 @@
aiohappyeyeballs==2.4.3 ; python_version >= "3.10" and python_version < "4.0"
aiohttp==3.10.10 ; python_version >= "3.10" and python_version < "4.0"
aiosignal==1.3.2 ; python_version >= "3.10" and python_version < "4.0"
aiohttp==3.10.11 ; python_version >= "3.10" and python_version < "4.0"
aiosignal==1.3.1 ; python_version >= "3.10" and python_version < "4.0"
annotated-types==0.7.0 ; python_version >= "3.10" and python_version < "4.0"
anyio==4.6.0 ; python_version >= "3.10" and python_version < "4.0"
async-timeout==4.0.3 ; python_version >= "3.10" and python_version < "3.11"
attrs==24.2.0 ; python_version >= "3.10" and python_version < "4.0"
azure-core==1.31.0 ; python_version >= "3.10" and python_version < "4.0"
azure-cosmos==4.7.0 ; python_version >= "3.10" and python_version < "4.0"
azure-cosmos==4.9.0 ; python_version >= "3.10" and python_version < "4.0"
certifi==2024.8.30 ; python_version >= "3.10" and python_version < "4.0"
charset-normalizer==3.4.1 ; python_version >= "3.10" and python_version < "4.0"
charset-normalizer==3.4.0 ; python_version >= "3.10" and python_version < "4.0"
click==8.1.7 ; python_version >= "3.10" and python_version < "4.0"
colorama==0.4.6 ; python_version >= "3.10" and python_version < "4.0" and platform_system == "Windows"
dataclasses-json==0.6.7 ; python_version >= "3.10" and python_version < "4.0"
distro==1.9.0 ; python_version >= "3.10" and python_version < "4.0"
duckduckgo-search==6.3.2 ; python_version >= "3.10" and python_version < "4.0"
duckduckgo-search==6.3.7 ; python_version >= "3.10" and python_version < "4.0"
exceptiongroup==1.2.2 ; python_version >= "3.10" and python_version < "3.11"
frozenlist==1.4.1 ; python_version >= "3.10" and python_version < "4.0"
greenlet==3.1.1 ; python_version < "3.13" and (platform_machine == "aarch64" or platform_machine == "ppc64le" or platform_machine == "x86_64" or platform_machine == "amd64" or platform_machine == "AMD64" or platform_machine == "win32" or platform_machine == "WIN32") and python_version >= "3.10"
h11==0.14.0 ; python_version >= "3.10" and python_version < "4.0"
httpcore==1.0.6 ; python_version >= "3.10" and python_version < "4.0"
httpx-sse==0.4.0 ; python_version >= "3.10" and python_version < "4.0"
httpx==0.27.2 ; python_version >= "3.10" and python_version < "4.0"
idna==3.10 ; python_version >= "3.10" and python_version < "4.0"
jiter==0.6.1 ; python_version >= "3.10" and python_version < "4.0"
jsonpatch==1.33 ; python_version >= "3.10" and python_version < "4.0"
jsonpointer==3.0.0 ; python_version >= "3.10" and python_version < "4.0"
langchain-community==0.3.3 ; python_version >= "3.10" and python_version < "4.0"
langchain-core==0.3.12 ; python_version >= "3.10" and python_version < "4.0"
langchain-openai==0.2.3 ; python_version >= "3.10" and python_version < "4.0"
langchain-text-splitters==0.3.0 ; python_version >= "3.10" and python_version < "4.0"
langchain==0.3.4 ; python_version >= "3.10" and python_version < "4.0"
langsmith==0.1.137 ; python_version >= "3.10" and python_version < "4.0"
langchain-community==0.3.13 ; python_version >= "3.10" and python_version < "4.0"
langchain-core==0.3.28 ; python_version >= "3.10" and python_version < "4.0"
langchain-openai==0.2.14 ; python_version >= "3.10" and python_version < "4.0"
langchain-text-splitters==0.3.4 ; python_version >= "3.10" and python_version < "4.0"
langchain==0.3.13 ; python_version >= "3.10" and python_version < "4.0"
langsmith==0.1.147 ; python_version >= "3.10" and python_version < "4.0"
marshmallow==3.22.0 ; python_version >= "3.10" and python_version < "4.0"
multidict==6.1.0 ; python_version >= "3.10" and python_version < "4.0"
mypy-extensions==1.0.0 ; python_version >= "3.10" and python_version < "4.0"
numpy==1.26.4 ; python_version >= "3.10" and python_version < "4.0"
openai==1.52.0 ; python_version >= "3.10" and python_version < "4.0"
orjson==3.10.7 ; python_version >= "3.10" and python_version < "4.0"
openai==1.58.1 ; python_version >= "3.10" and python_version < "4.0"
orjson==3.10.7 ; python_version >= "3.10" and python_version < "4.0" and platform_python_implementation != "PyPy"
packaging==24.1 ; python_version >= "3.10" and python_version < "4.0"
primp==0.6.4 ; python_version >= "3.10" and python_version < "4.0"
primp==0.8.1 ; python_version >= "3.10" and python_version < "4.0"
propcache==0.2.0 ; python_version >= "3.10" and python_version < "4.0"
pydantic-core==2.23.4 ; python_version >= "3.10" and python_version < "4.0"
pydantic-settings==2.5.2 ; python_version >= "3.10" and python_version < "4.0"
Expand Down
7 changes: 5 additions & 2 deletions src/sc_system_ai/template/azure_cosmos.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,12 +84,15 @@ def __init__(
def create_document(
self,
text: str,
text_type: Literal["markdown", "plain"] = "markdown"
text_type: Literal["markdown", "plain"] = "markdown",
title: str | None = None,
metadata: dict[str, Any] | None = None,
) -> list[str]:
"""データベースに新しいdocumentを作成する関数"""
logger.info("新しいdocumentを作成します")
texts, metadatas = self._division_document(
md_formatter(text) if text_type == "markdown" else text_formatter(text)
md_formatter(text, title, metadata) if text_type == "markdown"
else text_formatter(text, title=title, metadata=metadata)
)
ids = self._insert_texts(texts, metadatas)
return ids
Expand Down
29 changes: 19 additions & 10 deletions src/sc_system_ai/template/document_formatter.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,7 @@ def add_metadata(
source (str, optional): ソース.
with_timestamp (bool, optional): タイムスタンプの有無. Defaults to True.
with_section_number (bool, optional): セクション番号の有無. Defaults to False.
**kwargs: その他のメタデータ.
"""
i = 1
date = datetime.now().strftime("%Y-%m-%d")
Expand Down Expand Up @@ -124,21 +125,26 @@ def add_metadata(

def md_formatter(
text: str,
title: str | None = None,
metadata: dict[str, Any] | None = None,
chunk_size: int = CHUNK_SIZE,
chunk_overlap: int = CHUNK_OVERLAP,
**kwargs: Any
) -> list[Document]:
"""Markdown形式のテキストをフォーマットする関数
Args:
text (str): Markdown形式のテキスト
title (str, optional): タイトル.
metadata (dict[str, Any], optional): メタデータ.
chunk_size (int, optional): 分割するサイズ.
chunk_overlap (int, optional): オーバーラップのサイズ.

chunk_sizeを超えるテキストは再分割し、メタデータにセクション番号を付与します.
"""
formatted_docs: list[Document] = []
_metadata = metadata if metadata is not None else {}

for doc in markdown_splitter(text):
t = _find_header(doc)
t = _find_header(doc) if title is None else title
if len(doc.page_content) > chunk_size:
rdocs = recursive_document_splitter(
[doc],
Expand All @@ -149,27 +155,30 @@ def md_formatter(
rdocs,
title=t if t is not None else rdocs[0].page_content,
with_section_number=True,
**kwargs
**_metadata
)
else:
formatted_docs += add_metadata(
[doc],
title=t if t is not None else doc.page_content,
**kwargs
**_metadata
)

return formatted_docs

def text_formatter(
text: str,
separator: str = "\n\n",
title: str | None = None,
metadata: dict[str, Any] | None = None,
chunk_size: int = CHUNK_SIZE,
chunk_overlap: int = CHUNK_OVERLAP,
**kwargs: Any
) -> list[Document]:
"""テキストをフォーマットする関数
Args:
text (str): テキスト
title (str, optional): タイトル.
metadata (dict[str, Any], optional): メタデータ.
separator (str, optional): 区切り文字.
chunk_size (int, optional): 分割するサイズ.
chunk_overlap (int, optional): オーバーラップのサイズ.
Expand All @@ -184,9 +193,9 @@ def text_formatter(
)
return add_metadata(
docs,
title=docs[0].page_content,
with_section_number=True,
**kwargs
title=docs[0].page_content if title is None else title,
with_section_number=True if len(docs) > 1 else False,
**metadata if metadata is not None else {},
)

if __name__ == "__main__":
Expand Down Expand Up @@ -219,8 +228,8 @@ def print_docs(docs: list[Document]) -> None:
print()


docs = md_formatter(md_text)
docs = md_formatter(md_text, title="hogehogehoge", metadata={"fuga": "piyopiyo"})
print_docs(docs)

docs = text_formatter(md_text)
docs = text_formatter(md_text, title="hogehogehoge", metadata={"fuga": "piyopiyo"})
print_docs(docs)
Loading