Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/title decision #164

Merged
merged 19 commits into from
Jan 16, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 62 additions & 0 deletions docs/document_formatter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# DocumentFormatter

`CosmosDBManager`クラスからドキュメントを追加する場合`src/sc_system_ai/template/document_formatter.py`の関数群を使用します。

## 基本動作

テキストを分割する関数は、文章を1000文字程度で分割します。この大きさは引数`chunk_size`、`chunk_overlap`から指定可能です。

以下のメタデータを付与します。

- created_at : 作成日時
- updated_at : 更新日時

### `md_formatter()`

#### 引数

| 引数名 | 型 | 説明 |
|----------------|-------------------|--------------------------------|
| `text` | str | Markdown形式のテキスト |
| `title` | str (optional) | タイトル |
| `metadata` | dict[str, Any] (optional) | メタデータ |
| `chunk_size` | int (optional) | 分割するサイズ |
| `chunk_overlap`| int (optional) | オーバーラップのサイズ |

#### 動作

マークダウン形式のテキストを分割し、メタデータを付与します。
`Document`オブジェクトを返却します。
メタデータにはヘッダーが付与されています。

テキストの分割はヘッダー毎に行います。
分割したテキストがチャンクサイズを超える場合また分割を行います。
2度目の分割を行ったテキストにはセクション番号がメタデータとして付与されます。

`title`を与えず呼び出した場合、対応するヘッダーをタイトルとしてメタデータに与えます。
ヘッダーがない場合は分割後のテキストの最初のテキストをタイトルとします。

### `text_formatter()`

#### 引数

| 引数名 | 型 | 説明 |
|----------------|-------------------|--------------------------------|
| `text` | str | テキスト |
| `title` | str (optional) | タイトル |
| `metadata` | dict[str, Any] (optional) | メタデータ |
| `separator` | str (optional) | 区切り文字 |
| `chunk_size` | int (optional) | 分割するサイズ |
| `chunk_overlap`| int (optional) | オーバーラップのサイズ |

#### 動作

セパレータとチャンクサイズで分割を行い、メタデータを付与します。

`title`を与えず呼び出した場合、分割後のテキストの最初のテキストをタイトルとします。

## `CosmosDBManager`での動作

`create_document`メソッドでベクターストアにドキュメントを作成します。

`updata_document`メソッドではメタデータ`updated_at`の更新を行います。
29 changes: 15 additions & 14 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,43 +1,44 @@
aiohappyeyeballs==2.4.3 ; python_version >= "3.10" and python_version < "4.0"
aiohttp==3.10.10 ; python_version >= "3.10" and python_version < "4.0"
aiosignal==1.3.2 ; python_version >= "3.10" and python_version < "4.0"
aiohttp==3.10.11 ; python_version >= "3.10" and python_version < "4.0"
aiosignal==1.3.1 ; python_version >= "3.10" and python_version < "4.0"
annotated-types==0.7.0 ; python_version >= "3.10" and python_version < "4.0"
anyio==4.6.0 ; python_version >= "3.10" and python_version < "4.0"
async-timeout==4.0.3 ; python_version >= "3.10" and python_version < "3.11"
attrs==24.2.0 ; python_version >= "3.10" and python_version < "4.0"
azure-core==1.31.0 ; python_version >= "3.10" and python_version < "4.0"
azure-cosmos==4.7.0 ; python_version >= "3.10" and python_version < "4.0"
azure-cosmos==4.9.0 ; python_version >= "3.10" and python_version < "4.0"
certifi==2024.8.30 ; python_version >= "3.10" and python_version < "4.0"
charset-normalizer==3.4.1 ; python_version >= "3.10" and python_version < "4.0"
charset-normalizer==3.4.0 ; python_version >= "3.10" and python_version < "4.0"
click==8.1.7 ; python_version >= "3.10" and python_version < "4.0"
colorama==0.4.6 ; python_version >= "3.10" and python_version < "4.0" and platform_system == "Windows"
dataclasses-json==0.6.7 ; python_version >= "3.10" and python_version < "4.0"
distro==1.9.0 ; python_version >= "3.10" and python_version < "4.0"
duckduckgo-search==6.3.2 ; python_version >= "3.10" and python_version < "4.0"
duckduckgo-search==6.3.7 ; python_version >= "3.10" and python_version < "4.0"
exceptiongroup==1.2.2 ; python_version >= "3.10" and python_version < "3.11"
frozenlist==1.4.1 ; python_version >= "3.10" and python_version < "4.0"
greenlet==3.1.1 ; python_version < "3.13" and (platform_machine == "aarch64" or platform_machine == "ppc64le" or platform_machine == "x86_64" or platform_machine == "amd64" or platform_machine == "AMD64" or platform_machine == "win32" or platform_machine == "WIN32") and python_version >= "3.10"
h11==0.14.0 ; python_version >= "3.10" and python_version < "4.0"
httpcore==1.0.6 ; python_version >= "3.10" and python_version < "4.0"
httpx-sse==0.4.0 ; python_version >= "3.10" and python_version < "4.0"
httpx==0.27.2 ; python_version >= "3.10" and python_version < "4.0"
idna==3.10 ; python_version >= "3.10" and python_version < "4.0"
jiter==0.6.1 ; python_version >= "3.10" and python_version < "4.0"
jsonpatch==1.33 ; python_version >= "3.10" and python_version < "4.0"
jsonpointer==3.0.0 ; python_version >= "3.10" and python_version < "4.0"
langchain-community==0.3.3 ; python_version >= "3.10" and python_version < "4.0"
langchain-core==0.3.12 ; python_version >= "3.10" and python_version < "4.0"
langchain-openai==0.2.3 ; python_version >= "3.10" and python_version < "4.0"
langchain-text-splitters==0.3.0 ; python_version >= "3.10" and python_version < "4.0"
langchain==0.3.4 ; python_version >= "3.10" and python_version < "4.0"
langsmith==0.1.137 ; python_version >= "3.10" and python_version < "4.0"
langchain-community==0.3.13 ; python_version >= "3.10" and python_version < "4.0"
langchain-core==0.3.28 ; python_version >= "3.10" and python_version < "4.0"
langchain-openai==0.2.14 ; python_version >= "3.10" and python_version < "4.0"
langchain-text-splitters==0.3.4 ; python_version >= "3.10" and python_version < "4.0"
langchain==0.3.13 ; python_version >= "3.10" and python_version < "4.0"
langsmith==0.1.147 ; python_version >= "3.10" and python_version < "4.0"
marshmallow==3.22.0 ; python_version >= "3.10" and python_version < "4.0"
multidict==6.1.0 ; python_version >= "3.10" and python_version < "4.0"
mypy-extensions==1.0.0 ; python_version >= "3.10" and python_version < "4.0"
numpy==1.26.4 ; python_version >= "3.10" and python_version < "4.0"
openai==1.52.0 ; python_version >= "3.10" and python_version < "4.0"
orjson==3.10.7 ; python_version >= "3.10" and python_version < "4.0"
openai==1.58.1 ; python_version >= "3.10" and python_version < "4.0"
orjson==3.10.7 ; python_version >= "3.10" and python_version < "4.0" and platform_python_implementation != "PyPy"
packaging==24.1 ; python_version >= "3.10" and python_version < "4.0"
primp==0.6.4 ; python_version >= "3.10" and python_version < "4.0"
primp==0.8.1 ; python_version >= "3.10" and python_version < "4.0"
propcache==0.2.0 ; python_version >= "3.10" and python_version < "4.0"
pydantic-core==2.23.4 ; python_version >= "3.10" and python_version < "4.0"
pydantic-settings==2.5.2 ; python_version >= "3.10" and python_version < "4.0"
Expand Down
Loading
Loading