Skip to content

[Template] add unified template. #10633

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
195 changes: 195 additions & 0 deletions paddlenlp/datasets/formatter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import re
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from enum import Enum, unique
from typing import Optional, Union

Check warning on line 19 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L15-L19

Added lines #L15 - L19 were not covered by tests

from typing_extensions import override

Check warning on line 21 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L21

Added line #L21 was not covered by tests

SLOTS = list[Union[str, set[str], dict[str, str]]]

Check warning on line 23 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L23

Added line #L23 was not covered by tests


KG_RES_MARKUPS = [

Check warning on line 26 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L26

Added line #L26 was not covered by tests
"[<kg-res>]",
"[</kg-res>]",
"[<kg-yes>]",
"[</kg-yes>]",
"[<kg-cs-yes>]",
"[</kg-cs-yes>]",
"[<kg-cs-no>]",
"[</kg-cs-no>]",
]


@unique
class Role(str, Enum):
USER = "user"
ASSISTANT = "assistant"
SYSTEM = "system"

Check warning on line 42 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L38-L42

Added lines #L38 - L42 were not covered by tests


def extract_knowledge(text):

Check warning on line 45 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L45

Added line #L45 was not covered by tests
"""Extracts structured knowledge from text markup.

Args:
text (str): Input text containing markup.

Returns:
str: Processed knowledge string.

Raises:
ValueError: If no valid knowledge pattern found.
"""

if any(markup in text for markup in KG_RES_MARKUPS):
for markup in KG_RES_MARKUPS + ["[<image>]", "[</image>]"]:
text = text.replace(markup, "")
text = f"知识库:{text.strip()}\n根据所提供的知识库信息,回答问题并补全对话:"
return text

Check warning on line 62 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L58-L62

Added lines #L58 - L62 were not covered by tests

res = re.findall(

Check warning on line 64 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L64

Added line #L64 was not covered by tests
r"\[<search-res>\](.*?)\[<\/search-res>\]",
text,
re.DOTALL | re.MULTILINE,
)
if len(res) > 0:
text = res[0]
text = f"{text.strip()}\n根据以上参考文章回答问题,补全对话"
return text

Check warning on line 72 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L69-L72

Added lines #L69 - L72 were not covered by tests

res = re.findall(

Check warning on line 74 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L74

Added line #L74 was not covered by tests
r"\[<prompt-res>\](.*?)\[<\/prompt-res>\]",
text,
re.DOTALL | re.MULTILINE,
)
if len(res) > 0:
text = res[0]
text = text.strip()
return text

Check warning on line 82 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L79-L82

Added lines #L79 - L82 were not covered by tests

res = re.findall(

Check warning on line 84 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L84

Added line #L84 was not covered by tests
r"\[<compute-res>\](.*?)\[<\/compute-res>\]",
text,
re.DOTALL | re.MULTILINE,
)
if len(res) > 0:
text = res[0]
text = f"参考文章1:{text.strip()}\n根据以上参考文章回答问题,补全对话"
return text

Check warning on line 92 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L89-L92

Added lines #L89 - L92 were not covered by tests

res = re.findall(

Check warning on line 94 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L94

Added line #L94 was not covered by tests
r"\[<citation-ref>\](.*?)\[<\/citation-ref>\]",
text,
re.DOTALL | re.MULTILINE,
)
if len(res) > 0:
text = res[0]
text = (

Check warning on line 101 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L99-L101

Added lines #L99 - L101 were not covered by tests
"请参考搜索结果回答下面问题并使用引用标记来标注回答内容参考的搜索结果序号,"
"例如^[1]^ (引用单个搜索结果),^[1][2]^(引用多个搜索结果),"
"其中方括号中的数字是搜索结果序号。引用标记只能出现在句尾标点符号前。\n"
"以下是搜索结果(每行开头[1]、[2]、...是搜索结果序号),"
f"可以对答案中的核心部分进行markdown加粗(**加粗内容**):\n{text.strip()}\n"
"根据以上搜索结果回答问题并标注引用,补全对话"
)
return text

Check warning on line 109 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L109

Added line #L109 was not covered by tests

res = re.findall(

Check warning on line 111 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L111

Added line #L111 was not covered by tests
r"\[<retrieve-ref>\](.*?)\[<\/retrieve-ref>\]",
text,
re.DOTALL | re.MULTILINE,
)
if len(res) > 0:
text = res[0]
text = (

Check warning on line 118 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L116-L118

Added lines #L116 - L118 were not covered by tests
"请你扮演一个专家,参考搜索结果中正确、可信、高质量的信息回答问题,并注明答案中引用的搜索结果,"
"格式为^[2]^表示引用了第2条搜索结果,^[1][3]^表示引用第1和第3条搜索结果。"
"每条搜索结果包含若干相关内容片段。同时你需要遵循以下原则回答问题:\n"
"1. 严格遵循搜索结果作答,可以承认不知道答案,并尝试给出一些搜索结果中的相关背景信息。\n"
"2. 如果搜索结果存在多种可能的答案,要罗列出每种情况。\n"
"3. 如果问题涉及金融、医疗、法律等存在风险的领域,请在结尾提醒用户注意并进行免责说明。\n"
f"搜索结果:\n{text.strip()}\n\n现在,请根据上面的搜索结果回答问题并标注引用,补全对话"
)
return text

Check warning on line 127 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L127

Added line #L127 was not covered by tests

raise ValueError(f"Cannot extract knowledge from `{text}`")

Check warning on line 129 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L129

Added line #L129 was not covered by tests


@dataclass
class Formatter(ABC):
slots: SLOTS = field(default_factory=list)
tool_format: Optional[str] = None

Check warning on line 135 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L132-L135

Added lines #L132 - L135 were not covered by tests

@abstractmethod
def apply(self, **kwargs) -> SLOTS:

Check warning on line 138 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L137-L138

Added lines #L137 - L138 were not covered by tests
r"""Forms a list of slots according to the inputs to encode."""
...

Check warning on line 140 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L140

Added line #L140 was not covered by tests


@dataclass
class EmptyFormatter(Formatter):
def __post_init__(self):
has_placeholder = False
for slot in filter(lambda s: isinstance(s, str), self.slots):
if re.search(r"\{\{[a-zA-Z_][a-zA-Z0-9_]*\}\}", slot):
has_placeholder = True

Check warning on line 149 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L143-L149

Added lines #L143 - L149 were not covered by tests

if has_placeholder:
raise ValueError("Empty formatter should not contain any placeholder.")

Check warning on line 152 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L151-L152

Added lines #L151 - L152 were not covered by tests

@override
def apply(self, **kwargs) -> SLOTS:
return self.slots

Check warning on line 156 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L154-L156

Added lines #L154 - L156 were not covered by tests


@dataclass
class StringFormatter(Formatter):
def __post_init__(self):
has_placeholder = False
for slot in filter(lambda s: isinstance(s, str), self.slots):
if re.search(r"\{\{[a-zA-Z_][a-zA-Z0-9_]*\}\}", slot):
has_placeholder = True

Check warning on line 165 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L159-L165

Added lines #L159 - L165 were not covered by tests

if not has_placeholder:
raise ValueError("A placeholder is required in the string formatter.")

Check warning on line 168 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L167-L168

Added lines #L167 - L168 were not covered by tests

@override
def apply(self, **kwargs) -> SLOTS:
elements = []
for slot in self.slots:
if isinstance(slot, str):
for name, value in kwargs.items():
if not isinstance(value, str):
raise RuntimeError(f"Expected a string, got {name} : s{value}")

Check warning on line 177 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L170-L177

Added lines #L170 - L177 were not covered by tests

slot = slot.replace("{{" + name + "}}", value, 1)
elements.append(slot)
elif isinstance(slot, (dict, set)):
elements.append(slot)

Check warning on line 182 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L179-L182

Added lines #L179 - L182 were not covered by tests
else:
raise RuntimeError(f"Input must be string, set[str] or dict[str, str], got {type(slot)}.")

Check warning on line 184 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L184

Added line #L184 was not covered by tests

return elements

Check warning on line 186 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L186

Added line #L186 was not covered by tests


@dataclass
class KnowledgeFormatter(StringFormatter):
@override
def apply(self, **kwargs) -> SLOTS:
content: str = extract_knowledge(kwargs.pop("content")) + "\n"
idx: int = kwargs.pop("idx")
return super().apply(content=content, idx=idx)

Check warning on line 195 in paddlenlp/datasets/formatter.py

View check run for this annotation

Codecov / codecov/patch

paddlenlp/datasets/formatter.py#L189-L195

Added lines #L189 - L195 were not covered by tests
Loading