Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/partition_html解析时会删除html 表格 #3717

Closed
deku0818 opened this issue Oct 12, 2024 · 1 comment
Closed

bug/partition_html解析时会删除html 表格 #3717

deku0818 opened this issue Oct 12, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@deku0818
Copy link

Describe the bug
partition_html解析时会删除html 表格

To Reproduce
from unstructured.partition.html import partition_html
res = partition_html(filename="/app/KnowledgeBase/test_user/kb0001/1226BL troubleshooting chart/1226BL troubleshooting chart.md")
for i, element in enumerate(res, 1):
print(f"元素 {i}:")
print(f"类型: {type(element).name}")
print(f"内容: {element.text}")
print("-" * 50)

Expected behavior
应该会输出文档中的内容,而不应该删除表格的内容吧?

Screenshots
image

Environment Info
OS version: Linux-5.4.0-70-generic-x86_64-with-glibc2.35
Python version: 3.10.14
unstructured version: 0.15.14
unstructured-inference is not installed
pytesseract is not installed
Torch version: 2.3.1
Detectron2 version: 0.6
PaddleOCR version: 3.0.0b1
Additional context
截图中提供了部分文档内容

@deku0818 deku0818 added the bug Something isn't working label Oct 12, 2024
@scanny scanny closed this as completed Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants
@scanny @deku0818 and others