Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor extension #11

Merged
merged 4 commits into from
Jul 9, 2024
Merged

Refactor extension #11

merged 4 commits into from
Jul 9, 2024

Conversation

shaokeyibb
Copy link
Collaborator

This PR refactor the entire extension, bump version to v0.1.0, the following stuff changed:

  1. The main logic of offset calculate has been changed, now we use difflib to generate diff opcodes.
  2. Add debug mode option.
  3. Test will bypass non-accurate element

@Enter-tainer
Copy link
Member

能描述一下这一版是怎么work的吗

@Enter-tainer
Copy link
Member

可能我们最好能有一种方便大规模测试的工具..比如考虑把对应的原文也生成到html里面,搞点css,做成一个叠加层,这样能比较一目了然地看出来哪里有问题,哪里没有

@shaokeyibb
Copy link
Collaborator Author

shaokeyibb commented Jul 9, 2024

可能我们最好能有一种方便大规模测试的工具..比如考虑把对应的原文也生成到html里面,搞点css,做成一个叠加层,这样能比较一目了然地看出来哪里有问题,哪里没有

目前原文就在 HTML 里,有一个 data-original-document 属性,CSS 这个的话恐怕得在 mkdocs 那边动土了,直接加到编译器输出里会很怪

@shaokeyibb
Copy link
Collaborator Author

shaokeyibb commented Jul 9, 2024

能描述一下这一版是怎么work的吗

首先,获取预处理前(也就是原始文档)和预处理后(一部分数据被使用占位符覆盖)的文档的 diff,这通过 Python 内置的 difflib 模块实现,然后在这个基础上生成一个 restore_opcodes,这个 opcodes 描述了如果要把后者恢复到前者的样子应该做哪些操作(插入、移除、替换)。在块处理器阶段,我根据这个 opcodes 试图精准或者模糊匹配块在源文档的 offset,准确率还行。

@Enter-tainer
Copy link
Member

我的期望是我们能有一个很方便的,含人量比较少的测试工具,能让我们方便地在 oi-wiki/oi-wiki 上验证我们的算法的正确性

@shaokeyibb
Copy link
Collaborator Author

shaokeyibb commented Jul 9, 2024

我的期望是我们能有一个很方便的,含人量比较少的测试工具,能让我们方便地在 oi-wiki/oi-wiki 上验证我们的算法的正确性

有。使用 debug 模式运行 mkdocs 编译,你可以看到一些 WARN 和 ERROR:

WARNING - Trying to patch a block without original document start, patching to current block offset in {}-{}

这说明该文本块的解析是不准确的(可能会漏掉一些文字)

ERROR - Failed to restore the document offsets for the block {}-{}, restored {}-{}

这说明该文本块无法被还原到原文档,这会导致根本不生成 offset

除此之外,在 debug 模式下,还会插入 data-offset-accurate-startdata-offset-accurate-end 属性,值均为 boolean,代表这个 offset 的起始/结束是否是精确匹配

@Enter-tainer
Copy link
Member

这个不是测试吧,只是你的算法运行的时候打的一些log

@shaokeyibb
Copy link
Collaborator Author

这个不是测试吧,只是你的算法运行的时候打的一些log

确实不是,但是如果要实际看做一下的话很容易,可能弄个脚本就行了

@shaokeyibb shaokeyibb merged commit d8259d3 into master Jul 9, 2024
2 checks passed
@shaokeyibb shaokeyibb deleted the hikarilan branch July 9, 2024 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants