Skip to content

Commit

Permalink
[en] Filter out LINK nodes that are files from headers
Browse files Browse the repository at this point in the history
Fixes #910

Files are links, and so if they have an alt-text that alt-text
would pop up in heads because we handle En heads by looking
at the nodes and don't use clean_value.

We might want to consider creating a LinkNode and FileLinkNode class
like we have for TemplateNode so that we can more easily filter
out file links.
  • Loading branch information
kristian-clausal committed Nov 15, 2024
1 parent d1aaa63 commit 11ede27
Showing 1 changed file with 12 additions and 3 deletions.
15 changes: 12 additions & 3 deletions src/wiktextract/extractor/en/page.py
Original file line number Diff line number Diff line change
Expand Up @@ -1165,8 +1165,16 @@ def parse_part_of_speech(posnode: WikiNode, pos: str) -> None:
posnode.children,
lambda x: (
isinstance(x, WikiNode)
and x.kind == NodeKind.TEMPLATE
and x.largs[0][0] in FLOATING_TABLE_TEMPLATES
and (
(
x.kind == NodeKind.TEMPLATE
and x.largs[0][0] in FLOATING_TABLE_TEMPLATES
)
or (
x.kind == NodeKind.LINK
and x.largs[0][0].lower().startswith("file:") # type:ignore[union-attr]
)
)
),
)
tempnode = WikiNode(NodeKind.LEVEL6, 0)
Expand Down Expand Up @@ -1445,6 +1453,7 @@ def process_gloss_header(
new_nodes = []
info_template_data = []
for node in header_nodes:
# print(f"{node=}")
info_data, info_out = parse_info_template_node(wxr, node, "head")
if info_data or info_out:
if info_data:
Expand Down Expand Up @@ -1913,7 +1922,7 @@ def extract_link_texts(item: GeneralNode) -> None:
elif rawgloss == "Technical or specialized senses.":
rawgloss = ""
elif rawgloss.startswith("inflection of "):
parsed = parse_alt_or_inflection_of(wxr, rawgloss, set())
parsed = parse_alt_or_inflection_of(wxr, rawgloss, set())
if parsed is not None:
tags, origins = parsed
if origins is not None:
Expand Down

0 comments on commit 11ede27

Please sign in to comment.