Skip to content

Commit

Permalink
[en] Adjustment to form_description heuristics
Browse files Browse the repository at this point in the history
"clipping" was skipped because it had too high distw()
with "clip" (0.5), so I added yet another kludge for when
the language section is English: if something starts with
the title, accept it in this case. This will cause problems
elsewhere, but we'll hunt those down...
  • Loading branch information
kristian-clausal committed Jan 7, 2025
1 parent 9a96ef4 commit ab7cc50
Showing 1 changed file with 19 additions and 10 deletions.
29 changes: 19 additions & 10 deletions src/wiktextract/extractor/en/form_descriptions.py
Original file line number Diff line number Diff line change
Expand Up @@ -2447,16 +2447,25 @@ def strokes_repl(m: re.Match) -> str:
if (
i > 1
and len(parts[i - 1]) >= 4
and distw(titleparts, parts[i - 1]) <= 0.4
# Fixes wiktextract #983, where "participle"
# was too close to "Martinize" and so this accepted
# ["participle", "Martinize"] as matching; this
# kludge prevents this from happening if titleparts
# is shorter than what would be 'related'.
# This breaks if we want to detect stuff that
# actually gets an extra space-separated word when
# 'inflected'.
and len(titleparts) >= len(parts[i - 1:])
and (
distw(titleparts, parts[i - 1]) <= 0.4
# Fixes wiktextract #983, where "participle"
# was too close to "Martinize" and so this accepted
# ["participle", "Martinize"] as matching; this
# kludge prevents this from happening if titleparts
# is shorter than what would be 'related'.
# This breaks if we want to detect stuff that
# actually gets an extra space-separated word when
# 'inflected'.
or (
wxr.wtp.section == "English"
and any(
parts[i - 1].startswith(title)
for title in titleparts
)
)
)
and len(titleparts) >= len(parts[i - 1 :])
):
# print(f"Reached; {parts=}, {parts[i-1]=}")
alt_related = related
Expand Down

0 comments on commit ab7cc50

Please sign in to comment.