Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wiktextract typing #432

Merged
merged 12 commits into from
Jan 3, 2024
Merged

Wiktextract typing #432

merged 12 commits into from
Jan 3, 2024

Conversation

kristian-clausal
Copy link
Collaborator

I'm going to start slowly typing Wiktextract too (I already found some stuff that I maybe should have PRed separately, but it's such a small issue it doesn't matter at all: --statistics isn't actually implemented).

Keep on trucking with whatever is being done, I will manually merge if need be.

This is low priority, but really nice to have.

Current workflow: alphabetically. categories.py, config.py...

--statistics command line parameter was never implemented.

There are bits of logic and code left over from the previous
attempt, like in merge_return that assumes somewhere certain
statistics data has been collected to be handled in the
central process, but that statistics collection has never
actually been done except for one place: wxr.config.section_counts
gets incremented in en/page.py, BUT even that data is then
discarded if multiprocessing is used and merging is required
(because there is no data sent with the merge).

Using --statistics doesn't actually print any statistics, except
if you are processing one page (or one process): then you get
section_counts.

What is currently typed as CollagedErrorReturnData is assumed to be
actually a larger collection that includes these count statistics.

I've commented out the code related to all of this; nothing
should break, because nothing actually touches this.

To make it easier to find all these random bits, I've commented
them with the term `STATISTICS_IMPLEMENTATION`.
Set mypy to use Python 3.9 as baseline in pyproject.toml.

Add typing to easy files.

Found out that --statistics wasn't being actually used,
so commented that out for later in the previous commit.
src/wiktextract/config.py Outdated Show resolved Hide resolved
src/wiktextract/config.py Outdated Show resolved Hide resolved
src/wiktextract/clean.py Outdated Show resolved Hide resolved
Because the code in extractor/en/thesaurus.py will
fail due to missing/None assignments from .get()
in any case, it's quicker to fail using indexing.

Similarly, some other places are also changed to
use [] instead.
For some reason HEAD_TAG_RE was being initialized in a specific
function for it, without any apparent reason; 1 interaction
with HEAD_TAG_RE in the codebase, no branching, no changes
to HEAD_TAG_RE. Most probably remnants of some removed logic
or an to-do that was forgotten or became unnecessary.

This was previously touched by xxyzz to fix a but in the
regex, but the weird initialization function is from earlier.

A test was modified.
)


WordData = dict[str, Union[
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think list[dict[str, any]] is fine for parse_page(), since these functions are not directly called and the data are written to JSON file soon after. Plus mypy doesn't work well with union types.

And could you please format the newly added code and rebase the commits instead of merge master branch?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between rebasing and merging here?

Copy link
Collaborator

@xxyzz xxyzz Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebase is like the branch is created at the current master branch so the commits will looks like they are created linearly otherwise old commits could appear before newer commits in the master branch after merge and it's harder to read the git log history, and rebase don't have distracting merge commits. Git book has some nice pictures: https://git-scm.com/book/en/v2/Git-Branching-Rebasing

Basically I think it's like changing your start point, like the latest master commit but not the outdated commit. And also has other features like squash commits and reword a commit message.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried rebasing, take a look at the commit history now. 👍 I will soon undo this and revert to previous commits, but suffice to say I will not be using rebase for this branch. I might restart the branch just to keep things simple and try, try if rebasing makes any sense then.

Copy link
Collaborator

@xxyzz xxyzz Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's much better now, however there are still two extra merge commits. You could create a new branch from the wiktextract-typing branch locally and experimenting on that branch, then delete the old wiktextract-typing and rename the new branch. Wait, this is not rebase, old commits are still there. Usually we need to force push after a rebase.

A comment to 0d41c02: the long HEAD_TAG_RE was created in a function before is because language data are loaded later from JSON files so they are not available from the beginning.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phew, I thought I might have missed something.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how could you merge these commits and start over?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've merged the branch and now I'll start a new branch from master. Then I can use rebase to keep the branch updated without weird duplicate commits that happened when I rebased previously when I already had merges there. When I rebased previously, there were suddenly 18 commits instead of the actual 12, with duplicates. I would probably have to undo all the merges somehow to make a 'clean' branch and rebase that.

Continuing making small type checking changes in a new branch should be fine. It's not like it's actually code that breaks things.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think of the merge commits, not sure how to rebase in that case...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could start a new branch from the remote master branch and then cherry-pick individual commits from your working branch.

It's a bit of work and a good lesson for the future to keep the git history clean. 😁 A lesson I had to learn myself recently. 😇

@kristian-clausal kristian-clausal marked this pull request as ready for review January 3, 2024 07:58
@kristian-clausal kristian-clausal merged commit 1e2ff3e into master Jan 3, 2024
15 checks passed
@kristian-clausal kristian-clausal deleted the wiktextract-typing branch January 3, 2024 07:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants