-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wiktextract typing #432
Wiktextract typing #432
Conversation
--statistics command line parameter was never implemented. There are bits of logic and code left over from the previous attempt, like in merge_return that assumes somewhere certain statistics data has been collected to be handled in the central process, but that statistics collection has never actually been done except for one place: wxr.config.section_counts gets incremented in en/page.py, BUT even that data is then discarded if multiprocessing is used and merging is required (because there is no data sent with the merge). Using --statistics doesn't actually print any statistics, except if you are processing one page (or one process): then you get section_counts. What is currently typed as CollagedErrorReturnData is assumed to be actually a larger collection that includes these count statistics. I've commented out the code related to all of this; nothing should break, because nothing actually touches this. To make it easier to find all these random bits, I've commented them with the term `STATISTICS_IMPLEMENTATION`.
Set mypy to use Python 3.9 as baseline in pyproject.toml. Add typing to easy files. Found out that --statistics wasn't being actually used, so commented that out for later in the previous commit.
Because the code in extractor/en/thesaurus.py will fail due to missing/None assignments from .get() in any case, it's quicker to fail using indexing. Similarly, some other places are also changed to use [] instead.
For some reason HEAD_TAG_RE was being initialized in a specific function for it, without any apparent reason; 1 interaction with HEAD_TAG_RE in the codebase, no branching, no changes to HEAD_TAG_RE. Most probably remnants of some removed logic or an to-do that was forgotten or became unnecessary. This was previously touched by xxyzz to fix a but in the regex, but the weird initialization function is from earlier. A test was modified.
) | ||
|
||
|
||
WordData = dict[str, Union[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think list[dict[str, any]]
is fine for parse_page()
, since these functions are not directly called and the data are written to JSON file soon after. Plus mypy doesn't work well with union types.
And could you please format the newly added code and rebase the commits instead of merge master branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between rebasing and merging here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rebase is like the branch is created at the current master branch so the commits will looks like they are created linearly otherwise old commits could appear before newer commits in the master branch after merge and it's harder to read the git log history, and rebase don't have distracting merge commits. Git book has some nice pictures: https://git-scm.com/book/en/v2/Git-Branching-Rebasing
Basically I think it's like changing your start point, like the latest master commit but not the outdated commit. And also has other features like squash commits and reword a commit message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried rebasing, take a look at the commit history now. 👍 I will soon undo this and revert to previous commits, but suffice to say I will not be using rebase for this branch. I might restart the branch just to keep things simple and try, try if rebasing makes any sense then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's much better now, however there are still two extra merge commits. You could create a new branch from the wiktextract-typing
branch locally and experimenting on that branch, then delete the old wiktextract-typing
and rename the new branch. Wait, this is not rebase, old commits are still there. Usually we need to force push after a rebase.
A comment to 0d41c02: the long HEAD_TAG_RE
was created in a function before is because language data are loaded later from JSON files so they are not available from the beginning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phew, I thought I might have missed something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how could you merge these commits and start over?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've merged the branch and now I'll start a new branch from master. Then I can use rebase to keep the branch updated without weird duplicate commits that happened when I rebased previously when I already had merges there. When I rebased previously, there were suddenly 18 commits instead of the actual 12, with duplicates. I would probably have to undo all the merges somehow to make a 'clean' branch and rebase that.
Continuing making small type checking changes in a new branch should be fine. It's not like it's actually code that breaks things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't think of the merge commits, not sure how to rebase in that case...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could start a new branch from the remote master branch and then cherry-pick individual commits from your working branch.
It's a bit of work and a good lesson for the future to keep the git history clean. 😁 A lesson I had to learn myself recently. 😇
6a5322d
to
8c87081
Compare
I'm going to start slowly typing Wiktextract too (I already found some stuff that I maybe should have PRed separately, but it's such a small issue it doesn't matter at all: --statistics isn't actually implemented).
Keep on trucking with whatever is being done, I will manually merge if need be.
This is low priority, but really nice to have.
Current workflow: alphabetically. categories.py, config.py...