-
-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize tagger logic using numpy #35
base: main
Are you sure you want to change the base?
Conversation
Also update model file and a test case.
Doc test is failing with:
Any suggestion? I'm not sure whether "得" in this context is VERB or AUX. |
我覺得個 AUX 肯定係唔啱嘅,但係亦都唔係 VERB,唔知 @jacksonllee 點睇。我唔係瞭解 PyCantonese 嘅標準,我覺得應該係將「得」視作「得閒」嘅縮略所以都係標成 ADJ? |
Hello! Just a heads-up that I see this coming through and I'll be able to take a look at this PR this week. Thanks for the contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. For the test failures, my hunch is that the noisy nature of the HKCanCor data might be to blame, but I'm digging into this and will come back with more comments. Thank you for your careful work!
# Maps feature into column index of the weights matrix. | ||
self._feature_to_index: Dict[Hashable, int] = {} | ||
# Matrix represented by 2D array of shape (n_classes, n_features). | ||
self._weights = numpy.zeros(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this _weights
array is a pretty sparse matrix at the end of the day (2 million values, with ~93% being zero). What do you think about switching to one of those sparse matrix representations from scipy (I'd be okay with adding scipy as a dependency) to lower the memory footprint? Speaking of which, maybe we can use float32 instead of the default float64 in numpy/scipy to save more memory? I doubt if the higher precision of float64 matters for our purposes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to float32. I tried but scipy sparse matrices are not a good fit here. It's expensive to update sparse matrix that supports fast arithmetic, but we need to do such updates in this case. As a result, it's much slower, e.g. a training iteration takes like 10s while the current impl takes <1s. After changing the float32, the memory footprint is <8 MiB which I don't think it's really a concern. Also, importing scipy will definitely take more than 8 MiB.
No update since my last comment. As I remember, the problem is that test data are too flaky and the updated code doesn't produce the same output. |
For #33.
Use
numpy
to vectorize computations instead of doing vector/matrix computations usingdict
.The observed performance gain is around 3x, tested under the use case of https://github.com/CanCLID/typo-corrector.
train_tagger.py
shows similar performance improvement.To make the tests pass, I had to change a bug about
prev, prev2 = self.START
. It should really beprev2, prev = self.START
asprev2
is supposed be the tag of thei-2
th word.Also had to increase the training interation count to make tests pass. The test
135
is changed to136
as5
happens to have some weights in the trained model. The iteration count is now 50% and the accuracy is over 98%.black
andflake8
as well as the test suite.CHANGELOG.md
at the repository's root level.