Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize tagger logic using numpy #35

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

ZhanruiLiang
Copy link

@ZhanruiLiang ZhanruiLiang commented Oct 17, 2022

For #33.
Use numpy to vectorize computations instead of doing vector/matrix computations using dict.
The observed performance gain is around 3x, tested under the use case of https://github.com/CanCLID/typo-corrector. train_tagger.py shows similar performance improvement.

To make the tests pass, I had to change a bug about prev, prev2 = self.START. It should really be prev2, prev = self.START as prev2 is supposed be the tag of the i-2th word.

Also had to increase the training interation count to make tests pass. The test 135 is changed to 136 as 5 happens to have some weights in the trained model. The iteration count is now 50% and the accuracy is over 98%.

  • Add a concise title to this pull request on the GitHub web interface.
  • Add a description in this box to describe what this pull request is about.
  • If code behavior is being updated (e.g., a bug fix), relevant tests should be added.
  • The CircleCI builds should pass, including both the code styling checks by
    black and flake8 as well as the test suite.
  • Add an entry to CHANGELOG.md at the repository's root level.

@ZhanruiLiang
Copy link
Author

Doc test is failing with:

Example at /home/circleci/project/docs/source/parsing.rst, line 26, column 1 did not evaluate as expected:
Expected:
    *X:    你         食         咗        飯          未        呀        ?
    %mor:  PRON|nei5  VERB|sik6  PART|zo2  NOUN|faan6  ADV|mei6  PART|aa4  ?
    <BLANKLINE>
    *X:    食         咗        喇         !
    %mor:  VERB|sik6  PART|zo2  PART|laa1  !
    <BLANKLINE>
    *X:    你         聽日           得         唔      得閒           呀        ?
    %mor:  PRON|nei5  ADV|ting1jat6  VERB|dak1  ADV|m4  ADJ|dak1haan4  PART|aa4  ?
    <BLANKLINE>
Got:
    *X:    你         食         咗        飯          未        呀        ?
    %mor:  PRON|nei5  VERB|sik6  PART|zo2  NOUN|faan6  ADV|mei6  PART|aa4  ?
    <BLANKLINE>
    *X:    食         咗        喇         !
    %mor:  VERB|sik6  PART|zo2  PART|laa1  !
    <BLANKLINE>
    *X:    你         聽日           得        唔      得閒           呀        ?
    %mor:  PRON|nei5  ADV|ting1jat6  AUX|dak1  ADV|m4  ADJ|dak1haan4  PART|aa4  ?
    <BLANKLINE>

Any suggestion? I'm not sure whether "得" in this context is VERB or AUX.

@ZhanruiLiang ZhanruiLiang changed the title Optimize tagger logic using numpy (#33) Optimize tagger logic using numpy Oct 17, 2022
@laubonghaudoi
Copy link

我覺得個 AUX 肯定係唔啱嘅,但係亦都唔係 VERB,唔知 @jacksonllee 點睇。我唔係瞭解 PyCantonese 嘅標準,我覺得應該係將「得」視作「得閒」嘅縮略所以都係標成 ADJ?

@jacksonllee
Copy link
Owner

Hello! Just a heads-up that I see this coming through and I'll be able to take a look at this PR this week. Thanks for the contribution!

Copy link
Owner

@jacksonllee jacksonllee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. For the test failures, my hunch is that the noisy nature of the HKCanCor data might be to blame, but I'm digging into this and will come back with more comments. Thank you for your careful work!

# Maps feature into column index of the weights matrix.
self._feature_to_index: Dict[Hashable, int] = {}
# Matrix represented by 2D array of shape (n_classes, n_features).
self._weights = numpy.zeros(1)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this _weights array is a pretty sparse matrix at the end of the day (2 million values, with ~93% being zero). What do you think about switching to one of those sparse matrix representations from scipy (I'd be okay with adding scipy as a dependency) to lower the memory footprint? Speaking of which, maybe we can use float32 instead of the default float64 in numpy/scipy to save more memory? I doubt if the higher precision of float64 matters for our purposes here.

Copy link
Author

@ZhanruiLiang ZhanruiLiang Nov 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to float32. I tried but scipy sparse matrices are not a good fit here. It's expensive to update sparse matrix that supports fast arithmetic, but we need to do such updates in this case. As a result, it's much slower, e.g. a training iteration takes like 10s while the current impl takes <1s. After changing the float32, the memory footprint is <8 MiB which I don't think it's really a concern. Also, importing scipy will definitely take more than 8 MiB.

@laubonghaudoi
Copy link

想問下呢個有冇咩進展?希望可以解決到 #33 噉樣就可以發佈個新版,解決埋 #43

@ZhanruiLiang
Copy link
Author

No update since my last comment. As I remember, the problem is that test data are too flaky and the updated code doesn't produce the same output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants