CU-8696nbm03: Remove unigram table #503

mart-r · 2024-11-14T16:18:23Z

The unigram table we've been using is massive (10M-100M element array).
As such, when this gets saved to disk, it takes up a lot of space.

But really, all it is is a long array of numbers that are repeated based on the frequency they're expected (across all words).
E.g if you had words with equal counts, the number of their indices would be the same in the unigram table. And if you had some that had far higher counts, they would be far frequent in the unigram table.

What this PR does is remove the unigram table in favour of another approach.
This new approach finds the frequencies of all words, and then finds the cumulative probabilities (which is a sorted array starting near 0 and ending at 1 since we're adding up all the probabilities) for each word index.
And when it comes to getting indices for negative sampling, it finds the indices using np.searchsorted which finds the indices the generated random numbers (between 0 and 1) would need to be added to maintain order of the array.

The PR also adds a test that makes sure the new method maintains expected frequency of words (in a simple example).

There are a few advantages for this new approach:

The saved vocab.dat will be smaller*
- If using a 10M length unigram table vocab is 314MB
- If using a 100M length unigram table vocab is over 800MB
- If using the new approach vocab is 239MB
Loading a smaller vocab (without unigram table) is a little faster
- From 0.62s for 10M unigram or 0.72s for 100M unigram table
- Down to 0.55s with cumulative frequencies (no unigram table)
Computational cost stays the same**
- Tested with 100 000 repeats (so per individual time is a lot smaller)
- When getting 3 at a time
  - Unigram table took: 0.5745s
  - New approach took: 0.2842s
- When getting 9 at a time
  - Unigram table took: 0.5684s
  - New approach took: 0.4996s
- When getting 18 at a time
  - Unigram table took: 0.8640s
  - New approach took: 0.6791s
- When getting 27 at a time
  - Unigram table took: 0.8223s
  - New approach took: 0.9714s

* NOTE: The currently most prevalently used vocab has a unigram table with length 10M, but the defaults are now (for a long time) 100M so if a new unigram table is calculated with no extra input, we get 100M length unigram table.
** NOTE: The number of samples required at a time is dictated by the context vector sizes defined in the config (config.linking.context_vector_sizes). These are (by default) 3, 9, 18, and 27.

…quency

tomolopolis · 2024-11-14T16:18:30Z

Task linked: CU-8696nbm03 Remove use of unigram table

… a counter

adam-sutton-1992

Looks good to me for decent speed ups.

* CU-8696nbm03: Remove use of unigram table * CU-8696nbm03: Fix usage of new unigram table alternative * CU-8696nbm03: Remove unigram table from loaded vocabs * CU-8696nbm03: Add tests for unigram table usage/negative sampling frequency * CU-8696nbm03: Add small comment to tests * CU-8696nbm03: Calculate frequencies upon load if not present * CU-8696nbm03: Update comment regarding probability calculatioons * CU-8696nbm03: Remove commented test case * CU-8696n7w95: Fix docstring issue * CU-8696nbm03: Fix serialisation tests * CU-8696nbm03: Add python 3.9-friendly method for getting the total of a counter

mart-r added 8 commits November 14, 2024 14:38

CU-8696nbm03: Remove use of unigram table

a46a48f

CU-8696nbm03: Fix usage of new unigram table alternative

3389d2c

CU-8696nbm03: Remove unigram table from loaded vocabs

d561428

CU-8696nbm03: Add tests for unigram table usage/negative sampling fre…

0a42064

…quency

CU-8696nbm03: Add small comment to tests

1670755

CU-8696nbm03: Calculate frequencies upon load if not present

a7bf18f

CU-8696nbm03: Update comment regarding probability calculatioons

d8898ad

CU-8696nbm03: Remove commented test case

107f334

CU-8696n7w95: Fix docstring issue

6756739

mart-r mentioned this pull request Nov 14, 2024

CU-8696nbm9j: Add module to convert vocab vectors #504

Merged

mart-r added 2 commits November 14, 2024 19:04

CU-8696nbm03: Fix serialisation tests

89288f8

CU-8696nbm03: Add python 3.9-friendly method for getting the total of…

da2f927

… a counter

adam-sutton-1992 approved these changes Nov 26, 2024

View reviewed changes

mart-r merged commit 3c44dcb into master Nov 27, 2024
7 checks passed

mart-r deleted the CU-8696nbm03-remove-unigram-table branch January 23, 2025 09:53

mart-r restored the CU-8696nbm03-remove-unigram-table branch February 18, 2025 16:24

mart-r deleted the CU-8696nbm03-remove-unigram-table branch February 18, 2025 16:26

mart-r mentioned this pull request Mar 25, 2025

CU-8698f8fgc: Fix negative sampling including indices for words without a vector #524

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CU-8696nbm03: Remove unigram table #503

CU-8696nbm03: Remove unigram table #503

Uh oh!

mart-r commented Nov 14, 2024 •

edited

Loading

Uh oh!

tomolopolis commented Nov 14, 2024

Uh oh!

adam-sutton-1992 left a comment

Uh oh!

Uh oh!

Uh oh!

CU-8696nbm03: Remove unigram table #503

CU-8696nbm03: Remove unigram table #503

Uh oh!

Conversation

mart-r commented Nov 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomolopolis commented Nov 14, 2024

Uh oh!

adam-sutton-1992 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mart-r commented Nov 14, 2024 •

edited

Loading