Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could do with some intuition into how get_residual_entropy method works :) #3

Open
hughperkins opened this issue Sep 6, 2021 · 0 comments

Comments

@hughperkins
Copy link

hughperkins commented Sep 6, 2021

Hi, I've been trying to understand get_residual_entropy method, and can't quite figure it out :) Awesome paper by the way :)

Really, my underlying need is to be able to use it on input where the language L has a vocab size larger than 2. I'm unclear if that's possible in the existing implementation? I'm sort of guessing that it's using a greedy approach to minimizing over all partitions? And that the greedy approach only works for a vocab size of 2 in L?

One question I'm wondering is: is the current implementation actually normalizing, as per the formula in section 5.1 of the paper? My entropy skills are not that great, so I may be misunderstanding. Do you mind if I walk through a simplish case study, and you can tell me where I'm going wrong?

let's suppose that we have num categories = 2, and each category C[i] can take 3 possible values. The utterances will have a vocab size of 2 (as per the paper), and total 6 bits.

Then, L* will comprise the following:

tensor([[0, 0],
        [0, 1],
        [0, 2],
        [1, 0],
        [1, 1],
        [1, 2],
        [2, 0],
        [2, 1],
        [2, 2]])

Actually, you are offsetting each category, so per your convention L* will look like:

meanings_for_resnick tensor([[0, 3],
        [0, 4],
        [0, 5],
        [1, 3],
        [1, 4],
        [1, 5],
        [2, 3],
        [2, 4],
        [2, 5]])

The entropy of each C[i] will be (and I'm using base 2 entropy here, since I like thinking of it as the number of bits required):

print('H_Ci', - 3 * 1 / 3 * math.log(1 / 3) / math.log(2))

which gives

H_Ci 1.5849625007211563

We generate a random holistic language, where every utterance is different, eg:

tensor([[1, 1, 1, 0],
        [0, 1, 0, 1],
        [0, 0, 1, 0],
        [0, 0, 0, 0],
        [1, 0, 1, 0],
        [1, 0, 0, 1],
        [0, 0, 0, 1],
        [1, 1, 0, 1],
        [1, 1, 0, 0]])

Now we pass L* and L through the get_residual_entropy method from this repo. I've modified it in two ways:

  • prints indx, so we can see the partition
  • uses base 2 entropy (since, as stated above, I find it easier to reason about)
  • I also replaced 10 with num_att_vals, which is passed in as a parameter
resent = get_residual_entropy(target=meanings_for_resnick.numpy(), lang=_utts.numpy(), num_att_vals=3)
print('resent %.4f' % resent)

The result is:

indx [[0, 1, 2, 3], []]
resent 0.7925

So, the partition comprises one sub-sequence containing entire z, and one empty sub-sequence.

Let's suppose that somehow indx is the partition which results from minimizing over all partitions. I don't quite understand how this happens, but let's suppose that somehow it does happen, e.g. maybe by using some kind of greedy approach?

Looking at the formula in 5.1, we need to calculate H(C[i], z), twice, once for the empty sub-sequence, and once for the sub-sequence containing all positions

  • the conditional entropy of C[i] given the entire utterances will be simply 0, since all L* are unique, and all L utterances are unique
    • so we can ignore this one
  • the conditional entropy of C[i] given no utterances at all will I guess simply be the entropy of C[i]? Which we calculated above, to be 1.58496
  • then, if we ignore the denominator of H(C[i]), we have:
1 / 2 (1.58496 + 0)
= 0.7925

... which is what get_residual_entropy method returns.

However, this is without the denominiator of H(C[i]). If we add in the denominator of H(C[i])), then we have, I think, but, to be honest, I'm not sure the difference between H(C[i]) with respect to L (in the numerator), and H(C[i]) with respect to L* (in the denominator). I'm assuming they are the same, since L and L* are both the same length, and each utterance from L and L* are unique. so, this gives:

1 / 2 (1.58496/ 1.58496 + 0)
= 1/2

... which is differnet from what get_residual_entropy method returns?

Dont suppose... do you have a moment to go over with me what is the difference between H(C[i]) with respect to L and H(C[i]) with respect to L*? It looks like somehow H(C[i]) with respect to L* must evaluate to 1 somehow?

Also, if you have a moment, I'm really interested in knowing how does the code manage to minimize over all possible partitions?

And lastly, what do I need to modify in the code in order to measure residual entropy in languages L having a vocab size greater than 2? (this is my actual goal in fact :) so if there is a way of short-circuiting my having to understand the code ,and just run this, this could be great too :) )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant