Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bb2cf version V: TCH Key is stored in encoded hex strings, should be bytes #40

Open
hrz6976 opened this issue Jun 8, 2024 · 3 comments

Comments

@hrz6976
Copy link

hrz6976 commented Jun 8, 2024

When I was developing the new python driver, I was surprised to find that bb2cf never worked for me.
getValues won't work given any blob sha as well:

(base) ~ echo 125afaaefe99189eb2cec5aafc470770b79abbd0 | ~/lookup/getValues bb2cf
No 125afaaefe99189eb2cec5aafc470770b79abbd0 in /fast/bb2cfFullV
(base) ~ echo 125afaaefe99189eb2cec5aafc470770b79abbd0 | ~/lookup/getValues obb2cf
125afaaefe99189eb2cec5aafc470770b79abbd0;5664f2622743898c5ca094670f16b4fb8fb4b74f;4071672afedd71d8997f427ee8ffec5dd97a3a1c;src/templates/unitedstates/states/louisiana.ejs

Digging into the issue, I found out that the keys in tokyocabinet hashtables are encoded hex strings:

In [8]: from woc.tch import TCHashDB
   ...: db = TCHashDB('/fast/woc_azure/da3-fast/bb2cfFullV.18.tch'.encode(), ro=True)
   ...: for k in db:
   ...:     print(k, k.hex())
   ...:     break
   ...:
b'125afaaefe99189eb2cec5aafc470770b79abbd0' 31323561666161656665393931383965623263656335616166633437303737306237396162626430

Instead of bytes, as the ones of obb2cf:

In [7]: from woc.tch import TCHashDB
   ...: db = TCHashDB('/fast/woc_azure/da3-fast/obb2cfFullV.18.tch'.encode(), ro=True)
   ...: for k in db:
   ...:     print(k, k.hex())
   ...:     break
   ...:
b'\x12|O\x11\xcfS\x19+\xbeC\x93\xb2\x0c\xbd\xaf\xcfW\xe6<q' 127c4f11cf53192bbe4393b20cbdafcf57e63c71

And it works with the following walkaround:

# TODO: remove bb2cf quirk after fixing tch keys
# bb2cf: keys are stored as hex strings in tch db
if map_name == 'bb2cf':
    key = key.hex().encode('ascii')

Is it intended, or you are planning to fix that?

@audrism
Copy link
Collaborator

audrism commented Jun 11, 2024

  1. all sha1's are in binary form if they are keys. If they are mixed with strings in values, they may sometimes be in hex
  2. 125afaaefe99189eb2cec5aafc470770b79abbd0 indeed does not have a parent, only a child, hence in obb2cf but not in bb2cf
  3. I am surprised how bb2cf does not have binary keys:
    for i in {0..31}; do time zcat bb2cfFullV{$i,$((i+32)),$((i+64)),$((i+96))}.s | ~/lookup/h2fbbBinSorted.perl /fast/bb2cfFullV.$i.tch; done
  4. recreating bb2cfFullV

@hrz6976
Copy link
Author

hrz6976 commented Jun 11, 2024

Okay, I'll remove the quirk for bb2cf in python-woc after the fix.

@audrism
Copy link
Collaborator

audrism commented Jun 12, 2024

Finished

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants