-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data files should probably be in a separate package. #251
Comments
Why? |
To update them separately and fully automatically. For example with a CI pipeline run by "cron". |
Currently there is only one data file emoji/unicode_codes/data_dict.py. Also the Unicode data is only updated twice per year, so at the moment it's not that much work to do it manually. |
I have created a tool merging some data from different sources: https://github.com/KOLANICH-tools/emojiSlugMappingGen.py |
In light of the pull request #252 adding Japanese & Korean and the recently added languages Chinese & Indonesian, I think the more important issue with the single data file is memory consumption. |
@cvzi, https://github.com/KOLANICH-tools/emojifilt.cpp uses |
I think something like that would be overkill for this library. I don't think it needs to be that efficient. My suggestion would be to ideally keep the current API of the library and still try to reduce memory usage a little bit. People use the big dictionary directly at the moment (it is in the public API). I think it is also nice that you can just open the file in a text editor and look at the emoji. In that way it is kind of like JSON, a human can easily read or even edit it. It is simple to add custom slugs or even custom emoji. Maybe the language data could be in separate files and could be loaded on request. import emoji
print(emoji.EMOJI_DATA['🐕']['fr']) # would throw an error because fr data has not been loaded
emoji.load_languages(['fr', 'zh', 'ja'])
print(emoji.EMOJI_DATA['🐕']['fr']) # now it would work If all the languages would be in separate files, it would probably reduce memory usage by about 50% for a user who only uses one language. It would still be a breaking change for the API though, since the first access in the example fails. |
Maybe we could use a class to emulate a dictionary with Currently it looks like this: EMOJI_DATA = {
u'\U0001F415': { # 🐕
'en' : ':dog:',
'status' : fully_qualified,
'E' : 0.7,
'alias' : [':dog2:'],
'variant': True,
'de': ':hund:',
'es': ':perro:',
'fr': ':chien:',
'ja': u':イヌ:',
'ko': u':개:',
'pt': ':cachorro:',
'it': ':cane:',
'fa': u':سگ:',
'id': ':anjing:',
'zh': u':狗:'
},
...
} Maybe the inner dictionaries could be objects instead and the language data could be in separate files: EMOJI_DATA = {
u'\U0001F415': ClassLikeADictionary({ # 🐕
'en' : ':dog:',
'status' : fully_qualified,
'E' : 0.7,
'alias' : [':dog2:'],
'variant': True
}),
...
}
class ClassLikeADictionary:
def __getitem__(self, key):
# Load language data if it is not loaded yet
if languageIsNotLoaded(key):
loadLanguageFromDataFile(key)
return valueFor(key)
... So you could still access it with |
If you don't want to use binary gettext
I guess since your impl is going to spend memory on each opened file for each language,
should make more sense, because it'll make opening a new file more explicit. |
Thinking about the idea with updating the data files in CI, I had another idea that could make the memory usage smaller for average user: We could release several flavors of this package instead of just one. For example:
I think this could be easy with CI/Github actions. The main thing to do would be to remove all languages from EMOJI_DATA except the one language and then publish on PyPi |
I think keeping languages in a separate files (like it was before) but in one "emoji" project much better than create separate projects on PyPi. |
And also not in Python code, but in JSON/CSV/TSV.
The text was updated successfully, but these errors were encountered: