-
-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pick_lstm_model parameters are too complicated to call #10
Comments
In this document under "input_name" I explain the relationship between the name of the models and hyperparameters. For |
In fact, it is possible to get rid of the variable |
So when I use Thai_graphclust_model5_heavy the embedding should be
"grapheme_clusters_tf" right?
I think there is some bug there for grapheme_clusters_tf.
It does not make sense to me in some cluster:
For example, for the input
พิธีส
we should get 121, 234, 22 as the cluster id
but right now in python we got 121, 235, 22 as the cluster id
~/lstm_word_segmentation$ jq .
Models/Thai_graphclust_model5_heavy/weights.json |egrep " (22|121|234|235),"
"ส": 22,
"พิ": 121,
"ธี": 234,
"ป่": 235,
My C++ code will get me 121,234,22 but that does not match the python one,
this is before feeding into LSTM.
…On Thu, 21 Jan 2021 at 17:09, Sahand Farhoodi ***@***.***> wrote:
In fact, it is possible to get rid of the variable embedding for
pick_lstm_model if it is guaranteed that any trained model in future
follows the naming convection I explained in this link
<https://github.com/unicode-org/lstm_word_segmentation/blob/master/Models%20Specifications.md>,
but I just left it there because I wasn't sure if that's the case.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#10 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJ2N2KMBT2DSSSYINNE5OJDS3DF5ZANCNFSM4WNWS76Q>
.
--
Frank Yung-Fong Tang
譚永鋒 / 🌭🍊
Sr. Software Engineer
|
On Tue, 26 Jan 2021 at 23:51, Frank Tang (譚永鋒) ***@***.***> wrote:
So when I use Thai_graphclust_model5_heavy the embedding should be
"grapheme_clusters_tf" right?
I think there is some bug there for grapheme_clusters_tf.
It does not make sense to me in some cluster:
For example, for the input
พิธีส
we should get 121, 234, 22 as the cluster id
but right now in python we got 121, 235, 22 as the cluster id
~/lstm_word_segmentation$ jq .
Models/Thai_graphclust_model5_heavy/weights.json |egrep " (22|121|234|235),"
"ส": 22,
"พิ": 121,
"ธี": 234,
"ป่": 235,
My C++ code will get me 121,234,22 but that does not match the python one,
this is before feeding into LSTM.
ok, I think I know what is going on.
I am using the data from the json one, and the python code is using the npy
inside the Data directory. Somehow they do not match.
Neither Thai_graph_clust_ratio.npy nor Thai_exclusive_graph_clust_ratio.npy
will have "ธี" as 234 but somehow
Models/Thai_graphclust_model5_heavy/weights.json has "ธี" as 234
…
On Thu, 21 Jan 2021 at 17:09, Sahand Farhoodi ***@***.***>
wrote:
> In fact, it is possible to get rid of the variable embedding for
> pick_lstm_model if it is guaranteed that any trained model in future
> follows the naming convection I explained in this link
> <https://github.com/unicode-org/lstm_word_segmentation/blob/master/Models%20Specifications.md>,
> but I just left it there because I wasn't sure if that's the case.
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#10 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AJ2N2KMBT2DSSSYINNE5OJDS3DF5ZANCNFSM4WNWS76Q>
> .
>
--
Frank Yung-Fong Tang
譚永鋒 / 🌭🍊
Sr. Software Engineer
--
Frank Yung-Fong Tang
譚永鋒 / 🌭🍊
Sr. Software Engineer
|
I am pretty sure Models/Thai_graphclust_model*/weights.json were not
generated by neither the current version of
Thai_exclusive_graph_clust_ratio.npy nor Thai_graph_clust_ratio.npy in the
Data directory and I am not sure what will be the output quality from the
python now with the current version of these two files in the Data
directory.
Somehow most of the order are the same but about 5-10% are different
check
~/lstm_word_segmentation$ ls
Models/Thai_graphclust_model*/weights.json|xargs jq .dic|egrep ": 234,"
"ธี": 234,
"ธี": 234,
"ธี": 234,
you will see all these Models/Thai_graphclust_model*/weights.json were
generated with "ธี" as the 234 item in the grapheme cluster, but that is
not the case in either
Thai_exclusive_graph_clust_ratio.npy nor Thai_graph_clust_ratio.npy
there are other cases, for example 29
~/lstm_word_segmentation$ ls
Models/Thai_graphclust_model*/weights.json|xargs jq .dic|egrep ": 29,"
"ม่": 29,
"ม่": 29,
"ม่": 29,
but in Thai_graph_clust_ratio.npy
'"': 29 and 'ม่': 30,
and in Thai_exclusive_graph_clust_ratio.npy
'ว่': 29, and. 'ม่': 28
…On Wed, 27 Jan 2021 at 00:44, Frank Tang (譚永鋒) ***@***.***> wrote:
On Tue, 26 Jan 2021 at 23:51, Frank Tang (譚永鋒) ***@***.***> wrote:
>
> So when I use Thai_graphclust_model5_heavy the embedding should be
> "grapheme_clusters_tf" right?
>
> I think there is some bug there for grapheme_clusters_tf.
>
> It does not make sense to me in some cluster:
> For example, for the input
>
> พิธีส
>
> we should get 121, 234, 22 as the cluster id
> but right now in python we got 121, 235, 22 as the cluster id
>
> ~/lstm_word_segmentation$ jq .
> Models/Thai_graphclust_model5_heavy/weights.json |egrep " (22|121|234|235),"
> "ส": 22,
> "พิ": 121,
> "ธี": 234,
> "ป่": 235,
>
> My C++ code will get me 121,234,22 but that does not match the
> python one, this is before feeding into LSTM.
>
ok, I think I know what is going on.
I am using the data from the json one, and the python code is using the
npy inside the Data directory. Somehow they do not match.
Neither Thai_graph_clust_ratio.npy nor Thai_exclusive_graph_clust_ratio.npy
will have "ธี" as 234 but somehow
Models/Thai_graphclust_model5_heavy/weights.json has "ธี" as 234
>
>
>
> On Thu, 21 Jan 2021 at 17:09, Sahand Farhoodi ***@***.***>
> wrote:
>
>> In fact, it is possible to get rid of the variable embedding for
>> pick_lstm_model if it is guaranteed that any trained model in future
>> follows the naming convection I explained in this link
>> <https://github.com/unicode-org/lstm_word_segmentation/blob/master/Models%20Specifications.md>,
>> but I just left it there because I wasn't sure if that's the case.
>>
>> —
>> You are receiving this because you authored the thread.
>> Reply to this email directly, view it on GitHub
>> <#10 (comment)>,
>> or unsubscribe
>> <https://github.com/notifications/unsubscribe-auth/AJ2N2KMBT2DSSSYINNE5OJDS3DF5ZANCNFSM4WNWS76Q>
>> .
>>
>
>
> --
> Frank Yung-Fong Tang
> 譚永鋒 / 🌭🍊
> Sr. Software Engineer
>
--
Frank Yung-Fong Tang
譚永鋒 / 🌭🍊
Sr. Software Engineer
--
Frank Yung-Fong Tang
譚永鋒 / 🌭🍊
Sr. Software Engineer
|
Yes, apparently an old version of dictionaries was on our shared Google drive and I didn't notice it. Sorry if it wasted some of your time. I updated the *.ratio files on our drive. I checked the updated file "Thai_graphclust_ratio.npy" and it seems to give the same numbers that you mentioned above. So the python code that you ran was flawed (and I guess you get lower accuracy there), but whatever we had in the json files was up to date. @sffc this should not affect our model performance in Rust, that's probably why we didn't spot it sooner. |
hum... how about this. Could you submit a PR to change
https://github.com/unicode-org/lstm_word_segmentation/tree/master/Data to
what it should be?
…On Wed, 27 Jan 2021 at 06:30, Sahand Farhoodi ***@***.***> wrote:
Yes, apparently an old version of dictionaries was on our shared Google
drive and I didn't notice it. Sorry if it wasted some of your time. I
updated the *.ratio files on our drive. I checked the updated file
"Thai_graphclust_ratio.npy" and it seems to give the same numbers that you
mentioned above.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#10 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJ2N2KOEWI3RPJYFCDAFL53S4APQNANCNFSM4WNWS76Q>
.
--
Frank Yung-Fong Tang
譚永鋒 / 🌭🍊
Sr. Software Engineer
|
I made a commit that does this and left a comment for you there. I forgot to submit a PR, but I basically just changed the files and those lines of code that read/write dictionaries. Please see my commit. I also updated our Google drive accordingly. |
ok, thanks .let me try
…On Thu, 28 Jan 2021 at 07:55, Sahand Farhoodi ***@***.***> wrote:
I made a commit
<4bb9e07>
that does this and left a comment for you there. I forgot to submit a PR,
but I basically just changed the files and those lines of code that stores
dictionaries. Please see my commit.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#10 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJ2N2KL6KSWLKYSHD7OLPN3S4GCHFANCNFSM4WNWS76Q>
.
--
Frank Yung-Fong Tang
譚永鋒 / 🌭🍊
Sr. Software Engineer
|
I have the following simple program to see how to run all different models under
https://github.com/unicode-org/lstm_word_segmentation/tree/master/Models
It currently work for Thai_codepoints_exclusive_model4_heavy but I have problem to figure out what the value need to be passed in for other model
Could you specify what values should be used for embedding, train_data and eval_data for the other models?
Burmese_codepoints_exclusive_model4_heavy
Burmese_codepoints_exclusive_model5_heavy
Burmese_codepoints_exclusive_model7_heavy
Burmese_genvec1235_model4_heavy
Burmese_graphclust_model4_heavy
Burmese_graphclust_model5_heavy
Burmese_graphclust_model7_heavy
Thai_codepoints_exclusive_model4_heavy
Thai_codepoints_exclusive_model5_heavy
Thai_codepoints_exclusive_model7_heavy
Thai_genvec123_model5_heavy
Thai_graphclust_model4_heavy
Thai_graphclust_model5_heavy
Thai_graphclust_model7_heavy
or is there a simple way we can just have a simple function
get_lstm_model(model_name) on top of pick_lstm_model() and just fill the necessary parameter to call pick_lstm_model()
The text was updated successfully, but these errors were encountered: