Unicode support #94

slbinilkumar · 2017-07-31T14:14:52Z

Hi,
What are the modifications had to be done for Unicode support . I need to do it for Indian languages.

stephenvxx · 2017-08-24T03:35:21Z

Change dictionary to Indian langugages, modify DeepSpeechModel.lua,
fullyConnected:add(nn.Linear(rnnHiddenSize, dict_size))

Change dict_size to the length of dictionary, example the length of dictionary_english : 29

slbinilkumar · 2017-08-24T03:37:40Z

Thank you.

…

On Aug 24, 2017 9:05 AM, "Dat Thanh Vu" ***@***.***> wrote: Change dictionary to Indian langugages, modify DeepSpeechModel.lua, fullyConnected:add(nn.Linear(rnnHiddenSize, dict_size)) Change dict_size to the length of dictionary, example the length of dictionary_english : 29 — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#94 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARCwvcsIIe2Iaaz2TCbMCbWvblSiOjcxks5sbO96gaJpZM4OocQZ> .

slbinilkumar · 2017-08-24T03:41:11Z

In mapper.lua it is reading byte wise .So that will break Unicode characters .I had changed mapper.lua and read characters with unicode support.

…

On Aug 24, 2017 9:05 AM, "Dat Thanh Vu" ***@***.***> wrote: Change dictionary to Indian langugages, modify DeepSpeechModel.lua, fullyConnected:add(nn.Linear(rnnHiddenSize, dict_size)) Change dict_size to the length of dictionary, example the length of dictionary_english : 29 — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#94 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARCwvcsIIe2Iaaz2TCbMCbWvblSiOjcxks5sbO96gaJpZM4OocQZ> .

SeanNaren · 2017-08-24T08:30:15Z

If you could open a PR with those changes that would be awesome :)

stephenvxx · 2017-08-28T06:53:00Z

@slbinilkumar Dont worry, use Lua UTF-8 library instead of string library.
*use utf8.lower instead of string.lower
*In for loop line 29, change to : for _, c in utf8.codes(line) do
local character = utf8.char(c)
table.insert(label, self.alphabet2token[character])
end
Please install utf-8 library
https://github.com/starwing/luautf8
*Make sure your dictionary don't copy from the Internet or other. Should self-writing Indian Language.

slbinilkumar · 2017-08-28T06:55:31Z

Thank you

…

On Aug 28, 2017 12:23 PM, "Dat Thanh Vu" ***@***.***> wrote: @slbinilkumar <https://github.com/slbinilkumar> Dont worry, use Lua UTF-8 library instead of string library. My script: require 'torch' local utf8 = require 'lua-utf8' -- construct an object to deal with the mapping local mapper = torch.class('Mapper') function mapper:__init(dictPath) assert(paths.filep(dictPath), dictPath ..' not found') self.alphabet2token = {} self.token2alphabet = {} -- make maps local cnt = 0 for line in io.lines(dictPath) do --local line = utf8.char(line) self.alphabet2token[line] = cnt --print(self.alphabet2token[line]) self.token2alphabet[cnt] = line --print(self.token2alphabet[cnt]) cnt = cnt + 1 end --print(self.alphabet2token['$']) end function mapper:encodeString(line) line = utf8.lower(line) -- print(line) local label = {} for _,c in utf8.codes(line) do local character = utf8.char(c) --print(character) table.insert(label, self.alphabet2token[character]) --print(label[i]) --print(self.alphabet2token[character]) end --print(label) return label end function mapper:decodeOutput(predictions) --[[ Turns the predictions tensor into a list of the most likely tokens NOTE:i to compute WER we strip the begining and ending spaces --]] --print("Predictions...",predictions) local tokens = {} local blankToken = self.alphabet2token['$'] local preToken = blankToken --print("preToken ",preToken) -- The prediction is a sequence of likelihood vectors local _, maxIndices = torch.max(predictions, 2) maxIndices = maxIndices:float():squeeze() --print("maxIndices",maxIndices) for i=1, maxIndices:size(1) do local token = maxIndices[i] - 1 -- CTC indexes start from 1, while token starts from 0 -- add token if it's not blank, and is not the same as pre_token --print("Token ",token) if token ~= blankToken and token ~= preToken then table.insert(tokens, token) end preToken = token end --print("Tokens",tokens) return tokens end function mapper:tokensToText(tokens) local text = "" for i, t in ipairs(tokens) do --print(i,t) text = text .. self.token2alphabet[tokens[i]] --print(text) end -- print(text) return text end — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#94 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARCwvePbg6Gj1O-eyGbHsnEDS1BH7Ln5ks5scmPOgaJpZM4OocQZ> .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode support #94

Unicode support #94

slbinilkumar commented Jul 31, 2017

stephenvxx commented Aug 24, 2017

slbinilkumar commented Aug 24, 2017 via email

slbinilkumar commented Aug 24, 2017 via email

SeanNaren commented Aug 24, 2017

stephenvxx commented Aug 28, 2017 •

edited

Loading

slbinilkumar commented Aug 28, 2017 via email

Unicode support #94

Unicode support #94

Comments

slbinilkumar commented Jul 31, 2017

stephenvxx commented Aug 24, 2017

slbinilkumar commented Aug 24, 2017 via email

slbinilkumar commented Aug 24, 2017 via email

SeanNaren commented Aug 24, 2017

stephenvxx commented Aug 28, 2017 • edited Loading

slbinilkumar commented Aug 28, 2017 via email

stephenvxx commented Aug 28, 2017 •

edited

Loading