-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode support #94
Comments
Change dictionary to Indian langugages, modify DeepSpeechModel.lua, Change dict_size to the length of dictionary, example the length of dictionary_english : 29 |
Thank you.
…On Aug 24, 2017 9:05 AM, "Dat Thanh Vu" ***@***.***> wrote:
Change dictionary to Indian langugages, modify DeepSpeechModel.lua,
fullyConnected:add(nn.Linear(rnnHiddenSize, dict_size))
Change dict_size to the length of dictionary, example the length of
dictionary_english : 29
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#94 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ARCwvcsIIe2Iaaz2TCbMCbWvblSiOjcxks5sbO96gaJpZM4OocQZ>
.
|
In mapper.lua it is reading byte wise .So that will break Unicode
characters .I had changed mapper.lua and read characters with unicode
support.
…On Aug 24, 2017 9:05 AM, "Dat Thanh Vu" ***@***.***> wrote:
Change dictionary to Indian langugages, modify DeepSpeechModel.lua,
fullyConnected:add(nn.Linear(rnnHiddenSize, dict_size))
Change dict_size to the length of dictionary, example the length of
dictionary_english : 29
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#94 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ARCwvcsIIe2Iaaz2TCbMCbWvblSiOjcxks5sbO96gaJpZM4OocQZ>
.
|
If you could open a PR with those changes that would be awesome :) |
@slbinilkumar Dont worry, use Lua UTF-8 library instead of string library. |
Thank you
…On Aug 28, 2017 12:23 PM, "Dat Thanh Vu" ***@***.***> wrote:
@slbinilkumar <https://github.com/slbinilkumar> Dont worry, use Lua UTF-8
library instead of string library.
My script:
require 'torch'
local utf8 = require 'lua-utf8'
-- construct an object to deal with the mapping
local mapper = torch.class('Mapper')
function mapper:__init(dictPath)
assert(paths.filep(dictPath), dictPath ..' not found')
self.alphabet2token = {}
self.token2alphabet = {}
-- make maps
local cnt = 0
for line in io.lines(dictPath) do
--local line = utf8.char(line)
self.alphabet2token[line] = cnt
--print(self.alphabet2token[line])
self.token2alphabet[cnt] = line
--print(self.token2alphabet[cnt])
cnt = cnt + 1
end
--print(self.alphabet2token['$'])
end
function mapper:encodeString(line)
line = utf8.lower(line)
-- print(line)
local label = {}
for _,c in utf8.codes(line) do
local character = utf8.char(c)
--print(character)
table.insert(label, self.alphabet2token[character])
--print(label[i])
--print(self.alphabet2token[character])
end
--print(label)
return label
end
function mapper:decodeOutput(predictions)
--[[
Turns the predictions tensor into a list of the most likely tokens
NOTE:i
to compute WER we strip the begining and ending spaces
--]]
--print("Predictions...",predictions)
local tokens = {}
local blankToken = self.alphabet2token['$']
local preToken = blankToken
--print("preToken ",preToken)
-- The prediction is a sequence of likelihood vectors
local _, maxIndices = torch.max(predictions, 2)
maxIndices = maxIndices:float():squeeze()
--print("maxIndices",maxIndices)
for i=1, maxIndices:size(1) do
local token = maxIndices[i] - 1 -- CTC indexes start from 1, while token starts from 0
-- add token if it's not blank, and is not the same as pre_token
--print("Token ",token)
if token ~= blankToken and token ~= preToken then
table.insert(tokens, token)
end
preToken = token
end
--print("Tokens",tokens)
return tokens
end
function mapper:tokensToText(tokens)
local text = ""
for i, t in ipairs(tokens) do
--print(i,t)
text = text .. self.token2alphabet[tokens[i]]
--print(text)
end
-- print(text)
return text
end
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#94 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ARCwvePbg6Gj1O-eyGbHsnEDS1BH7Ln5ks5scmPOgaJpZM4OocQZ>
.
|
Hi,
What are the modifications had to be done for Unicode support . I need to do it for Indian languages.
The text was updated successfully, but these errors were encountered: