Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formatting Custom Training Data & ValueError: bucket_boundaries must not be empty #113

Open
dbarroso1 opened this issue Jun 15, 2018 · 3 comments

Comments

@dbarroso1
Copy link

Hello, ive been trying to make my own Training data, but there doesnt seem to be a ton of resources on how the data should be formatted. Ive compared the LJ001 Data and tried to imitate it, including splitting wavs, and the transcript.csv.

I have tested train.py with the LJ001 Data and the trainer works, but when i try with my Data it fails, giving me this error:

Traceback (most recent call last):
  File "train.py", line 96, in <module>
    g = Graph(); print("Training Graph loaded")
  File "train.py", line 33, in __init__
    self.x, self.y, self.z, self.fnames, self.num_batch = get_batch()
  File "C:\Users\...\tacotron-master\data_load.py", line 116, in get_batch
    dynamic_pad=True)
  File "C:\anaconda3\envs\...\training\bucket_ops.py", line 374, in bucket_by_sequence_length
    raise ValueError("bucket_boundaries must not be empty")
ValueError: bucket_boundaries must not be empty

Here is an example of the CSV File, i tried matching the ID, TEXT, LENGTH Format.

SM001-0001|Oh happy fourth of July America|00:00:02
SM001-0002|Ready to fire up the grill and celebrate our victory over the Brits|00:00:03
SM001-0003|Well, I'm not|00:00:01
SM001-0004|Because despite that incredibly convincing American accent, I'm one of those Brits|00:00:04
SM001-0005|now I've acted in film and TV for years|00:00:02
SM001-0006|but my greatest performance is acting like I don't care that every summer you gobble down tube sausages and celebrate kicking our arses|00:00:07
SM001-0007|Or butts as you say incorrectly|00:00:02
SM001-0008|Do you really still have to celebrate your emancipation from us|00:00:02
SM001-0009|I mean that's like your girlfriend breaking up with you and then celebrating with fireworks|00:00:04
SM001-0010|every year for 300 years|00:00:03
SM001-0011|it gets my goat|00:00:01
SM001-0012|but what really gets my goat is imagining how great America would be if we were still in charge|00:00:04
SM001-0013|Oh America if we'd won the war you'd have better comedy news TV programs and way better rude words|00:00:07
SM001-0014|Oh I'm talking fanny, trollop, minger tar, Minjbag, bleeding, sodding, blooming, cocked up, get stuffed|00:00:06
SM001-0015|and of course wanker|00:00:01
SM001-0016|imagine how sophisticated you'd say when you're insulting someone|00:00:03 
SM001-0017|Oh Brad your wife's a slag don't piss off your wanker|00:00:04
SM001-0018|see how classy that sounded with our accents and your American self-confidence you'd be unstoppable|00:00:05
SM001-0019|yeah you'd have to pay a few more taxes but you can't put a price on that|00:00:03
SM001-0020|Great Britain two would be the greatest country on Earth|00:00:02
SM001-0021|your lawyers would all wear powdered wigs so criminals really respect them|00:00:04
SM001-0022|and you'd have all the mushy peas you can stuff down your bloody great gobs|00:00:03
SM001-0023|oh and if you get sick you don't need to worry about medical insurance because with a National Health Service a doctor will see you for free in about two years|00:00:08
SM001-0024|plus your taxes will be spent on things you really need like a royal family who do the tough jobs no one else wants to do|00:00:06
SM001-0025|like being driven around in a really nice car while waving|00:00:04
SM001-0026|you'all want to eat some apple pie then shoot some hoops and have hoedown|00:00:03

So tldr two questions:

  1. Why am i receiving this bucket_boundaries must not be empty Error when python finds the CSV and can read it.
  2. Based on 1's answer how can i properly format my data to work with the neural network
@djebel-amila
Copy link

Hi @dbarroso1 ,
Did you eventually find how to format your data?
I’m at the same stage. I couldn’t figure how to properly do it but duplicating the transcript.csv file from the LJ dataset and carefully pasting in my own dataset, sentence by sentence, did the trick. Not a particularly sustainable or elegant solution…

@aryachiranjeev
Copy link

I am also facing this bucket error I checked the maxlen(151) and minlength (149) that's why in for loop there is no iteration , so there is no value in bucket . If anyone solved this problem kindly help me in solving this issue

@ST2-EV
Copy link

ST2-EV commented Mar 10, 2020

Hey consider using this GUI to make the datset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants