Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the information of your dataset in detail? #8

Open
mohsenoon opened this issue Feb 16, 2023 · 4 comments
Open

What is the information of your dataset in detail? #8

mohsenoon opened this issue Feb 16, 2023 · 4 comments
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers

Comments

@mohsenoon
Copy link

mohsenoon commented Feb 16, 2023

Apart from data size, please publish other information such as duration, number of files, average length of files, quality, number of speakers, source, and method of collection.

Also, since these data are Google's speech-to-text transcriptions, it is better to report this issue and its approximate error.
The raw outputs of a speech-to-text model can be used with some considerations to train other models, but it certainly cannot be introduced as a speech-to-text dataset.

@mohsenoon mohsenoon changed the title What is the information of this speech data in detail? What is the information of your dataset in detail? Feb 16, 2023
@masoudMZB masoudMZB pinned this issue Feb 19, 2023
@masoudMZB
Copy link
Member

masoudMZB commented Feb 19, 2023

Hi,
[tmp note : This answer will be updated as much as I can. when todo is completed this line will be removed]

I'll write here my todo list to update and edit repo as soon as possible:

TODO

  • Pin Issue
  • Edit Readme about what is the source of data
  • data statistical information and other type of informations
    • number of files
    • duration
    • average length of files
    • quality
    • number of speakers
    • method of collection
  • write about Google output results
  • some suggestions about how to use this data for different tasks (STT, etc. )

Thanks to @mohsenoon for analysing data

@masoudMZB masoudMZB added documentation Improvements or additions to documentation good first issue Good for newcomers labels Feb 19, 2023
@masoudMZB
Copy link
Member

Update 1 : 3/6/2023

new Stats for data is ready, these stats are not 100% accurate but they are accurate enough. you can trust these numbers :

Total Hours : 1697.1423399942473 Hour
Total size : 195510797567.33728 bytes
duration mean : 4.834937608942991 second
size : 154718.00348617567 bytes

@hamjam
Copy link

hamjam commented Jun 10, 2023

Hi Masoud,
I have downloaded all parts of version 2, but after removing duplicated metadata from CSVs, the remaining dataset consists of only 625h of audio clips, not 1697h. What do you think is the problem?

@masoudMZB
Copy link
Member

@hamjam
hi, sorry for the late response, can you add the duration of data v1 too, and say how much is data when both versions are added?

then if my information is wrong, send a pull request for reamdefile and correct it.

thanks for your attention

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants