added code and readme

chahuja · Aug 25, 2020 · 6bdba57 · 6bdba57
1 parent 1bfac59
commit 6bdba57
Show file tree

Hide file tree

Showing 13 changed files with 2,649 additions and 11 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,5 @@
+*~
+*#
+__pycache__
+*pyc
+.ipynb_checkpoints
diff --git a/README.md b/README.md
@@ -1,22 +1,124 @@
 # PATS Dataset
+For an overview of the dataset check the [website](http://chahuja.com/pats)
+<center>
+<img src="https://user-images.githubusercontent.com/43928520/90454983-c022ba00-e0c2-11ea-991e-36bd5cb3b38b.png" width="500px">
+</center>
 
-% Image
+# Structure of the dataset
 
-% link to webpage
+```sh
+pats/data
+  - cmu_intervals_df.csv
+  - missing_intervals.h5
+  - processed
+      - oliver # speakers
+          - XXXX.h5
+          - YYYY.h5
+          ...
+      - jon
+      ...
+      ...
+      - bee
+      - noah
+  - raw
+      - oliver
+      - oliver_cropped
+```
 
-## Download
+The dataset consists of:
+
+- `cmu_intervals_df.csv`: list of all intervals and the relevant meta information (Similar to [Ginosar et. al. 2019](https://github.com/amirbar/speech2gesture/blob/master/data/dataset.md))
+- `missing_intervals.h5`: list of all intervals that have incomplete set of features. For the sake of uniformity they are excluded from the benchmark tests.
+- `proceesed`: h5 files containing processed features for pose, audio and transcripts for all speakers
+- `raw`: mp3 audio files corresponding for each interval which is useful during rendering.
+
+## Processed Features
+- pose 
+    - data: XY coordinates of upper body pose relative to the neck joint. Joint order and parents can be found [here](https://github.com/chahuja/pats/data/skeleton.py#L40)
+    - normalize: same as data but the size of the body is normalized across speakers. In other words, each speaker is scaled to have the same shoulder length. This is especially useful in style transfer experiments where we would like the style of gestures to be independent of the size of the speaker.
+    - confidence: confidence scores provided by openpose with 1 being most confident and 0 being least confident.
+- audio
+    - log_mel_400:  Log mel Spectrograms extracted with the function [here](https://github.com/chahuja/pats/data/audio.py#L38)
+    - log_mel_512: Log mel Spectrograms extracted with the function [here](https://github.com/chahuja/pats/data/audio.py#L32) 
+    - silence: Using [VAD](https://github.com/chahuja/pats/data/audio.py#L65) we estimate which segments have voice of the speaker and which just have noise.
+- text
+    - bert: fixed pre-trained bert embeddings of size 768
+    - tokens: tokens extracted using BertTokenizer of [HuggingFace](https://huggingface.co)
+    - w2v: Word2Vec features of size 300
+    - meta: Pandas Dataframe with words, start_frame and end_frame
+
+## Raw Features
+We provide links to original youtube videos to help download the relevant audio files. Rendering the generated animations with audio would require the raw audio and would be useful for user-studies.
+
+# Dataset Download
+To download **processed** features of the dataset visit [here](http://chahuja.com/pats/download.html).
+
+To download **raw** features of the dataset run,
+```sh
+python youtube2croppedaudio/youtube2audio.py \
+-base_path pats/data/ \ # Path to dataset folder 
+-speaker bee \ # Speaker Name (Optional). Downloads all speakers if not specified
+-interval_path cmu_intervals_df.csv
+```
+As raw audio files are downloaded from video streaming websites such as YouTube, some of them may not be available at the time of download. For the purposes of consistent benchmarking **processed** features should be used.
+
+# Data Loader
+As a part of this dataset we provide a DataLoader in [PyTorch](https://pytorch.org) to jumpstart your research. This DataLoader samples batches of aligned processed features of Pose, Audio and Transcripts for one or many speakers in a dictionary format. We describe the various [arguments](#arguments-of-class-data) of the class [`Data`](https://github.com/chahuja/pats/data/dataUtils.py#L51) which generates the DataLoaders.
+
+DataLoader Examples: [Ipython Notebook](dataloader_tutorial.ipynb)
 
 ## Requirements
+* pycasper
+
+```sh
+mkdir ../pycasper
+git clone https://github.com/chahuja/pycasper ../pycasper
+ln -s ../pycasper/pycasper .
+```
+
+* Create an [anaconda](https://www.anaconda.com/) enviroment named `pats` from `env.yaml`
+
+```sh
+conda env create -f env.yaml
+```
+
+## Arguments of class `Data`
+There are way too many arguments (#research) for `Data`. For most cases you might not even need most of them and can leave them as default values. We divide the arguments into **Essential**, **DataLoader Arguments**, **Modality Arguments**, **Sampler Arguments** and **Others**.
+### Essential
+- `path2data (str)`: path to processed data e.g. "pats/data/processed"
+- `speaker (str or list)`: one or more speaker names. Find list of speakers [here](https://github.com/chahuja/pats/data/common.py#L152).
+- `modalities (list)`: list of processed features to be loaded. Default- ['pose/data', 'audio/log_mel_512']. Find list of all processed features [here](https://github.com/chahuja/pats#processed-features).
+- `fs_new (list)`: list of frame rates for each modality in modalities. Default- [15, 15]. Length of fs_new == Length of modalities. 
+- `time (float)`: length of window for each sample in seconds. Default- 4.3. The default value is recommended. It results in 64 frames of audio and pose when fs_new is 15.
+- `split (tuple or None)`: train, dev and test split as fractions. Default- None. Using None would use pre-defined splits in cmu_intervals_df.csv. Example use case of a tuple, (0.7, 0.1) represents the ratios of train and dev, hence test split is 0.2.
+- `window_hop (int)`: number of frames a window hops in an interval to contruct samples. Default- 0. Using 0 implies non-overlapping windows. For `window_hop` > 0, samples are created with the following formula `[sample[i:i+int(time*fs_new[0])] for i in range(0, len(sample), window_hop)]`
+
+### DataLoader Arguments
+- `batch_size (int)`: Size of batch. Default- 100.
+- `shuffle (bool)`: Shuffle samples after each epoch. Default- True
+- `num_workers (int)`: Number of workers to load the data. Defaut- 0
+
+### Text Arguments
+- `filler (int)`: Get "text/filler" as a feature in the sampled batch. This feature is a tensor of shape 'batch x time', where each element represents if the spoken work was a filler word or not. The list of filler words is the same as nltk's stopword list for english. Default- 0. Use 1 to get the "text/filler" feature.
+- `repeat_text (int)`: If 1, the feature of each word token is repeated to match the length of its duration. For example if a word is spoken for 10 frames of the pose and/or audio sequence, it is stacked 10 times. Hence the time dimension of pose audio and transcripts are the same. If 0, words tokens are not repeated. As each sample could have different number of words, the shorter sequences are padded with zeros. Extra features "text/token_duration" and "text/token_count" are also part of the sample which represent the duration of each token in frames and number of tokens in each sequence respectively.
+
+### Sampler Arguments (Mutually exclusive unless specified)
+- `style_iters (int)`: If value > 0, [`AlternateClassSampler`](https://github.com/chahuja/pats/data/dataUtils.py#L618) is used as the sampler argument while building the train dataloader. This sampler is useful if two or more speakers are trained together. This sampler ensures that each mini-batch has equal number of samples from each speaker. Value refers to the number of iterations in each epoch. Default- 0.
+- `sample_all_styles (int)`: Can only be used with argument `style_iters` If value > 0, randomly selects value number of samples from each speaker to load. This is especially useful for performing inference in style transfer experiments, when the number of permutations of style transfer increases exponentially with the number of speakers. This argument puts an upper bound on the number of samples for each speaker, hence limiting the time to generate gestures for a limited number of inputs. Default- 0.
+- `num_training_sample (int or None)`: if value > 0, chooses a random subset of unique samples with cardinality of the set == value as the new training set. if value is None, all samples are considered for training. Default- None.
+- `quantile_sample (float or int or None)`: Default- None.
+- `quantile_num_training_sample (int or None)`: Default- None.
+- `weighted (int)`: If value > 0, `torch.utils.data.WeightedRandomSampler` as the sampler argument while building the train dataloader. The weights are set to 1 for each sample. While, this is equivalent to a uniform sampler, this provides a possibility of being able to change the weights for each sample while training. Default- 0.
 
-## Data Loader
-pytorch dataloader
+### Others
+- `load_data (bool)`: If True, loads the hdf5 files in RAM. If False, files are not loaded and the dataloaders will not work as intended. Useful for quick debugging.
+- `num_training_iters (int or None)`: If value > 0, changes the training sampler to sample with replacement and value is the number of iterations per epoch. If value is None, the sampler samples without replacement and the number of iterations are inferred based on the size of the dataset. Default- None.
 
 ## Render
-% render script
-% format of h5
-% download audios to render
+Todo..
 
-## FIle Contents
-% intervals.csv
-% .h5
+# Creating your own dataloader
+In case you prefer to create your own dataloaders, we would recommend checking out the [structure of the h5 files](#processed-features) and the last sections of the [Ipython Notebook](dataloader_tutorial.ipynb). We have a class [`HDF5`](data/common.py#L16) with many staticmethods which might be useful to load HDF5 files.
 
+# Issues
+All research has a tag of work in progress. If you find any issues with this code, feel free to raise issues or pull requests (even better) and I will get to it as soon as humanly possible.
diff --git a/argsUtils.py b/argsUtils.py
@@ -0,0 +1,44 @@
+import argparse
+import itertools
+from ast import literal_eval
+
+def get_args_perm():
+  parser = argparse.ArgumentParser()
+
+  ## Dataset Parameters
+  parser.add_argument('-path2data', nargs='+', type=str, default=['pats/data/'],
+                      help='path to data')
+  parser.add_argument('-speaker', nargs='+', type=literal_eval, default=['bee'],
+                      help='choose speaker or `all` to use all the speakers available')  
+  parser.add_argument('-modalities', nargs='+', type=literal_eval, default=[['pose/data', 'audio/log_mel_512']],
+                      help='choose a set of modalities to be loaded by the dataloader')  
+  parser.add_argument('-split', nargs='+', type=literal_eval, default=[None],
+                      help='(train,dev) split of data. default=None')
+  parser.add_argument('-batch_size', nargs='+', type=int, default=[32],
+                      help='minibatch size. Use batch_size=1 when using time=0')
+  parser.add_argument('-shuffle', nargs='+', type=int, default=[1],
+                      help='shuffle the data after each epoch. default=True')
+  parser.add_argument('-time', nargs='+', type=int, default=[4.3],
+                      help='time (in seconds) for each sample')
+  parser.add_argument('-fs_new', nargs='+', type=literal_eval, default=[[15, 15]],
+                      help='subsample to the new frequency')  
+
+
+  args, unknown = parser.parse_known_args()
+  print(args)
+  print(unknown)
+
+  ## Create a permutation of all the values in argparse
+  args_dict = args.__dict__
+  args_keys = sorted(args_dict)
+  args_perm = [dict(zip(args_keys, prod)) for prod in itertools.product(*(args_dict[names] for names in args_keys))]
+
+  return args, args_perm
+
+def argparseNloop(loop):
+  args, args_perm = get_args_perm()
+
+  for i, perm in enumerate(args_perm):
+    args.__dict__.update(perm)
+    print(args)    
+    loop(args, i)
diff --git a/data/__init__.py b/data/__init__.py
@@ -0,0 +1,9 @@
+import sys
+sys.path.insert(0, '..')
+
+from .dataUtils import *
+from .transform import *
+from .common import *
+from .skeleton import *
+from .audio import *
+from .text import *
diff --git a/data/audio.py b/data/audio.py
@@ -0,0 +1,102 @@
+import os
+import sys
+sys.path.append(os.path.dirname(os.path.abspath(__file__)))
+sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+
+from pathlib import Path
+import numpy as np
+import pandas as pd
+import pdb
+from tqdm import tqdm
+import warnings
+
+from common import Modality, MissingData
+
+class Audio(Modality):
+  def __init__(self, path2data='../dataset/groot/data',
+               path2outdata='../dataset/groot/data',
+               speaker='all',
+               preprocess_methods=['log_mel_512']):
+    super(Audio, self).__init__(path2data=path2data)
+    self.path2data = path2data
+    self.df = pd.read_csv(Path(self.path2data)/'cmu_intervals_df.csv', dtype=object)
+    self.df.loc[:, 'delta_time'] = self.df['delta_time'].apply(float)
+    self.df.loc[:, 'interval_id'] = self.df['interval_id'].apply(str)
+
+    self.path2outdata = path2outdata
+    self.speaker = speaker
+    self.preprocess_methods = preprocess_methods
+
+    self.missing = MissingData(self.path2data)
+
+  def log_mel_512(self, y, sr, eps=1e-10):
+    spec = librosa.feature.melspectrogram(y=y, sr=sr, n_fft=2048, hop_length=512)
+    mask = (spec == 0).astype(np.float)
+    spec = mask * eps + (1-mask) * spec
+    return np.log(spec).transpose(1,0)
+
+  def log_mel_400(self, y, sr, eps=1e-6):
+    y = librosa.core.resample(y, orig_sr=sr, target_sr=16000) ## resampling to 16k Hz
+    #pdb.set_trace()
+    sr = 16000
+    n_fft = 512
+    hop_length = 160
+    win_length = 400
+    S = librosa.core.stft(y=y.reshape((-1)),
+                          n_fft=n_fft,
+                          hop_length=hop_length,
+                          win_length=win_length,
+                          center=False)
+
+    S = np.abs(S)
+    spec = librosa.feature.melspectrogram(S=S, 
+                                          sr=sr, 
+                                          n_fft=n_fft, 
+                                          hop_length=hop_length, 
+                                          power=1,
+                                          n_mels=64,
+                                          fmin=125.0,
+                                          fmax=7500.0,
+                                          norm=None)    
+    mask = (spec == 0).astype(np.float)
+    spec = mask * eps + (1-mask) * spec
+    return np.log(spec).transpose(1,0)
+
+  def silence(self, y, sr, eps=1e-6):
+    vad = webrtcvad.Vad(3)
+    y = librosa.core.resample(y, orig_sr=sr, target_sr=16000) ## resampling to 16k Hz
+    #pdb.set_trace()
+    fs_old = 16000
+    fs_new = 15
+    ranges = np.arange(0, y.shape[0], fs_old/fs_new)
+    starts = ranges[0:-1]
+    ends = ranges[1:]
+
+    is_speeches = []
+    for start, end in zip(starts, ends):
+      Ranges = np.arange(start, end, fs_old/100)
+      is_speech = []
+      for s, e, in zip(Ranges[:-1], Ranges[1:]):
+        try:
+          is_speech.append(vad.is_speech(y[int(s):int(e)].tobytes(), fs_old))
+        except:
+          pdb.set_trace()
+      is_speeches.append(int(np.array(is_speech, dtype=np.int).mean() <= 0.5))
+      is_speeches.append(0)
+    return np.array(is_speeches, dtype=np.int)
+
+  @property
+  def fs_map(self):
+    return {
+      'log_mel_512': int(45.6*1000/512), #int(44.1*1000/512) #112 #round(22.5*1000/512)
+      'log_mel_400': int(16.52 *1000/160),
+      'silence': 15
+      }
+
+  def fs(self, modality):
+    modality = modality.split('/')[-1]
+    return self.fs_map[modality]
+
+  @property
+  def h5_key(self):
+    return 'audio'