Skip to content

This is a re-implementation of "Python Speech Features" that offers up to hundreds of times performance boost on CUDA enabled GPUs.

License

Notifications You must be signed in to change notification settings

vkola-lab/python_speech_features_cuda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python Speech Features CUDA

This package is a Python Speech Features re-implementation that offers up to hundreds of times performance boost on CUDA enabled GPUs. The API is designed to be as close as possible to the original implementation such that users may have their existing projects benefited from the acceleration with least modifications to the code. If you do not have the access to a CUDA GPU, this package may also get you a decent speedup (i.e. roughly x20) by utilizing multi-core CPU, optimizing RAM usage etc.

Speedup Plot

The performance of the 3 most important functions, namely mfcc, ssc and delta, were tested on random signals of length 500,000 which are approximately 30 seconds each. Let's take the speed of original implementation as baseline (i.e ), the vertical axis tells the speed gain; the horizontal axis signifies the batch size. It is clear to see that the acceleration is universal whichever the backend is NumPy (CPU) or CuPy (CUDA GPU), although the advantage of GPU is way more significant. Please also note the astonishing performance of delta function is due to a reworked logic.

Note that the benchmark was run on a system of Intel i9-7920X (12-core) and NVIDIA GTX 2080Ti, the acutal performance may vary on different settings.

Getting Started

This section will walk you through the installation and prerequisites.

Dependencies

The package was developed on the following dependencies:

  1. NumPy (1.19).
  2. CuPy (7.6).

Please note that the dependencies may require Python 3.7 or greater. It is recommended to install and maintain all packages using conda or pip. To install CuPy, additional effort is needed to get CUDA mounted. Please check the official websites of CUDA for detailed instructions. Also, since this package only uses the most generic functions that are expected to be invariant through dependencies' versions, it will possibly be working well even with lower versions.

Optional dependencies:

  1. pyFFTW (0.12)
  2. Numba (0.52)

If available, they will be auto-detected and loaded during the initialization stage. There are a couple of routines defined in the _acc sub-module utilizing both packages to enhance the CPU performance. Of course you don't need them if you have a CUDA-enabled GPU and go for CuPy as the backend.

For Numba installation, it's highly recommended to build from the project's GitHub repository directly. The framework is constantly getting improved through frenquent updates. With version 0.52, it can even beat Cython in numerical computation tasks.

pip install git+git://github.com/numba/numba

Installation

To install from PyPI:

pip install python_speech_features_cuda

To install from GitHub repo using pip:

pip install git+git://github.com/vkola-lab/python_speech_features_cuda

What Is Different

All changes are made around the point of performance gain.

Intermediate result buffer

Intermediate results (e.g. Mel filterbank and DCT matrix) can be buffered to avoid duplicated computation when all parameters remain the same. It is possibly the major reason why this implementation is still faster on CPU.

Batch process

The original implementation can process only one signal sequence at a time. Of course, it is a sufficient manner within CPU-only environment, overly vectorizing NumPy code is actually harmful to the performance due to the curse of cache-miss in practice. However, GPU is another story that, roughly speaking, only if we letting it process as many signals as possible at once can unleash its power of parallelism. As we can see from the plot above, GPU code has consistent performance gain as the batch size increases. Here, functions can be fed with multiple sequences as a batch ndarray whose preceding dimensions are batch dimensions.

Strict floating-point control

Numerical data subtype is almost transparent to Python coders, but it is necessarily explicit for GPU programming. In order to constraint floating-point type, this implementation introduces a global 'knob' indicating what floating-point (i.e. FP32 or FP64) is expected; any input ndarray needs to be consistent with that or a TypeError will be raised.

API changes

The API is kept almost the same except that sub-module sigproc is removed. All functions previously under sigproc can now be accessed at the package root level. This is to adopt the 'pythonic' principle of 'flat is better than nested.'

A few function argument names may also be changed to make them appear more unified. For example, NFFT and nfft are both changed to nfft, although we will not notice that if arguments are passed in positional manner.

Examples

Import, then check backend and floating-point type

import python_speech_features_cuda as psf

print(psf.env.backend.__name__)  # >>> cupy
print(psf.env.dtype.__name__)    # >>> float64

By default, the backend will be set to CuPy and the data type float64. If CuPy is not found in the environment, then the backend will be switched to NumPy instead at package initialization stage.

Change backend and floating-point type

import numpy as np

psf.env.backend = np
psf.env.dtype = np.float32

print(psf.env.backend.__name__)  # >>> numpy
print(psf.env.dtype.__name__)    # >>> float32

Call MFCC()

# initialize a batch of 4 signals of length 500,000 each
sig = psf.env.backend.random.rand(4, 500000, dtype=psf.env.dtype)

# apply MFCC
fea = psf.mfcc(sig, samplerate=16000, winlen=.025, winstep=.01, numcep=13,
               nfilt=26, nfft=None, lowfreq=0, highfreq=None, preemph=.97,
               ceplifter=22, appendEnergy=True, winfunc=None)

print(fea.shape)  # >>> (4, 3124, 13)

Please note that the input array MUST be consistent with the package enviroment in terms of backend and dtype. If our raw data is loaded in different format, use psf.env.backend.asarray(..., dtype=psf.env.dtype) for conversion.

Call MFCC() with nontrivial window

# calculate window function (vector)
samplerate, winlen = 16000, .025
win_len = int(np.round(samplerate * winlen))
win = psf.env.backend.hamming(win_len).astype(psf.env.dtype)

# apply MFCC
fea = psf.mfcc(sig, nfft=512, winfunc=win)

print(fea.shape)  # >>> (4, 3124, 13)

Window function (e.g. hamming) has only one degree of freedom that is window/frame length. Since window length doesn't change oftenly in most senarios, it is not necessary to calculate it over and over again at each call. This API change is consistent with the idea of buffering.

Clean buffer

# reset buffer
psf.buf.reset()

If the input signals vary a lot in terms of length and batch size, or we want to try multiple combinations of parameters for the optimum, the buffer size can monotonically increase since all intermediate results are stacked. If RAM or GPU memory is filled up, call reset() to release.

Tips

Use FP32

FP32 computation is generally faster than FP64 unless a CPU or a GPU offers advanced instruction sets making the gap small. As to the output of MFCC on random signals normalized within (0, 1), the observed precision error is always less than 1e-5 which is definitely tolerable for most applications.

Interoperability

If we are using CuPy as the backend, then all function outputs are CuPy ndarray stored on GPU memory. Assume the very next stop of the CuPy ndarray is another GPU function but provided by other package/library (e.g. PyTorch, Numba), we can simply pass the memory 'pointer' instead of suffering the huge overhead of GPU->RAM->GPU transfer. Please check this CuPy documentation page for details.

Authors

  • Chonghua Xue, [email protected] - Kolachalama laboratory, Boston University School of Medicine

About

This is a re-implementation of "Python Speech Features" that offers up to hundreds of times performance boost on CUDA enabled GPUs.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages