Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Noise is being added to generated speech in Python E2E flow (TFLite Models) #135

Open
barrylee111 opened this issue Oct 25, 2023 · 2 comments

Comments

@barrylee111
Copy link

Description

I am currently working on a project that is built in Unity where I am modulating voices (e.g. source speech → voice modulator → target speech (elf)). I currently have an E2E flow with the TFLite models, but there is a decent amount of noise being added to the speech generation. It sounds almost like a clipping noise. I'm currently using the TFLite models from the repo and I have split the quantizer into a QuantizerEncoder & QuantizerDecoder. I'm not sure if a better solution is to attempt to convert Lyra into a DLL and just run that in Unity vs the models, but this is what I have so far.

E2E Flow

  • Load a wav file via librosa
  • Pad the data to meet data_length % 320 == 0
  • Feed the data through the 4 models: Encoder, QuantizerEncoder, QuantizerDecoder, & Decoder
  • Store the waveform data as I go as:
    • One singular array of data
    • A series of audio clips
  • Save the singular array of waveform data as a wav file
  • Play back the file

Code

!wget -q https://huggingface.co/rocca/lyra-v2-soundstream/resolve/main/tflite/1.3.0/soundstream_encoder.tflite
!wget -q https://huggingface.co/rocca/lyra-v2-soundstream/resolve/main/tflite/1.3.0/lyragan.tflite
!wget -q https://huggingface.co/rocca/lyra-v2-soundstream/resolve/main/tflite/1.3.0/quantizer_encoder.tflite
!wget -q https://huggingface.co/rocca/lyra-v2-soundstream/resolve/main/tflite/1.3.0/quantizer_decoder.tflite

import tensorflow as tf
import numpy as np

import librosa

def getAudioData(audio_file, verbose=False):
    data, sr = librosa.load(audio_file, sr=None)
    
    if verbose:
        print(len(y))
        
    batch_size = 320
    padding_length = batch_size - (len(data) % batch_size)
    padded_data = np.pad(data, (0, padding_length), mode='constant', constant_values=0)
    
    return padded_data, sr

# Encoder:
def runEncoderInference(input_data, verbose=False):
    interpreter = tf.lite.Interpreter(model_path="encoder.tflite")
    interpreter.allocate_tensors()

    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    # Extract the first 320 samples
    input_data = np.array(input_data, dtype=input_details[0]['dtype'])
    input_data = np.reshape(input_data, (1, 320))

    interpreter.set_tensor(input_details[0]['index'], input_data)

    interpreter.invoke()
    output_data = interpreter.get_tensor(output_details[0]['index'])
    
    if verbose:
        print(output_data)
    
    return output_data

# Quantizer Encoder:
def runQuantizerEncoderInference(input_data2, verbose=False):
    interpreter = tf.lite.Interpreter(model_path="quantizer_encoder.tflite")
    interpreter.allocate_tensors()

    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    input_data1 = np.array(46, dtype=np.int32)
    interpreter.set_tensor(input_details[0]['index'], input_data1)

    # input_data2 = np.ones(input_details[1]['shape'], dtype=input_details[1]['dtype'])
    interpreter.set_tensor(input_details[1]['index'], input_data2)

    interpreter.invoke()
    output_data = interpreter.get_tensor(output_details[0]['index'])
    
    if verbose:
        print(output_data)
    
    return output_data

# Quantizer Decoder:
def runQuantizerDecoderInference(input_data, verbose=False):
    interpreter = tf.lite.Interpreter(model_path="quantizer_decoder.tflite")
    interpreter.allocate_tensors()

    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    # input_data = np.ones(input_details[0]['shape'], dtype=input_details[0]['dtype'])
    interpreter.set_tensor(input_details[0]['index'], input_data)

    interpreter.invoke()
    output_data = interpreter.get_tensor(output_details[0]['index'])
    
    if verbose:
        print(output_data)
    
    return output_data

# Decoder:
def runDecoderInference(input_data, verbose=False):
    interpreter = tf.lite.Interpreter(model_path="lyragan.tflite")
    interpreter.allocate_tensors()

    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    # input_data = np.ones(input_details[0]['shape'], dtype=input_details[0]['dtype'])
    interpreter.set_tensor(input_details[0]['index'], input_data)

    interpreter.invoke()
    output_data = interpreter.get_tensor(output_details[0]['index'])
    
    if verbose:
        print(output_data)
        
    return output_data

audio_file = "<wavfile_path>.wav"
data, sr = getAudioData(audio_file)

# Check length and pad with zeroes so that the length % 320 = 0
import numpy as np

batch_size = 320
num_batches = len(data) // batch_size
waveform_data = None
audio_clips = None

for i in range(num_batches):
    start_idx = i * batch_size
    end_idx = (i + 1) * batch_size
    batch_data = data[start_idx:end_idx]

    enc_output = runEncoderInference(batch_data)
    qe_output = runQuantizerEncoderInference(enc_output)
    qd_output = runQuantizerDecoderInference(qe_output)
    dec_output = runDecoderInference(qd_output)
    
    if i == 0:
        waveform_data = dec_output[0] # Concatenates all waveform data
        audio_clips = dec_output # Stores waveform data as clips
    else:
        waveform_data = np.concatenate((waveform_data, dec_output[0]))
        audio_clips = np.concatenate((audio_clips, dec_output))

import torchaudio
import torch

audio_tensor = torch.tensor([waveform_data])
output_file = "<your_output_path>.wav"
torchaudio.save(output_file, audio_tensor, sr)

Questions

  • Is the better solution to create a DLL and run that in Unity?
  • Do the models encompass all of the pre & post-processing needed to provide a clean output signal that the cpp implementation provides (e.g. Integration Test Example)
  • Have I made an error in my implementation? I haven't been able to find a python implementation yet that runs the data and tests, so this is what I've come up with so far.
  • Is the noise possibly due to the fact that I am concatenating all of the data to test vs playing each clip back iteratively? I attempted to playback the output iteratively as plain waveform data in a Jupyter Notebook vs stored wavfiles, but no sound was produced.

Resources

Sound samples.zip

@shlomiez
Copy link

shlomiez commented May 5, 2024

Same happens to me... Did you solve it?

@barrylee111
Copy link
Author

barrylee111 commented May 7, 2024

@shlomiez The output of the data had a prefix that was being added with values that were really high or low. I don't remember the exact cause of the issue, but I do remember the fix was managing how we were staging and creating the DLL. One of the methods we added to the DLL was incorrectly prefixing those out of range data points to that data.

Hope this helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants