transcribe method broken? #11516

hammondm · 2024-12-09T16:17:40Z

Hi.

For some reason, the transcribe method causes a core dump with quartznet or conformer. I'm running this in docker. My code is tweaked from one of the tutorials.

Mike H

Here's my code:

import os,glob,os,subprocess,tarfile
import wget,nemo,librosa,json
from ruamel.yaml import YAML
import nemo.collections.asr as nemo_asr
import pytorch_lightning as pl
from omegaconf import DictConfig

data_dir = '/data/an4'
config_path = 'quartzconf.yaml'
epochs = 3

if not os.path.exists(data_dir):
	os.makedirs(data_dir)

#download data
if not os.path.exists(
		data_dir + '/an4_sphere.tar.gz'
	):
	an4_url = 'https://dldata-public.s3.us' + \
		'-east-2.amazonaws.com/an4_sphere.tar.gz'
	an4_path = wget.download(an4_url,data_dir)
else:
	an4_path = data_dir + '/an4_sphere.tar.gz'

#convert to wav files
if not os.path.exists(data_dir + '/an4/'):
	tar = tarfile.open(an4_path)
	tar.extractall(path=data_dir)
	sph_list = glob.glob(
		data_dir + '/an4/**/*.sph',
		recursive=True
	)
	for sph_path in sph_list:
		wav_path = sph_path[:-4] + '.wav'
		cmd = ["sox",sph_path,wav_path]
		subprocess.run(cmd)

#function to create manifest file
def build_manifest(
		transcripts_path,
		manifest_path,wav_path
	):
	with open(transcripts_path,'r') as fin:
		with open(manifest_path,'w') as fout:
			for line in fin:
				transcript = line[: \
					line.find('(')-1].lower()
				transcript = transcript.replace(
					'<s>',''
				).replace('</s>','')
				transcript = transcript.strip()
				file_id = line[line.find('(')+1 : -2]
				audio_path = os.path.join(
					data_dir,wav_path,
					file_id[file_id.find('-')+1 : \
						file_id.rfind('-')],
					file_id + '.wav')
				duration = librosa.core.get_duration(
					filename=audio_path
				)
				metadata = {
					"audio_filepath": audio_path,
					"duration": duration,
					"text": transcript
				}
				json.dump(metadata,fout)
				fout.write('\n')
				
#make manifest files
train_transcripts = data_dir + \
	'/an4/etc/an4_train.transcription'
train_manifest = data_dir + \
	'/an4/train_manifest.json'
if not os.path.isfile(train_manifest):
	build_manifest(
		train_transcripts,
		train_manifest,
		'an4/wav/an4_clstk'
	)
test_transcripts = data_dir + \
	'/an4/etc/an4_test.transcription'
test_manifest = data_dir + \
	'/an4/test_manifest.json'
if not os.path.isfile(test_manifest):
	build_manifest(
		test_transcripts,
		test_manifest,
		'an4/wav/an4test_clstk'
	)

#read config from yaml file
yaml = YAML(typ='safe')
with open(config_path) as f:
	params = yaml.load(f)

print(params)

#build trainer
trainer = pl.Trainer(
	devices=1,
	accelerator='gpu',
	max_epochs=epochs
)

#specify training and validation data
params['model']['train_ds']\
	['manifest_filepath'] = train_manifest
params['model']['validation_ds']\
	['manifest_filepath'] = test_manifest

#build model
first_asr_model = \
	nemo_asr.models.EncDecCTCModel(
		cfg=DictConfig(params['model']),
		trainer=trainer
)

#train
trainer.fit(first_asr_model)

#do some inference
paths2audio_files = [
		os.path.join(
			data_dir,
			'an4/wav/an4_clstk/mgah/cen2-mgah-b.wav'
		),
		os.path.join(
			data_dir,
			'an4/wav/an4_clstk/fmjd/cen7-fmjd-b.wav'
		),
		os.path.join(
			data_dir,
			'an4/wav/an4_clstk/fmjd/cen8-fmjd-b.wav'
		),
		os.path.join(
			data_dir,
			'an4/wav/an4_clstk/fkai/cen8-fkai-b.wav'
		)
	]
#print(first_asr_model.transcribe(
#	paths2audio_files=paths2audio_files,
#	batch_size=4
#))

Here's the output:

{'name': 'QuartzNet15x5', 'sample_rate': 16000, 'repeat': 1, 'dropout': 0.0, 'separable': True, 'labels': [' ', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', "'"], 'model': {'train_ds': {'manifest_filepath': '???', 'sample_rate': 16000, 'labels': [' ', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', "'"], 'batch_size': 32, 'trim_silence': True, 'max_duration': 16.7, 'shuffle': True, 'num_workers': 4, 'pin_memory': True, 'is_tarred': False, 'tarred_audio_filepaths': None, 'shuffle_n': 2048, 'bucketing_strategy': 'synced_randomized', 'bucketing_batch_size': None}, 'validation_ds': {'manifest_filepath': '???', 'sample_rate': 16000, 'labels': [' ', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', "'"], 'batch_size': 32, 'shuffle': False, 'num_workers': 4, 'pin_memory': True}, 'preprocessor': {'_target_': 'nemo.collections.asr.modules.audio_preprocessing.AudioToMelSpectrogramPreprocessor', 'normalize': 'per_feature', 'window_size': 0.02, 'sample_rate': 16000, 'window_stride': 0.01, 'window': 'hann', 'features': 64, 'n_fft': 512, 'frame_splicing': 1, 'dither': 1e-05, 'stft_conv': False}, 'spec_augment': {'_target_': 'nemo.collections.asr.modules.SpectrogramAugmentation', 'rect_freq': 50, 'rect_masks': 5, 'rect_time': 120}, 'encoder': {'_target_': 'nemo.collections.asr.modules.ConvASREncoder', 'feat_in': 64, 'activation': 'relu', 'conv_mask': True, 'jasper': [{'filters': 128, 'repeat': 1, 'kernel': [11], 'stride': [1], 'dilation': [1], 'dropout': 0.0, 'residual': True, 'separable': True, 'se': True, 'se_context_size': -1}, {'filters': 256, 'repeat': 1, 'kernel': [13], 'stride': [1], 'dilation': [1], 'dropout': 0.0, 'residual': True, 'separable': True, 'se': True, 'se_context_size': -1}, {'filters': 256, 'repeat': 1, 'kernel': [15], 'stride': [1], 'dilation': [1], 'dropout': 0.0, 'residual': True, 'separable': True, 'se': True, 'se_context_size': -1}, {'filters': 256, 'repeat': 1, 'kernel': [17], 'stride': [1], 'dilation': [1], 'dropout': 0.0, 'residual': True, 'separable': True, 'se': True, 'se_context_size': -1}, {'filters': 256, 'repeat': 1, 'kernel': [19], 'stride': [1], 'dilation': [1], 'dropout': 0.0, 'residual': True, 'separable': True, 'se': True, 'se_context_size': -1}, {'filters': 256, 'repeat': 1, 'kernel': [21], 'stride': [1], 'dilation': [1], 'dropout': 0.0, 'residual': False, 'separable': True, 'se': True, 'se_context_size': -1}, {'filters': 1024, 'repeat': 1, 'kernel': [1], 'stride': [1], 'dilation': [1], 'dropout': 0.0, 'residual': False, 'separable': True, 'se': True, 'se_context_size': -1}]}, 'decoder': {'_target_': 'nemo.collections.asr.modules.ConvASRDecoder', 'feat_in': 1024, 'num_classes': 28, 'vocabulary': [' ', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', "'"]}, 'optim': {'name': 'novograd', 'lr': 0.01, 'betas': [0.8, 0.5], 'weight_decay': 0.001, 'sched': {'name': 'CosineAnnealing', 'monitor': 'val_loss', 'reduce_on_plateau': False, 'warmup_steps': None, 'warmup_ratio': None, 'min_lr': 0.0, 'last_epoch': -1}}}, 'trainer': {'devices': 1, 'max_epochs': 5, 'max_steps': -1, 'num_nodes': 1, 'accelerator': 'gpu', 'strategy': 'ddp', 'accumulate_grad_batches': 1, 'enable_checkpointing': False, 'logger': False, 'log_every_n_steps': 1, 'val_check_interval': 1.0, 'benchmark': False}, 'exp_manager': {'exp_dir': None, 'name': 'QuartzNet15x5', 'create_tensorboard_logger': True, 'create_checkpoint_callback': True}}
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo I 2024-12-09 16:13:59 audio_to_text_dataset:49] Model level config does not contain `sample_rate`, please explicitly provide `sample_rate` to the dataloaders.
[NeMo I 2024-12-09 16:13:59 audio_to_text_dataset:49] Model level config does not contain `labels`, please explicitly provide `labels` to the dataloaders.
[NeMo I 2024-12-09 16:13:59 collections:196] Dataset loaded with 948 files totalling 0.71 hours
[NeMo I 2024-12-09 16:13:59 collections:197] 0 files were filtered totalling 0.00 hours
[NeMo I 2024-12-09 16:13:59 audio_to_text_dataset:49] Model level config does not contain `sample_rate`, please explicitly provide `sample_rate` to the dataloaders.
[NeMo I 2024-12-09 16:13:59 audio_to_text_dataset:49] Model level config does not contain `labels`, please explicitly provide `labels` to the dataloaders.
[NeMo I 2024-12-09 16:13:59 collections:196] Dataset loaded with 130 files totalling 0.10 hours
[NeMo I 2024-12-09 16:13:59 collections:197] 0 files were filtered totalling 0.00 hours
[NeMo I 2024-12-09 16:13:59 features:289] PADDING: 16
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
[NeMo I 2024-12-09 16:14:00 modelPT:728] Optimizer config = Novograd (
    Parameter Group 0
        amsgrad: False
        betas: [0.8, 0.5]
        eps: 1e-08
        grad_averaging: False
        lr: 0.01
        weight_decay: 0.001
    )
[NeMo I 2024-12-09 16:14:00 lr_scheduler:910] Scheduler "<nemo.core.optim.lr_scheduler.CosineAnnealing object at 0x7f8ff9db6da0>" 
    will be used during training (effective maximum steps = 90) - 
    Parameters : 
    (warmup_steps: null
    warmup_ratio: null
    min_lr: 0.0
    last_epoch: -1
    max_steps: 90
    )

  | Name              | Type                              | Params
------------------------------------------------------------------------
0 | preprocessor      | AudioToMelSpectrogramPreprocessor | 0     
1 | encoder           | ConvASREncoder                    | 1.2 M 
2 | decoder           | ConvASRDecoder                    | 29.7 K
3 | loss              | CTCLoss                           | 0     
4 | spec_augmentation | SpectrogramAugmentation           | 0     
5 | _wer              | WER                               | 0     
------------------------------------------------------------------------
1.2 M     Trainable params
0         Non-trainable params
1.2 M     Total params
4.836     Total estimated model params size (MB)
[NeMo W 2024-12-09 16:14:01 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py:281: PossibleUserWarning: The number of training batches (30) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
      rank_zero_warn(
    
Epoch 2: 100%|████████████████████████| 30/30 [00:01<00:00, 15.07it/s, v_num=15]`Trainer.fit` stopped: `max_epochs=3` reached.                                  
Epoch 2: 100%|████████████████████████| 30/30 [00:02<00:00, 14.87it/s, v_num=15]
Transcribing:   0%|                                       | 0/1 [00:00<?, ?it/s]Segmentation fault (core dumped)

Here's how the container was built:

docker run -it \
  --gpus all \
  --ipc=host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -p 8888:8888 \
  --name ne \
  -v /data/:/mhdata \
  nvcr.io/nvidia/nemo:23.10

Describe the bug

A clear and concise description of what the bug is.

Steps/Code to reproduce bug

Please list minimal steps or code snippet for us to be able to reproduce the bug.

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.

Expected behavior

A clear and concise description of what you expected to happen.

Environment overview (please complete the following information)

Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)]
Method of NeMo install: [pip install or from source]. Please specify exact commands you used to install.
If method of install is [Docker], provide docker pull & docker run commands used

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

OS version
PyTorch version
Python version

Additional context

Add any other context about the problem here.
Example: GPU model

The text was updated successfully, but these errors were encountered:

hammondm added the bug Something isn't working label Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transcribe method broken? #11516

transcribe method broken? #11516

hammondm commented Dec 9, 2024

transcribe method broken? #11516

transcribe method broken? #11516

Comments

hammondm commented Dec 9, 2024