magor_sgenglish

This is in development.

The final system takes in audio/ video files (both single files and multi-channel recordings) as input, then performs a number of processing tasks in a pipeline - sampling, diarization, transcription using Google Speech APIs/ in-house LVCSR system, keyframe captioning, visualization etc.

The repository includes the core system, as well as a collection of reference modules for the above-mentioned tasks.

System pre-requisites

This is the requisites for the core system only. For module-specific requirements, consult the Modules section below.

Linux
Python 2

The system was developed using Python 2.7.12 on Lubuntu 16.04.2 LTS.

Setup

Clone this project: $ git clone https://github.com/nguyenhuyanhh/magor_sgenglish.git --recursive
Setup the individual module dependencies automatically ($ python system.py setup --all) or manually (as prescribed in Modules)
Crawl (put) files into /crawl
Run the system: $ python system.py process [options] (for a list of options, see $ python system.py process -h or consult Command-line interface section below).

Documentation

Command-line interface

System setup: `setup`

$ python system.py setup -h
usage: system.py setup [-h] [-m [module_id [module_id ...]]] [-a]

optional arguments:
  -h, --help            show this help message and exit
  -m [module_id [module_id ...]], --modules [module_id [module_id ...]]
                        module_ids to setup
  -a, --all             setup all available modules

Interactive processing mode: `process`

$ python system.py process -h
usage: system.py process [-h] [-b] [-p [procedure_id [procedure_id ...]]]
                         [-f [file_name [file_name ...]]] [-i [file_id]] [-t]
                         [-n]
                         [process_id]

positional arguments:
  process_id            process_id for this run

optional arguments:
  -h, --help            show this help message and exit
  -b, --batch           batch mode
  -p [procedure_id [procedure_id ...]], --procedures [procedure_id [procedure_id ...]]
                        procedures to run
  -f [file_name [file_name ...]], --files [file_name [file_name ...]]
                        file_names to process
  -i [file_id], --id [file_id]
                        file_id to process
  -t, --test            just do system checks and exit
  -n, --simulate        simulate the run, without processing any file

Batch processing mode: `process -b`

Specifying process options is a tedious task, hence there is a batch processing mode using operations.json as an alternative.

The format of operations.json are as follows:

[
    {
        "process_id": "process-1",
        "procedures": [],
        "file_names": [],
        "file_id": "file-id",
        "simulate": false
    }
]

It is essentially similar to the interactive mode. If process_id and/or procedures are not specified, the default in the manifest would be used.

Overall repository structure

Non-critical files are omitted for brevity.

crawl/                  # raw files (from crawler or manual input)
data/                   # main data 
modules/                # modules
utils/                  # utility scripts
manifest.json           # system manifest
system.py               # core system executable

Manifest file structure

An example would be the included manifest.json.

{
    "processes": {
        "process-1": {
            "resample": "1.0",
            "diarize": "8.4.1",
            "google": "1",
            "lvcsr": "1701",
            "convert": "1.0",
            "capgen": "1.0",
            "visualize": "1.0",
            "vad": "1.0"
        },
        "process-2": {
            "resample": "0.9"
        }
    },
    "default_process": "process-1",
    "procedures":{
        "procedure-id-1":[
            "resample",
            "diarize",
            "google"
        ],
        "procedure-id-2":[]
    },
    "file_types":{
        "audio":[
            ".mp3",
            ".wav"
        ],
        "video":[
            ".mp4"
        ]
    }
}

All fields in the manifest file are of type str.

The manifest is initiated as an instance of the class Manifest, and manifest integrity checks would be executed before processing any file.

Modules, procedures and processes

Each module in the system performs a function, which takes input files from certain subfolders under the working folder (data/process-id/file-id) and produce output files in other subfolders under the same working folder. Modules could be pipelined into procedures, if their input and output requirements are linked.

Processes are a certain configuration of procedures, each procedure having modules locked to a certain version (overwriting the default version, specified by the default process). The same procedure when applied to different processes might have different versions; this enables versioning of module outputs.

Module file structure

The module-id (module folder name) must be a composition of the module name and its version (in the form {name}-{version}).

modules/
    module-id-1/
        manifest.json       # module manifest
        setup               # setup script (`#!/bin/bash` is recommended)
        module.py           # core module executable
        [optional executables and data files]
    module-id-2/
        ...
    ...

Module manifest file structure

This is an example manifest file for module resample-1.0.

{
    "name": "resample",
    "version": "1.0",
    "requires": [],
    "inputs": [
        "raw"
    ],
    "outputs": [
        "resample"
    ]
}

Field types and value constraints:

Field	Type	Constraint
`name`	`str`
`version`	`str`
`requires`	`list(str)`	Module dependencies (required paths under `/modules/module-id`)
`inputs`	`list(str)`	Module inputs (subfolders under `/data/process-id/file-id`)
`outputs`	`list(str)`	Module outputs (subfolders under `/data/process-id/file-id`)

Included modules and procedures

The modules included within this repository are:

Module	Version	`module-id`	Author/ Contributor	System Requirements/ Setup	Python requirements
`resample`	1.0	`resample-1.0`	Nguyen Huy Anh	`ffmpeg` installed, via `$ sudo apt-get install ffmpeg`	`FFmpy`
`convert`	1.0	`convert-1.0`	Nguyen Huy Anh	`ffmpeg` installed, via `$ sudo apt-get install ffmpeg`	`FFmpy`
`vad`	1.0	`vad-1.0`	Pham Van Tung/ Nguyen Huy Anh	`ffmpeg` installed, via `$ sudo apt-get install ffmpeg`	`scipy`, `numpy`, `soundfile`, `FFmpy`
`diarize`	8.4.1	`diarize-8.4.1`	Nguyen Huy Anh	Java 7 (at least) installed. Recommended to install JDK 7/8	None
`google`	1	`google-1`	Nguyen Huy Anh	A valid Google Service Account Key as `google*/key.json`. How to acquire key	`google-cloud-speech`
`lvcsr`	1701	`lvcsr-1701`	Xu Haihua/ Nguyen Huy Anh	Install Kaldi with `sequitur` (included in `/tools` after successful installation) Include `$KALDI_ROOT` as an environment variable in `~/.bashrc` Acquire the models and put into `/lvcsr/systems` (The Singapore-English LVCSR models by Xu Haihua is the property of Speech and Language Research Group, School of Computer Science and Engineering, NTU, and is not avalable outside NTU.*)	None
`capgen`	1.0	`capgen-1.0`	Peter/ Nguyen Huy Anh	Follow the instructions here. Also, put the cpu checkpoints in `capgen*/neuraltalk2/model/`	None
`visualize`	1.0	`visualize-1.0`	Nguyen Huy Anh	`ffmpeg` installed, via `$ sudo apt-get install ffmpeg`	`FFmpy`

Most of the setup procedures are automated into setup scripts.

Five procedures are included with this repository:

Procedure	Description
`google`	Transcribe audios using Google Cloud Speech API
`lvcsr`	Transcribe audios using in-house LVCSR system
`capgen`	Generate caption for video keyframes in videos
`vad`	Transcribe multi-channel recordings
`visualize`	Visualize transcriptions and captions

Data folder structure

The /data folder structure allows independent module inputs/ outputs to be stored in their respective folders. Each record is identifiable by process_id, the process used, and file_id, a slug of the original file name.

data/
    process-id-1/
        file-id-1/
            raw/            # raw file (.m4a, .mp3, .mp4, .wav)
            resample/       # output of module resample
            convert/        # output of module convert
            vad/            # output of module vad
            diarization/    # output of module diarize
            transcript/
                google/     # output of module google
                lvcsr/      # output of module lvcsr
            keyframes/      # output of module capgen
            visualize/      # output of module visualize
            temp/           # temp files for each module
                google/
                lvcsr/
                vad/
        file-id-2/
            ...
        ...
    process-id-2/
        ...
    ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

magor_sgenglish

System pre-requisites

Setup

Documentation

Command-line interface

System setup: `setup`

Interactive processing mode: `process`

Batch processing mode: `process -b`

Overall repository structure

Manifest file structure

Modules, procedures and processes

Module file structure

Module manifest file structure

Included modules and procedures

Data folder structure

Files

README.md

Latest commit

History

README.md

File metadata and controls

magor_sgenglish

System pre-requisites

Setup

Documentation

Command-line interface

System setup: setup

Interactive processing mode: process

Batch processing mode: process -b

Overall repository structure

Manifest file structure

Modules, procedures and processes

Module file structure

Module manifest file structure

Included modules and procedures

Data folder structure

System setup: `setup`

Interactive processing mode: `process`

Batch processing mode: `process -b`