DPE transcript format

DPE, stands for Digital Paper Edit, named after digital-paper-edit project. Also known as autoEdit3

An application to make it faster, easier and more accessible to edit audio and video interviews using automatically generated transcriptions form STT service. The current representation of a transcription is a list of timed word objects and one of speakers.

{
  "words": [
    {
      "end": 0.46, // in seconds
      "start": 0,
      "text": "Hello"
    },
    {
      "end": 1.02,
      "start": 0.46,
      "text": "World"
    },
    ...
    ]
    "paragraphs": [
    {
      "speaker": "SPEAKER_A",
      "start": 0,
      "end": 3
    },
    {
      "speaker": "SPEAKER_B",
      "start": 3,
      "end": 19.2
    },
    ...
    ]
 }

Having paragraphs and words separate as a way of modelling this domain has proven extremly flexible for situation where you need to run alignment on the whole text or just parts of it.

Generating paragraphs

Paragraphs are generally generated by the Speech To Text service speaker diarization information. Or when this is not available they can generated via punctuation (.|? |!) that might be present in the words.

See these STT adapters for examples of it can be generated

AssemblyAI assemblyai-to-dpe
AWS Transcriber aws-to-dpe
Google STT gcp-to-dpe
IBM Watson STT (in PR pietrop/digital-paper-edit-electron#52 module ibmwatson-to-dpe but not extracted as separate module npm/github repo)
~~Speechmatics~~ (There's a speechmatics-to-dpe module but not extracted as a separate npm/github repo/module - since speechmatics web portal API deprecation notice)

There's helper functions such as dpe-add-words-to-paragraphs.sj you can write to interpolate the paragraphs back with the words of getWordsForParagraph used in slate-transcript-editor - dpe-to-slate "import" adapter.

interpolating paragraphs

/**
 *
 * @param {*} currentParagraph a dpe paragraph object, with start, and end attribute eg in seconds
 * @param {*} words a list of word objects with start and end attributes
 * @returns a lsit of words obejcts that are included in the given paragraphs
 */
const getWordsForParagraph = (currentParagraph, words) => {
  const { start, end } = currentParagraph;
  return words.filter((word) => {
    return word.start >= start && word.end <= end;
  });
};
export default getWordsForParagraph;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!