-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesign CLI around the new JSON file format #59
Comments
@aaron-meyers re: #12 Since you are currently turning substudy output into ebooks (which is a fantastic idea), I wanted to show you my current draft of the new file format. Example book (with tags, notes and a paragraph implemented as a nested alignment){
"creators": [
"Miguel de Cervantes Saavedra"
],
"title": "El ingenioso hidalgo don Quijote de la Mancha",
"year": 1605,
"tracks": {
"es": {
"type": "html",
"origin": "original",
"lang": "es"
},
"en": {
"type": "html",
"origin": "ai_generated",
"generated_by": "gpt-4",
"derived_from_track_id": "es",
"lang": "en"
},
"notes": {
"type": "notes"
}
},
"tags": [
"classic"
],
"base_track_id": "es",
"alignments": [
{
"id": "2acdeaf4-7b0c-4f78-abf2-dc299ab362e9",
"heading": 1,
"tracks": {
"es": {
"html": "El ingenioso hidalgo don Quijote de la Mancha"
},
"en": {
"html": "The Ingenious Gentleman Don Quijote of La Mancha"
}
}
},
{
"id": "f4b3b3b4-4b3b-4b3b-4b3b-4b3b4b3b4b3b",
"heading": 2,
"tracks": {
"es": {
"html": "Capítulo I. Que trata de la condición y ejercicio del famoso hidalgo don Quijote de la Mancha"
},
"en": {
"html": "Chapter I. Which treats of the condition and exercise of the famous gentleman don Quijote of La Mancha"
}
}
},
{
"alignments": [
{
"id": "f5fb686f-b0ab-486c-9e7d-40c4abd51bc7",
"tracks": {
"es": {
"html": "En un lugar de la Mancha, de cuyo nombre no quiero acordarme, no ha mucho tiempo que vivía un hidalgo de los de lanza en astillero, adarga antigua, rocín flaco y galgo corredor."
},
"en": {
"html": "In a place of La Mancha, whose name I do not wish to recall, not long ago there lived a gentleman of the type with a lance in the rack, an ancient shield, a skinny steed, and a racing greyhound."
},
"notes": {
"html": "<ul><li>\"acordarme\" is a reflexive verb that means \"to remember\"; the reflexive pronoun \"me\" is used to indicate that the action is being done to oneself.</li></ul>"
}
},
"tags": [
"star"
]
},
{
"id": "f24a1744-45f9-4b00-98b1-c7a4c27a5a12",
"tracks": {
"es": {
"html": "Una olla de algo más vaca que carnero, salpicón las más noches, duelos y quebrantos los sábados, lantejas los viernes, algún palomino de añadidura los domingos, consumían las tres partes de su hacienda."
},
"en": {
"html": "A pot of stew more beef than mutton, minced meat most nights, grievous discomforts on Saturdays, lentils on Fridays, and an occasional pigeon as a treat on Sundays, consumed three parts of his estate."
}
}
}
]
}
]
} Example episode{
"series": {
"series_title": "Les aventures de Jean & Luc",
"index_in_series": 1
},
"title": "Episode 01.01",
"tracks": {
"base": {
"origin": "original",
"type": "media",
"lang": "fr",
"file": "files/episode1.mp4"
},
"subs.fr": {
"origin": "ai_generated",
"generated_by": "whisper-1",
"derived_from_track_id": "base",
"type": "html",
"lang": "fr"
},
"subs.en": {
"origin": "ai_generated",
"generated_by": "gpt-3.5-turbo",
"derived_from_track_id": "subs.fr",
"type": "html",
"lang": "en"
}
},
"base_track_id": "base",
"alignments": [
{
"id": "56523fb0-b4c5-40d4-bb08-4c59fb027dbb",
"timeSpan": [
10,
15.5
],
"tracks": {
"subs.fr": {
"html": "<i>Jean & Luc:</i> On y va !"
},
"subs.en": {
"html": "<i>Jean & Luc:</i> Let's go!"
}
}
}
]
} There's a commented Rust "schema" of the file format here. Re: line breaks. Preserving line breaks by using My basic goals here are to:
If a format like this existed, would you be interested in either:
|
Yes, this is fantastic - I was thinking about this but didn't explicitly mention it. If you looked at my script you probably saw it's currently parsing the HTML output from Overall, your JSON format has a lot of similar goals with the YAML-based formats I've been working on in my tandoku project. The core file format in my project, 'content' files, are basically an attempt at producing an aligned-media file format that can be used across a variety of media input and output types. I've specifically considered video+subtitles, ebooks (text, graphic novels, picture books), and video game scripts as sources which could all be imported into a common aligned media content format and then exported to output formats like EPUB or HTML slides. The script I referenced earlier doesn't actually use my aligned media file format yet - it was a quick-and-dirty direct conversion from your HTML output to an EPUB. I did build some tooling though recently to import Anki decks (with an image and native/reference text) into my file format and then some tools to export that into EPUB or HTML slides. I used it to import a video game script deck from Anki and output it as an offline HTML site that I can use on Steam Deck. It should be trivial for me to import your JSON format into my format so looking forward to this when you're ready! |
Sounds fantastic! I am definitely interesting in collaborating on formats for aligned media. Many years ago, I made a brief attempt to come up with a shared format for aligned media, but it didn't go anywhere. And in retrospect, that format has no way to represent headings or paragraphs, so it turned out to pretty painful for ebooks. I am imagining a workflow something like:
But I want cat episode_1.substudy/metadata.json | \
jq '.alignments[] | .. | select(.tracks?) | [.time_span, .tracks.subs_es.html, .tracks.subs_en.html, .tracks.image.file]' And yes, other formats are interesting!
Anyway, I am definitely interested in feedback! Do you think it might be worthwhile setting up a Discord (or something similar) for discussing content file formats? |
Sorry for the delay - work has been really busy the past couple weeks and then I've been on a trip. Setting up a Discord sounds like a good idea! Your proposed workflow is very similar to some of the things I've implemented - take a look at tandoku/scripts for examples. I tend to adopt terminology from others when discussing a topic but I should call out that the 'aligned' aspect of the file formats I've been working on is technically optional. The core goal of my content format was to provide a standard way to represent media from a variety of sources so that I could build common tools and workflows rather than a bunch of media-specific ones. I wanted to be able to do things like extract word statistics, keep track of known words and estimate % of known words in some target media, as well as aligning media and building ebooks or even an app for consuming media with built-in dictionary lookup and word tracking. There are some Japanese-specific things as well, like dealing with kanji and adding readings to unknown kanji. Most of these are just ideas; I've only managed to implement a few specific flows over the last several years. I should make some time to write down my goals somewhere in my repo (most of my notes are currently in a private OneNote notebook). Anyway, happy to chat - seems like you've been doing quite a bit in this repo recently! I'm not sure how much GPT-4 costs to do OCR but I've used Azure and Google cloud OCR and they do a very good job with pretty reasonable pricing ($1.50 for 1000 pages) - although I have $150 of Azure credit from Microsoft each month so I haven't actually paid anything 😉 Recently I've actually been using the Panels app on my iPad which supports Apple's Live Text (on-device OCR) - with a dictionary app as a "slide over" from the side, it works pretty well. I would still like to at least align pages of graphic novels (e.g. interleave Japanese and English pages in a single CBZ file) - technically doesn't require OCR but at least some info on panels / text regions could help with automatic alignment. And using OCR to generate my content file format and be able to run word statistics and so on is something I'd still like to do at some point. |
I have been a little busy with other stuff, but I will be back around to this project in a while with any luck! |
We want a new JSON-based file format that can represent subtitles or books, and that contains all versions of work, plus any metadata needed to support card creation.
The text was updated successfully, but these errors were encountered: