-
Notifications
You must be signed in to change notification settings - Fork 36
Sanskrit text preparation
Here is an outline of the steps for preparing a Sanskrit text for translation on Bilara.
Let's first define some terms:
- properHTML is the HTML that ends up in the Bilara HTML files.
-
inlineHTML is any HTML included inside segments, such as
<i>
,<supplied>
, etc. These tags end up inside the Bilara root files. -
fakeHTML is any HTML that is not included in the final output but merely structures content.
- Use these custom tags:
<segment>
,<root>
,<translation>
,<reference>
,<comment>
,<variant>
.
- Use these custom tags:
Note: Keep the git repo clean. As a general rule, the only content that should be committed are the source files and the final product, nothing in-between.
- Select a source text.
- Let’s assume our text is the Candrasūtra.
- If the text is already on SC, identify it by its project and UID.
- project =
sf
, UID =sf276
- If it is not on SC, assign a project and UID.
- project =
- Add the folder named with the SC UID to the appropriate project in
publication-sources
. - Copy the source file or files to the folder.
- Keep the original file name:
sa_candrasUtra.xml
- Any kind of content can be added to this folder.
- Keep the original file name:
- Make an HTML file from a local copy of the text.
- Delete all front and end matter, including metadata etc.
- Ensure the HTML file is well-structured with appropriate heading and
<p>
tags. Occasionally other semantic tags such as lists might be used. Ensure each text is wrapped in<article id='uid'>
, and each<h1>
is wrapped in<header>
. - Niceties: add the following where appropriate.
- wrap in span:
<span class='evam'>evam mayā śrutam</span>
- add class to paragraph for remarks at end of sutra, etc:
<p class='end'>śarabha iti sūtraṃ</p>
- likewise for verse of homage at start of sutra, etc:
<p class='namo'>namo buddhāya</p>
- wrap in span:
- Make sure all HTML uses
'single quotes'
. - Any text-critical markup or plain-text marks must be replaced with inlineHTML.
- Where meaning of markup is unclear, refer back to original printed edition ideally, else consult old SC versions.
- Create segments.
- Typically, use punctuation as the basis, then refine it by an initial reading of the text. It is much more efficient to get the segmenting right now than fix it later!
- Wrap segments in
<segment>
.- All properHTML is outside
<segment>
.
- All properHTML is outside
- Make sure all content inside
<segment>
is wrapped in fakeHTML tags as<root>
,<translation>
,<reference>
,<comment>
, or<variant>
. - Follow instructions here for running HTML tidy and eliminating overlapping markup.
Here is an example HTML file for sf276.
<article id='sf276'>
<header>
<h1>
<segment><root>Candrasūtra</root></segment>
</h1>
</header>
<p>
<segment><root>evaṃ mayā śrutam</root></segment>
<segment><root>ekasama<supplied>yaṃ bhagavāñ</supplied> śrāvastyāṃ viharati jet<supplied>a</supplied>v<supplied>a</supplied>n<supplied>a</supplied> anāthapiṇḍad<supplied>ā</supplied>r<supplied>ā</supplied>m<supplied>e /</supplied></root></segment>
</p>
<p>
<segment><root>tena khalu samayena rāhuṇā asurendreṇa sarvaṃ candramaṇḍalam āvṛtam* <supplied>/</supplied></root></segment>
</p>
<p>
<segment><root><supplied>atha</supplied> yā devatā tasmiṃ<supplied>ś</supplied> candramaṇḍala adhyuṣitā sā bhītā trast<supplied>ā</supplied> saṃvignā āhṛṣṭaromakūpā yena bhagavāṃs teno<supplied>pajagāma /</supplied> upetya bha<supplied>ga</supplied>v<supplied>a</supplied>tpādau śirasā <supplied>vanditvaikāṃ</supplied>te 'sthād ekāntasthitā sā devatā tasyāṃ velāyāṃ gāthā babhāṣe //</root></segment>
</p>
<p>
<segment><root>buddhavīra namas te 'stu vipramuktāya sarvataḥ</root><comment>Ed. bhitā but MS reads bhītā</comment></segment>
<segment><root>saṃbādhapratipannāsmi tasya me śaraṇaṃ bhava :</root><comment>Ed. buddha vīra</comment></segment>
</p>
<blockquote class='gatha'>
<p>
<span class='verse-line'><segment><root>arhantaṃ sugataṃ loke candramāḥ śaraṇaṃ gataḥ</root></segment></span>
<span class='verse-line'><segment><root><span class='verse-line'>rāhoś candramasaṃ muñca buddhā lokānukampakāḥ //</root></segment></span>
</p>
</blockquote>
<p>
<segment><root>bhagavān āha //</root></segment>
</p>
<p>
<segment><root>tamonudaṃ taṃ nabhasi prabhākaraṃ virocanaṃ śukla<supplied>v</supplied>iśuddhavarcasam*</root></segment>
<segment><root>rāho ś<supplied>a</supplied>śāṅkaṃ grasa māntarīkṣe praj<supplied>ā</supplied>pr<supplied>a</supplied>dīpaṃ drutam utsṛjainam* //</root></segment>
</p>
<p>
<segment><root>atha rāhuṇā as<supplied>u</supplied>rendreṇa tvaritatvaritaṃ candramaṇḍalam utsṛṣṭam* ⟨/⟩</root></segment>
<segment><root>tataḥ sa<supplied>ṃ</supplied>tvaramāṇo 'sau rāhuś candram avāsṛ<supplied>jat*</supplied></root></segment>
<segment><root><supplied>saṃsvinnagātro vya</supplied>thitaḥ saṃbhr<supplied>ānta āturo ya</supplied>thā //</root></segment>
</p>
<p>
<segment><root>adrākṣīd baḍir vairocano <supplied>rāhuṇā</supplied> asurendreṇa tvaritatvaritaṃ candr<supplied>a</supplied>maṇḍala<supplied>m utsṛṣṭam* / dṛṣṭvā ca baḍi</supplied>r gāthāṃ babhāṣe //</root></segment>
</p>
<p>
<segment><root>ki<supplied>ṃ</supplied> nu sa<supplied>ṃ</supplied>tv<supplied>aramāṇas</supplied> tv<supplied>aṃ</supplied> rāhuś candraṃ vimuñcasi ·</root></segment>
<segment><root>saṃsvinnagātro vyathitaḥ saṃ<supplied>bhrānta āturo yathā</supplied> <supplied>//</supplied></root><comment>Cf. Pelliot Sanskrit bleu 449 Ac: /// ro yathā //</comment></segment>
</p>
<p>
<segment><root><supplied>rāhur avocat* //</supplied></root></segment>
</p>
<p>
<segment><root><supplied>sa</supplied>ptadhā me sphalen mūrdhā <supplied>jīvan na sukha</supplied>m āp<supplied>nu</supplied>yāṃ</root></segment>
<segment><root>ta<supplied>tra buddh</supplied>ābhigītena muñceyaṃ śaśinaṃ na cet*</root><comment>Cf. Pelliot Sanskrit bleu 449 Ac: rāhu prāha // saptadhā me sphal[e] mūrdhā</comment></segment>
</p>
<p>
<segment><root><supplied>baḍir vairocano 'vocat* /</supplied></root></segment>
<segment><root>x x x x x - - - x x x x madarśi<supplied>nāṃ</supplied></root></segment>
<segment><root><supplied>teṣāṃ gāthābhigītena rāhuś candraṃ vimuñcati //</supplied></root><comment>Cf. Pelliot Sanskrit bleu 449 Ad: + + + + + .. .. .. .. .. .. .. (bh)i(g)itena muñce</comment></segment>
</p>
<p>
<segment><root><supplied>candrasūtraṃ samāptam* //</supplied></root></segment>
</p>
</article>
The next step is to create a TSV file. This will allow the data to be separated into its different types.
For this we use Karl's bilara-html-tsv script, which is currently found here:
https://github.com/sc-voice/bilara-html-tsv
- Get rid of document-level HTML.
- Run bilara-html-tsv, this creates a TSV file
- The first row has column headers,
- the first column header is
segment_id
. - the second column header is
html
. This contains the properHTML with{}
as placeholder for<segment>
content. If there is no properHTML, still use{}
. - remaining column headers are identical to the names of the fakeHTML custom tags.
- the first column header is
That gives us something like:
segment_id html root comment
sf276:0.1 <article id='sf276'><header><h1>{}</h1></header> Candrasūtra
sf276:1.1 <p>{} evaṃ mayā śrutam
sf276:1.2 {}</p> ekasama<supplied>yaṃ bhagavāñ</supplied> śrāvastyāṃ viharati jet<supplied>a</supplied>v<supplied>a</supplied>n<supplied>a</supplied> anāthapiṇḍad<supplied>ā</supplied>r<supplied>ā</supplied>m<supplied>e /</supplied>
sf276:2.1 <p>{}</p> tena khalu samayena rāhuṇā asurendreṇa sarvaṃ candramaṇḍalam āvṛtam*<supplied>/</supplied>
sf276:3.1 <p>{}</p> <supplied>atha</supplied> yā devatā tasmiṃ<supplied>ś</supplied> candramaṇḍala adhyuṣitā sā bhītā trast<supplied>ā</supplied> saṃvignā āhṛṣṭaromakūpā yena bhagavāṃs teno<supplied>pajagāma /</supplied> upetya bha<supplied>ga</supplied>v<supplied>a</supplied>tpādau śirasā <supplied>vanditvaikāṃ</supplied>te 'sthād ekāntasthitā sā devatā tasyāṃ velāyāṃ gāthā babhāṣe // Ed. bhitā but MS reads bhītā
sf276:4.1 <p>{} buddhavīra namas te 'stu vipramuktāya sarvataḥ Ed. buddha vīra
sf276:4.2 {}</p> saṃbādhapratipannāsmi tasya me śaraṇaṃ bhava :
sf276:5.1 <blockquote class='gatha'><p><span class='verse-line'>{}</span> arhantaṃ sugataṃ loke candramāḥ śaraṇaṃ gataḥ
sf276:5.2 <span class='verse-line'>{}</span></p></blockquote> rāhoś candramasaṃ muñca buddhā lokānukampakāḥ //
sf276:6.1 <p>{}</p> bhagavān āha //
sf276:7.1 <p>{} tamonudaṃ taṃ nabhasi prabhākaraṃ virocanaṃ śukla<supplied>v</supplied>iśuddhavarcasam*
sf276:7.2 {}</p> rāho ś<supplied>a</supplied>śāṅkaṃ grasa māntarīkṣe praj<supplied>ā</supplied>pr<supplied>a</supplied>dīpaṃ drutam utsṛjainam* //
sf276:8.1 <p>{} atha rāhuṇā as<supplied>u</supplied>rendreṇa tvaritatvaritaṃ candramaṇḍalam utsṛṣṭam* ⟨/⟩
sf276:8.2 {} tataḥ sa<supplied>ṃ</supplied>tvaramāṇo 'sau rāhuś candram avāsṛ<supplied>jat*</supplied>
sf276:8.3 {}</p> <supplied>saṃsvinnagātro vya</supplied>thitaḥ saṃbhr<supplied>ānta āturo ya</supplied>thā //
sf276:9.1 <p>{}</p> adrākṣīd baḍir vairocano <supplied>rāhuṇā</supplied> asurendreṇa tvaritatvaritaṃ candr<supplied>a</supplied>maṇḍala<supplied>m utsṛṣṭam* / dṛṣṭvā ca baḍi</supplied>r gāthāṃ babhāṣe //
sf276:10.1 <p>{} ki<supplied>ṃ</supplied> nu sa<supplied>ṃ</supplied>tv<supplied>aramāṇas</supplied> tv<supplied>aṃ</supplied> rāhuś candraṃ vimuñcasi · Cf. Pelliot Sanskrit bleu 449 Ac: /// ro yathā //
sf276:10.2 {}</p> saṃsvinnagātro vyathitaḥ saṃ<supplied>bhrānta āturo yathā</supplied> <supplied>//</supplied>
sf276:11.1 <p>{}</p> <supplied>rāhur avocat* //</supplied>
sf276:12.1 <p>{} <supplied>sa</supplied>ptadhā me sphalen mūrdhā <supplied>jīvan na sukha</supplied>m āp<supplied>nu</supplied>yāṃ Cf. Pelliot Sanskrit bleu 449 Ac: rāhu prāha // saptadhā me sphal[e] mūrdhā
sf276:12.2 {}</p> ta<supplied>tra buddh</supplied>ābhigītena muñceyaṃ śaśinaṃ na cet*
sf276:13.1 <p>{} <supplied>baḍir vairocano 'vocat* /</supplied>
sf276:13.2 {} x x x x x - - - x x x x madarśi<supplied>nāṃ</supplied> Cf. Pelliot Sanskrit bleu 449 Ad: + + + + + .. .. .. .. .. .. .. (bh)i(g)itena muñce
sf276:13.3 {}</p> <supplied>teṣāṃ gāthābhigītena rāhuś candraṃ vimuñcati //</supplied>
sf276:14.1 <p>{}</p></article> <supplied>candrasūtraṃ samāptam* //</supplied>
From here, we hand it over to bilara i/o. If creating a new collection, we need to add the details to a config file.
https://github.com/suttacentral/bilara-data/tree/unpublished/.scripts/bilara-io/config
- ! make sure the
tsv
file has columns for each header in config, else it will throw an error !
Once config is defined, save the file in .scripts/bilara-io/
and run:
./sheet_import.py sf276.tsv -c
This will separate the content types and place them in the correct folders. Ready to translate!
/html/sf276.json
{
"sf276:0.1": "<article id='sf276'><header><h1>{}</h1></header>",
"sf276:1.1": "<p>{}",
"sf276:1.2": "{}</p>",
"sf276:2.1": "<p>{}",
"sf276:3.1": "<p>{}</p>",
"sf276:4.1": "<p>{}",
"sf276:4.2": "{}</p>",
"sf276:5.1": "<blockquote class='gatha'><p><span class='verse-line'>{}</span>",
"sf276:5.2": "<span class='verse-line'>{}</span></p></blockquote>",
"sf276:6.1": "<p>{}</p>",
"sf276:6.2": "<p>{}",
"sf276:6.3": "{}</p>",
"sf276:7.1": "<p>{}",
"sf276:7.2": "{}",
"sf276:7.3": "{}</p>",
"sf276:8.1": "<p>{}</p>",
"sf276:9.1": "<p>{}",
"sf276:9.2": "{}</p>",
"sf276:10.1": "<p>{}</p>",
"sf276:10.2": "<p>{}",
"sf276:10.3": "{}</p>",
"sf276:11.1": "<p>{}",
"sf276:11.2": "{}",
"sf276:11.3": "{}</p>",
"sf276:12.1": "<p>{}</p></article>"
}
/root/sf276.json
{
"sf276:0.1": "Candrasūtra",
"sf276:1.1": "evaṃ mayā śrutam",
"sf276:1.2": "ekasama<supplied>yaṃ bhagavāñ</supplied> śrāvastyāṃ viharati jet<supplied>a</supplied>v<supplied>a</supplied>n<supplied>a</supplied> anāthapiṇḍad<supplied>ā</supplied>r<supplied>ā</supplied>m<supplied>e /</supplied>",
"sf276:2.1": "tena khalu samayena rāhuṇā asurendreṇa sarvaṃ candramaṇḍalam āvṛtam* <supplied>/</supplied></p>",
"sf276:3.1": "<supplied>atha</supplied> yā devatā tasmiṃ<supplied>ś</supplied> candramaṇḍala adhyuṣitā sā bhītā trast<supplied>ā</supplied> saṃvignā āhṛṣṭaromakūpā yena bhagavāṃs teno<supplied>pajagāma /</supplied> upetya bha<supplied>ga</supplied>v<supplied>a</supplied>tpādau śirasā <supplied>vanditvaikāṃ</supplied>te 'sthād ekāntasthitā sā devatā tasyāṃ velāyāṃ gāthā babhāṣe //",
"sf276:4.1": "buddhavīra namas te 'stu vipramuktāya sarvataḥ",
"sf276:4.2": "saṃbādhapratipannāsmi tasya me śaraṇaṃ bhava :",
"sf276:5.1": "arhantaṃ sugataṃ loke candramāḥ śaraṇaṃ gataḥ ",
"sf276:5.2": "rāhoś candramasaṃ muñca buddhā lokānukampakāḥ //",
"sf276:6.1": "bhagavān āha //",
"sf276:6.2": "tamonudaṃ taṃ nabhasi prabhākaraṃ virocanaṃ śukla<supplied>v</supplied>iśuddhavarcasam*",
"sf276:6.3": "rāho ś<supplied>a</supplied>śāṅkaṃ grasa māntarīkṣe praj<supplied>ā</supplied>pr<supplied>a</supplied>dīpaṃ drutam utsṛjainam* //",
"sf276:7.1": "atha rāhuṇā as<supplied>u</supplied>rendreṇa tvaritatvaritaṃ candramaṇḍalam utsṛṣṭam* /",
"sf276:7.2": "tataḥ sa<supplied>ṃ</supplied>tvaramāṇo 'sau rāhuś candram avāsṛ<supplied>jat*</supplied>",
"sf276:7.3": "<supplied>saṃsvinnagātro vya</supplied>thitaḥ saṃbhr<supplied>ānta āturo ya</supplied>thā //",
"sf276:8.1": "adrākṣīd baḍir vairocano <supplied>rāhuṇā</supplied> asurendreṇa tvaritatvaritaṃ candr<supplied>a</supplied>maṇḍala<supplied>m utsṛṣṭam* / dṛṣṭvā ca baḍi</supplied>r gāthāṃ babhāṣe //",
"sf276:9.1": "ki<supplied>ṃ</supplied> nu sa<supplied>ṃ</supplied>tv<supplied>aramāṇas</supplied> tv<supplied>aṃ</supplied> rāhuś candraṃ vimuñcasi ·",
"sf276:9.2": "saṃsvinnagātro vyathitaḥ saṃ<supplied>bhrānta āturo yathā</supplied> <supplied>//</supplied>",
"sf276:10.1": "<supplied>rāhur avocat* //</supplied>",
"sf276:10.2": "<supplied>sa</supplied>ptadhā me sphalen mūrdhā <supplied>jīvan na sukha</supplied>m āp<supplied>nu</supplied>yāṃ",
"sf276:10.3": "ta<supplied>tra buddh</supplied>ābhigītena muñceyaṃ śaśinaṃ na cet*",
"sf276:11.1": "<supplied>baḍir vairocano 'vocat* /</supplied>",
"sf276:11.2": "x x x x x - - - x x x x madarśi<supplied>nāṃ</supplied>",
"sf276:11.3": "<supplied>teṣāṃ gāthābhigītena rāhuś candraṃ vimuñcati //</supplied>",
"sf276:12.1": "<supplied>candrasūtraṃ samāptam* //</supplied>"
}
/comment/sf276.json
{
"sf276:4.1": "Ed. bhitā but MS reads bhītā",
"sf276:4.2": "Ed. buddha vīra",
"sf276:9.2": "Cf. Pelliot Sanskrit bleu 449 Ac: /// ro yathā //",
"sf276:10.3": "Cf. Pelliot Sanskrit bleu 449 Ac: rāhu prāha // saptadhā me sphal[e] mūrdhā",
"sf276:11.3": "Cf. Pelliot Sanskrit bleu 449 Ad: + + + + + .. .. .. .. .. .. .. (bh)i(g)itena muñce"
}