-
Notifications
You must be signed in to change notification settings - Fork 37
Sanskrit text preparation
Here is an outline of the steps for preparing a Sanskrit text for translation on Bilara.
Let's first define some terms:
- properHTML is the HTML that ends up in the Bilara HTML files.
- inlineHTML is any HTML included inside segments, such as
<i>
,<supplied>
, etc.- fakeHTML is any HTML that is not included in the final output but merely structures content.
- Use these custom tags:
<segment>
,<root>
,<translation>
,<reference>
,<comment>
,<variant>
.
- Select a source text.
- Let’s assume our text is the Candrasūtra.
- If the text is already on SC, identify it by its project and UID.
- project =
sf
, UID =sf276
- If it is not on SC, assign a project and UID.
- project =
- Add the folder named with the SC UID to the appropriate project in
publication-sources
.bilara-data/.publication-sources/sf/sf276
- Copy the source file or files to the folder.
- Keep the original file name:
sa_candrasUtra.xml
- Any kind of content can be added to this folder.
- Keep the original file name:
- Make an HTML file from a local copy of the text.
- Delete all front and end matter, including metadata etc.
- Ensure the HTML file is well-structured with appropriate heading and
<p>
tags. Occasionally other semantic tags such as lists might be used. Ensure each text is wrapped in<article id='uid'>
, and each<h1>
is wrapped in<header>
. - Make sure all HTML uses
'single quotes'
. - Any text-critical markup or plain-text marks must be replaced with inlineHTML.
- Where meaning of markup is unclear, consult old SC versions.
- Create segments.
- Typically, use punctuation as the basis, then refine it by an initial reading of the text. It is much more efficient to get the segmenting right now than fix it later!
- Wrap segments in
<segment>
.- All properHTML is outside
<segment>
.
- All properHTML is outside
- Make sure all content inside
<segment>
is wrapped in fakeHTML tags as<root>
,<translation>
,<reference>
,<comment>
, or<variant>
.
What we end up with is the properHTML, most commonly <p>
, as parent to <segment>
, which is parent to sibling fakeHTML tags.
<p>
<segment>
<root>some root text</root>
<comment>what a nice root text</comment>
</segment>
</p>
Now that our basic HTML is ready, let's finalize it. Follow instructions here for running HTML tidy and eliminating overlapping markup.
This will produce an HTML file something like the following.
<article id='sf276'>
<header>
<h1>
<segment><root>Candrasūtra</root></segment>
</h1>
</header>
<p>
<segment><root>evaṃ mayā śrutam</root></segment>
<segment><root>ekasama<supplied>yaṃ bhagavāñ</supplied> śrāvastyāṃ viharati jet<supplied>a</supplied>v<supplied>a</supplied>n<supplied>a</supplied> anāthapiṇḍad<supplied>ā</supplied>r<supplied>ā</supplied>m<supplied>e /</supplied></root></segment>
</p>
<p>
<segment><root>tena khalu samayena rāhuṇā asurendreṇa sarvaṃ candramaṇḍalam āvṛtam* <supplied>/</supplied></root></segment>
</p>
<p>
<segment><root><supplied>atha</supplied> yā devatā tasmiṃ<supplied>ś</supplied> candramaṇḍala adhyuṣitā sā bhītā trast<supplied>ā</supplied> saṃvignā āhṛṣṭaromakūpā yena bhagavāṃs teno<supplied>pajagāma /</supplied> upetya bha<supplied>ga</supplied>v<supplied>a</supplied>tpādau śirasā <supplied>vanditvaikāṃ</supplied>te 'sthād ekāntasthitā sā devatā tasyāṃ velāyāṃ gāthā babhāṣe //</root></segment>
</p>
<p>
<segment><root>buddhavīra namas te 'stu vipramuktāya sarvataḥ</root><comment>Ed. bhitā but MS reads bhītā</comment></segment>
<segment><root>saṃbādhapratipannāsmi tasya me śaraṇaṃ bhava :</root><comment>Ed. buddha vīra</comment></segment>
</p>
<blockquote class='gatha'>
<p>
<segment><root><span class='verse-line'>arhantaṃ sugataṃ loke candramāḥ śaraṇaṃ gataḥ</span></root></segment>
<segment><root><span class='verse-line'>rāhoś candramasaṃ muñca buddhā lokānukampakāḥ //</span></root></segment>
</p>
</blockquote>
<p>
<segment><root>bhagavān āha //</root></segment>
</p>
<p>
<segment><root>tamonudaṃ taṃ nabhasi prabhākaraṃ virocanaṃ śukla<supplied>v</supplied>iśuddhavarcasam*</root></segment>
<segment><root>rāho ś<supplied>a</supplied>śāṅkaṃ grasa māntarīkṣe praj<supplied>ā</supplied>pr<supplied>a</supplied>dīpaṃ drutam utsṛjainam* //</root></segment>
</p>
<p>
<segment><root>atha rāhuṇā as<supplied>u</supplied>rendreṇa tvaritatvaritaṃ candramaṇḍalam utsṛṣṭam* ⟨/⟩</root></segment>
<segment><root>tataḥ sa<supplied>ṃ</supplied>tvaramāṇo 'sau rāhuś candram avāsṛ<supplied>jat*</supplied></root></segment>
<segment><root><supplied>saṃsvinnagātro vya</supplied>thitaḥ saṃbhr<supplied>ānta āturo ya</supplied>thā //</root></segment>
</p>
<p>
<segment><root>adrākṣīd baḍir vairocano <supplied>rāhuṇā</supplied> asurendreṇa tvaritatvaritaṃ candr<supplied>a</supplied>maṇḍala<supplied>m utsṛṣṭam* / dṛṣṭvā ca baḍi</supplied>r gāthāṃ babhāṣe //</root></segment>
</p>
<p>
<segment><root>ki<supplied>ṃ</supplied> nu sa<supplied>ṃ</supplied>tv<supplied>aramāṇas</supplied> tv<supplied>aṃ</supplied> rāhuś candraṃ vimuñcasi ·</root></segment>
<segment><root>saṃsvinnagātro vyathitaḥ saṃ<supplied>bhrānta āturo yathā</supplied> <supplied>//</supplied></root><comment>Cf. Pelliot Sanskrit bleu 449 Ac: /// ro yathā //</comment></segment>
</p>
<p>
<segment><root><supplied>rāhur avocat* //</supplied></root></segment>
</p>
<p>
<segment><root><supplied>sa</supplied>ptadhā me sphalen mūrdhā <supplied>jīvan na sukha</supplied>m āp<supplied>nu</supplied>yāṃ</root></segment>
<segment><root>ta<supplied>tra buddh</supplied>ābhigītena muñceyaṃ śaśinaṃ na cet*</root><comment>Cf. Pelliot Sanskrit bleu 449 Ac: rāhu prāha // saptadhā me sphal[e] mūrdhā</comment></segment>
</p>
<p>
<segment><root><supplied>baḍir vairocano 'vocat* /</supplied></root></segment>
<segment><root>x x x x x - - - x x x x madarśi<supplied>nāṃ</supplied></root></segment>
<segment><root><supplied>teṣāṃ gāthābhigītena rāhuś candraṃ vimuñcati //</supplied></root><comment>Cf. Pelliot Sanskrit bleu 449 Ad: + + + + + .. .. .. .. .. .. .. (bh)i(g)itena muñce</comment></segment>
</p>
<p>
<segment><root><supplied>candrasūtraṃ samāptam* //</supplied></root></segment>
</p>
</article>
The HTML file must now be split into separate files, each with a single type of data, and each numbered with the same segments. Our utility bilara i/o is designed to do just that. So let's put the file in a form bilara i/o can consume.
- Get rid of document-level HTML.
- Put each segment on a separate line.
<segment>
tags and all content are wrapped in whatever properHTML there is.<properHTML><segment> … </segment></properHTML>
- Assign segment numbers. See spec for details.
- (Alternatively, create segment numbers when creating TSV file.)
<article id='sf276'><header><h1 id='sf276:0.1'><segment>Candrasūtra</segment></h1></header>
<p><segment id='sf276:1.1'><root>evaṃ mayā śrutam</root></segment>
<segment id='sf276:1.2'><root>ekasama<supplied>yaṃ bhagavāñ</supplied> śrāvastyāṃ viharati jet<supplied>a</supplied>v<supplied>a</supplied>n<supplied>a</supplied> anāthapiṇḍad<supplied>ā</supplied>r<supplied>ā</supplied>m<supplied>e /</supplied></root></segment></p>
<p><segment id='sf276:2.1'><root>tena khalu samayena rāhuṇā asurendreṇa sarvaṃ candramaṇḍalam āvṛtam*<supplied>/</supplied></root></segment></p>
<p><segment id='sf276:3.1'><root><supplied>atha</supplied> yā devatā tasmiṃ<supplied>ś</supplied> candramaṇḍala adhyuṣitā sā bhītā trast<supplied>ā</supplied> saṃvignā āhṛṣṭaromakūpā yena bhagavāṃs teno<supplied>pajagāma /</supplied> upetya bha<supplied>ga</supplied>v<supplied>a</supplied>tpādau śirasā <supplied>vanditvaikāṃ</supplied>te 'sthād ekāntasthitā sā devatā tasyāṃ velāyāṃ gāthā babhāṣe //</root></segment></p>
<p><segment id='sf276:4.1'><root>buddhavīra namas te 'stu vipramuktāya sarvataḥ</root><comment>Ed. bhitā but MS reads bhītā</comment></segment>
<segment id='sf276:4.2'><root>saṃbādhapratipannāsmi tasya me śaraṇaṃ bhava :</root><comment>Ed. buddha vīra</comment></segment></p>
<blockquote class='gatha.1'><p><segment id='sf276:5.1'><root><span class='verse-line.1'>arhantaṃ sugataṃ loke candramāḥ śaraṇaṃ gataḥ</span></root></segment>
<segment id='sf276:5.2'><root><span class='verse-line.1'>rāhoś candramasaṃ muñca buddhā lokānukampakāḥ //</span></root></segment></p></blockquote>
<p><segment id='sf276:6.1'><root>bhagavān āha //</root></segment></p>
<p><segment id='sf276:7.1'><root>tamonudaṃ taṃ nabhasi prabhākaraṃ virocanaṃ śukla<supplied>v</supplied>iśuddhavarcasam*</root></segment>
<segment id='sf276:7.2'><root>rāho ś<supplied>a</supplied>śāṅkaṃ grasa māntarīkṣe praj<supplied>ā</supplied>pr<supplied>a</supplied>dīpaṃ drutam utsṛjainam* //</root></segment></p>
<p><segment id='sf276:8.1'><root>atha rāhuṇā as<supplied>u</supplied>rendreṇa tvaritatvaritaṃ candramaṇḍalam utsṛṣṭam* ⟨/⟩</root></segment>
<segment id='sf276:8.2'><root>tataḥ sa<supplied>ṃ</supplied>tvaramāṇo 'sau rāhuś candram avāsṛ<supplied>jat*</supplied></root></segment>
<segment id='sf276:8.3'><root><supplied>saṃsvinnagātro vya</supplied>thitaḥ saṃbhr<supplied>ānta āturo ya</supplied>thā //</root></segment></p>
<p><segment id='sf276:9.1'><root>adrākṣīd baḍir vairocano <supplied>rāhuṇā</supplied> asurendreṇa tvaritatvaritaṃ candr<supplied>a</supplied>maṇḍala<supplied>m utsṛṣṭam* / dṛṣṭvā ca baḍi</supplied>r gāthāṃ babhāṣe //</root></segment></p>
<p><segment id='sf276:10.1'><root>ki<supplied>ṃ</supplied> nu sa<supplied>ṃ</supplied>tv<supplied>aramāṇas</supplied> tv<supplied>aṃ</supplied> rāhuś candraṃ vimuñcasi ·</root></segment>
<segment id='sf276:10.2'><root>saṃsvinnagātro vyathitaḥ saṃ<supplied>bhrānta āturo yathā</supplied> <supplied>//</supplied></root><comment>Cf. Pelliot Sanskrit bleu 449 Ac: /// ro yathā //</comment></segment></p>
<p><segment id='sf276:11.1'><root><supplied>rāhur avocat* //</supplied></root></segment></p>
<p><segment id='sf276:12.1'><root><supplied>sa</supplied>ptadhā me sphalen mūrdhā <supplied>jīvan na sukha</supplied>m āp<supplied>nu</supplied>yāṃ</root></segment>
<segment id='sf276:12.2'><root>ta<supplied>tra buddh</supplied>ābhigītena muñceyaṃ śaśinaṃ na cet*</root><comment>Cf. Pelliot Sanskrit bleu 449 Ac: rāhu prāha // saptadhā me sphal[e] mūrdhā</comment></segment></p>
<p><segment id='sf276:13.1'><root><supplied>baḍir vairocano 'vocat* /</supplied></root></segment>
<segment id='sf276:13.2'><root>x x x x x - - - x x x x madarśi<supplied>nāṃ</supplied></root></segment>
<segment id='sf276:13.3'><root><supplied>teṣāṃ gāthābhigītena rāhuś candraṃ vimuñcati //</supplied></root><comment>Cf. Pelliot Sanskrit bleu 449 Ad: + + + + + .. .. .. .. .. .. .. (bh)i(g)itena muñce</comment></segment></p>
<p><segment id='sf276:14.1'><root><supplied>candrasūtraṃ samāptam* //</supplied></root></segment></p></article>
Now let's make a TSV file!
- Put each data type in a separate tab-separated column.
- The first row has column headers,
- the first column header is
segment_id
. - the second column header is
html
. This contains the properHTML with{}
as placeholder for<segment>
content. If there is no properHTML, still use{}
. - remaining column headers are identical to the names of the fakeHTML custom tags.
- the first column header is
- Remove unneeded fakeHTML
<segment>
,<comment>
etc. (because now data types are defined per column.)
That gives us something like:
segment_id html root comment
sf276:0.1 <article id='sf276'><header><h1>{}</h1></header> Candrasūtra
sf276:1.1 <p>{} evaṃ mayā śrutam
sf276:1.2 {}</p> ekasama<supplied>yaṃ bhagavāñ</supplied> śrāvastyāṃ viharati jet<supplied>a</supplied>v<supplied>a</supplied>n<supplied>a</supplied> anāthapiṇḍad<supplied>ā</supplied>r<supplied>ā</supplied>m<supplied>e /</supplied>
sf276:2.1 <p>{}</p> tena khalu samayena rāhuṇā asurendreṇa sarvaṃ candramaṇḍalam āvṛtam*<supplied>/</supplied>
sf276:3.1 <p>{}</p> <supplied>atha</supplied> yā devatā tasmiṃ<supplied>ś</supplied> candramaṇḍala adhyuṣitā sā bhītā trast<supplied>ā</supplied> saṃvignā āhṛṣṭaromakūpā yena bhagavāṃs teno<supplied>pajagāma /</supplied> upetya bha<supplied>ga</supplied>v<supplied>a</supplied>tpādau śirasā <supplied>vanditvaikāṃ</supplied>te 'sthād ekāntasthitā sā devatā tasyāṃ velāyāṃ gāthā babhāṣe // Ed. bhitā but MS reads bhītā
sf276:4.1 <p>{} buddhavīra namas te 'stu vipramuktāya sarvataḥ Ed. buddha vīra
sf276:4.2 {}</p> saṃbādhapratipannāsmi tasya me śaraṇaṃ bhava :
sf276:5.1 <blockquote class='gatha'><p><span class='verse-line'>{}</span> arhantaṃ sugataṃ loke candramāḥ śaraṇaṃ gataḥ
sf276:5.2 <span class='verse-line'>{}</span></p></blockquote> rāhoś candramasaṃ muñca buddhā lokānukampakāḥ //
sf276:6.1 <p>{}</p> bhagavān āha //
sf276:7.1 <p>{} tamonudaṃ taṃ nabhasi prabhākaraṃ virocanaṃ śukla<supplied>v</supplied>iśuddhavarcasam*
sf276:7.2 {}</p> rāho ś<supplied>a</supplied>śāṅkaṃ grasa māntarīkṣe praj<supplied>ā</supplied>pr<supplied>a</supplied>dīpaṃ drutam utsṛjainam* //
sf276:8.1 <p>{} atha rāhuṇā as<supplied>u</supplied>rendreṇa tvaritatvaritaṃ candramaṇḍalam utsṛṣṭam* ⟨/⟩
sf276:8.2 {} tataḥ sa<supplied>ṃ</supplied>tvaramāṇo 'sau rāhuś candram avāsṛ<supplied>jat*</supplied>
sf276:8.3 {}</p> <supplied>saṃsvinnagātro vya</supplied>thitaḥ saṃbhr<supplied>ānta āturo ya</supplied>thā //
sf276:9.1 <p>{}</p> adrākṣīd baḍir vairocano <supplied>rāhuṇā</supplied> asurendreṇa tvaritatvaritaṃ candr<supplied>a</supplied>maṇḍala<supplied>m utsṛṣṭam* / dṛṣṭvā ca baḍi</supplied>r gāthāṃ babhāṣe //
sf276:10.1 <p>{} ki<supplied>ṃ</supplied> nu sa<supplied>ṃ</supplied>tv<supplied>aramāṇas</supplied> tv<supplied>aṃ</supplied> rāhuś candraṃ vimuñcasi · Cf. Pelliot Sanskrit bleu 449 Ac: /// ro yathā //
sf276:10.2 {}</p> saṃsvinnagātro vyathitaḥ saṃ<supplied>bhrānta āturo yathā</supplied> <supplied>//</supplied>
sf276:11.1 <p>{}</p> <supplied>rāhur avocat* //</supplied>
sf276:12.1 <p>{} <supplied>sa</supplied>ptadhā me sphalen mūrdhā <supplied>jīvan na sukha</supplied>m āp<supplied>nu</supplied>yāṃ Cf. Pelliot Sanskrit bleu 449 Ac: rāhu prāha // saptadhā me sphal[e] mūrdhā
sf276:12.2 {}</p> ta<supplied>tra buddh</supplied>ābhigītena muñceyaṃ śaśinaṃ na cet*
sf276:13.1 <p>{} <supplied>baḍir vairocano 'vocat* /</supplied>
sf276:13.2 {} x x x x x - - - x x x x madarśi<supplied>nāṃ</supplied> Cf. Pelliot Sanskrit bleu 449 Ad: + + + + + .. .. .. .. .. .. .. (bh)i(g)itena muñce
sf276:13.3 {}</p> <supplied>teṣāṃ gāthābhigītena rāhuś candraṃ vimuñcati //</supplied>
sf276:14.1 <p>{}</p></article> <supplied>candrasūtraṃ samāptam* //</supplied>
From here, we should be able to hand it over to bilara i/o. Save the file in .scripts
and run:
./sheet_import.py sf276.tsv
Good to go.
Doing it the boring way, hopefully not needed!
We can separate each data type.
/html/sf276.json
{
"sf276:0.1": "<article id='sf276'><header><h1>{}</h1></header>",
"sf276:1.1": "<p>{}",
"sf276:1.2": "{}</p>",
"sf276:2.1": "<p>{}",
"sf276:3.1": "<p>{}</p>",
"sf276:4.1": "<p>{}",
"sf276:4.2": "{}</p>",
"sf276:5.1": "<blockquote class='gatha'><p><span class='verse-line'>{}</span>",
"sf276:5.2": "<span class='verse-line'>{}</span></p></blockquote>",
"sf276:6.1": "<p>{}</p>",
"sf276:6.2": "<p>{}",
"sf276:6.3": "{}</p>",
"sf276:7.1": "<p>{}",
"sf276:7.2": "{}",
"sf276:7.3": "{}</p>",
"sf276:8.1": "<p>{}</p>",
"sf276:9.1": "<p>{}",
"sf276:9.2": "{}</p>",
"sf276:10.1": "<p>{}</p>",
"sf276:10.2": "<p>{}",
"sf276:10.3": "{}</p>",
"sf276:11.1": "<p>{}",
"sf276:11.2": "{}",
"sf276:11.3": "{}</p>",
"sf276:12.1": "<p>{}</p></article>"
}
/root/sf276.json
{
"sf276:0.1": "Candrasūtra",
"sf276:1.1": "evaṃ mayā śrutam",
"sf276:1.2": "ekasama<supplied>yaṃ bhagavāñ</supplied> śrāvastyāṃ viharati jet<supplied>a</supplied>v<supplied>a</supplied>n<supplied>a</supplied> anāthapiṇḍad<supplied>ā</supplied>r<supplied>ā</supplied>m<supplied>e /</supplied>",
"sf276:2.1": "tena khalu samayena rāhuṇā asurendreṇa sarvaṃ candramaṇḍalam āvṛtam* <supplied>/</supplied></p>",
"sf276:3.1": "<supplied>atha</supplied> yā devatā tasmiṃ<supplied>ś</supplied> candramaṇḍala adhyuṣitā sā bhītā trast<supplied>ā</supplied> saṃvignā āhṛṣṭaromakūpā yena bhagavāṃs teno<supplied>pajagāma /</supplied> upetya bha<supplied>ga</supplied>v<supplied>a</supplied>tpādau śirasā <supplied>vanditvaikāṃ</supplied>te 'sthād ekāntasthitā sā devatā tasyāṃ velāyāṃ gāthā babhāṣe //",
"sf276:4.1": "buddhavīra namas te 'stu vipramuktāya sarvataḥ",
"sf276:4.2": "saṃbādhapratipannāsmi tasya me śaraṇaṃ bhava :",
"sf276:5.1": "arhantaṃ sugataṃ loke candramāḥ śaraṇaṃ gataḥ ",
"sf276:5.2": "rāhoś candramasaṃ muñca buddhā lokānukampakāḥ //",
"sf276:6.1": "bhagavān āha //",
"sf276:6.2": "tamonudaṃ taṃ nabhasi prabhākaraṃ virocanaṃ śukla<supplied>v</supplied>iśuddhavarcasam*",
"sf276:6.3": "rāho ś<supplied>a</supplied>śāṅkaṃ grasa māntarīkṣe praj<supplied>ā</supplied>pr<supplied>a</supplied>dīpaṃ drutam utsṛjainam* //",
"sf276:7.1": "atha rāhuṇā as<supplied>u</supplied>rendreṇa tvaritatvaritaṃ candramaṇḍalam utsṛṣṭam* /",
"sf276:7.2": "tataḥ sa<supplied>ṃ</supplied>tvaramāṇo 'sau rāhuś candram avāsṛ<supplied>jat*</supplied>",
"sf276:7.3": "<supplied>saṃsvinnagātro vya</supplied>thitaḥ saṃbhr<supplied>ānta āturo ya</supplied>thā //",
"sf276:8.1": "adrākṣīd baḍir vairocano <supplied>rāhuṇā</supplied> asurendreṇa tvaritatvaritaṃ candr<supplied>a</supplied>maṇḍala<supplied>m utsṛṣṭam* / dṛṣṭvā ca baḍi</supplied>r gāthāṃ babhāṣe //",
"sf276:9.1": "ki<supplied>ṃ</supplied> nu sa<supplied>ṃ</supplied>tv<supplied>aramāṇas</supplied> tv<supplied>aṃ</supplied> rāhuś candraṃ vimuñcasi ·",
"sf276:9.2": "saṃsvinnagātro vyathitaḥ saṃ<supplied>bhrānta āturo yathā</supplied> <supplied>//</supplied>",
"sf276:10.1": "<supplied>rāhur avocat* //</supplied>",
"sf276:10.2": "<supplied>sa</supplied>ptadhā me sphalen mūrdhā <supplied>jīvan na sukha</supplied>m āp<supplied>nu</supplied>yāṃ",
"sf276:10.3": "ta<supplied>tra buddh</supplied>ābhigītena muñceyaṃ śaśinaṃ na cet*",
"sf276:11.1": "<supplied>baḍir vairocano 'vocat* /</supplied>",
"sf276:11.2": "x x x x x - - - x x x x madarśi<supplied>nāṃ</supplied>",
"sf276:11.3": "<supplied>teṣāṃ gāthābhigītena rāhuś candraṃ vimuñcati //</supplied>",
"sf276:12.1": "<supplied>candrasūtraṃ samāptam* //</supplied>"
}
/comment/sf276.json
{
"sf276:4.1": "Ed. bhitā but MS reads bhītā",
"sf276:4.2": "Ed. buddha vīra",
"sf276:9.2": "Cf. Pelliot Sanskrit bleu 449 Ac: /// ro yathā //",
"sf276:10.3": "Cf. Pelliot Sanskrit bleu 449 Ac: rāhu prāha // saptadhā me sphal[e] mūrdhā",
"sf276:11.3": "Cf. Pelliot Sanskrit bleu 449 Ad: + + + + + .. .. .. .. .. .. .. (bh)i(g)itena muñce"
}