Skip to content

BuddhaNexus/segmented-sanskrit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

segmented-sanskrit

Segmented files for sanskrit. These are the input files for the Buddhanexus neural network.

Sources:

GRETIL Göttingen Register of Electronic Texts in Indian Languages http://gretil.sub.uni-goettingen.de/gretil.html

DSBC Digital Sanskrit Buddhist Canon http://www.dsbcproject.org/

Due to the huge amount of material, some texts from the GRETIL database have been omitted (cumulative pāda indexes and duplicate texts from the same source). Moreover, there has been no attempt by BuddhaNexus to improve the quality of the texts (e.g. removing typos, introducing identical conventions, and the like). Some minor changes have, nonetheless, been made for the sake of standardization.

For the calculation of the Sanskrit matches, a stemming algorithm has been used. This stemming algorithm is accessible as a standalone application.

NOTE: These files contain a lot of errors and need to be cleaned.

NOTE: Files in segmented_files are those currently used in Buddhanexus. These are converted from html.

Further sources come from SuttaCentral.net (sf and uvs files)

The checked_jsons are manually checked Gretil files. They use verse numbers as segment numbers and are therefore more accurate than the machine-created files.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published