segmented-sanskrit

Segmented files for sanskrit. These are the input files for the Buddhanexus neural network.

Sources:

GRETIL Göttingen Register of Electronic Texts in Indian Languages http://gretil.sub.uni-goettingen.de/gretil.html

DSBC Digital Sanskrit Buddhist Canon http://www.dsbcproject.org/

Due to the huge amount of material, some texts from the GRETIL database have been omitted (cumulative pāda indexes and duplicate texts from the same source). Moreover, there has been no attempt by BuddhaNexus to improve the quality of the texts (e.g. removing typos, introducing identical conventions, and the like). Some minor changes have, nonetheless, been made for the sake of standardization.

For the calculation of the Sanskrit matches, a stemming algorithm has been used. This stemming algorithm is accessible as a standalone application.

NOTE: These files contain a lot of errors and need to be cleaned.

NOTE: Files in segmented_files are those currently used in Buddhanexus. These are converted from html.

Further sources come from SuttaCentral.net (sf and uvs files)

The checked_jsons are manually checked Gretil files. They use verse numbers as segment numbers and are therefore more accurate than the machine-created files.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
checkedjson		checkedjson
segmented_files		segmented_files
.gitignore		.gitignore
README.md		README.md
skt_stopwords.txt		skt_stopwords.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

segmented-sanskrit

Sources:

About

Releases

Packages

BuddhaNexus/segmented-sanskrit

Folders and files

Latest commit

History

Repository files navigation

segmented-sanskrit

Sources:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages