Segmented files for sanskrit. These are the input files for the Buddhanexus neural network.
GRETIL Göttingen Register of Electronic Texts in Indian Languages http://gretil.sub.uni-goettingen.de/gretil.html
DSBC Digital Sanskrit Buddhist Canon http://www.dsbcproject.org/
Due to the huge amount of material, some texts from the GRETIL database have been omitted (cumulative pāda indexes and duplicate texts from the same source). Moreover, there has been no attempt by BuddhaNexus to improve the quality of the texts (e.g. removing typos, introducing identical conventions, and the like). Some minor changes have, nonetheless, been made for the sake of standardization.
For the calculation of the Sanskrit matches, a stemming algorithm has been used. This stemming algorithm is accessible as a standalone application.
NOTE: These files contain a lot of errors and need to be cleaned.
NOTE: Files in segmented_files are those currently used in Buddhanexus. These are converted from html.
Further sources come from SuttaCentral.net (sf and uvs files)
The checked_jsons are manually checked Gretil files. They use verse numbers as segment numbers and are therefore more accurate than the machine-created files.