GitHub - sids/nerchuko: Machine Learning with Clojure.

Nerchuko is a library of Machine Learning algorithms written in Clojure. Nerchuko presently focuses on Machine Learning for textual data.

Apart from the core Machine Learning algorithms, Nerchuko includes several helper functions that are useful when working with those Machine Learning algorithms. For example there are helper functions for preparing datasets, Feature Selection, Cross-validation etc.

Getting Started

Please note that Nerchuko is under active development. There may be bugs and the API may change without notice.

The API documentation can be found here: http://sids.github.com/nerchuko.

Use with leiningen or maven

Nerchujo is hosted on Clojars. You can find the instructions for adding it as a dependency to your projects here: http://clojars.org/nerchuko.

Use with ant, etc.

Simply add the Nerchuko jar along with the jars of all the dependencies to your classpath and you are good to go. See below for instructions on building the Nerchuko jar. Nerchuko's dependencies are:

Building from source

If you have git installed on your system, use the following command to get the Nerchuko source code:

git clone git://github.com/sids/nerchuko.git

Otherwise, you can download the source code from here: http://github.com/sids/nerchuko/tarball/master.

You will need lein installed to build Nerchuko from the source. Build the Nerchuko jar using the following command:

cd nerchuko
lein jar

Classification

Nerchuko's classification capabilities can be accessed through nerchuko.classification. Documentation for the namespace provides a simple example of how to use it. For a more elaborate example, look at the 20 Newsgroups example.

The nerchuko.classification namespace also includes other functions that might be useful when dealing with classification tasks: n-fold cross validation; produce, manipulate & print confusion matrices. More helper functions can be found in the namespaces nerchuko.helpers.

When working on text classification, functions in the nerchuko.text.helpers namespace might be useful.

Nerchuko includes implementations for the following classifiers:

Feature Selection

Nerchuko's feature selection capabilities can be accessed through nerchuko.feature-selection. Documentation for the namespace provides simple example of how to use it. For a more elaborate example, look at the 20 Newsgroups example.

Nerchuko includes implementations for the following feature selection techniques:

Examples

Look in the examples/ directory for some examples demonstrating the usage of Nerchuko. These examples use Nerchuko to work with some standard machine learning datasets. This is currently the best way to learn to use Nerchuko.

You can run the examples using the command

lein run-example

This will print out a short help with instructions on running specific examples.

20 Newsgroups Data Set

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

This is a very simple and good example demonstrating the usage of Nerchuko for text classification/categorization.

Download the data set from the above link and then run this using the command:

lein run-example newsgroups

Spambade Data Set

The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography...

Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.

Although this might seem like another example for text classification, the the text has been preprocessed and we are presented with a numeric data set.

Download the data set from the above link and then run this using the command:

lein run-example spambase

License

Distributed under the Apache License Version 2.0. See the file LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
examples/src/nerchuko		examples/src/nerchuko
leiningen		leiningen
src/nerchuko		src/nerchuko
test/nerchuko_test		test/nerchuko_test
.gitignore		.gitignore
LICENSE		LICENSE
README.markdown		README.markdown
project.clj		project.clj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Getting Started

Use with leiningen or maven

Use with ant, etc.

Building from source

Classification

Feature Selection

Examples

20 Newsgroups Data Set

Spambade Data Set

License

About

Releases

Packages

Languages

License

sids/nerchuko

Folders and files

Latest commit

History

Repository files navigation

Getting Started

Use with leiningen or maven

Use with ant, etc.

Building from source

Classification

Feature Selection

Examples

20 Newsgroups Data Set

Spambade Data Set

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages