diff --git a/getting_help/index.html b/getting_help/index.html index 38b76b7..4be9207 100644 --- a/getting_help/index.html +++ b/getting_help/index.html @@ -702,8 +702,8 @@

Forums offering helpBug reports#

IRC channel for discussing Weka#

diff --git a/search/search_index.json b/search/search_index.json index 1b46aba..a41e677 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"New to Weka? # Have a look at the Frequently Asked Questions (FAQ), the Troubleshooting article or search the mailing list archives . Don't forget to check out the documentation and the online courses . You have questions regarding Weka? # You can post questions to the Weka mailing list . Please keep in mind that you cannot expect an immediate answer to your question(s). The questions are mainly answered by volunteers, Weka users just like you. You are looking for packages? # With Weka 3.7.2 and later, you can easily install packages through Weka's package manager interface, either official ones or unofficial ones. Have a look at the Packages article for more information on this topic. You want to contribute to the wiki? # The wiki is based on Markdown articles, which are turned into static HTML using MkDocs (see here for details on writing articles). The content of the wiki is available as repository on GitHub . Feel free to add/update and then do a pull request . You found a bug? # Please post the bug report to the Weka mailing list . The following information will help tracking things down: version of Weka (e.g., 3.9.6) operating system (e.g., Windows 11 or Ubuntu 20.04 64bit) Java version (e.g., 11.0.11+9) You can also run the following command in the SimpleCLI and attach the generated output as a text file to your post: java weka.core.SystemInfo","title":"Home"},{"location":"#new-to-weka","text":"Have a look at the Frequently Asked Questions (FAQ), the Troubleshooting article or search the mailing list archives . Don't forget to check out the documentation and the online courses .","title":"New to Weka?"},{"location":"#you-have-questions-regarding-weka","text":"You can post questions to the Weka mailing list . Please keep in mind that you cannot expect an immediate answer to your question(s). The questions are mainly answered by volunteers, Weka users just like you.","title":"You have questions regarding Weka?"},{"location":"#you-are-looking-for-packages","text":"With Weka 3.7.2 and later, you can easily install packages through Weka's package manager interface, either official ones or unofficial ones. Have a look at the Packages article for more information on this topic.","title":"You are looking for packages?"},{"location":"#you-want-to-contribute-to-the-wiki","text":"The wiki is based on Markdown articles, which are turned into static HTML using MkDocs (see here for details on writing articles). The content of the wiki is available as repository on GitHub . Feel free to add/update and then do a pull request .","title":"You want to contribute to the wiki?"},{"location":"#you-found-a-bug","text":"Please post the bug report to the Weka mailing list . The following information will help tracking things down: version of Weka (e.g., 3.9.6) operating system (e.g., Windows 11 or Ubuntu 20.04 64bit) Java version (e.g., 11.0.11+9) You can also run the following command in the SimpleCLI and attach the generated output as a text file to your post: java weka.core.SystemInfo","title":"You found a bug?"},{"location":"add_weights_to_dataset/","text":"The following examples show how to add weights to normal datasets and save them in the new XRFF data format. A version of Weka later than 3.5.3 (or the code from Git ) is necessary for this code to work. Add arbitrary weights # import weka.core.converters.ConverterUtils.DataSource ; import weka.core.converters.XRFFSaver ; import weka.core.Instances ; import java.io.File ; /** * Loads file \"args[0]\", sets class if necessary (in that case the last * attribute), adds some test weights and saves it as XRFF file * under \"args[1]\". E.g.:
* AddWeights anneal.arff anneal.xrff.gz * * @author FracPete (fracpete at waikato dot ac dot nz) */ public class AddWeights { public static void main ( String [] args ) throws Exception { * load data DataSource source = new DataSource ( args [ 0 ] ); Instances data = source . getDataSet (); if ( data . classIndex () == - 1 ) data . setClassIndex ( data . numAttributes () - 1 ); * set weights double factor = 0.5 / ( double ) data . numInstances (); for ( int i = 0 ; i < data . numInstances (); i ++ ) { data . instance ( i ). setWeight ( 0.5 + factor * i ); } // save data XRFFSaver saver = new XRFFSaver (); saver . setFile ( new File ( args [ 1 ] )); saver . setInstances ( data ); saver . writeBatch (); } } Add weights stored in an external file # import weka.core.converters.ConverterUtils.DataSource ; import weka.core.converters.XRFFSaver ; import weka.core.Instances ; import java.io.BufferedReader ; import java.io.File ; import java.io.FileReader ; /** * Loads file \"args[0]\" (can be ARFF, CSV, C4.5, etc.), sets class if necessary * (in that case the last attribute), adds weights from \"args[1]\" (one weight * per line) and saves it as XRFF file under \"args[2]\". E.g.:
* AddWeightsFromFile anneal.arff weights.txt anneal.xrff.gz * * @author FracPete (fracpete at waikato dot ac dot nz) */ public class AddWeightsFromFile { public static void main ( String [] args ) throws Exception { * load data DataSource source = new DataSource ( args [ 0 ] ); Instances data = source . getDataSet (); if ( data . classIndex () == - 1 ) data . setClassIndex ( data . numAttributes () - 1 ); * read and set weights BufferedReader reader = new BufferedReader ( new FileReader ( args [ 1 ] )); for ( int i = 0 ; i < data . numInstances (); i ++ ) { String line = reader . readLine (); double weight = Double . parseDouble ( line ); data . instance ( i ). setWeight ( weight ); } reader . close (); // save data XRFFSaver saver = new XRFFSaver (); saver . setFile ( new File ( args [ 2 ] )); saver . setInstances ( data ); saver . writeBatch (); } } Add weights stored in the attribute # import weka.core.converters.ConverterUtils.DataSource ; import weka.core.converters.XRFFSaver ; import weka.core.Instances ; import java.io.File ; /** * Loads file \"args[0]\", Adds weight given in attribute with * index \"args[1]\" - 1, deletes this attribute. * sets class if necessary (in that case the last * attribute) and saves it as XRFF file * under \"args[2]\". E.g.:
* AddWeightsFromAtt file.arff 2 file.xrff.gz * * @author FracPete (fracpete at waikato dot ac dot nz) * @author gabi (gs23 at waikato dot ac dot nz) */ public class AddWeightsFromAtt { public static void main ( String [] args ) throws Exception { * load data DataSource source = new DataSource ( args [ 0 ] ); Instances data = source . getDataSet (); * get weight index int wIndex = Integer . parseInt ( args [ 1 ] ) - 1 ; * set weights for ( int i = 0 ; i < data . numInstances (); i ++ ) { double weight = data . instance ( i ). value ( wIndex ); data . instance ( i ). setWeight ( weight ); } * delete weight attribute and set class index data . deleteAttributeAt ( wIndex ); if ( data . classIndex () == - 1 ) data . setClassIndex ( data . numAttributes () - 1 ); * save data XRFFSaver saver = new XRFFSaver (); saver . setFile ( new File ( args [ 2 ] )); saver . setInstances ( data ); saver . writeBatch (); } } Download # AddWeights.java AddWeightsFromFile.java AddWeightsFromAtt.java See also # git The unofficial Weka package dataset-weights allows you to modify attribute/instance weights using filters - no coding required","title":"Add weights to dataset"},{"location":"add_weights_to_dataset/#add-arbitrary-weights","text":"import weka.core.converters.ConverterUtils.DataSource ; import weka.core.converters.XRFFSaver ; import weka.core.Instances ; import java.io.File ; /** * Loads file \"args[0]\", sets class if necessary (in that case the last * attribute), adds some test weights and saves it as XRFF file * under \"args[1]\". E.g.:
* AddWeights anneal.arff anneal.xrff.gz * * @author FracPete (fracpete at waikato dot ac dot nz) */ public class AddWeights { public static void main ( String [] args ) throws Exception { * load data DataSource source = new DataSource ( args [ 0 ] ); Instances data = source . getDataSet (); if ( data . classIndex () == - 1 ) data . setClassIndex ( data . numAttributes () - 1 ); * set weights double factor = 0.5 / ( double ) data . numInstances (); for ( int i = 0 ; i < data . numInstances (); i ++ ) { data . instance ( i ). setWeight ( 0.5 + factor * i ); } // save data XRFFSaver saver = new XRFFSaver (); saver . setFile ( new File ( args [ 1 ] )); saver . setInstances ( data ); saver . writeBatch (); } }","title":"Add arbitrary weights"},{"location":"add_weights_to_dataset/#add-weights-stored-in-an-external-file","text":"import weka.core.converters.ConverterUtils.DataSource ; import weka.core.converters.XRFFSaver ; import weka.core.Instances ; import java.io.BufferedReader ; import java.io.File ; import java.io.FileReader ; /** * Loads file \"args[0]\" (can be ARFF, CSV, C4.5, etc.), sets class if necessary * (in that case the last attribute), adds weights from \"args[1]\" (one weight * per line) and saves it as XRFF file under \"args[2]\". E.g.:
* AddWeightsFromFile anneal.arff weights.txt anneal.xrff.gz * * @author FracPete (fracpete at waikato dot ac dot nz) */ public class AddWeightsFromFile { public static void main ( String [] args ) throws Exception { * load data DataSource source = new DataSource ( args [ 0 ] ); Instances data = source . getDataSet (); if ( data . classIndex () == - 1 ) data . setClassIndex ( data . numAttributes () - 1 ); * read and set weights BufferedReader reader = new BufferedReader ( new FileReader ( args [ 1 ] )); for ( int i = 0 ; i < data . numInstances (); i ++ ) { String line = reader . readLine (); double weight = Double . parseDouble ( line ); data . instance ( i ). setWeight ( weight ); } reader . close (); // save data XRFFSaver saver = new XRFFSaver (); saver . setFile ( new File ( args [ 2 ] )); saver . setInstances ( data ); saver . writeBatch (); } }","title":"Add weights stored in an external file"},{"location":"add_weights_to_dataset/#add-weights-stored-in-the-attribute","text":"import weka.core.converters.ConverterUtils.DataSource ; import weka.core.converters.XRFFSaver ; import weka.core.Instances ; import java.io.File ; /** * Loads file \"args[0]\", Adds weight given in attribute with * index \"args[1]\" - 1, deletes this attribute. * sets class if necessary (in that case the last * attribute) and saves it as XRFF file * under \"args[2]\". E.g.:
* AddWeightsFromAtt file.arff 2 file.xrff.gz * * @author FracPete (fracpete at waikato dot ac dot nz) * @author gabi (gs23 at waikato dot ac dot nz) */ public class AddWeightsFromAtt { public static void main ( String [] args ) throws Exception { * load data DataSource source = new DataSource ( args [ 0 ] ); Instances data = source . getDataSet (); * get weight index int wIndex = Integer . parseInt ( args [ 1 ] ) - 1 ; * set weights for ( int i = 0 ; i < data . numInstances (); i ++ ) { double weight = data . instance ( i ). value ( wIndex ); data . instance ( i ). setWeight ( weight ); } * delete weight attribute and set class index data . deleteAttributeAt ( wIndex ); if ( data . classIndex () == - 1 ) data . setClassIndex ( data . numAttributes () - 1 ); * save data XRFFSaver saver = new XRFFSaver (); saver . setFile ( new File ( args [ 2 ] )); saver . setInstances ( data ); saver . writeBatch (); } }","title":"Add weights stored in the attribute"},{"location":"add_weights_to_dataset/#download","text":"AddWeights.java AddWeightsFromFile.java AddWeightsFromAtt.java","title":"Download"},{"location":"add_weights_to_dataset/#see-also","text":"git The unofficial Weka package dataset-weights allows you to modify attribute/instance weights using filters - no coding required","title":"See also"},{"location":"adding_attributes_to_dataset/","text":"The following example class adds a nominal and a numeric attribute to the dataset identified by the filename given as first parameter. The second parameter defines whether the data is manipulated via the Add filter (= filter ) or through the Weka API directly (= java ). Usage: AddAttribute Source code: import weka.core.* ; import weka.filters.Filter ; import weka.filters.unsupervised.attribute.Add ; import java.io.* ; import java.util.* ; /** * Adds a nominal and a numeric attribute to the dataset provided as first * parameter (and fills it with random values) and outputs the result to * stdout. It's either done via the Add filter (first option \"filter\") * or manual with Java (second option \"java\"). * * Usage: AddAttribute <file.arff> <filter|java> * * @author FracPete (fracpete at waikato dot ac dot nz) */ public class AddAttribute { /** * adds the attributes * * @param args the commandline arguments */ public static void main ( String [] args ) throws Exception { if ( args . length != 2 ) { System . out . println ( \"\\nUsage: AddAttribute \\n\" ); System . exit ( 1 ); } // load dataset Instances data = new Instances ( new BufferedReader ( new FileReader ( args [ 0 ] ))); Instances newData = null ; // filter or java? if ( args [ 1 ] . equals ( \"filter\" )) { Add filter ; newData = new Instances ( data ); // 1. nominal attribute filter = new Add (); filter . setAttributeIndex ( \"last\" ); filter . setNominalLabels ( \"A,B,C,D\" ); filter . setAttributeName ( \"NewNominal\" ); filter . setInputFormat ( newData ); newData = Filter . useFilter ( newData , filter ); // 2. numeric attribute filter = new Add (); filter . setAttributeIndex ( \"last\" ); filter . setAttributeName ( \"NewNumeric\" ); filter . setInputFormat ( newData ); newData = Filter . useFilter ( newData , filter ); } else if ( args [ 1 ] . equals ( \"java\" )) { newData = new Instances ( data ); // add new attributes // 1. nominal FastVector values = new FastVector (); /* FastVector is now deprecated. Users can use any java.util.List */ values . addElement ( \"A\" ); /* implementation now */ values . addElement ( \"B\" ); values . addElement ( \"C\" ); values . addElement ( \"D\" ); newData . insertAttributeAt ( new Attribute ( \"NewNominal\" , values ), newData . numAttributes ()); // 2. numeric newData . insertAttributeAt ( new Attribute ( \"NewNumeric\" ), newData . numAttributes ()); } else { System . out . println ( \"\\nUsage: AddAttribute \\n\" ); System . exit ( 2 ); } // random values Random rand = new Random ( 1 ); for ( int i = 0 ; i < newData . numInstances (); i ++ ) { // 1. nominal // index of labels A:0,B:1,C:2,D:3 newData . instance ( i ). setValue ( newData . numAttributes () - 2 , rand . nextInt ( 4 )); // 2. numeric newData . instance ( i ). setValue ( newData . numAttributes () - 1 , rand . nextDouble ()); } // output on stdout System . out . println ( newData ); } } See also # Creating an ARFF file - explains the creation of all the different attribute types Use Weka in your Java code - for general usage of the Weka API Save Instances to an ARFF File - if you want to save the output to a file instead of printing them to stdout Downloads # AddAttribute.java ( stable , developer )","title":"Adding attributes to dataset"},{"location":"adding_attributes_to_dataset/#see-also","text":"Creating an ARFF file - explains the creation of all the different attribute types Use Weka in your Java code - for general usage of the Weka API Save Instances to an ARFF File - if you want to save the output to a file instead of printing them to stdout","title":"See also"},{"location":"adding_attributes_to_dataset/#downloads","text":"AddAttribute.java ( stable , developer )","title":"Downloads"},{"location":"adding_tabs_in_the_explorer/","text":"Description # This article explains how to add extra tabs in the Explorer in order to add new functionality without the hassle of having to dig into the Explorer code oneself. With the new plugin-architecture of the Explorer it is fairly easy making your extensions available in the GUI. Note: This is also covered in chapter Extending WEKA of the WEKA manual in versions later than 3.6.1/3.7.0 of the stable-3.6/developer version later than 10/01/2010. Version # 3.5.5 Requirements # Here is roughly what is required in order to add a new tab (the examples go into more detail): your class must be derived from javax.swing.JPanel your class must implemented the interface weka.gui.explorer.Explorer.ExplorerPanel optional interfaces weka.gui.explorer.Explorer.LogHandler in case you want to take advantage of the logging in the Explorer weka.gui.explorer.Explorer.CapabilitiesFilterChangeListener in case your class needs to be notified of changes in the Capabilities, e.g., if new data is loaded into the Explorer adding the classname of your class to the Tabs property in the Explorer.props file Examples # The following examples demonstrate the new plugin architecture (a bold term for such a simple extension mechanism). Only the necessary details are discussed, as the full source code is available for download as well. SQL worksheet # Purpose # Displaying the SqlViewer as a tab in the Explorer instead of using it either via the Open DB... button or as standalone application. Uses the existing components already available in Weka and just assembles them in a JPanel . Since this tab does not rely on a dataset being loaded into the Explorer, it will be used as a standalone one. Useful for people who are working a lot with databases and would like to have an SQL worksheet available all the time instead of clicking on a button every time to open up a database dialog. Implementation # class is derived from javax.swing.JPanel and implements the weka.gui.explorer.Explorer.ExplorerPanel interface (the full source code also imports the weka.gui.explorer.Explorer.LogHandler interface, but that is only additional functionality): public class SqlPanel extends JPanel implements ExplorerPanel { * some basic members that we need to have /** the parent frame */ protected Explorer m_Explorer = null ; /** sends notifications when the set of working instances gets changed*/ protected PropertyChangeSupport m_Support = new PropertyChangeSupport ( this ); * methods we need to implement due to the used interfaces /** Sets the Explorer to use as parent frame */ public void setExplorer ( Explorer parent ) { m_Explorer = parent ; } /** returns the parent Explorer frame */ public Explorer getExplorer () { return m_Explorer ; } /** Returns the title for the tab in the Explorer */ public String getTabTitle () { return \"SQL\" ; * what ' s displayed as tab - title , e . g ., * Classify // } /** Returns the tooltip for the tab in the Explorer */ public String getTabTitleToolTip () { return \"Retrieving data from databases\" ; // the tooltip of the tab } /** ignored, since we *\"generate\"* data and not receive it */ public void setInstances ( Instances inst ) { } /** PropertyChangeListener who will be notified of value changes. */ public void addPropertyChangeListener ( PropertyChangeListener l ) { m_Support . addPropertyChangeListener ( l ); } /** Removes a PropertyChangeListener. */ public void removePropertyChangeListener ( PropertyChangeListener l ) { m_Support . removePropertyChangeListener ( l ); } * additional GUI elements /** the actual SQL worksheet */ protected SqlViewer m_Viewer ; /** the panel for the buttons */ protected JPanel m_PanelButtons ; /** the Load button - makes the data available in the Explorer */ protected JButton m_ButtonLoad = new JButton ( \"Load data\" ); /** displays the current query */ protected JLabel m_LabelQuery = new JLabel ( \"\" ); * loading the data into the Explorer by clicking on the Load button will fire a property change: m_ButtonLoad . addActionListener ( new ActionListener () { public void actionPerformed ( ActionEvent evt ){ m_Support . firePropertyChange ( \"\" , null , null ); } }); * the propertyChange event will perform the actual loading of the data, hence we add an anonymous property change listener to our panel: addPropertyChangeListener ( new PropertyChangeListener () { public void propertyChange ( PropertyChangeEvent e ) { try { * load data InstanceQuery query = new InstanceQuery (); query . setDatabaseURL ( m_Viewer . getURL ()); query . setUsername ( m_Viewer . getUser ()); query . setPassword ( m_Viewer . getPassword ()); Instances data = query . retrieveInstances ( m_Viewer . getQuery ()); * set data in preprocess panel ( will also notify of capabilties changes ) getExplorer (). getPreprocessPanel (). setInstances ( data ); } catch ( Exception ex ) { ex . printStackTrace (); } } }); * In order to add our SqlPanel to the list of tabs displayed in the Explorer, we need to modify the Explorer.props file (just extract it from the weka.jar and place it in your home directory). The Tabs property must look like this: Tabs=weka.gui.explorer.SqlPanel,\\ weka.gui.explorer.ClassifierPanel,\\ weka.gui.explorer.ClustererPanel,\\ weka.gui.explorer.AssociationsPanel,\\ weka.gui.explorer.AttributeSelectionPanel,\\ weka.gui.explorer.VisualizePanel Screenshot # Source # SqlPanel.java ( stable-3.8 , developer ) Artificial data generation # Purpose # Instead of only having a Generate... button in the PreprocessPanel or using it from commandline, this example creates a new panel to be displayed as extra tab in the Explorer. This tab will be available regardless whether a dataset is already loaded or not (= standalone ). Implementation # class is derived from javax.swing.JPanel and implements the weka.gui.Explorer.ExplorerPanel interface (the full source code also imports the weka.gui.Explorer.LogHandler interface, but that is only additional functionality): public class GeneratorPanel extends JPanel implements ExplorerPanel { * some basic members that we need to have (the same as for the SqlPanel class): /** the parent frame */ protected Explorer m_Explorer = null ; /** sends notifications when the set of working instances gets changed*/ protected PropertyChangeSupport m_Support = new PropertyChangeSupport ( this ); * methods we need to implement due to the used interfaces (almost identical to SqlPanel ): /** Sets the Explorer to use as parent frame */ public void setExplorer ( Explorer parent ) { m_Explorer = parent ; } /** returns the parent Explorer frame */ public Explorer getExplorer () { return m_Explorer ; } /** Returns the title for the tab in the Explorer */ public String getTabTitle () { return \"DataGeneration\" ; // what's displayed as tab-title, e.g., Classify } /** Returns the tooltip for the tab in the Explorer */ public String getTabTitleToolTip () { return \"Generating artificial datasets\" ; // the tooltip of the tab } /** ignored, since we \"generate\" data and not receive it */ public void setInstances ( Instances inst ) { } /** PropertyChangeListener who will be notified of value changes. */ public void addPropertyChangeListener ( PropertyChangeListener l ) { m_Support . addPropertyChangeListener ( l ); } /** Removes a PropertyChangeListener. */ public void removePropertyChangeListener ( PropertyChangeListener l ) { m_Support . removePropertyChangeListener ( l ); } * additional GUI elements: /** the GOE for the generators */ protected GenericObjectEditor m_GeneratorEditor = new GenericObjectEditor (); /** the text area for the output of the generated data */ protected JTextArea m_Output = new JTextArea (); /** the Generate button */ protected JButton m_ButtonGenerate = new JButton ( \"Generate\" ); /** the Use button */ protected JButton m_ButtonUse = new JButton ( \"Use\" ); * the Generate button doesn't load the generated data directly into the Explorer, but only outputs in the JTextArea (this is done with the Use button - see further down): m_ButtonGenerate . addActionListener ( new ActionListener (){ public void actionPerformed ( ActionEvent evt ){ DataGenerator generator = ( DataGenerator ) m_GeneratorEditor . getValue (); String relName = generator . getRelationName (); String cname = generator . getClass (). getName (). replaceAll ( \".*\\\\.\" , \"\" ); String cmd = generator . getClass (). getName (); if ( generator instanceof OptionHandler ) cmd += \" \" + Utils . joinOptions ((( OptionHandler ) generator ). getOptions ()); try { * generate data StringWriter output = new StringWriter (); generator . setOutput ( new PrintWriter ( output )); DataGenerator . makeData ( generator , generator . getOptions ()); m_Output . setText ( output . toString ()); } catch ( Exception ex ) { ex . printStackTrace (); JOptionPane . showMessageDialog ( getExplorer (), \"Error generating data:\\n\" + ex . getMessage (), \"Error\" , JOptionPane . ERROR_MESSAGE ); } generator . setRelationName ( relName ); } }); * the Use button finally fires a property change event that will load the data into the Explorer: m_ButtonUse . addActionListener ( new ActionListener (){ public void actionPerformed ( ActionEvent evt ){ m_Support . firePropertyChange ( \"\" , null , null ); } }); * the propertyChange event will perform the actual loading of the data, hence we add an anonymous property change listener to our panel: addPropertyChangeListener ( new PropertyChangeListener () { public void propertyChange ( PropertyChangeEvent e ) { try { Instances data = new Instances ( new StringReader ( m_Output . getText ())); * set data in preprocess panel ( will also notify of capabilties changes ) getExplorer (). getPreprocessPanel (). setInstances ( data ); } catch ( Exception ex ) { ex . printStackTrace (); JOptionPane . showMessageDialog ( getExplorer (), \"Error generating data:\\n\" + ex . getMessage (), \"Error\" , JOptionPane . ERROR_MESSAGE ); } } }); * In order to add our GeneratorPanel to the list of tabs displayed in the Explorer, we need to modify the Explorer.props file (just extract it from the weka.jar and place it in your home directory). The Tabs property must look like this: Tabs=weka.gui.explorer.GeneratorPanel:standalone,\\ weka.gui.explorer.ClassifierPanel,\\ weka.gui.explorer.ClustererPanel,\\ weka.gui.explorer.AssociationsPanel,\\ weka.gui.explorer.AttributeSelectionPanel,\\ weka.gui.explorer.VisualizePanel Note: the standalone option is used to make the tab available without requiring the preprocess panel to load a dataset first. Screenshot # Source # GeneratorPanel.java ( stable-3.8 , developer ) Experimenter \"light\" # Purpose # By default the Classify panel only performs 1 run of 10-fold cross-validation. Since most classifiers are rather sensitive to the order of the data being presented to them, those results can be too optimistic or pessimistic. Averaging the results over 10 runs with differently randomized train/test pairs returns more reliable results. And this is where this plugin comes in: it can be used to obtain statistical sound results for a specific classifier/dataset combination, without having to setup a whole experiment in the Experimenter. Implementation # Since this plugin is rather bulky, we omit the implementation details, but the following can be said: based on the weka.gui.explorer.ClassifierPanel the actual code doing the work follows the example in Using the Experiment API article * In order to add our ExperimentPanel to the list of tabs displayed in the Explorer, we need to modify the Explorer.props file (just extract it from the weka.jar and place it in your home directory). The Tabs property must look like this: Tabs=weka.gui.explorer.ClassifierPanel,\\ weka.gui.explorer.ExperimentPanel,\\ weka.gui.explorer.ClustererPanel,\\ weka.gui.explorer.AssociationsPanel,\\ weka.gui.explorer.AttributeSelectionPanel,\\ weka.gui.explorer.VisualizePanel Screenshot # Source # ExperimentPanel.java ( stable-3.6 , developer )","title":"Description"},{"location":"adding_tabs_in_the_explorer/#description","text":"This article explains how to add extra tabs in the Explorer in order to add new functionality without the hassle of having to dig into the Explorer code oneself. With the new plugin-architecture of the Explorer it is fairly easy making your extensions available in the GUI. Note: This is also covered in chapter Extending WEKA of the WEKA manual in versions later than 3.6.1/3.7.0 of the stable-3.6/developer version later than 10/01/2010.","title":"Description"},{"location":"adding_tabs_in_the_explorer/#version","text":"3.5.5","title":"Version"},{"location":"adding_tabs_in_the_explorer/#requirements","text":"Here is roughly what is required in order to add a new tab (the examples go into more detail): your class must be derived from javax.swing.JPanel your class must implemented the interface weka.gui.explorer.Explorer.ExplorerPanel optional interfaces weka.gui.explorer.Explorer.LogHandler in case you want to take advantage of the logging in the Explorer weka.gui.explorer.Explorer.CapabilitiesFilterChangeListener in case your class needs to be notified of changes in the Capabilities, e.g., if new data is loaded into the Explorer adding the classname of your class to the Tabs property in the Explorer.props file","title":"Requirements"},{"location":"adding_tabs_in_the_explorer/#examples","text":"The following examples demonstrate the new plugin architecture (a bold term for such a simple extension mechanism). Only the necessary details are discussed, as the full source code is available for download as well.","title":"Examples"},{"location":"adding_tabs_in_the_explorer/#sql-worksheet","text":"","title":"SQL worksheet"},{"location":"adding_tabs_in_the_explorer/#purpose","text":"Displaying the SqlViewer as a tab in the Explorer instead of using it either via the Open DB... button or as standalone application. Uses the existing components already available in Weka and just assembles them in a JPanel . Since this tab does not rely on a dataset being loaded into the Explorer, it will be used as a standalone one. Useful for people who are working a lot with databases and would like to have an SQL worksheet available all the time instead of clicking on a button every time to open up a database dialog.","title":"Purpose"},{"location":"adding_tabs_in_the_explorer/#implementation","text":"class is derived from javax.swing.JPanel and implements the weka.gui.explorer.Explorer.ExplorerPanel interface (the full source code also imports the weka.gui.explorer.Explorer.LogHandler interface, but that is only additional functionality): public class SqlPanel extends JPanel implements ExplorerPanel { * some basic members that we need to have /** the parent frame */ protected Explorer m_Explorer = null ; /** sends notifications when the set of working instances gets changed*/ protected PropertyChangeSupport m_Support = new PropertyChangeSupport ( this ); * methods we need to implement due to the used interfaces /** Sets the Explorer to use as parent frame */ public void setExplorer ( Explorer parent ) { m_Explorer = parent ; } /** returns the parent Explorer frame */ public Explorer getExplorer () { return m_Explorer ; } /** Returns the title for the tab in the Explorer */ public String getTabTitle () { return \"SQL\" ; * what ' s displayed as tab - title , e . g ., * Classify // } /** Returns the tooltip for the tab in the Explorer */ public String getTabTitleToolTip () { return \"Retrieving data from databases\" ; // the tooltip of the tab } /** ignored, since we *\"generate\"* data and not receive it */ public void setInstances ( Instances inst ) { } /** PropertyChangeListener who will be notified of value changes. */ public void addPropertyChangeListener ( PropertyChangeListener l ) { m_Support . addPropertyChangeListener ( l ); } /** Removes a PropertyChangeListener. */ public void removePropertyChangeListener ( PropertyChangeListener l ) { m_Support . removePropertyChangeListener ( l ); } * additional GUI elements /** the actual SQL worksheet */ protected SqlViewer m_Viewer ; /** the panel for the buttons */ protected JPanel m_PanelButtons ; /** the Load button - makes the data available in the Explorer */ protected JButton m_ButtonLoad = new JButton ( \"Load data\" ); /** displays the current query */ protected JLabel m_LabelQuery = new JLabel ( \"\" ); * loading the data into the Explorer by clicking on the Load button will fire a property change: m_ButtonLoad . addActionListener ( new ActionListener () { public void actionPerformed ( ActionEvent evt ){ m_Support . firePropertyChange ( \"\" , null , null ); } }); * the propertyChange event will perform the actual loading of the data, hence we add an anonymous property change listener to our panel: addPropertyChangeListener ( new PropertyChangeListener () { public void propertyChange ( PropertyChangeEvent e ) { try { * load data InstanceQuery query = new InstanceQuery (); query . setDatabaseURL ( m_Viewer . getURL ()); query . setUsername ( m_Viewer . getUser ()); query . setPassword ( m_Viewer . getPassword ()); Instances data = query . retrieveInstances ( m_Viewer . getQuery ()); * set data in preprocess panel ( will also notify of capabilties changes ) getExplorer (). getPreprocessPanel (). setInstances ( data ); } catch ( Exception ex ) { ex . printStackTrace (); } } }); * In order to add our SqlPanel to the list of tabs displayed in the Explorer, we need to modify the Explorer.props file (just extract it from the weka.jar and place it in your home directory). The Tabs property must look like this: Tabs=weka.gui.explorer.SqlPanel,\\ weka.gui.explorer.ClassifierPanel,\\ weka.gui.explorer.ClustererPanel,\\ weka.gui.explorer.AssociationsPanel,\\ weka.gui.explorer.AttributeSelectionPanel,\\ weka.gui.explorer.VisualizePanel","title":"Implementation"},{"location":"adding_tabs_in_the_explorer/#screenshot","text":"","title":"Screenshot"},{"location":"adding_tabs_in_the_explorer/#source","text":"SqlPanel.java ( stable-3.8 , developer )","title":"Source"},{"location":"adding_tabs_in_the_explorer/#artificial-data-generation","text":"","title":"Artificial data generation"},{"location":"adding_tabs_in_the_explorer/#purpose_1","text":"Instead of only having a Generate... button in the PreprocessPanel or using it from commandline, this example creates a new panel to be displayed as extra tab in the Explorer. This tab will be available regardless whether a dataset is already loaded or not (= standalone ).","title":"Purpose"},{"location":"adding_tabs_in_the_explorer/#implementation_1","text":"class is derived from javax.swing.JPanel and implements the weka.gui.Explorer.ExplorerPanel interface (the full source code also imports the weka.gui.Explorer.LogHandler interface, but that is only additional functionality): public class GeneratorPanel extends JPanel implements ExplorerPanel { * some basic members that we need to have (the same as for the SqlPanel class): /** the parent frame */ protected Explorer m_Explorer = null ; /** sends notifications when the set of working instances gets changed*/ protected PropertyChangeSupport m_Support = new PropertyChangeSupport ( this ); * methods we need to implement due to the used interfaces (almost identical to SqlPanel ): /** Sets the Explorer to use as parent frame */ public void setExplorer ( Explorer parent ) { m_Explorer = parent ; } /** returns the parent Explorer frame */ public Explorer getExplorer () { return m_Explorer ; } /** Returns the title for the tab in the Explorer */ public String getTabTitle () { return \"DataGeneration\" ; // what's displayed as tab-title, e.g., Classify } /** Returns the tooltip for the tab in the Explorer */ public String getTabTitleToolTip () { return \"Generating artificial datasets\" ; // the tooltip of the tab } /** ignored, since we \"generate\" data and not receive it */ public void setInstances ( Instances inst ) { } /** PropertyChangeListener who will be notified of value changes. */ public void addPropertyChangeListener ( PropertyChangeListener l ) { m_Support . addPropertyChangeListener ( l ); } /** Removes a PropertyChangeListener. */ public void removePropertyChangeListener ( PropertyChangeListener l ) { m_Support . removePropertyChangeListener ( l ); } * additional GUI elements: /** the GOE for the generators */ protected GenericObjectEditor m_GeneratorEditor = new GenericObjectEditor (); /** the text area for the output of the generated data */ protected JTextArea m_Output = new JTextArea (); /** the Generate button */ protected JButton m_ButtonGenerate = new JButton ( \"Generate\" ); /** the Use button */ protected JButton m_ButtonUse = new JButton ( \"Use\" ); * the Generate button doesn't load the generated data directly into the Explorer, but only outputs in the JTextArea (this is done with the Use button - see further down): m_ButtonGenerate . addActionListener ( new ActionListener (){ public void actionPerformed ( ActionEvent evt ){ DataGenerator generator = ( DataGenerator ) m_GeneratorEditor . getValue (); String relName = generator . getRelationName (); String cname = generator . getClass (). getName (). replaceAll ( \".*\\\\.\" , \"\" ); String cmd = generator . getClass (). getName (); if ( generator instanceof OptionHandler ) cmd += \" \" + Utils . joinOptions ((( OptionHandler ) generator ). getOptions ()); try { * generate data StringWriter output = new StringWriter (); generator . setOutput ( new PrintWriter ( output )); DataGenerator . makeData ( generator , generator . getOptions ()); m_Output . setText ( output . toString ()); } catch ( Exception ex ) { ex . printStackTrace (); JOptionPane . showMessageDialog ( getExplorer (), \"Error generating data:\\n\" + ex . getMessage (), \"Error\" , JOptionPane . ERROR_MESSAGE ); } generator . setRelationName ( relName ); } }); * the Use button finally fires a property change event that will load the data into the Explorer: m_ButtonUse . addActionListener ( new ActionListener (){ public void actionPerformed ( ActionEvent evt ){ m_Support . firePropertyChange ( \"\" , null , null ); } }); * the propertyChange event will perform the actual loading of the data, hence we add an anonymous property change listener to our panel: addPropertyChangeListener ( new PropertyChangeListener () { public void propertyChange ( PropertyChangeEvent e ) { try { Instances data = new Instances ( new StringReader ( m_Output . getText ())); * set data in preprocess panel ( will also notify of capabilties changes ) getExplorer (). getPreprocessPanel (). setInstances ( data ); } catch ( Exception ex ) { ex . printStackTrace (); JOptionPane . showMessageDialog ( getExplorer (), \"Error generating data:\\n\" + ex . getMessage (), \"Error\" , JOptionPane . ERROR_MESSAGE ); } } }); * In order to add our GeneratorPanel to the list of tabs displayed in the Explorer, we need to modify the Explorer.props file (just extract it from the weka.jar and place it in your home directory). The Tabs property must look like this: Tabs=weka.gui.explorer.GeneratorPanel:standalone,\\ weka.gui.explorer.ClassifierPanel,\\ weka.gui.explorer.ClustererPanel,\\ weka.gui.explorer.AssociationsPanel,\\ weka.gui.explorer.AttributeSelectionPanel,\\ weka.gui.explorer.VisualizePanel Note: the standalone option is used to make the tab available without requiring the preprocess panel to load a dataset first.","title":"Implementation"},{"location":"adding_tabs_in_the_explorer/#screenshot_1","text":"","title":"Screenshot"},{"location":"adding_tabs_in_the_explorer/#source_1","text":"GeneratorPanel.java ( stable-3.8 , developer )","title":"Source"},{"location":"adding_tabs_in_the_explorer/#experimenter-light","text":"","title":"Experimenter \"light\""},{"location":"adding_tabs_in_the_explorer/#purpose_2","text":"By default the Classify panel only performs 1 run of 10-fold cross-validation. Since most classifiers are rather sensitive to the order of the data being presented to them, those results can be too optimistic or pessimistic. Averaging the results over 10 runs with differently randomized train/test pairs returns more reliable results. And this is where this plugin comes in: it can be used to obtain statistical sound results for a specific classifier/dataset combination, without having to setup a whole experiment in the Experimenter.","title":"Purpose"},{"location":"adding_tabs_in_the_explorer/#implementation_2","text":"Since this plugin is rather bulky, we omit the implementation details, but the following can be said: based on the weka.gui.explorer.ClassifierPanel the actual code doing the work follows the example in Using the Experiment API article * In order to add our ExperimentPanel to the list of tabs displayed in the Explorer, we need to modify the Explorer.props file (just extract it from the weka.jar and place it in your home directory). The Tabs property must look like this: Tabs=weka.gui.explorer.ClassifierPanel,\\ weka.gui.explorer.ExperimentPanel,\\ weka.gui.explorer.ClustererPanel,\\ weka.gui.explorer.AssociationsPanel,\\ weka.gui.explorer.AttributeSelectionPanel,\\ weka.gui.explorer.VisualizePanel","title":"Implementation"},{"location":"adding_tabs_in_the_explorer/#screenshot_2","text":"","title":"Screenshot"},{"location":"adding_tabs_in_the_explorer/#source_2","text":"ExperimentPanel.java ( stable-3.6 , developer )","title":"Source"},{"location":"ant/","text":"What is ANT? This is how the ANT homepage defines its tool: Apache Ant is a Java-based build tool. In theory, it is kind of like Make, but without Make's wrinkles. Basics # the ANT build file is based on XML the usual name for the build file is build.xml invocation - the usual build file needs not be specified explicitly, if it's in the current directory; if not target is specified, the default one is used ant [-f ] [] displaying all the available targets of a build file ant [-f ] -projecthelp Weka and ANT # a build file for Weka is available from git (it has been included in the weka-src.jar since version 3.4.8 and 3.5.3) it is located in the weka directory some targets of interest clean - Removes the build, dist and reports directories; also any class files in the source tree compile - Compile weka and deposit class files in ${path_modifier}/build/classes docs - Make javadocs into {${path_modifier}/doc}} exejar - Create an executable jar file in ${path_modifier}/dist Links # ANT homepage XML","title":"Ant"},{"location":"ant/#basics","text":"the ANT build file is based on XML the usual name for the build file is build.xml invocation - the usual build file needs not be specified explicitly, if it's in the current directory; if not target is specified, the default one is used ant [-f ] [] displaying all the available targets of a build file ant [-f ] -projecthelp","title":"Basics"},{"location":"ant/#weka-and-ant","text":"a build file for Weka is available from git (it has been included in the weka-src.jar since version 3.4.8 and 3.5.3) it is located in the weka directory some targets of interest clean - Removes the build, dist and reports directories; also any class files in the source tree compile - Compile weka and deposit class files in ${path_modifier}/build/classes docs - Make javadocs into {${path_modifier}/doc}} exejar - Create an executable jar file in ${path_modifier}/dist","title":"Weka and ANT"},{"location":"ant/#links","text":"ANT homepage XML","title":"Links"},{"location":"auc/","text":"AUC = the A rea U nder the ROC C urve. Weka uses the Mann Whitney statistic to calculate the AUC via the weka.classifiers.evaluation.ThresholdCurve class. Explorer # See ROC curves . KnowledgeFlow # See ROC curves . Commandline # Classifiers can output the AUC if the -i option is provided. The -i option provides detailed information per class. Running the J48 classifier on the iris UCI Dataset with the following commandline: java [CLASSPATH|-classpath ] weka.classifiers.trees.J48 -t /some/where/iris.arff -i produces this output: == Detailed Accuracy By Class == TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.98 0 1 0.98 0.99 0.99 Iris-setosa 0.94 0.03 0.94 0.94 0.94 0.952 Iris-versicolor 0.96 0.03 0.941 0.96 0.95 0.961 Iris-virginica See also # ROC curves Mann Whitney statistic on WikiPedia Links # University of Nebraska Medical Center, Interpreting Diagnostic Tests weka.classifiers.evaluation.ThresholdCurve","title":"Auc"},{"location":"auc/#explorer","text":"See ROC curves .","title":"Explorer"},{"location":"auc/#knowledgeflow","text":"See ROC curves .","title":"KnowledgeFlow"},{"location":"auc/#commandline","text":"Classifiers can output the AUC if the -i option is provided. The -i option provides detailed information per class. Running the J48 classifier on the iris UCI Dataset with the following commandline: java [CLASSPATH|-classpath ] weka.classifiers.trees.J48 -t /some/where/iris.arff -i produces this output: == Detailed Accuracy By Class == TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.98 0 1 0.98 0.99 0.99 Iris-setosa 0.94 0.03 0.94 0.94 0.94 0.952 Iris-versicolor 0.96 0.03 0.941 0.96 0.95 0.961 Iris-virginica","title":"Commandline"},{"location":"auc/#see-also","text":"ROC curves Mann Whitney statistic on WikiPedia","title":"See also"},{"location":"auc/#links","text":"University of Nebraska Medical Center, Interpreting Diagnostic Tests weka.classifiers.evaluation.ThresholdCurve","title":"Links"},{"location":"batch_filtering/","text":"Batch filtering is used if a second dataset, normally the test set, needs to be processed with the same statistics as the the first dataset, normally the training set. For example, performing standardization with the Standardize filter on two datasets separately will most certainly create two differently standardized output files, since the mean and the standard deviation are based on the input data (and those will differ if the datasets are different). The same applies to the StringToWordVector : here the word dictionary will change, since word occurrences will differ in training and test set. The generated output will be two incompatible files. In order to create compatible train and test set, batch filtering is necessary. Here, the first input/output pair ( -i / -o ) initializes the filter's statistics and the second input/output pair ( -r / -s ) gets processed according to those statistics. To enable batch filtering, one has to provide the additional parameter -b on the commandline. Here is an example Java call: java weka.filters.unsupervised.attribute.Standardize \\ -b \\ -i train.arff \\ -o train_std.arff \\ -r test.arff \\ -s test_std.arff Note: The commandline outlined above is for a Linux/Unix bash (the backslash tells the shell that the command isn't finished yet and continues on the next line). In case of Windows or the SimpleCLI, just remove those backslashes and put everything on one line. See also # See section Batch filtering in the article Use Weka in your Java code , in case you need to perform batch filtering from within your own code","title":"Batch filtering"},{"location":"batch_filtering/#see-also","text":"See section Batch filtering in the article Use Weka in your Java code , in case you need to perform batch filtering from within your own code","title":"See also"},{"location":"binarize_attribute/","text":"Sometimes one wants to binarize a nominal attribute of a certain dataset by grouping all values except the one of interest together as a negation of this value. E.g., in the {{weather}} data the outlook attribute, where sunny is of interest and the other values, rainy and overcast , are grouped together as not-sunny . Original dataset: @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature real @attribute humidity real @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data sunny,85,85,FALSE,no sunny,80,90,TRUE,no overcast,83,86,FALSE,yes rainy,70,96,FALSE,yes rainy,68,80,FALSE,yes rainy,65,70,TRUE,no overcast,64,65,TRUE,yes sunny,72,95,FALSE,no sunny,69,70,FALSE,yes rainy,75,80,FALSE,yes sunny,75,70,TRUE,yes overcast,72,90,TRUE,yes overcast,81,75,FALSE,yes rainy,71,91,TRUE,no Desired output: @relation weather-sunny-and-not_sunny @attribute outlook {sunny,not_sunny} @attribute temperature numeric @attribute humidity numeric @attribute windy {TRUE,FALSE} @attribute play {yes,no} @data sunny,85,85,FALSE,no sunny,80,90,TRUE,no not_sunny,83,86,FALSE,yes not_sunny,70,96,FALSE,yes not_sunny,68,80,FALSE,yes not_sunny,65,70,TRUE,no not_sunny,64,65,TRUE,yes sunny,72,95,FALSE,no sunny,69,70,FALSE,yes not_sunny,75,80,FALSE,yes sunny,75,70,TRUE,yes not_sunny,72,90,TRUE,yes not_sunny,81,75,FALSE,yes not_sunny,71,91,TRUE,no The Weka filter NominalToBinary cannot be used directly, since it generates a new attribute for each value of the nominal attribute. As a postprocessing step one could delete all the attributes that are of no interest, but this is quite cumbersome. The Binarize.java class on the other hand generates directly several ARFF out of a given one in the desired format. Download # Binarize.java ( stable , developer )","title":"Binarize attribute"},{"location":"binarize_attribute/#download","text":"Binarize.java ( stable , developer )","title":"Download"},{"location":"citing_weka/","text":"The best reference for WEKA 3.8 and 3.9 is the online appendix on the WEKA workbench for the fourth edition of \"Data Mining: Practical Machine Learning Tools and Techniques\" by I.H. Witten, Eibe Frank, Mark A. Hall, and Chris J. Pal. The citation is Eibe Frank, Mark A. Hall, and Ian H. Witten (2016). The WEKA Workbench. Online Appendix for \"Data Mining: Practical Machine Learning Tools and Techniques\", Morgan Kaufmann, Fourth Edition, 2016. You may also want to consider the SIGKDD Explorations paper covering WEKA 3.6. The citation is Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten (2009). The WEKA Data Mining Software: An Update. SIGKDD Explorations, Volume 11, Issue 1. The WEKA logo is available under the Creative Commons Attribution-ShareAlike 2.5 License .","title":"Citing Weka"},{"location":"classifying_large_datasets/","text":"Unless one has access to a 64-bit machine with lots of RAM, it can happen quite easy that one runs into an OutOfMemoryException running WEKA on large datasets. This article tries to present some solutions apart from buying new hardware. Sampling # The question is, does one really need to train with all the data, or is a subset of the data already sufficient? WEKA offers several filters for re-sampling a dataset and generating a new dataset reduced in size: weka.filters.supervised.instance.Resample This filter takes the class distribution into account for generating the sample, i.e., you can even adjust the distribution by adding a bias. weka.filters.unsupervised.instance.Resample The unsupervised filter does not take the class distribution into account for generating the output. weka.filters.supervised.instance.SpreadSubsample It allows you to specify the maximum \"spread\" between the rarest and most common class. See the respective Javadoc for more information ( book version , developer version ). Incremental classifiers # Most classifiers need to see all the data before they can be trained, e.g., J48 or SMO. But there are also schemes that can be trained in an incremental fashion, not just in batch mode. All classifiers implementing the weka.classifiers.UpdateableClassifier interface are able to process data in such a way. Running such a classifier from commandline will load the dataset incrementally (NB: not all data formats can be loaded incrementally; XRFF is one of them, ARFF on the other hand can be read incrementally) and feed the data instance by instance to the classifier. Check out the Javadoc of the UpdateableClassifier interface to see what schemes implement it ( book version , developer version ). Other tools # MOA - Massive Online Analysis A framework for learning from a continuous supply of examples, a data stream. Includes tools for evaluation and a collection of machine learning algorithms. Related to the WEKA project, also written in Java, while scaling to more demanding problems.","title":"Classifying large datasets"},{"location":"classifying_large_datasets/#sampling","text":"The question is, does one really need to train with all the data, or is a subset of the data already sufficient? WEKA offers several filters for re-sampling a dataset and generating a new dataset reduced in size: weka.filters.supervised.instance.Resample This filter takes the class distribution into account for generating the sample, i.e., you can even adjust the distribution by adding a bias. weka.filters.unsupervised.instance.Resample The unsupervised filter does not take the class distribution into account for generating the output. weka.filters.supervised.instance.SpreadSubsample It allows you to specify the maximum \"spread\" between the rarest and most common class. See the respective Javadoc for more information ( book version , developer version ).","title":"Sampling"},{"location":"classifying_large_datasets/#incremental-classifiers","text":"Most classifiers need to see all the data before they can be trained, e.g., J48 or SMO. But there are also schemes that can be trained in an incremental fashion, not just in batch mode. All classifiers implementing the weka.classifiers.UpdateableClassifier interface are able to process data in such a way. Running such a classifier from commandline will load the dataset incrementally (NB: not all data formats can be loaded incrementally; XRFF is one of them, ARFF on the other hand can be read incrementally) and feed the data instance by instance to the classifier. Check out the Javadoc of the UpdateableClassifier interface to see what schemes implement it ( book version , developer version ).","title":"Incremental classifiers"},{"location":"classifying_large_datasets/#other-tools","text":"MOA - Massive Online Analysis A framework for learning from a continuous supply of examples, a data stream. Includes tools for evaluation and a collection of machine learning algorithms. Related to the WEKA project, also written in Java, while scaling to more demanding problems.","title":"Other tools"},{"location":"classpath/","text":"The CLASSPATH environment variable tells Java where to look for classes. Since Java does the search in a ''first-come-first-serve'' kind of manner, you'll have to take care where and what to put in your CLASSPATH. I, personally, never use the environment variable, since I'm working often on a project in different versions in parallel. The CLASSPATH would just mess up things, if you're not careful (or just forget to remove an entry). ANT offers a nice way for building (and separating source code and class files) Java projects. But still, if you're only working on totally separate projects, it might be easiest for you to use the environment variable. Setting the CLASSPATH # In the following we add the mysql-connector-java-5.1.6-bin.jar to our CLASSPATH variable (this works for any other jar archive) to make it possible to access MySQL Databases via JDBC. Windows # We assume that the mysql-connector-java-5.1.6-bin.jar archive is located in the following directory: C:\\Program Files\\Weka-3-8 In the Control Panel click on System (or right click on This PC and select Properties ) and then go to the Advanced tab. There you will find a button called Environment Variables , click it. Depending on, whether you're the only person using this computer or it is a lab computer shared by many, you can either create a new system-wide (you are the only user) environment variable or a user dependent one (recommended for multi-user machines). Enter the following name for the variable CLASSPATH and add this value C:\\Program Files\\Weka-3-8\\mysql-connector-java-5.1.6-bin.jar If you want to add additional jars, you'll have to separate them with the path separator, the semicolon ; (no spaces!). Unix/Linux # I assume, that the mysql jar is located in the following directory: /home/johndoe/jars/ Open a shell and execute the following command, depending on the shell you're using: bash export CLASSPATH=$CLASSPATH:/home/johndoe/jars/mysql-connector-java-5.1.6-bin.jar c shell setenv CLASSPATH $CLASSPATH:/home/johndoe/jars/mysql-connector-java-5.1.6-bin.jar Unix/Linux uses the colon : as path separator, in contrast to Windows, which uses the semicolon ; . Note: the prefixing with $CLASSPATH adds the mysql jar at the end of the currently existing CLASSPATH . Cygwin # The process is like with Unix/Linux systems, but since the host system is Win32 and therefore the Java installation also a Windows application, you'll have to use the semicolon ; as separator for several jars.","title":"Classpath"},{"location":"classpath/#setting-the-classpath","text":"In the following we add the mysql-connector-java-5.1.6-bin.jar to our CLASSPATH variable (this works for any other jar archive) to make it possible to access MySQL Databases via JDBC.","title":"Setting the CLASSPATH"},{"location":"classpath/#windows","text":"We assume that the mysql-connector-java-5.1.6-bin.jar archive is located in the following directory: C:\\Program Files\\Weka-3-8 In the Control Panel click on System (or right click on This PC and select Properties ) and then go to the Advanced tab. There you will find a button called Environment Variables , click it. Depending on, whether you're the only person using this computer or it is a lab computer shared by many, you can either create a new system-wide (you are the only user) environment variable or a user dependent one (recommended for multi-user machines). Enter the following name for the variable CLASSPATH and add this value C:\\Program Files\\Weka-3-8\\mysql-connector-java-5.1.6-bin.jar If you want to add additional jars, you'll have to separate them with the path separator, the semicolon ; (no spaces!).","title":"Windows"},{"location":"classpath/#unixlinux","text":"I assume, that the mysql jar is located in the following directory: /home/johndoe/jars/ Open a shell and execute the following command, depending on the shell you're using: bash export CLASSPATH=$CLASSPATH:/home/johndoe/jars/mysql-connector-java-5.1.6-bin.jar c shell setenv CLASSPATH $CLASSPATH:/home/johndoe/jars/mysql-connector-java-5.1.6-bin.jar Unix/Linux uses the colon : as path separator, in contrast to Windows, which uses the semicolon ; . Note: the prefixing with $CLASSPATH adds the mysql jar at the end of the currently existing CLASSPATH .","title":"Unix/Linux"},{"location":"classpath/#cygwin","text":"The process is like with Unix/Linux systems, but since the host system is Win32 and therefore the Java installation also a Windows application, you'll have to use the semicolon ; as separator for several jars.","title":"Cygwin"},{"location":"classpath_problems/","text":"Having problems getting Weka to run from a DOS/UNIX command prompt? Getting java.lang.NoClassDefFoundError exceptions? Most likely your CLASSPATH environment variable is not set correctly - it needs to point to the weka.jar file that you downloaded with Weka (or the parent of the Weka directory if you have extracted the jar). Under DOS this can be achieved with: set CLASSPATH=c:\\weka-3-4\\weka.jar;%CLASSPATH% Under UNIX/Linux something like: export CLASSPATH = /home/weka/weka.jar: $CLASSPATH An easy way to get avoid setting the variable this is to specify the CLASSPATH when calling Java. For example, if the jar file is located at c:\\weka-3-4\\weka.jar you can use: java -cp c: \\w eka-3-4 \\w eka.jar weka.classifiers... See also the CLASSPATH article.","title":"Classpath problems"},{"location":"command_redirection/","text":"Console # With command redirection one can redirect standard streams like stdin , stdout and stderr to user-specified locations. Quite often it is useful to redirect the output of a program to a text file. redirecting stdout to a file someProgram >/some/where/output.txt (Linux/Unix Bash) someProgram >c:\\some\\where\\output.txt (Windows command prompt) redirecting stderr to a file someProgram 2>/some/where/output.txt (Linux/Unix Bash) someProgram 2>c:\\some\\where\\output.txt (Windows command prompt) redirecting stdout and stderr to a file someProgram &>/some/where/output.txt (Linux/Unix Bash) someProgram >c:\\some\\where\\output.txt 2>&1 (Windows command prompt) Note: under Weka quite often the output is printed to stderr , e.g., if one is using the -p 0 option from the commandline to print the predicted values for a test file: java weka.classifiers.trees.J48 -t train.arff -T test.arff -p 0 2> j48.txt or if one already has a trained model: java weka.classifiers.trees.J48 -l j48.model -T test.arff -p 0 2> j48.txt SimpleCLI # One can perform a basic redirection also in the SimpleCLI, e.g.: java weka.classifiers.trees.J48 -t test.arff > j48.txt Note: the > must be preceded and followed by a space , otherwise it is not recognized as redirection, but part of another parameter. Links # Linux Command redirection under Bash I/O Redirection under Bash Redirection under Unix (WikiPedia) Windows Command redirection under MS Windows Command redirection under MS DOS","title":"Console"},{"location":"command_redirection/#console","text":"With command redirection one can redirect standard streams like stdin , stdout and stderr to user-specified locations. Quite often it is useful to redirect the output of a program to a text file. redirecting stdout to a file someProgram >/some/where/output.txt (Linux/Unix Bash) someProgram >c:\\some\\where\\output.txt (Windows command prompt) redirecting stderr to a file someProgram 2>/some/where/output.txt (Linux/Unix Bash) someProgram 2>c:\\some\\where\\output.txt (Windows command prompt) redirecting stdout and stderr to a file someProgram &>/some/where/output.txt (Linux/Unix Bash) someProgram >c:\\some\\where\\output.txt 2>&1 (Windows command prompt) Note: under Weka quite often the output is printed to stderr , e.g., if one is using the -p 0 option from the commandline to print the predicted values for a test file: java weka.classifiers.trees.J48 -t train.arff -T test.arff -p 0 2> j48.txt or if one already has a trained model: java weka.classifiers.trees.J48 -l j48.model -T test.arff -p 0 2> j48.txt","title":"Console"},{"location":"command_redirection/#simplecli","text":"One can perform a basic redirection also in the SimpleCLI, e.g.: java weka.classifiers.trees.J48 -t test.arff > j48.txt Note: the > must be preceded and followed by a space , otherwise it is not recognized as redirection, but part of another parameter.","title":"SimpleCLI"},{"location":"command_redirection/#links","text":"Linux Command redirection under Bash I/O Redirection under Bash Redirection under Unix (WikiPedia) Windows Command redirection under MS Windows Command redirection under MS DOS","title":"Links"},{"location":"compiling_weka/","text":"There are several ways of compiling the Weka source code: with ant takes care of compiling all the necessary classes and easily generates jar archives with maven similar to ant with an IDE, like IntelliJ IDEA, Eclipse or NetBeans can be very helpful for debugging tricky bugs","title":"Compiling weka"},{"location":"cost_matrix/","text":"Format # Format of the cost matrices: regular % Rows Columns 2 2 % Matrix elements 0.0 5.0 1.0 0.0 Matlab single-line format (see also the Matlab Primer ) [0.0 5.0; 1.0 0.0] Testing the format # The following code loads a cost matrix and prints its content to the console. Useful, if one wants to test whether the format is correct: import weka.classifiers.CostMatrix ; import java.io.BufferedReader ; import java.io.FileReader ; /** * Loads the cost matrix \"args[0]\" and prints its content to the console. * * @author FracPete (fracpete at waikato dot ac dot nz) */ public class CostMatrixLoader { public static void main ( String [] args ) throws Exception { CostMatrix matrix = new ** CostMatrix ** ( new BufferedReader ( new FileReader ( args [ 0 ] ))); System . out . println ( matrix ); } } See also # CostSensitiveClassifier MetaCost Downloads # CostMatrixLoader.java","title":"Format"},{"location":"cost_matrix/#format","text":"Format of the cost matrices: regular % Rows Columns 2 2 % Matrix elements 0.0 5.0 1.0 0.0 Matlab single-line format (see also the Matlab Primer ) [0.0 5.0; 1.0 0.0]","title":"Format"},{"location":"cost_matrix/#testing-the-format","text":"The following code loads a cost matrix and prints its content to the console. Useful, if one wants to test whether the format is correct: import weka.classifiers.CostMatrix ; import java.io.BufferedReader ; import java.io.FileReader ; /** * Loads the cost matrix \"args[0]\" and prints its content to the console. * * @author FracPete (fracpete at waikato dot ac dot nz) */ public class CostMatrixLoader { public static void main ( String [] args ) throws Exception { CostMatrix matrix = new ** CostMatrix ** ( new BufferedReader ( new FileReader ( args [ 0 ] ))); System . out . println ( matrix ); } }","title":"Testing the format"},{"location":"cost_matrix/#see-also","text":"CostSensitiveClassifier MetaCost","title":"See also"},{"location":"cost_matrix/#downloads","text":"CostMatrixLoader.java","title":"Downloads"},{"location":"cost_sensitive_classifier/","text":"A meta classifier that makes its base classifier cost-sensitive. Two methods can be used to introduce cost-sensitivity: reweighting training instances according to the total cost assigned to each class; or predicting the class with minimum expected misclassification cost (rather than the most likely class). Performance can often be improved by using a bagged classifier to improve the probability estimates of the base classifier. Since the classifier, in default mode (i.e., when using the reweighting method), normalizes the cost matrix before applying it, it can be hard coming up with a cost matrix, e.g., one that balances out imbalanced data. Here is an example: input cost matrix -3 1 1 1 -6 1 0 0 0 normalized cost matrix 0 7 1 4 0 1 3 6 0 The application of a cost matrix using the second, minimum-expected cost approach, which is also used by MetaCost , is more intuitive. See also # MetaCost CostMatrix","title":"Cost sensitive classifier"},{"location":"cost_sensitive_classifier/#see-also","text":"MetaCost CostMatrix","title":"See also"},{"location":"creating_instances/","text":"see Creating an ARFF file","title":"Creating instances"},{"location":"databases/","text":"CLASSPATH # See the CLASSPATH article for how to set up your CLASSPATH environment variable, in order to make the JDBC driver available for Weka. Configuration files # Thanks to JDBC it is easy to connect to Databases that provide a JDBC driver. Responsible for the setup is the following properties file, located in the weka.experiment package: DatabaseUtils.props You can get this properties file from the weka.jar or weka-src.jar jar-archive, both part of a normal Weka release. If you open up one of those files, you'll find the properties file in the sub-folder weka/experiment . Weka comes with example files for a wide range of databases: DatabaseUtils.props.hsql - HSQLDB DatabaseUtils.props.msaccess - MS Access (see the Windows Databases article for more information) DatabaseUtils.props.mssqlserver - MS SQL Server 2000 DatabaseUtils.props.mssqlserver2005 - MS SQL Server 2005 DatabaseUtils.props.mysql - MySQL DatabaseUtils.props.odbc - ODBC access via Sun's ODBC/JDBC bridge, e.g., for MS Sql Server (see the Windows Databases article for more information) DatabaseUtils.props.oracle - Oracle 10g DatabaseUtils.props.postgresql - PostgreSQL 7.4 DatabaseUtils.props.sqlite3 - sqlite 3.x The easiest way is just to place the extracted properties file into your HOME directory. For more information on how property files are processed, check out this article. Note: Weka only looks for the DatabaseUtils.props file. If you take one of the example files listed above, you need to rename it first. Setup # Under normal circumstances you only have to edit the following two properties: jdbcDriver jdbcURL Driver # jdbcDriver is the classname of the JDBC driver, necessary to connect to your database, e.g.: HSQLDB - org.hsqldb.jdbcDriver MS SQL Server 2000 (Desktop Edition) - com.microsoft.jdbc.sqlserver.SQLServerDriver MS SQL Server 2005 - com.microsoft.sqlserver.jdbc.SQLServerDriver MySQL - org.gjt.mm.mysql.Driver (or com.mysql.jdbc.Driver ) ODBC - part of Sun's JDKs/JREs, no external driver necessary - sun.jdbc.odbc.JdbcOdbcDriver Oracle - oracle.jdbc.driver.OracleDriver PostgreSQL - org.postgresql.Driver sqlite 3.x - org.sqlite.JDBC URL # jdbcURL specifies the JDBC URL pointing to your database (can be still changed in the Experimenter/Explorer), e.g. for the database MyDatabase on the server server.my.domain : HSQLDB - jdbc:hsqldb:hsql://server.my.domain/MyDatabase MS SQL Server 2000 (Desktop Edition) - jdbc:microsoft:sqlserver://server.my.comain:1433 Note: if you add ;databasename=*db-name* you can connect to a different database than the default one, e.g., MyDatabase MS SQL Server 2005 - jdbc:sqlserver://server.my.domain:1433 MySQL - jdbc:mysql://server.my.domain:3306/MyDatabase ODBC - jdbc:odbc:DSN_name (replace DSN_name with the DSN that you want to use) Oracle (thin driver) - jdbc:oracle:thin:@server.my.domain:1526:orcl Note: @machineName:port:SID for the Express Edition you can use: jdbc:oracle:thin:@server.my.domain:1521:XE PostgreSQL - jdbc:postgresql://server.my.domain:5432/MyDatabase You can also specify user and password directly in the URL: jdbc:postgresql://server.my.domain:5432/MyDatabase?user=<...>&password=<...> where you have to replace the <...> with the correct values sqlite 3.x - jdbc:sqlite:/path/to/database.db (you can access only local files) Missing Datatypes # Sometimes (e.g. with MySQL) it can happen that a column type cannot be interpreted. In that case it is necessary to map the name of the column type to the Java type it should be interpreted as. E.g. the MySQL type TEXT is returned as BLOB from the JDBC driver and has to be mapped to String ( 0 represents String - the mappings can be found in the comments of the properties file): BLOB=0 The article weka/experiment/DatabaseUtils.props contains more details on this topic. Stored Procedures # Let's say you're tired of typing the same query over and over again. A good way to shorten that, is to create a stored procedure. PostgreSQL 7.4.x # The following example creates a procedure called emplyoee_name that returns the names of all the employees in table employee . Even though it doesn't make much sense to create a stored procedure for this query, nonetheless, it shows how to create and call stored procedures in PostgreSQL. Create CREATE OR REPLACE FUNCTION public.employee_name() RETURNS SETOF text AS 'select name from employee' LANGUAGE 'sql' VOLATILE; SQL statement to call procedure SELECT * FROM employee_name() Retrieve data via InstanceQuery java weka.experiment.InstanceQuery -Q \"SELECT * FROM employee_name()\" -U -P Troubleshooting # In case you're experiencing problems connecting to your database, check out the mailing list . It is possible that somebody else encountered the same problem as you and you'll find a post containing the solution to your problem. Specific MS SQL Server 2000 Troubleshooting MS SQL Server 2005: TCP/IP is not enabled for SQL Server, or the server or port number specified is incorrect.Verify that SQL Server is listening with TCP/IP on the specified server and port. This might be reported with an exception similar to: \"The login has failed. The TCP/IP connection to the host has failed.\" This indicates one of the following: SQL Server is installed but TCP/IP has not been installed as a network protocol for SQL Server by using the SQL Server Network Utility for SQL Server 2000, or the SQL Server Configuration Manager for SQL Server 2005 TCP/IP is installed as a SQL Server protocol, but it is not listening on the port specified in the JDBC connection URL. The default port is 1433. The port that is used by the server has not been opened in the firewall The Added driver: ... output on the commandline does not mean that the actual class was found, but only that Weka will attempt to load the class later on in order to establish a database connection. The error message No suitable driver can be caused by the following: The JDBC driver you are attempting to load is not in the CLASSPATH (Note: using -jar in the java commandline overwrites the CLASSPATH environment variable!). Open the SimpleCLI, run the command java weka.core.SystemInfo and check whether the property java.class.path lists your database jar. If not correct your CLASSPATH or the Java call you start Weka with. The JDBC driver class is misspelled in the jdbcDriver property or you have multiple entries of jdbcDriver ( properties file s need unique keys!) The jdbcURL property has a spelling error and tries to use a non-existing protocol or you listed it multiple times, which doesn't work either (remember, properties file s need unique keys!) See also # weka/experiment/DatabaseUtils.props properties file CLASSPATH Links # HSQLDB homepage IBM Cloudscape homepage Microsoft SQL Server SQL Server 2000 (Desktop Engine) SQL Server 2000 JDBC Driver SP 3 SQL Server 2005 JDBC Driver MySQL homepage JDBC driver Oracle homepage JDBC driver JDBC FAQ PostgreSQL homepage JDBC driver sqlite homepage JDBC driver Weka Mailing list","title":"CLASSPATH"},{"location":"databases/#classpath","text":"See the CLASSPATH article for how to set up your CLASSPATH environment variable, in order to make the JDBC driver available for Weka.","title":"CLASSPATH"},{"location":"databases/#configuration-files","text":"Thanks to JDBC it is easy to connect to Databases that provide a JDBC driver. Responsible for the setup is the following properties file, located in the weka.experiment package: DatabaseUtils.props You can get this properties file from the weka.jar or weka-src.jar jar-archive, both part of a normal Weka release. If you open up one of those files, you'll find the properties file in the sub-folder weka/experiment . Weka comes with example files for a wide range of databases: DatabaseUtils.props.hsql - HSQLDB DatabaseUtils.props.msaccess - MS Access (see the Windows Databases article for more information) DatabaseUtils.props.mssqlserver - MS SQL Server 2000 DatabaseUtils.props.mssqlserver2005 - MS SQL Server 2005 DatabaseUtils.props.mysql - MySQL DatabaseUtils.props.odbc - ODBC access via Sun's ODBC/JDBC bridge, e.g., for MS Sql Server (see the Windows Databases article for more information) DatabaseUtils.props.oracle - Oracle 10g DatabaseUtils.props.postgresql - PostgreSQL 7.4 DatabaseUtils.props.sqlite3 - sqlite 3.x The easiest way is just to place the extracted properties file into your HOME directory. For more information on how property files are processed, check out this article. Note: Weka only looks for the DatabaseUtils.props file. If you take one of the example files listed above, you need to rename it first.","title":"Configuration files"},{"location":"databases/#setup","text":"Under normal circumstances you only have to edit the following two properties: jdbcDriver jdbcURL","title":"Setup"},{"location":"databases/#driver","text":"jdbcDriver is the classname of the JDBC driver, necessary to connect to your database, e.g.: HSQLDB - org.hsqldb.jdbcDriver MS SQL Server 2000 (Desktop Edition) - com.microsoft.jdbc.sqlserver.SQLServerDriver MS SQL Server 2005 - com.microsoft.sqlserver.jdbc.SQLServerDriver MySQL - org.gjt.mm.mysql.Driver (or com.mysql.jdbc.Driver ) ODBC - part of Sun's JDKs/JREs, no external driver necessary - sun.jdbc.odbc.JdbcOdbcDriver Oracle - oracle.jdbc.driver.OracleDriver PostgreSQL - org.postgresql.Driver sqlite 3.x - org.sqlite.JDBC","title":"Driver"},{"location":"databases/#url","text":"jdbcURL specifies the JDBC URL pointing to your database (can be still changed in the Experimenter/Explorer), e.g. for the database MyDatabase on the server server.my.domain : HSQLDB - jdbc:hsqldb:hsql://server.my.domain/MyDatabase MS SQL Server 2000 (Desktop Edition) - jdbc:microsoft:sqlserver://server.my.comain:1433 Note: if you add ;databasename=*db-name* you can connect to a different database than the default one, e.g., MyDatabase MS SQL Server 2005 - jdbc:sqlserver://server.my.domain:1433 MySQL - jdbc:mysql://server.my.domain:3306/MyDatabase ODBC - jdbc:odbc:DSN_name (replace DSN_name with the DSN that you want to use) Oracle (thin driver) - jdbc:oracle:thin:@server.my.domain:1526:orcl Note: @machineName:port:SID for the Express Edition you can use: jdbc:oracle:thin:@server.my.domain:1521:XE PostgreSQL - jdbc:postgresql://server.my.domain:5432/MyDatabase You can also specify user and password directly in the URL: jdbc:postgresql://server.my.domain:5432/MyDatabase?user=<...>&password=<...> where you have to replace the <...> with the correct values sqlite 3.x - jdbc:sqlite:/path/to/database.db (you can access only local files)","title":"URL"},{"location":"databases/#missing-datatypes","text":"Sometimes (e.g. with MySQL) it can happen that a column type cannot be interpreted. In that case it is necessary to map the name of the column type to the Java type it should be interpreted as. E.g. the MySQL type TEXT is returned as BLOB from the JDBC driver and has to be mapped to String ( 0 represents String - the mappings can be found in the comments of the properties file): BLOB=0 The article weka/experiment/DatabaseUtils.props contains more details on this topic.","title":"Missing Datatypes"},{"location":"databases/#stored-procedures","text":"Let's say you're tired of typing the same query over and over again. A good way to shorten that, is to create a stored procedure.","title":"Stored Procedures"},{"location":"databases/#postgresql-74x","text":"The following example creates a procedure called emplyoee_name that returns the names of all the employees in table employee . Even though it doesn't make much sense to create a stored procedure for this query, nonetheless, it shows how to create and call stored procedures in PostgreSQL. Create CREATE OR REPLACE FUNCTION public.employee_name() RETURNS SETOF text AS 'select name from employee' LANGUAGE 'sql' VOLATILE; SQL statement to call procedure SELECT * FROM employee_name() Retrieve data via InstanceQuery java weka.experiment.InstanceQuery -Q \"SELECT * FROM employee_name()\" -U -P ","title":"PostgreSQL 7.4.x"},{"location":"databases/#troubleshooting","text":"In case you're experiencing problems connecting to your database, check out the mailing list . It is possible that somebody else encountered the same problem as you and you'll find a post containing the solution to your problem. Specific MS SQL Server 2000 Troubleshooting MS SQL Server 2005: TCP/IP is not enabled for SQL Server, or the server or port number specified is incorrect.Verify that SQL Server is listening with TCP/IP on the specified server and port. This might be reported with an exception similar to: \"The login has failed. The TCP/IP connection to the host has failed.\" This indicates one of the following: SQL Server is installed but TCP/IP has not been installed as a network protocol for SQL Server by using the SQL Server Network Utility for SQL Server 2000, or the SQL Server Configuration Manager for SQL Server 2005 TCP/IP is installed as a SQL Server protocol, but it is not listening on the port specified in the JDBC connection URL. The default port is 1433. The port that is used by the server has not been opened in the firewall The Added driver: ... output on the commandline does not mean that the actual class was found, but only that Weka will attempt to load the class later on in order to establish a database connection. The error message No suitable driver can be caused by the following: The JDBC driver you are attempting to load is not in the CLASSPATH (Note: using -jar in the java commandline overwrites the CLASSPATH environment variable!). Open the SimpleCLI, run the command java weka.core.SystemInfo and check whether the property java.class.path lists your database jar. If not correct your CLASSPATH or the Java call you start Weka with. The JDBC driver class is misspelled in the jdbcDriver property or you have multiple entries of jdbcDriver ( properties file s need unique keys!) The jdbcURL property has a spelling error and tries to use a non-existing protocol or you listed it multiple times, which doesn't work either (remember, properties file s need unique keys!)","title":"Troubleshooting"},{"location":"databases/#see-also","text":"weka/experiment/DatabaseUtils.props properties file CLASSPATH","title":"See also"},{"location":"databases/#links","text":"HSQLDB homepage IBM Cloudscape homepage Microsoft SQL Server SQL Server 2000 (Desktop Engine) SQL Server 2000 JDBC Driver SP 3 SQL Server 2005 JDBC Driver MySQL homepage JDBC driver Oracle homepage JDBC driver JDBC FAQ PostgreSQL homepage JDBC driver sqlite homepage JDBC driver Weka Mailing list","title":"Links"},{"location":"datasets/","text":"Some example datasets for analysis with Weka are included in the Weka distribution and can be found in the data folder of the installed software. Miscellaneous collections of datasets # A jarfile containing 37 classification problems originally obtained from the UCI repository of machine learning datasets ( datasets-UCI.jar , 1,190,961 Bytes). A jarfile containing 37 regression problems obtained from various sources ( datasets-numeric.jar , 169,344 Bytes). A jarfile containing 6 agricultural datasets obtained from agricultural researchers in New Zealand ( agridatasets.jar , 31,200 Bytes). A jarfile containing 30 regression datasets collected by Professor Luis Torgo ( regression-datasets.jar , 10,090,266 Bytes). A gzip'ed tar containing UCI ML and UCI KDD datasets ( uci-20070111.tar.gz , 17,952,832 Bytes) A gzip'ed tar containing StatLib datasets ( statlib-20050214.tar.gz , 12,785,582 Bytes) A gzip'ed tar containing ordinal, real-world datasets donated by Professor Arie Ben David ( datasets-arie_ben_david.tar.gz , 11,348 Bytes) A zip file containing 19 multi-class (1-of-n) text datasets donated by Dr George Forman ( 19MclassTextWc.zip , 14,084,828 Bytes) A bzip'ed tar file containing the Reuters21578 dataset split into separate files according to the ModApte split reuters21578-ModApte.tar.bz2 , 81,745,032 Bytes A zip file containing 41 drug design datasets formed using the Adriana.Code software donated by Dr Mehmet Fatih Amasyali ( Drug-datasets.zip , 11,376,153 Bytes) A zip file containing 80 artificial datasets generated from the Friedman function donated by Dr. M. Fatih Amasyali (Yildiz Technical Unversity) ( Friedman-datasets.zip , 5,802,204 Bytes) A zip file containing a new, image-based version of the classic iris data, with 50 images for each of the three species of iris. The images have size 600x600. Please see the ARFF file for further information ( iris_reloaded.zip , 92,267,000 Bytes). After expanding into a directory using your jar utility (or an archive program that handles tar-archives/zip files in case of the gzip'ed tars/zip files), these datasets may be used with Weka. Bioinformatics datasets # Some bioinformatics datasets in Weka's ARFF format. These are quite old but still available thanks to the Internet Archive. Protein datasets made available by Associate Professor Shuiwang Ji when he was a PhD student at Louisiana State University . Kent Ridge Biomedical Data Set Repository , which was put together by Professor Jinyan Li and Dr Huiqing Liu while they were at the Institute for Infocomm Research, Singapore . Repository for Epitope Datasets (RED) , maintained by Professor Yasser El-Manzalawy when he was at Iowa State University .","title":"Datasets"},{"location":"datasets/#miscellaneous-collections-of-datasets","text":"A jarfile containing 37 classification problems originally obtained from the UCI repository of machine learning datasets ( datasets-UCI.jar , 1,190,961 Bytes). A jarfile containing 37 regression problems obtained from various sources ( datasets-numeric.jar , 169,344 Bytes). A jarfile containing 6 agricultural datasets obtained from agricultural researchers in New Zealand ( agridatasets.jar , 31,200 Bytes). A jarfile containing 30 regression datasets collected by Professor Luis Torgo ( regression-datasets.jar , 10,090,266 Bytes). A gzip'ed tar containing UCI ML and UCI KDD datasets ( uci-20070111.tar.gz , 17,952,832 Bytes) A gzip'ed tar containing StatLib datasets ( statlib-20050214.tar.gz , 12,785,582 Bytes) A gzip'ed tar containing ordinal, real-world datasets donated by Professor Arie Ben David ( datasets-arie_ben_david.tar.gz , 11,348 Bytes) A zip file containing 19 multi-class (1-of-n) text datasets donated by Dr George Forman ( 19MclassTextWc.zip , 14,084,828 Bytes) A bzip'ed tar file containing the Reuters21578 dataset split into separate files according to the ModApte split reuters21578-ModApte.tar.bz2 , 81,745,032 Bytes A zip file containing 41 drug design datasets formed using the Adriana.Code software donated by Dr Mehmet Fatih Amasyali ( Drug-datasets.zip , 11,376,153 Bytes) A zip file containing 80 artificial datasets generated from the Friedman function donated by Dr. M. Fatih Amasyali (Yildiz Technical Unversity) ( Friedman-datasets.zip , 5,802,204 Bytes) A zip file containing a new, image-based version of the classic iris data, with 50 images for each of the three species of iris. The images have size 600x600. Please see the ARFF file for further information ( iris_reloaded.zip , 92,267,000 Bytes). After expanding into a directory using your jar utility (or an archive program that handles tar-archives/zip files in case of the gzip'ed tars/zip files), these datasets may be used with Weka.","title":"Miscellaneous collections of datasets"},{"location":"datasets/#bioinformatics-datasets","text":"Some bioinformatics datasets in Weka's ARFF format. These are quite old but still available thanks to the Internet Archive. Protein datasets made available by Associate Professor Shuiwang Ji when he was a PhD student at Louisiana State University . Kent Ridge Biomedical Data Set Repository , which was put together by Professor Jinyan Li and Dr Huiqing Liu while they were at the Institute for Infocomm Research, Singapore . Repository for Epitope Datasets (RED) , maintained by Professor Yasser El-Manzalawy when he was at Iowa State University .","title":"Bioinformatics datasets"},{"location":"development/","text":"We are following the Linux model of releases, where an even second digit of a release number indicates a \"stable\" release and an odd second digit indicates a \"development\" release (e.g., 3.0.x is a stable release and 3.1.x is a developmental release). If you are using a developmental release, there may be new features, but it is entirely possible that these features will be transient and/or unstable, and backward compatibility of the API and/or models is not guaranteed. If you require stability for teaching or deployment in applications, it is best to use a stable release of Weka. Source code repository # Weka's source code for a particular release is included in the distribution when you download it, in a .jar file (a form of .zip file) called weka-src.jar . However, it is also possible to read source code directly from the git source code repository for Weka. Code credits # The Weka developers would like to thank The MathWorks and the National Institute of Standards and Technology (NIST) for developing the Jama Matrix package and releasing it to the public domain, and to CERN (European Organization for Nuclear Research) for statistics-related code from their Jet libraries (now part of COLT ). The core Weka distributions include third-party library code from the MTJ project for fast matrix algebra in Java, the Java CUP project for generating parsers, the authentication dialog from the Bounce project , and the Apache Commons Compress library. For more information, see the lib folder of the source code repository. Weka, including the early non-Java predecessors of Weka 3, was developed at the Department of Computer Science of the University of Waikato in Hamilton , New Zealand . Most of Weka 3 was written by Eibe Frank, Mark Hall, Peter Reutemann, and Len Trigg, but many others have made significant contributions, in particular, Remco Bouckaert, Richard Kirkby, Ashraf Kibriya, Xin Xu, and Malcolm Ware. For complete info on the contributors, check the Javadoc extracted from the source code of Weka, which is part of the available documentation . Weka's package manager provides access to a large collection of optional libraries, many of which have been contributed by developers from other institutions. For information on the authors of these packages and the third-party libraries used within those Weka packages, please consult the Javadoc for the relevant package and the corresponding package lib folder.","title":"Development"},{"location":"development/#source-code-repository","text":"Weka's source code for a particular release is included in the distribution when you download it, in a .jar file (a form of .zip file) called weka-src.jar . However, it is also possible to read source code directly from the git source code repository for Weka.","title":"Source code repository"},{"location":"development/#code-credits","text":"The Weka developers would like to thank The MathWorks and the National Institute of Standards and Technology (NIST) for developing the Jama Matrix package and releasing it to the public domain, and to CERN (European Organization for Nuclear Research) for statistics-related code from their Jet libraries (now part of COLT ). The core Weka distributions include third-party library code from the MTJ project for fast matrix algebra in Java, the Java CUP project for generating parsers, the authentication dialog from the Bounce project , and the Apache Commons Compress library. For more information, see the lib folder of the source code repository. Weka, including the early non-Java predecessors of Weka 3, was developed at the Department of Computer Science of the University of Waikato in Hamilton , New Zealand . Most of Weka 3 was written by Eibe Frank, Mark Hall, Peter Reutemann, and Len Trigg, but many others have made significant contributions, in particular, Remco Bouckaert, Richard Kirkby, Ashraf Kibriya, Xin Xu, and Malcolm Ware. For complete info on the contributors, check the Javadoc extracted from the source code of Weka, which is part of the available documentation . Weka's package manager provides access to a large collection of optional libraries, many of which have been contributed by developers from other institutions. For information on the authors of these packages and the third-party libraries used within those Weka packages, please consult the Javadoc for the relevant package and the corresponding package lib folder.","title":"Code credits"},{"location":"discretizing_datasets/","text":"Once in a while one has numeric data but wants to use classifier that handles only nominal values. In that case one needs to discretize the data, which can be done with the following filters: weka.filters.supervised.attribute.Discretize uses either Fayyad & Irani's MDL method or Kononeko's MDL criterion weka.filters.unsupervised.attribute.Discretize uses simple binning But, since discretization depends on the data which presented to the discretization algorithm, one easily end up with incompatible train and test files. The following shows how to generate compatible discretized files out of a training and a test file by using the supervised version of the filter. The class takes four files as arguments: input training file input test file output training file output test file import java.io.* ; import weka.core.* ; import weka.filters.Filter ; import weka.filters.supervised.attribute.Discretize ; /** * Shows how to generate compatible train/test sets using the Discretize * filter. * * @author FracPete (fracpete at waikato dot ac dot nz) */ public class DiscretizeTest { /** * loads the given ARFF file and sets the class attribute as the last * attribute. * * @param filename the file to load * @throws Exception if somethings goes wrong */ protected static Instances load ( String filename ) throws Exception { Instances result ; BufferedReader reader ; reader = new BufferedReader ( new FileReader ( filename )); result = new Instances ( reader ); result . setClassIndex ( result . numAttributes () - 1 ); reader . close (); return result ; } /** * saves the data to the specified file * * @param data the data to save to a file * @param filename the file to save the data to * @throws Exception if something goes wrong */ protected static void save ( Instances data , String filename ) throws Exception { BufferedWriter writer ; writer = new BufferedWriter ( new FileWriter ( filename )); writer . write ( data . toString ()); writer . newLine (); writer . flush (); writer . close (); } /** * Takes four arguments: *
    *
  1. input train file
  2. *
  3. input test file
  4. *
  5. output train file
  6. *
  7. output test file
  8. *
* * @param args the commandline arguments * @throws Exception if something goes wrong */ public static void main ( String [] args ) throws Exception { Instances inputTrain ; Instances inputTest ; Instances outputTrain ; Instances outputTest ; Discretize filter ; * load data ( class attribute is assumed to be last attribute ) inputTrain = load ( args [ 0 ] ); inputTest = load ( args [ 1 ] ); * setup filter filter = new Discretize (); filter . setInputFormat ( inputTrain ); * apply filter outputTrain = Filter . useFilter ( inputTrain , filter ); outputTest = Filter . useFilter ( inputTest , filter ); * save output save ( outputTrain , args [ 2 ] ); save ( outputTest , args [ 3 ] ); } } The same can be achieved from the commandline with this command ( batch filtering ): java weka.filters.supervised.attribute.Discretize -b -i < in -train> -o -r < in -test> -s -c See also # Manual discretization (Using the MathExpression filter) Batch filtering Downloads # DiscretizeTest.java Links # Javadoc Discretize (supervised) Discretize (unsupervised)","title":"Discretizing datasets"},{"location":"discretizing_datasets/#see-also","text":"Manual discretization (Using the MathExpression filter) Batch filtering","title":"See also"},{"location":"discretizing_datasets/#downloads","text":"DiscretizeTest.java","title":"Downloads"},{"location":"discretizing_datasets/#links","text":"Javadoc Discretize (supervised) Discretize (unsupervised)","title":"Links"},{"location":"document_classification/","text":"See Text categorization with Weka","title":"Document classification"},{"location":"documentation/","text":"This wiki is not the only source of information on the Weka software. Weka comes with built-in help and includes a comprehensive manual. For an introduction to the machine learning techniques implemented in Weka, and the software itself, consider taking a look at the book Data Mining: Practical Machine Learning Tools and Techniques and its freely available online appendix on the Weka workbench , providing an overview of the software. Closely linked to the book, there are also free online courses on data mining with the machine learning techniques in Weka. A list of sources with information on Weka is provided below. General documentation # The online appendix The Weka Workbench , distributed as a free PDF, for the fourth edition of the book Data Mining: Practical Machine Learning Tools and Techniques . The manual for Weka 3.8 and the manual for Weka 3.9 , as included in the distribution of the software when you download it. The Javadoc for Weka 3.8 and the Javadoc for Weka 3.9 , extracted directly from the source code, providing information on the API and parameters for command-line usage of Weka. The videos and slides for the online courses on Data Mining with Weka , More Data Mining with Weka , and Advanced Data Mining with Weka . Weka packages # There is a list of packages for Weka that can be installed using the built-in package manager. Javadoc for a package is available at https://weka.sourceforge.io/doc.packages/ followed by the name of the package. Mailing list archive # The Weka mailing list is a very helpful source of information, spanning more than 15 years of questions and answers on Weka. Blogs # There is the official Weka blog that has Weka-related news items and the occasional article of interest to Weka users. There is also Mark Hall's blog with a lot of useful information on several important Weka packages in particular. Other sources of information # Weka can be used from several other software systems for data science, and there is a set of slides on WEKA in the Ecosystem for Scientific Computing covering Octave/Matlab, R, Python, and Hadoop. A page with with news and documentation on Weka's support for importing PMML models . A short tutorial on connecting Weka to MongoDB using a JDBC driver .","title":"Documentation"},{"location":"documentation/#general-documentation","text":"The online appendix The Weka Workbench , distributed as a free PDF, for the fourth edition of the book Data Mining: Practical Machine Learning Tools and Techniques . The manual for Weka 3.8 and the manual for Weka 3.9 , as included in the distribution of the software when you download it. The Javadoc for Weka 3.8 and the Javadoc for Weka 3.9 , extracted directly from the source code, providing information on the API and parameters for command-line usage of Weka. The videos and slides for the online courses on Data Mining with Weka , More Data Mining with Weka , and Advanced Data Mining with Weka .","title":"General documentation"},{"location":"documentation/#weka-packages","text":"There is a list of packages for Weka that can be installed using the built-in package manager. Javadoc for a package is available at https://weka.sourceforge.io/doc.packages/ followed by the name of the package.","title":"Weka packages"},{"location":"documentation/#mailing-list-archive","text":"The Weka mailing list is a very helpful source of information, spanning more than 15 years of questions and answers on Weka.","title":"Mailing list archive"},{"location":"documentation/#blogs","text":"There is the official Weka blog that has Weka-related news items and the occasional article of interest to Weka users. There is also Mark Hall's blog with a lot of useful information on several important Weka packages in particular.","title":"Blogs"},{"location":"documentation/#other-sources-of-information","text":"Weka can be used from several other software systems for data science, and there is a set of slides on WEKA in the Ecosystem for Scientific Computing covering Octave/Matlab, R, Python, and Hadoop. A page with with news and documentation on Weka's support for importing PMML models . A short tutorial on connecting Weka to MongoDB using a JDBC driver .","title":"Other sources of information"},{"location":"downloading_weka/","text":"There are two versions of Weka: Weka 3.8 is the latest stable version and Weka 3.9 is the development version. New releases of these two versions are normally made once or twice a year. The stable version receives only bug fixes and feature upgrades that do not break compatibility with its earlier releases, while the development version may receive new features that break compatibility with its earlier releases. Weka 3.8 and 3.9 feature a package management system that makes it easy for the Weka community to add new functionality to Weka. The package management system requires an internet connection in order to download and install packages. Stable version # Weka 3.8 is the latest stable version of Weka. This branch of Weka only receives bug fixes and upgrades that do not break compatibility with earlier 3.8 releases, although major new features may become available in packages. There are different options for downloading and installing it on your system: Windows # Click here to download a self-extracting executable for 64-bit Windows that includes Azul's 64-bit OpenJDK Java VM 17 (weka-3-8-6-azul-zulu-windows.exe; 133.2 MB) This executable will install Weka in your Program Menu. Launching via the Program Menu or shortcuts will automatically use the included JVM to run Weka. Mac OS - Intel processors # Click here to download a disk image for Mac OS that contains a Mac application including Azul's 64-bit OpenJDK Java VM 17 for Intel Macs. (weka-3-8-6-azul-zulu-osx.dmg; 180.2 MB) Mac OS - ARM processors # Click here to download a disk image for Mac OS that contains a Mac application including Azul's 64-bit OpenJDK Java VM 17 for ARM Macs. (weka-3-8-6-azul-zulu-arm-osx.dmg; 166.3 MB) Linux # Click here to download a zip archive for Linux that includes Azul's 64-bit OpenJDK Java VM 17 (weka-3-8-6-azul-zulu-linux.zip; 146.9 MB) First unzip the zip file. This will create a new directory called weka-3-8-5. To run Weka, change into that directory and type ./weka.sh Other platforms # Click here to download a zip archive containing Weka (weka-3-8-6.zip; 59.6 MB) First unzip the zip file. This will create a new directory called weka-3-8-6. To run Weka, change into that directory and type java -jar weka.jar Note that Java needs to be installed on your system for this to work. Also note that using -jar will override your current CLASSPATH variable and only use the weka.jar . Developer version # This is the main development trunk of Weka and continues from the stable Weka 3.8 code line. It may receive new features that break backwards compatibility. Windows # Click here to download a self-extracting executable for 64-bit Windows that includes Azul's 64-bit OpenJDK Java VM 17 (weka-3-9-6-azul-zulu-windows.exe; 133.0 MB) This executable will install Weka in your Program Menu. Launching via the Program Menu or shortcuts will automatically use the included JVM to run Weka. Mac OS - Intel processors # Click here to download a disk image for Mac OS that contains a Mac application including Azul's 64-bit OpenJDK Java VM 17 for Intel Macs. (weka-3-9-6-azul-zulu-osx.dmg; 180.0 MB) Mac OS - ARM processors # Click here to download a disk image for Mac OS that contains a Mac application including Azul's 64-bit OpenJDK Java VM 17 for ARM Macs. (weka-3-9-6-azul-zulu-arm-osx.dmg; 166.3 MB) Linux # Click here to download a zip archive for Linux that includes Azul's 64-bit OpenJDK Java VM 17 (weka-3-9-6-azul-zulu-linux.zip; 146.7 MB) First unzip the zip file. This will create a new directory called weka-3-9-6. To run Weka, change into that directory and type ./weka.sh Other platforms # Click here to download a zip archive containing Weka (weka-3-9-6.zip; 59.4 MB) First unzip the zip file. This will create a new directory called weka-3-9-6. To run Weka, change into that directory and type java -jar weka.jar Note that Java needs to be installed on your system for this to work. Also note, that using -jar will override your current CLASSPATH variable and only use the weka.jar . Old versions # All old versions of Weka are available from the Sourceforge website . Upgrading from Weka 3.7 # In case you are upgrading an existing Weka 3.7 installation, if the Weka 3.8 package manager does not start up, please delete the file installedPackageCache.ser in the packages folder that resides in the wekafiles folder in your user home. Also, serialized Weka models created in 3.7 are incompatible with 3.8. The model migrator tool can migrate some models to 3.8 (a known exception is RandomForest). Usage is as follows: java -cp : weka.core.ModelMigrator -i -o ","title":"Downloading and installing Weka"},{"location":"downloading_weka/#stable-version","text":"Weka 3.8 is the latest stable version of Weka. This branch of Weka only receives bug fixes and upgrades that do not break compatibility with earlier 3.8 releases, although major new features may become available in packages. There are different options for downloading and installing it on your system:","title":"Stable version"},{"location":"downloading_weka/#windows","text":"Click here to download a self-extracting executable for 64-bit Windows that includes Azul's 64-bit OpenJDK Java VM 17 (weka-3-8-6-azul-zulu-windows.exe; 133.2 MB) This executable will install Weka in your Program Menu. Launching via the Program Menu or shortcuts will automatically use the included JVM to run Weka.","title":"Windows"},{"location":"downloading_weka/#mac-os-intel-processors","text":"Click here to download a disk image for Mac OS that contains a Mac application including Azul's 64-bit OpenJDK Java VM 17 for Intel Macs. (weka-3-8-6-azul-zulu-osx.dmg; 180.2 MB)","title":"Mac OS - Intel processors"},{"location":"downloading_weka/#mac-os-arm-processors","text":"Click here to download a disk image for Mac OS that contains a Mac application including Azul's 64-bit OpenJDK Java VM 17 for ARM Macs. (weka-3-8-6-azul-zulu-arm-osx.dmg; 166.3 MB)","title":"Mac OS - ARM processors"},{"location":"downloading_weka/#linux","text":"Click here to download a zip archive for Linux that includes Azul's 64-bit OpenJDK Java VM 17 (weka-3-8-6-azul-zulu-linux.zip; 146.9 MB) First unzip the zip file. This will create a new directory called weka-3-8-5. To run Weka, change into that directory and type ./weka.sh","title":"Linux"},{"location":"downloading_weka/#other-platforms","text":"Click here to download a zip archive containing Weka (weka-3-8-6.zip; 59.6 MB) First unzip the zip file. This will create a new directory called weka-3-8-6. To run Weka, change into that directory and type java -jar weka.jar Note that Java needs to be installed on your system for this to work. Also note that using -jar will override your current CLASSPATH variable and only use the weka.jar .","title":"Other platforms"},{"location":"downloading_weka/#developer-version","text":"This is the main development trunk of Weka and continues from the stable Weka 3.8 code line. It may receive new features that break backwards compatibility.","title":"Developer version"},{"location":"downloading_weka/#windows_1","text":"Click here to download a self-extracting executable for 64-bit Windows that includes Azul's 64-bit OpenJDK Java VM 17 (weka-3-9-6-azul-zulu-windows.exe; 133.0 MB) This executable will install Weka in your Program Menu. Launching via the Program Menu or shortcuts will automatically use the included JVM to run Weka.","title":"Windows"},{"location":"downloading_weka/#mac-os-intel-processors_1","text":"Click here to download a disk image for Mac OS that contains a Mac application including Azul's 64-bit OpenJDK Java VM 17 for Intel Macs. (weka-3-9-6-azul-zulu-osx.dmg; 180.0 MB)","title":"Mac OS - Intel processors"},{"location":"downloading_weka/#mac-os-arm-processors_1","text":"Click here to download a disk image for Mac OS that contains a Mac application including Azul's 64-bit OpenJDK Java VM 17 for ARM Macs. (weka-3-9-6-azul-zulu-arm-osx.dmg; 166.3 MB)","title":"Mac OS - ARM processors"},{"location":"downloading_weka/#linux_1","text":"Click here to download a zip archive for Linux that includes Azul's 64-bit OpenJDK Java VM 17 (weka-3-9-6-azul-zulu-linux.zip; 146.7 MB) First unzip the zip file. This will create a new directory called weka-3-9-6. To run Weka, change into that directory and type ./weka.sh","title":"Linux"},{"location":"downloading_weka/#other-platforms_1","text":"Click here to download a zip archive containing Weka (weka-3-9-6.zip; 59.4 MB) First unzip the zip file. This will create a new directory called weka-3-9-6. To run Weka, change into that directory and type java -jar weka.jar Note that Java needs to be installed on your system for this to work. Also note, that using -jar will override your current CLASSPATH variable and only use the weka.jar .","title":"Other platforms"},{"location":"downloading_weka/#old-versions","text":"All old versions of Weka are available from the Sourceforge website .","title":"Old versions"},{"location":"downloading_weka/#upgrading-from-weka-37","text":"In case you are upgrading an existing Weka 3.7 installation, if the Weka 3.8 package manager does not start up, please delete the file installedPackageCache.ser in the packages folder that resides in the wekafiles folder in your user home. Also, serialized Weka models created in 3.7 are incompatible with 3.8. The model migrator tool can migrate some models to 3.8 (a known exception is RandomForest). Usage is as follows: java -cp : weka.core.ModelMigrator -i -o ","title":"Upgrading from Weka 3.7"},{"location":"ensemble_selection/","text":"Notes # This bug has now been fixed. (12/2014) There is a bug in the code to build a library -- trying to build any model specification with three layers (e.g., Bagging a REPTree) causes the form to freeze up and/or crash. The documentation on how to run from the command line is outdated. Some corrections: The \"-D\" option no longer exists. The command shown for training a library from the command line: java weka.classifiers.meta.EnsembleSelection -no-cv -v -L path/to/your/mode/list/file.model.xml -W /path/to/your/working/directory -A library -X 5 -S 1 -O -t yourTrainingInstances.arff fails for me with an exception that \"Folds 1 and 5 are not equal.\" A command line that works is to set the folds to 1: java weka.classifiers.meta.EnsembleSelection -no-cv -v -L path/to/your/mode/list/file.model.xml -W /path/to/your/working/directory -A library -X 1 -S 1 -O -t yourTrainingInstances.arff Links # Ensemble_selection.pdf - Documentation on how to use Ensemble Selection in Weka Ensemble Selection from Libraries of Models, ICML'04","title":"Notes"},{"location":"ensemble_selection/#notes","text":"This bug has now been fixed. (12/2014) There is a bug in the code to build a library -- trying to build any model specification with three layers (e.g., Bagging a REPTree) causes the form to freeze up and/or crash. The documentation on how to run from the command line is outdated. Some corrections: The \"-D\" option no longer exists. The command shown for training a library from the command line: java weka.classifiers.meta.EnsembleSelection -no-cv -v -L path/to/your/mode/list/file.model.xml -W /path/to/your/working/directory -A library -X 5 -S 1 -O -t yourTrainingInstances.arff fails for me with an exception that \"Folds 1 and 5 are not equal.\" A command line that works is to set the folds to 1: java weka.classifiers.meta.EnsembleSelection -no-cv -v -L path/to/your/mode/list/file.model.xml -W /path/to/your/working/directory -A library -X 1 -S 1 -O -t yourTrainingInstances.arff","title":"Notes"},{"location":"ensemble_selection/#links","text":"Ensemble_selection.pdf - Documentation on how to use Ensemble Selection in Weka Ensemble Selection from Libraries of Models, ICML'04","title":"Links"},{"location":"extending_weka/","text":"The following articles describe how you can extend Weka: Writing a new Filter Writing a new Classifier Writing your own Classifier Article","title":"Extending Weka"},{"location":"faq/","text":"General # What are the principal release branches of Weka? Where can I get old versions of WEKA? How do I get the latest bugfixes? Can I check my CLASSPATH from within WEKA? Where is my home directory located? Can I check how much memory is available for WEKA? Can I use WEKA in commercial applications? Basic usage # Can I use CSV files? How do I perform CSV file conversion? How do I divide a dataset into training and test set? How do I generate compatible train and test sets that get processed with a filter? How do I perform attribute selection? How do I perform clustering? Where do I find visualization of classifiers, etc.? How do I perform text classification? How can I perform multi-instance learning in WEKA? How do I perform cost-sensitive classification? How do I make predictions with a trained model? Why am I missing certain nominal or string values from sparse instances? Can I use WEKA for time series analysis? Does WEKA support multi-label classification? How do I perform one-class classification? Can I make a screenshot of a plot or graph directly in WEKA? How do I use the package manager? What do I do if the package manager does not start? Advanced usage # How can I track instances in WEKA? How do I use ID attributes? How do I connect to a database? How do I use WEKA from the command line? Can I tune the parameters of a classifier? How do I generate Learning curves? Where can I find information regarding ROC curves? I have unbalanced data - now what? Can I run an experiment using clusterers in the Experimenter? How can I use transactional data in Weka? How can I use Weka with Matlab or Octave? How can I speed up Weka? Can I use GPUs to speed up Weka? Customizing Weka # Can I change the colors (background, axes, etc.) of the plots in WEKA? How do I add a new classifier, filter, kernel, etc Using third-party tools # How do I use libsvm in WEKA? The snowball stemmers don't work, what am I doing wrong? Developing with WEKA # Where can I get WEKA's source code? How do I compile WEKA? What is Git and what do I need to do to access it? How do I use WEKA's classes in my own code? How do I write a new classifier or filter? Can I compile WEKA into native code? Can I use WEKA from C#? Can I use WEKA from Python? Can I use WEKA from Groovy? Serialization is nice, but what about generating actual Java code from WEKA classes? How are packages structured for the package management system? Pluggable evaluation metrics for classification/regression How can I contribute to WEKA? Windows # How do I modify the CLASSPATH? How do I modify the RunWeka.bat file? Can I process UTF-8 datasets or files? How do I run the Windows Weka installer in silent mode? Troubleshooting # I have Weka download problems - what's going wrong? My ARFF file doesn't load - why? What does nominal value not declared in header, read Token[X], line Y mean? ) How do I get rid of this OutOfMemoryException? How do I deal with a StackOverflowError? Why do I get the error message 'training and test set are not compatible'? Couldn't read from database: unknown data type Trying to add JDBC driver: ... - Error, not in CLASSPATH? I cannot process large datasets - any ideas? See Troubleshooting article for more troubleshooting.","title":"FAQ"},{"location":"faq/#general","text":"What are the principal release branches of Weka? Where can I get old versions of WEKA? How do I get the latest bugfixes? Can I check my CLASSPATH from within WEKA? Where is my home directory located? Can I check how much memory is available for WEKA? Can I use WEKA in commercial applications?","title":"General"},{"location":"faq/#basic-usage","text":"Can I use CSV files? How do I perform CSV file conversion? How do I divide a dataset into training and test set? How do I generate compatible train and test sets that get processed with a filter? How do I perform attribute selection? How do I perform clustering? Where do I find visualization of classifiers, etc.? How do I perform text classification? How can I perform multi-instance learning in WEKA? How do I perform cost-sensitive classification? How do I make predictions with a trained model? Why am I missing certain nominal or string values from sparse instances? Can I use WEKA for time series analysis? Does WEKA support multi-label classification? How do I perform one-class classification? Can I make a screenshot of a plot or graph directly in WEKA? How do I use the package manager? What do I do if the package manager does not start?","title":"Basic usage"},{"location":"faq/#advanced-usage","text":"How can I track instances in WEKA? How do I use ID attributes? How do I connect to a database? How do I use WEKA from the command line? Can I tune the parameters of a classifier? How do I generate Learning curves? Where can I find information regarding ROC curves? I have unbalanced data - now what? Can I run an experiment using clusterers in the Experimenter? How can I use transactional data in Weka? How can I use Weka with Matlab or Octave? How can I speed up Weka? Can I use GPUs to speed up Weka?","title":"Advanced usage"},{"location":"faq/#customizing-weka","text":"Can I change the colors (background, axes, etc.) of the plots in WEKA? How do I add a new classifier, filter, kernel, etc","title":"Customizing Weka"},{"location":"faq/#using-third-party-tools","text":"How do I use libsvm in WEKA? The snowball stemmers don't work, what am I doing wrong?","title":"Using third-party tools"},{"location":"faq/#developing-with-weka","text":"Where can I get WEKA's source code? How do I compile WEKA? What is Git and what do I need to do to access it? How do I use WEKA's classes in my own code? How do I write a new classifier or filter? Can I compile WEKA into native code? Can I use WEKA from C#? Can I use WEKA from Python? Can I use WEKA from Groovy? Serialization is nice, but what about generating actual Java code from WEKA classes? How are packages structured for the package management system? Pluggable evaluation metrics for classification/regression How can I contribute to WEKA?","title":"Developing with WEKA"},{"location":"faq/#windows","text":"How do I modify the CLASSPATH? How do I modify the RunWeka.bat file? Can I process UTF-8 datasets or files? How do I run the Windows Weka installer in silent mode?","title":"Windows"},{"location":"faq/#troubleshooting","text":"I have Weka download problems - what's going wrong? My ARFF file doesn't load - why? What does nominal value not declared in header, read Token[X], line Y mean? ) How do I get rid of this OutOfMemoryException? How do I deal with a StackOverflowError? Why do I get the error message 'training and test set are not compatible'? Couldn't read from database: unknown data type Trying to add JDBC driver: ... - Error, not in CLASSPATH? I cannot process large datasets - any ideas? See Troubleshooting article for more troubleshooting.","title":"Troubleshooting"},{"location":"feature_extraction_from_images/","text":"ImageJ can be used to extract features from images. ImageJ contains a macro language with which it is easy to extract features and then dump them into an ARFF file. Links # ImageJ homepage","title":"Feature extraction from images"},{"location":"feature_extraction_from_images/#links","text":"ImageJ homepage","title":"Links"},{"location":"filtered_classifier_updateable/","text":"Description # Incremental version of weka.classifiers.meta.FilteredClassifier , which takes only incremental base classifiers (i.e., classifiers implementing weka.classifiers.UpdateableClassifier ). Reference # -none- Package # weka.classifiers.meta Download # Source code: FilteredClassifierUpdateable.java Example class: FilteredUpdateableTest.java Additional Information # -none- Version # Tested with source code from git (= trunk/weka ) as of 10/11/2008.","title":"Description"},{"location":"filtered_classifier_updateable/#description","text":"Incremental version of weka.classifiers.meta.FilteredClassifier , which takes only incremental base classifiers (i.e., classifiers implementing weka.classifiers.UpdateableClassifier ).","title":"Description"},{"location":"filtered_classifier_updateable/#reference","text":"-none-","title":"Reference"},{"location":"filtered_classifier_updateable/#package","text":"weka.classifiers.meta","title":"Package"},{"location":"filtered_classifier_updateable/#download","text":"Source code: FilteredClassifierUpdateable.java Example class: FilteredUpdateableTest.java","title":"Download"},{"location":"filtered_classifier_updateable/#additional-information","text":"-none-","title":"Additional Information"},{"location":"filtered_classifier_updateable/#version","text":"Tested with source code from git (= trunk/weka ) as of 10/11/2008.","title":"Version"},{"location":"generating_and_saving_a_precision_recall_curve/","text":"The following Java class evaluates a NaiveBayes classifier using cross-validation with a dataset provided by the user and saves a precision-recall curve for the first class label as a JPEG file, based on a user-specified file name. Source code: import java.awt.* ; import java.io.* ; import java.util.* ; import javax.swing.* ; import weka.core.* ; import weka.classifiers.* ; import weka.classifiers.bayes.NaiveBayes ; import weka.classifiers.evaluation.Evaluation ; import weka.classifiers.evaluation.ThresholdCurve ; import weka.gui.visualize.* ; /** * Generates and saves a precision-recall curve. Uses a cross-validation * with NaiveBayes to make the curve. * * @author FracPete * @author Eibe Frank */ public class SavePrecisionRecallCurve { /** * takes two arguments: dataset in ARFF format (expects class to * be last attribute) and name of file with output */ public static void main ( String [] args ) throws Exception { // load data Instances data = new Instances ( new BufferedReader ( new FileReader ( args [ 0 ] ))); data . setClassIndex ( data . numAttributes () - 1 ); // train classifier Classifier cl = new NaiveBayes (); Evaluation eval = new Evaluation ( data ); eval . crossValidateModel ( cl , data , 10 , new Random ( 1 )); // generate curve ThresholdCurve tc = new ThresholdCurve (); int classIndex = 0 ; Instances result = tc . getCurve ( eval . predictions (), classIndex ); // plot curve ThresholdVisualizePanel vmc = new ThresholdVisualizePanel (); PlotData2D tempd = new PlotData2D ( result ); // specify which points are connected boolean [] cp = new boolean [ result . numInstances () ] ; for ( int n = 1 ; n < cp . length ; n ++ ) cp [ n ] = true ; tempd . setConnectPoints ( cp ); // add plot vmc . addPlot ( tempd ); // We want a precision-recall curve vmc . setXIndex ( result . attribute ( \"Recall\" ). index ()); vmc . setYIndex ( result . attribute ( \"Precision\" ). index ()); // Make window with plot but don't show it JFrame jf = new JFrame (); jf . setSize ( 500 , 400 ); jf . getContentPane (). add ( vmc ); jf . pack (); // Save to file specified as second argument (can use any of // BMPWriter, JPEGWriter, PNGWriter, PostscriptWriter for different formats) JComponentWriter jcw = new JPEGWriter ( vmc . getPlotPanel (), new File ( args [ 1 ] )); jcw . toOutput (); System . exit ( 1 ); } } See also # ROC curves Visualizing ROC curve Plotting multiple ROC curves Version # Needs the developer version >=3.5.1 or 3.6.x","title":"Generating and saving a precision recall curve"},{"location":"generating_and_saving_a_precision_recall_curve/#see-also","text":"ROC curves Visualizing ROC curve Plotting multiple ROC curves","title":"See also"},{"location":"generating_and_saving_a_precision_recall_curve/#version","text":"Needs the developer version >=3.5.1 or 3.6.x","title":"Version"},{"location":"generating_classifier_evaluation_output_manually/","text":"In the following some code snippets that explain how to generate the output Weka generates when one runs a classifier from the commandline. When referring to the Evaluation class, the weka.classifiers.Evaluation class is meant. This article provides only a quick overview, for more details, please see the Javadoc of the Evaluation class. Model # A classifier's model, if that classifier supports the output of it, can be simply output by using the toString() method after it got trained: Instances data = ... // from somewhere Classifier cls = new weka . classifiers . trees . J48 (); cls . buildClassifier ( data ); System . out . println ( cls ); NB: Weka always outputs the model based on the full training set (provided with the option -t ), no matter whether cross-validation is used or a designated test set (via -T ). The 10 models generated during a 10-fold cross-validation run are never output. If you want to output these models you have to simulate the crossValidateModel method yourself, use the KnowledgeFlow (see article Displaying results of cross-validation folds ). Statistics # The statistics, also called the summary of an evaluation, can be be generated via the toSummaryString methods. Here is an example of the summary from a cross-validated J48: Classifier cls = new J48 (); Evaluation eval = new Evaluation ( data ); Random rand = new Random ( 1 ); // using seed = 1 int folds = 10 ; eval . crossValidateModel ( cls , data , folds , rand ); System . out . println ( eval . toSummaryString ()); Detailed class statistics # In order to generate the detailed statistics per class (on the commandline via option -i ), one can use the toClassDetailsString methods. Once again a code snippet featuring a cross-validated J48: Classifier cls = new J48 (); Evaluation eval = new Evaluation ( data ); Random rand = new Random ( 1 ); // using seed = 1 int folds = 10 ; eval . crossValidateModel ( cls , data , folds , rand ); System . out . println ( eval . toClassDetailsString ()); Confusion matrix # The confusion matrix is simply output with the toMatrixString() or toMatrixString(String) method of the Evaluation class. In the following an example of cross-validating J48 on a dataset and outputting the confusion matrix to stdout. Classifier cls = new J48 (); Evaluation eval = new Evaluation ( data ); Random rand = new Random ( 1 ); // using seed = 1 int folds = 10 ; eval . crossValidateModel ( cls , data , folds , rand ); System . out . println ( eval . toMatrixString ()); See also # Use Weka in your Java code - general overview of the Weka API","title":"Generating classifier evaluation output manually"},{"location":"generating_classifier_evaluation_output_manually/#model","text":"A classifier's model, if that classifier supports the output of it, can be simply output by using the toString() method after it got trained: Instances data = ... // from somewhere Classifier cls = new weka . classifiers . trees . J48 (); cls . buildClassifier ( data ); System . out . println ( cls ); NB: Weka always outputs the model based on the full training set (provided with the option -t ), no matter whether cross-validation is used or a designated test set (via -T ). The 10 models generated during a 10-fold cross-validation run are never output. If you want to output these models you have to simulate the crossValidateModel method yourself, use the KnowledgeFlow (see article Displaying results of cross-validation folds ).","title":"Model"},{"location":"generating_classifier_evaluation_output_manually/#statistics","text":"The statistics, also called the summary of an evaluation, can be be generated via the toSummaryString methods. Here is an example of the summary from a cross-validated J48: Classifier cls = new J48 (); Evaluation eval = new Evaluation ( data ); Random rand = new Random ( 1 ); // using seed = 1 int folds = 10 ; eval . crossValidateModel ( cls , data , folds , rand ); System . out . println ( eval . toSummaryString ());","title":"Statistics"},{"location":"generating_classifier_evaluation_output_manually/#detailed-class-statistics","text":"In order to generate the detailed statistics per class (on the commandline via option -i ), one can use the toClassDetailsString methods. Once again a code snippet featuring a cross-validated J48: Classifier cls = new J48 (); Evaluation eval = new Evaluation ( data ); Random rand = new Random ( 1 ); // using seed = 1 int folds = 10 ; eval . crossValidateModel ( cls , data , folds , rand ); System . out . println ( eval . toClassDetailsString ());","title":"Detailed class statistics"},{"location":"generating_classifier_evaluation_output_manually/#confusion-matrix","text":"The confusion matrix is simply output with the toMatrixString() or toMatrixString(String) method of the Evaluation class. In the following an example of cross-validating J48 on a dataset and outputting the confusion matrix to stdout. Classifier cls = new J48 (); Evaluation eval = new Evaluation ( data ); Random rand = new Random ( 1 ); // using seed = 1 int folds = 10 ; eval . crossValidateModel ( cls , data , folds , rand ); System . out . println ( eval . toMatrixString ());","title":"Confusion matrix"},{"location":"generating_classifier_evaluation_output_manually/#see-also","text":"Use Weka in your Java code - general overview of the Weka API","title":"See also"},{"location":"generating_cv_folds/","text":"You have two choices of generating cross-validation folds: Filter approach - uses a bash script to generate the train/test pairs beforehand Java approach - to be used from within your own Java code, creates train/test pairs on the fly","title":"Generating cv folds"},{"location":"generating_cv_folds_filter/","text":"The filter RemoveFolds (package weka.filters.unsupervised.instance ) can be used to generate the train/test splits used in cross-validation (for stratified folds, use weka.filters.supervised.instance.StratifiedRemoveFolds ). The filter has to be used twice for each train/test split, first to generate the train set and then to obtain the test set. Since this is rather cumbersome by hand, one can also put this into a bash script: #!/bin/bash # # expects the weka.jar as first parameter and the datasets to work on as # second parameter. # # FracPete, 2007-04-10 if [ ! $# -eq 2 ] then echo echo \"usage: folds.sh \" echo exit 1 fi JAR = $1 DATASET = $2 FOLDS = 10 FILTER = weka.filters.unsupervised.instance.RemoveFolds SEED = 1 for (( i = 1 ; i < = $FOLDS ; i++ )) do echo \"Generating pair $i / $FOLDS ...\" OUTFILE = ` echo $DATASET | sed s/ \"\\.arff\" //g ` # train set java -cp $JAR $FILTER -V -N $FOLDS -F $i -S $SEED -i $DATASET -o \" $OUTFILE -train- $i -of- $FOLDS .arff\" # test set java -cp $JAR $FILTER -N $FOLDS -F $i -S $SEED -i $DATASET -o \" $OUTFILE -test- $i -of- $FOLDS .arff\" done The script expects two parameters: the weka.jar (or the path to the Weka classes) the dataset to generate the train/test pairs from Example: ./folds.sh /some/where/weka.jar /some/where/else/dataset.arff This example will create the train/test splits for a 10-fold cross-validation at the same location as the original dataset, i.e., in the directory /some/where/else/ . Downloads # folds.sh","title":"Generating cv folds filter"},{"location":"generating_cv_folds_filter/#downloads","text":"folds.sh","title":"Downloads"},{"location":"generating_cv_folds_java/","text":"This article describes how to generate train/test splits for cross-validation using the Weka API directly. The following variables are given: Instances data = ...; // contains the full dataset we wann create train/test sets from int seed = ...; // the seed for randomizing the data int folds = ...; // the number of folds to generate, >=2 Randomize the data # First, randomize your data: Random rand = new Random ( seed ); // create seeded number generator randData = new Instances ( data ); // create copy of original data randData . randomize ( rand ); // randomize data with number generator In case your data has a nominal class and you wanna perform stratified cross-validation: randData . stratify ( folds ); Generate the folds # Single run # Next thing that we have to do is creating the train and the test set: for ( int n = 0 ; n < folds ; n ++ ) { Instances train = randData . trainCV ( folds , n , rand ); Instances test = randData . testCV ( folds , n ); // further processing, classification, etc. ... } Note: the above code is used by the weka.filters.supervised.instance.StratifiedRemoveFolds filter the weka.classifiers.Evaluation class and the Explorer/Experimenter would use this method for obtaining the train set: Instances train = randData . trainCV ( folds , n , rand ); Multiple runs # The example above only performs one run of a cross-validation. In case you want to run 10 runs of 10-fold cross-validation, use the following loop: Instances data = ...; // our dataset again, obtained from somewhere int runs = 10 ; for ( int i = 0 ; i < runs ; i ++ ) { seed = i + 1 ; // every run gets a new, but defined seed value // see: randomize the data ... // see: generate the folds ... } See also # Use Weka in your Java code - for general use of the Weka API Downloads # CrossValidationSingleRun.java ( stable , developer ) - simulates a single run of 10-fold cross-validation CrossValidationSingleRunVariant.java ( stable , developer ) - simulates a single run of 10-fold cross-validation, but outputs the confusion matrices for each single train/test pair as well. CrossValidationMultipleRuns.java ( stable , developer ) - simulates 10 runs of 10-fold cross-validation CrossValidationAddPrediction.java ( stable , developer ) - simulates a single run of 10-fold cross-validation, but also adds the classification/distribution/error flag to the test data (uses the AddClassification filter)","title":"Generating cv folds java"},{"location":"generating_cv_folds_java/#randomize-the-data","text":"First, randomize your data: Random rand = new Random ( seed ); // create seeded number generator randData = new Instances ( data ); // create copy of original data randData . randomize ( rand ); // randomize data with number generator In case your data has a nominal class and you wanna perform stratified cross-validation: randData . stratify ( folds );","title":"Randomize the data"},{"location":"generating_cv_folds_java/#generate-the-folds","text":"","title":"Generate the folds"},{"location":"generating_cv_folds_java/#single-run","text":"Next thing that we have to do is creating the train and the test set: for ( int n = 0 ; n < folds ; n ++ ) { Instances train = randData . trainCV ( folds , n , rand ); Instances test = randData . testCV ( folds , n ); // further processing, classification, etc. ... } Note: the above code is used by the weka.filters.supervised.instance.StratifiedRemoveFolds filter the weka.classifiers.Evaluation class and the Explorer/Experimenter would use this method for obtaining the train set: Instances train = randData . trainCV ( folds , n , rand );","title":"Single run"},{"location":"generating_cv_folds_java/#multiple-runs","text":"The example above only performs one run of a cross-validation. In case you want to run 10 runs of 10-fold cross-validation, use the following loop: Instances data = ...; // our dataset again, obtained from somewhere int runs = 10 ; for ( int i = 0 ; i < runs ; i ++ ) { seed = i + 1 ; // every run gets a new, but defined seed value // see: randomize the data ... // see: generate the folds ... }","title":"Multiple runs"},{"location":"generating_cv_folds_java/#see-also","text":"Use Weka in your Java code - for general use of the Weka API","title":"See also"},{"location":"generating_cv_folds_java/#downloads","text":"CrossValidationSingleRun.java ( stable , developer ) - simulates a single run of 10-fold cross-validation CrossValidationSingleRunVariant.java ( stable , developer ) - simulates a single run of 10-fold cross-validation, but outputs the confusion matrices for each single train/test pair as well. CrossValidationMultipleRuns.java ( stable , developer ) - simulates 10 runs of 10-fold cross-validation CrossValidationAddPrediction.java ( stable , developer ) - simulates a single run of 10-fold cross-validation, but also adds the classification/distribution/error flag to the test data (uses the AddClassification filter)","title":"Downloads"},{"location":"generating_roc_curve/","text":"The following little Java class trains a NaiveBayes classifier with a dataset provided by the user and displays the ROC curve for the first class label. Source code: import java.awt.* ; import java.io.* ; import java.util.* ; import javax.swing.* ; import weka.core.* ; import weka.classifiers.* ; import weka.classifiers.bayes.NaiveBayes ; import weka.classifiers.evaluation.Evaluation ; import weka.classifiers.evaluation.ThresholdCurve ; import weka.gui.visualize.* ; /** * Generates and displays a ROC curve from a dataset. Uses a default * NaiveBayes to generate the ROC data. * * @author FracPete */ public class GenerateROC { /** * takes one argument: dataset in ARFF format (expects class to * be last attribute) */ public static void main ( String [] args ) throws Exception { // load data Instances data = new Instances ( new BufferedReader ( new FileReader ( args [ 0 ] ))); data . setClassIndex ( data . numAttributes () - 1 ); // train classifier Classifier cl = new NaiveBayes (); Evaluation eval = new Evaluation ( data ); eval . crossValidateModel ( cl , data , 10 , new Random ( 1 )); // generate curve ThresholdCurve tc = new ThresholdCurve (); int classIndex = 0 ; Instances result = tc . getCurve ( eval . predictions (), classIndex ); // plot curve ThresholdVisualizePanel vmc = new ThresholdVisualizePanel (); vmc . setROCString ( \"(Area under ROC = \" + Utils . doubleToString ( tc . getROCArea ( result ), 4 ) + \")\" ); vmc . setName ( result . relationName ()); PlotData2D tempd = new PlotData2D ( result ); tempd . setPlotName ( result . relationName ()); tempd . addInstanceNumberAttribute (); // specify which points are connected boolean [] cp = new boolean [ result . numInstances () ] ; for ( int n = 1 ; n < cp . length ; n ++ ) cp [ n ] = true ; tempd . setConnectPoints ( cp ); // add plot vmc . addPlot ( tempd ); // display curve String plotName = vmc . getName (); final javax . swing . JFrame jf = new javax . swing . JFrame ( \"Weka Classifier Visualize: \" + plotName ); jf . setSize ( 500 , 400 ); jf . getContentPane (). setLayout ( new BorderLayout ()); jf . getContentPane (). add ( vmc , BorderLayout . CENTER ); jf . addWindowListener ( new java . awt . event . WindowAdapter () { public void windowClosing ( java . awt . event . WindowEvent e ) { jf . dispose (); } }); jf . setVisible ( true ); } } See also # ROC curves Visualizing ROC curve Plotting multiple ROC curves Downloads # GenerateROC.java ( stable , developer )","title":"Generating roc curve"},{"location":"generating_roc_curve/#see-also","text":"ROC curves Visualizing ROC curve Plotting multiple ROC curves","title":"See also"},{"location":"generating_roc_curve/#downloads","text":"GenerateROC.java ( stable , developer )","title":"Downloads"},{"location":"generating_source_code_from_weka_classes/","text":"Some of the schemes in Weka can generate Java source code that represents their current internal state. At the moment these are classifiers (book and developer version) and filters (>3.5.6). The generated code can be used within Weka as normal classifier/filter, since this code will be derived from the same superclass ( weka.classifiers.Classifier or weka.filters.Filter ) as the generating code. Note: The commands listed here are for a Linux/Unix bash (the backslash tells the shell that the command isn't finished yet and continues on the next line). In case of Windows or the SimpleCLI, just remove the backslashes and put everything on one line. Classifiers # Instead of using a serialized filter to perform further classifications/predictions, one can also obtain source code from a trained classifier and use this instead. The advantage of this is being less dependent on version changes and incompatible serialized files. All classifiers implementing the weka.classifiers.Sourcable interface can turn their model into Java source code (check the Javadoc of this interface for all the classifiers implementing it). Here's an example of generating source code from a trained J48 (the source code is saved in a file called WekaWrapper.java ): java weka.classifiers.trees.J48 \\ -t /some/where/data.arff \\ -z SourcedJ48 \\ # name of the inner class, gets called by wrapper class WekaWrapper > /else/where/WekaWrapper.java # redirecting the output of the code into a file The package of the wrapper class is by default the weka.classifiers package. Make sure that you place the source code and/or class files in the correct location. The generated classifier can be used from the commandline or GUI like any other classifier within Weka, you only need to make sure that your GenericObjectEditor lists the package you place the classifier in ( weka.classifiers is not listed by default). The following command calls the generated classifier with a training set (training has no effect, of course) and outputs the predictions for this dataset to stdout : java weka.classifiers.WekaWrapper \\ -t /some/file.arff \\ -p 0 # output predictions for training set Note: the Explorer can output source code as well, you only have to check the Output source code option in the More options dialog. Filters # With versions of Weka later than 3.5.6 of the developer version, one can now also turn filters into source code. The process is basically the same as with classifiers outlined above. All filters that implement the weka.filters.Sourcable interface can be turned into Java code (again, check out the Javadoc for this interface, to see the filters implementing it). The following command turns an initialized ReplaceMissingValues filter into source code: java weka.filters.unsupervised.attribute.ReplaceMissingValues \\ -i /somewhere1/input.arff \\ -o /somewhere2/output.arff \\ -z SourcedRMV \\ # name of the inner class, gets called by wrapper class WekaWrapper > /some/place/WekaWrapper.java # redirecting the output of the code into a file The package of the wrapper class is by default the weka.filters package. Make sure that you place the source code and/or class files in the correct location. The generated filter can be used from the commandline or GUI like any other filter within Weka, you only need to make sure that your GenericObjectEditor lists the package you place the filter in. And again a little demonstration of how to call the generated source code: java weka.filters.WekaWrapper \\ -i /some/where/input.arff \\ # must have the same structure as **/somewhere1/input.arff**, of course -o /other/place/output.arff See also # Serialization - can be used for all classifiers and filters to save them in a persistent state.","title":"Generating source code from weka classes"},{"location":"generating_source_code_from_weka_classes/#classifiers","text":"Instead of using a serialized filter to perform further classifications/predictions, one can also obtain source code from a trained classifier and use this instead. The advantage of this is being less dependent on version changes and incompatible serialized files. All classifiers implementing the weka.classifiers.Sourcable interface can turn their model into Java source code (check the Javadoc of this interface for all the classifiers implementing it). Here's an example of generating source code from a trained J48 (the source code is saved in a file called WekaWrapper.java ): java weka.classifiers.trees.J48 \\ -t /some/where/data.arff \\ -z SourcedJ48 \\ # name of the inner class, gets called by wrapper class WekaWrapper > /else/where/WekaWrapper.java # redirecting the output of the code into a file The package of the wrapper class is by default the weka.classifiers package. Make sure that you place the source code and/or class files in the correct location. The generated classifier can be used from the commandline or GUI like any other classifier within Weka, you only need to make sure that your GenericObjectEditor lists the package you place the classifier in ( weka.classifiers is not listed by default). The following command calls the generated classifier with a training set (training has no effect, of course) and outputs the predictions for this dataset to stdout : java weka.classifiers.WekaWrapper \\ -t /some/file.arff \\ -p 0 # output predictions for training set Note: the Explorer can output source code as well, you only have to check the Output source code option in the More options dialog.","title":"Classifiers"},{"location":"generating_source_code_from_weka_classes/#filters","text":"With versions of Weka later than 3.5.6 of the developer version, one can now also turn filters into source code. The process is basically the same as with classifiers outlined above. All filters that implement the weka.filters.Sourcable interface can be turned into Java code (again, check out the Javadoc for this interface, to see the filters implementing it). The following command turns an initialized ReplaceMissingValues filter into source code: java weka.filters.unsupervised.attribute.ReplaceMissingValues \\ -i /somewhere1/input.arff \\ -o /somewhere2/output.arff \\ -z SourcedRMV \\ # name of the inner class, gets called by wrapper class WekaWrapper > /some/place/WekaWrapper.java # redirecting the output of the code into a file The package of the wrapper class is by default the weka.filters package. Make sure that you place the source code and/or class files in the correct location. The generated filter can be used from the commandline or GUI like any other filter within Weka, you only need to make sure that your GenericObjectEditor lists the package you place the filter in. And again a little demonstration of how to call the generated source code: java weka.filters.WekaWrapper \\ -i /some/where/input.arff \\ # must have the same structure as **/somewhere1/input.arff**, of course -o /other/place/output.arff","title":"Filters"},{"location":"generating_source_code_from_weka_classes/#see-also","text":"Serialization - can be used for all classifiers and filters to save them in a persistent state.","title":"See also"},{"location":"generic_object_editor/","text":"The GenericObjectEditor is the core component in Weka for modifying schemes, like classifiers and filters in the GUI. It has to be configured correctly in order to show default and additional schemes. See the following articles for more details: GenericObjectEditor (book version) GenericObjectEditor (developer version)","title":"Generic object editor"},{"location":"generic_object_editor_book_version/","text":"Introduction # As of version 3.4.4 it is possible for WEKA to dynamically discover classes at runtime (rather than using only those specified in the GenericObjectEditor.props (GOE) file). If dynamic class discovery is too slow, e.g., due to an enormous CLASSPATH, you can generate a new GenericObjectEditor.props file and then turn dynamic class discovery off. It is assumed that you already placed the GenericPropertiesCreator.props (GPC) file in your home directory (this file is located in directory weka/gui of either the weka.jar or weka-src.jar ZIP archive) and that the weka.jar jar archive with the WEKA classes is in your CLASSPATH (otherwise you have to add it to the java call using the -classpath option). For generating the GOE file, execute the following steps: generate a new GenericObjectEditor.props file using the following command: Linux/Unix java weka.gui.GenericPropertiesCreator \\ $HOME/GenericPropertiesCreator.props \\ $HOME/GenericObjectEditor.props Windows (command must be in one line) java weka.gui.GenericPropertiesCreator %USERPROFILE%\\GenericPropertiesCreator.props %USERPROFILE%\\GenericObjectEditor.props edit the GenericPropertiesCreator.props file in your home directory and set UseDynamic to false . For disabling dynamic class discovery, you need to set the boolean constant USE_DYNAMIC of the weka.gui.GenericObjectEditor class to false . See article Compiling WEKA for more information on how to compile a modified version of WEKA. A limitation of the GOE prior to 3.4.4 was, that additional classifiers, filters, etc., had to fit into the same package structure as the already existing ones, i.e., all had to be located below weka . WEKA can now display multiple class hierarchies in the GUI, which makes adding new functionality quite easy as we will see later in an example (it is not restricted to classifiers only, but also works with all the other entries in the GPC file). File Structure # The structure of the GOE so far was a key-value-pair, separated by an equals -sign. The value is a comma separated list of classes that are all derived from the superclass/superinterface key . The GPC is slightly different, instead of declaring all the classes/interfaces one need only to specify all the packages descendants are located in (only non-abstract ones are then listed). E.g., the weka.classifiers.Classifier entry in the GOE file looks like this: weka.classifiers.Classifier = \\ weka.classifiers.bayes.AODE, \\ weka.classifiers.bayes.BayesNet, \\ weka.classifiers.bayes.ComplementNaiveBayes, \\ weka.classifiers.bayes.NaiveBayes, \\ weka.classifiers.bayes.NaiveBayesMultinomial, \\ weka.classifiers.bayes.NaiveBayesSimple, \\ weka.classifiers.bayes.NaiveBayesUpdateable, \\ weka.classifiers.functions.LeastMedSq, \\ weka.classifiers.functions.LinearRegression, \\ weka.classifiers.functions.Logistic, \\ ... The entry producing the same output for the classifiers in the GPC looks like this (7 lines instead of over 70!): weka.classifiers.Classifier = \\ weka.classifiers.bayes, \\ weka.classifiers.functions, \\ weka.classifiers.lazy, \\ weka.classifiers.meta, \\ weka.classifiers.trees, \\ weka.classifiers.rules Class Discovery # Unlike the Class.forName(String) method that grabs the first class it can find in the CLASSPATH , and therefore fixes the location of the package it found the class in, the dynamic discovery examines the complete CLASSPATH you're starting the Java Virtual Machine (JVM) with. This means that you can have several parallel directories with the same WEKA package structure, e.g. the standard release of WEKA in one directory ( /distribution/weka.jar ) and another one with your own classes ( /development/weka/... ), and display all of the classifiers in the GUI. In case of a name conflict, i.e. two directories contain the same class, the first one that can be found is used. In a nutshell, your java call of the GUIChooser could look like this: java -classpath \"/development:/distribution/weka.jar\" weka.gui.GUIChooser Note: Windows users have to replace the \":\" with \";\" and the forward slashes with backslashes. Multiple Class Hierarchies # In case you're developing your own framework, but still want to use your classifiers within WEKA that wasn't possible so far. With the release 3.4.4 it is possible to have multiple class hierarchies being displayed in the GUI. If you've developed a modified version of J48, let's call it MyJ48 and it's located in the package dummy.classifiers then you'll have to add this package to the classifiers list in the GPC file like this: weka.classifiers.Classifier = \\ weka.classifiers.bayes, \\ weka.classifiers.functions, \\ weka.classifiers.lazy, \\ weka.classifiers.meta, \\ weka.classifiers.trees, \\ weka.classifiers.rules, \\ dummy.classifiers Your java call for the GUIChooser might look like this: java -classpath \"weka.jar:dummy.jar\" weka.gui.GUIChooser Note: Windows users have to replace the \":\" with \";\" and the forward slashes with backslashes. Starting up the GUI you'll now have another root node in the tree view of the classifiers, called root , and below it the weka and the dummy package hierarchy as you can see here: Links # GenericObjectEditor (developer version) CLASSPATH Properties file GenericPropertiesCreator.props","title":"Introduction"},{"location":"generic_object_editor_book_version/#introduction","text":"As of version 3.4.4 it is possible for WEKA to dynamically discover classes at runtime (rather than using only those specified in the GenericObjectEditor.props (GOE) file). If dynamic class discovery is too slow, e.g., due to an enormous CLASSPATH, you can generate a new GenericObjectEditor.props file and then turn dynamic class discovery off. It is assumed that you already placed the GenericPropertiesCreator.props (GPC) file in your home directory (this file is located in directory weka/gui of either the weka.jar or weka-src.jar ZIP archive) and that the weka.jar jar archive with the WEKA classes is in your CLASSPATH (otherwise you have to add it to the java call using the -classpath option). For generating the GOE file, execute the following steps: generate a new GenericObjectEditor.props file using the following command: Linux/Unix java weka.gui.GenericPropertiesCreator \\ $HOME/GenericPropertiesCreator.props \\ $HOME/GenericObjectEditor.props Windows (command must be in one line) java weka.gui.GenericPropertiesCreator %USERPROFILE%\\GenericPropertiesCreator.props %USERPROFILE%\\GenericObjectEditor.props edit the GenericPropertiesCreator.props file in your home directory and set UseDynamic to false . For disabling dynamic class discovery, you need to set the boolean constant USE_DYNAMIC of the weka.gui.GenericObjectEditor class to false . See article Compiling WEKA for more information on how to compile a modified version of WEKA. A limitation of the GOE prior to 3.4.4 was, that additional classifiers, filters, etc., had to fit into the same package structure as the already existing ones, i.e., all had to be located below weka . WEKA can now display multiple class hierarchies in the GUI, which makes adding new functionality quite easy as we will see later in an example (it is not restricted to classifiers only, but also works with all the other entries in the GPC file).","title":"Introduction"},{"location":"generic_object_editor_book_version/#file-structure","text":"The structure of the GOE so far was a key-value-pair, separated by an equals -sign. The value is a comma separated list of classes that are all derived from the superclass/superinterface key . The GPC is slightly different, instead of declaring all the classes/interfaces one need only to specify all the packages descendants are located in (only non-abstract ones are then listed). E.g., the weka.classifiers.Classifier entry in the GOE file looks like this: weka.classifiers.Classifier = \\ weka.classifiers.bayes.AODE, \\ weka.classifiers.bayes.BayesNet, \\ weka.classifiers.bayes.ComplementNaiveBayes, \\ weka.classifiers.bayes.NaiveBayes, \\ weka.classifiers.bayes.NaiveBayesMultinomial, \\ weka.classifiers.bayes.NaiveBayesSimple, \\ weka.classifiers.bayes.NaiveBayesUpdateable, \\ weka.classifiers.functions.LeastMedSq, \\ weka.classifiers.functions.LinearRegression, \\ weka.classifiers.functions.Logistic, \\ ... The entry producing the same output for the classifiers in the GPC looks like this (7 lines instead of over 70!): weka.classifiers.Classifier = \\ weka.classifiers.bayes, \\ weka.classifiers.functions, \\ weka.classifiers.lazy, \\ weka.classifiers.meta, \\ weka.classifiers.trees, \\ weka.classifiers.rules","title":"File Structure"},{"location":"generic_object_editor_book_version/#class-discovery","text":"Unlike the Class.forName(String) method that grabs the first class it can find in the CLASSPATH , and therefore fixes the location of the package it found the class in, the dynamic discovery examines the complete CLASSPATH you're starting the Java Virtual Machine (JVM) with. This means that you can have several parallel directories with the same WEKA package structure, e.g. the standard release of WEKA in one directory ( /distribution/weka.jar ) and another one with your own classes ( /development/weka/... ), and display all of the classifiers in the GUI. In case of a name conflict, i.e. two directories contain the same class, the first one that can be found is used. In a nutshell, your java call of the GUIChooser could look like this: java -classpath \"/development:/distribution/weka.jar\" weka.gui.GUIChooser Note: Windows users have to replace the \":\" with \";\" and the forward slashes with backslashes.","title":"Class Discovery"},{"location":"generic_object_editor_book_version/#multiple-class-hierarchies","text":"In case you're developing your own framework, but still want to use your classifiers within WEKA that wasn't possible so far. With the release 3.4.4 it is possible to have multiple class hierarchies being displayed in the GUI. If you've developed a modified version of J48, let's call it MyJ48 and it's located in the package dummy.classifiers then you'll have to add this package to the classifiers list in the GPC file like this: weka.classifiers.Classifier = \\ weka.classifiers.bayes, \\ weka.classifiers.functions, \\ weka.classifiers.lazy, \\ weka.classifiers.meta, \\ weka.classifiers.trees, \\ weka.classifiers.rules, \\ dummy.classifiers Your java call for the GUIChooser might look like this: java -classpath \"weka.jar:dummy.jar\" weka.gui.GUIChooser Note: Windows users have to replace the \":\" with \";\" and the forward slashes with backslashes. Starting up the GUI you'll now have another root node in the tree view of the classifiers, called root , and below it the weka and the dummy package hierarchy as you can see here:","title":"Multiple Class Hierarchies"},{"location":"generic_object_editor_book_version/#links","text":"GenericObjectEditor (developer version) CLASSPATH Properties file GenericPropertiesCreator.props","title":"Links"},{"location":"generic_object_editor_developer_version/","text":"Introduction # As of version 3.4.4 it is possible for WEKA to dynamically discover classes at runtime (rather than using only those specified in the GenericObjectEditor.props (GOE) file). In some versions (3.5.8, 3.6.0) this facility was not enabled by default as it is a bit slower than the GOE file approach, and, furthermore, does not function in environments that do not have a CLASSPATH (e.g., application servers). Later versions (3.6.1, 3.7.0) enabled the dynamic discovery again, as WEKA can now distinguish between being a standalone Java application or being run in a non-CLASSPATH environment. If you wish to enable or disable dynamic class discovery, the relevant file to edit is GenericPropertiesCreator.props (GPC). You can obtain this file either from the weka.jar or weka-src.jar archive. Open one of these files with an archive manager that can handle ZIP files (for Windows users, you can use 7-Zip for this) and navigate to the weka/gui directory, where the GPC file is located. All that is required, is to change the UseDynamic property in this file from false to true (for enabling it) or the other way round (for disabling it). After changing the file, you just place it in your home directory. In order to find out the location of your home directory, do the following: Linux/Unix Open a terminal run the following command: echo $HOME Windows Open a command-primpt run the following command: echo %USERPROFILE% If dynamic class discovery is too slow, e.g., due to an enormous CLASSPATH, you can generate a new GenericObjectEditor.props file and then turn dynamic class discovery off again. It is assumed that you already place the GPC file in your home directory (see steps above) and that the weka.jar jar archive with the WEKA classes is in your CLASSPATH (otherwise you have to add it to the java call using the -classpath option). For generating the GOE file, execute the following steps: generate a new GenericObjectEditor.props file using the following command: Linux/Unix java weka.gui.GenericPropertiesCreator \\ $HOME/GenericPropertiesCreator.props \\ $HOME/GenericObjectEditor.props Windows (command must be in one line) java weka.gui.GenericPropertiesCreator %USERPROFILE%\\GenericPropertiesCreator.props %USERPROFILE%\\GenericObjectEditor.props edit the GenericPropertiesCreator.props file in your home directory and set UseDynamic to false . A limitation of the GOE prior to 3.4.4 was, that additional classifiers, filters, etc., had to fit into the same package structure as the already existing ones, i.e., all had to be located below weka . WEKA can now display multiple class hierarchies in the GUI, which makes adding new functionality quite easy as we will see later in an example (it is not restricted to classifiers only, but also works with all the other entries in the GPC file). File Structure # The structure of the GOE so far was a key-value-pair, separated by an equals -sign. The value is a comma separated list of classes that are all derived from the superclass/superinterface key . The GPC is slightly different, instead of declaring all the classes/interfaces one need only to specify all the packages descendants are located in (only non-abstract ones are then listed). E.g., the weka.classifiers.Classifier entry in the GOE file looks like this: weka.classifiers.Classifier = \\ weka.classifiers.bayes.AODE, \\ weka.classifiers.bayes.BayesNet, \\ weka.classifiers.bayes.ComplementNaiveBayes, \\ weka.classifiers.bayes.NaiveBayes, \\ weka.classifiers.bayes.NaiveBayesMultinomial, \\ weka.classifiers.bayes.NaiveBayesSimple, \\ weka.classifiers.bayes.NaiveBayesUpdateable, \\ weka.classifiers.functions.LeastMedSq, \\ weka.classifiers.functions.LinearRegression, \\ weka.classifiers.functions.Logistic, \\ ... The entry producing the same output for the classifiers in the GPC looks like this (7 lines instead of over 70 in WEKA 3.4.4!): weka.classifiers.Classifier = \\ weka.classifiers.bayes, \\ weka.classifiers.functions, \\ weka.classifiers.lazy, \\ weka.classifiers.meta, \\ weka.classifiers.trees, \\ weka.classifiers.rules Exclusion # It may not always be desired to list all the classes that can be found along the CLASSPATH . Sometimes, classes cannot be declared abstract but still shouldn't be listed in the GOE. For that reason one can list classes, interfaces, superclasses for certain packages to be excluded from display. This exclusion is done with the following file: weka/gui/GenericPropertiesCreator.excludes The format of this properties file is fairly easy: =:[,:] Where the corresponds to a key in the GenericPropertiesCreator.props file and the can be one of the following: S - Superclass any class class derived from this will be excluded I - Interface any class implementing this interface will be excluded C - Class exactly this class will be excluded Here are a few examples: # exclude all ResultListeners that also implement the ResultProducer interface # (all ResultProducers do that!) weka.experiment.ResultListener = \\ I:weka.experiment.ResultProducer # exclude J48 and all SingleClassifierEnhancers weka.classifiers.Classifier = \\ C:weka.classifiers.trees.J48, \\ S:weka.classifiers.SingleClassifierEnhancer Class Discovery # Unlike the Class.forName(String) method that grabs the first class it can find in the CLASSPATH , and therefore fixes the location of the package it found the class in, the dynamic discovery examines the complete CLASSPATH you're starting the Java Virtual Machine (JVM) with. This means that you can have several parallel directories with the same WEKA package structure, e.g. the standard release of WEKA in one directory ( /distribution/weka.jar ) and another one with your own classes ( /development/weka/... ), and display all of the classifiers in the GUI. In case of a name conflict, i.e. two directories contain the same class, the first one that can be found is used. In a nutshell, your java call of the GUIChooser could look like this: java -classpath \"/development:/distribution/weka.jar\" weka.gui.GUIChooser Note: Windows users have to replace the \":\" with \";\" and the forward slashes with backslashes. Multiple Class Hierarchies # In case you're developing your own framework, but still want to use your classifiers within WEKA that wasn't possible so far. With the release 3.4.4 it is possible to have multiple class hierarchies being displayed in the GUI. If you've developed a modified version of J48, let's call it MyJ48 and it's located in the package dummy.classifiers then you'll have to add this package to the classifiers list in the GPC file like this: weka.classifiers.Classifier = \\ weka.classifiers.bayes, \\ weka.classifiers.functions, \\ weka.classifiers.lazy, \\ weka.classifiers.meta, \\ weka.classifiers.trees, \\ weka.classifiers.rules, \\ dummy.classifiers Your java call for the GUIChooser might look like this: java -classpath \"weka.jar:dummy.jar\" weka.gui.GUIChooser Note: Windows users have to replace the \":\" with \";\" and the forward slashes with backslashes. Starting up the GUI you'll now have another root node in the tree view of the classifiers, called root , and below it the weka and the dummy package hierarchy as you can see here: Capabilities # Version 3.5.3 of Weka introduces the notion of Capabilities . Capabilities basically list what kind of data a certain object can handle, e.g., one classifier can handle numeric classes, but another cannot. In case a class supports capabilities the additional buttons Filter... and Remove filter will be available in the GOE. The Filter... button pops up a dialog which lists all available Capabilities: One can then choose those capabilities an object, e.g., a classifier, should have. If one is looking for classification problem, then the Nominal class Capability can be selected. On the other hand, if one needs a regression scheme, then the Capability Numeric class can be selected. This filtering mechanism makes the search for an appropriate learning scheme easier. After applying that filter, the tree with the objects will be displayed again and lists all objects that can handle all the selected Capabilities black , the ones that cannot red (starting with 3.5.8: silver ) and the ones that might be able to handle them blue (e.g., meta classifiers which depend on their base classifier(s)). Links # GenericObjectEditor (book version) CLASSPATH Properties file GenericPropertiesCreator.props GenericPropertiesCreator.excludes","title":"Introduction"},{"location":"generic_object_editor_developer_version/#introduction","text":"As of version 3.4.4 it is possible for WEKA to dynamically discover classes at runtime (rather than using only those specified in the GenericObjectEditor.props (GOE) file). In some versions (3.5.8, 3.6.0) this facility was not enabled by default as it is a bit slower than the GOE file approach, and, furthermore, does not function in environments that do not have a CLASSPATH (e.g., application servers). Later versions (3.6.1, 3.7.0) enabled the dynamic discovery again, as WEKA can now distinguish between being a standalone Java application or being run in a non-CLASSPATH environment. If you wish to enable or disable dynamic class discovery, the relevant file to edit is GenericPropertiesCreator.props (GPC). You can obtain this file either from the weka.jar or weka-src.jar archive. Open one of these files with an archive manager that can handle ZIP files (for Windows users, you can use 7-Zip for this) and navigate to the weka/gui directory, where the GPC file is located. All that is required, is to change the UseDynamic property in this file from false to true (for enabling it) or the other way round (for disabling it). After changing the file, you just place it in your home directory. In order to find out the location of your home directory, do the following: Linux/Unix Open a terminal run the following command: echo $HOME Windows Open a command-primpt run the following command: echo %USERPROFILE% If dynamic class discovery is too slow, e.g., due to an enormous CLASSPATH, you can generate a new GenericObjectEditor.props file and then turn dynamic class discovery off again. It is assumed that you already place the GPC file in your home directory (see steps above) and that the weka.jar jar archive with the WEKA classes is in your CLASSPATH (otherwise you have to add it to the java call using the -classpath option). For generating the GOE file, execute the following steps: generate a new GenericObjectEditor.props file using the following command: Linux/Unix java weka.gui.GenericPropertiesCreator \\ $HOME/GenericPropertiesCreator.props \\ $HOME/GenericObjectEditor.props Windows (command must be in one line) java weka.gui.GenericPropertiesCreator %USERPROFILE%\\GenericPropertiesCreator.props %USERPROFILE%\\GenericObjectEditor.props edit the GenericPropertiesCreator.props file in your home directory and set UseDynamic to false . A limitation of the GOE prior to 3.4.4 was, that additional classifiers, filters, etc., had to fit into the same package structure as the already existing ones, i.e., all had to be located below weka . WEKA can now display multiple class hierarchies in the GUI, which makes adding new functionality quite easy as we will see later in an example (it is not restricted to classifiers only, but also works with all the other entries in the GPC file).","title":"Introduction"},{"location":"generic_object_editor_developer_version/#file-structure","text":"The structure of the GOE so far was a key-value-pair, separated by an equals -sign. The value is a comma separated list of classes that are all derived from the superclass/superinterface key . The GPC is slightly different, instead of declaring all the classes/interfaces one need only to specify all the packages descendants are located in (only non-abstract ones are then listed). E.g., the weka.classifiers.Classifier entry in the GOE file looks like this: weka.classifiers.Classifier = \\ weka.classifiers.bayes.AODE, \\ weka.classifiers.bayes.BayesNet, \\ weka.classifiers.bayes.ComplementNaiveBayes, \\ weka.classifiers.bayes.NaiveBayes, \\ weka.classifiers.bayes.NaiveBayesMultinomial, \\ weka.classifiers.bayes.NaiveBayesSimple, \\ weka.classifiers.bayes.NaiveBayesUpdateable, \\ weka.classifiers.functions.LeastMedSq, \\ weka.classifiers.functions.LinearRegression, \\ weka.classifiers.functions.Logistic, \\ ... The entry producing the same output for the classifiers in the GPC looks like this (7 lines instead of over 70 in WEKA 3.4.4!): weka.classifiers.Classifier = \\ weka.classifiers.bayes, \\ weka.classifiers.functions, \\ weka.classifiers.lazy, \\ weka.classifiers.meta, \\ weka.classifiers.trees, \\ weka.classifiers.rules","title":"File Structure"},{"location":"generic_object_editor_developer_version/#exclusion","text":"It may not always be desired to list all the classes that can be found along the CLASSPATH . Sometimes, classes cannot be declared abstract but still shouldn't be listed in the GOE. For that reason one can list classes, interfaces, superclasses for certain packages to be excluded from display. This exclusion is done with the following file: weka/gui/GenericPropertiesCreator.excludes The format of this properties file is fairly easy: =:[,:] Where the corresponds to a key in the GenericPropertiesCreator.props file and the can be one of the following: S - Superclass any class class derived from this will be excluded I - Interface any class implementing this interface will be excluded C - Class exactly this class will be excluded Here are a few examples: # exclude all ResultListeners that also implement the ResultProducer interface # (all ResultProducers do that!) weka.experiment.ResultListener = \\ I:weka.experiment.ResultProducer # exclude J48 and all SingleClassifierEnhancers weka.classifiers.Classifier = \\ C:weka.classifiers.trees.J48, \\ S:weka.classifiers.SingleClassifierEnhancer","title":"Exclusion"},{"location":"generic_object_editor_developer_version/#class-discovery","text":"Unlike the Class.forName(String) method that grabs the first class it can find in the CLASSPATH , and therefore fixes the location of the package it found the class in, the dynamic discovery examines the complete CLASSPATH you're starting the Java Virtual Machine (JVM) with. This means that you can have several parallel directories with the same WEKA package structure, e.g. the standard release of WEKA in one directory ( /distribution/weka.jar ) and another one with your own classes ( /development/weka/... ), and display all of the classifiers in the GUI. In case of a name conflict, i.e. two directories contain the same class, the first one that can be found is used. In a nutshell, your java call of the GUIChooser could look like this: java -classpath \"/development:/distribution/weka.jar\" weka.gui.GUIChooser Note: Windows users have to replace the \":\" with \";\" and the forward slashes with backslashes.","title":"Class Discovery"},{"location":"generic_object_editor_developer_version/#multiple-class-hierarchies","text":"In case you're developing your own framework, but still want to use your classifiers within WEKA that wasn't possible so far. With the release 3.4.4 it is possible to have multiple class hierarchies being displayed in the GUI. If you've developed a modified version of J48, let's call it MyJ48 and it's located in the package dummy.classifiers then you'll have to add this package to the classifiers list in the GPC file like this: weka.classifiers.Classifier = \\ weka.classifiers.bayes, \\ weka.classifiers.functions, \\ weka.classifiers.lazy, \\ weka.classifiers.meta, \\ weka.classifiers.trees, \\ weka.classifiers.rules, \\ dummy.classifiers Your java call for the GUIChooser might look like this: java -classpath \"weka.jar:dummy.jar\" weka.gui.GUIChooser Note: Windows users have to replace the \":\" with \";\" and the forward slashes with backslashes. Starting up the GUI you'll now have another root node in the tree view of the classifiers, called root , and below it the weka and the dummy package hierarchy as you can see here:","title":"Multiple Class Hierarchies"},{"location":"generic_object_editor_developer_version/#capabilities","text":"Version 3.5.3 of Weka introduces the notion of Capabilities . Capabilities basically list what kind of data a certain object can handle, e.g., one classifier can handle numeric classes, but another cannot. In case a class supports capabilities the additional buttons Filter... and Remove filter will be available in the GOE. The Filter... button pops up a dialog which lists all available Capabilities: One can then choose those capabilities an object, e.g., a classifier, should have. If one is looking for classification problem, then the Nominal class Capability can be selected. On the other hand, if one needs a regression scheme, then the Capability Numeric class can be selected. This filtering mechanism makes the search for an appropriate learning scheme easier. After applying that filter, the tree with the objects will be displayed again and lists all objects that can handle all the selected Capabilities black , the ones that cannot red (starting with 3.5.8: silver ) and the ones that might be able to handle them blue (e.g., meta classifiers which depend on their base classifier(s)).","title":"Capabilities"},{"location":"generic_object_editor_developer_version/#links","text":"GenericObjectEditor (book version) CLASSPATH Properties file GenericPropertiesCreator.props GenericPropertiesCreator.excludes","title":"Links"},{"location":"get_latest_bugfixes/","text":"Weka is actively developed, that means that bugs are fixed and new functionality is added (only to the developer version) all the time. Every now and then (about every 6-12 months), when there was a sufficiently large number of improvements or fixes, a release is made and uploaded to Sourceforget.net . If you don't want to wait that long, you can get the latest source code from Git and compile it yourself. See the following articles for more information: obtaining the source code from Git , either book or developer version compiling the source code","title":"Get latest bugfixes"},{"location":"getting_help/","text":"In addition to consulting the available documentation , try searching a mailing list archive or community forum to check whether a solution to your problem has already been posted there. Please consult these sources of information before posting a query on the Weka mailing list or elsewhere. And please never email individual Weka developers directly. When you do post a message regarding a problem you encountered with Weka, please include as much as information as possible. In particular, consider running Weka with a console window open so that you can see the entire error output from Java (including the Java stack trace). This makes it much more likely that you will get useful help. When posting questions, comments, or bug reports to the Weka mailing list, consider the mailing list etiquette . Mailing list archive and mirrors # Consider searching the archive of the Weka mailing list (wekalist) or its mirror marc.info . Forums offering help # You should also consider looking for a solution at stackoverflow.com , the old forum for Weka at pentaho.com , or the newer forum at hitachivantara.com . Bug reports # Bug reports can be sent to the Weka mailing list or posted at JIRA . IRC channel for discussing Weka # ##weka on freenode.","title":"Getting help"},{"location":"getting_help/#mailing-list-archive-and-mirrors","text":"Consider searching the archive of the Weka mailing list (wekalist) or its mirror marc.info .","title":"Mailing list archive and mirrors"},{"location":"getting_help/#forums-offering-help","text":"You should also consider looking for a solution at stackoverflow.com , the old forum for Weka at pentaho.com , or the newer forum at hitachivantara.com .","title":"Forums offering help"},{"location":"getting_help/#bug-reports","text":"Bug reports can be sent to the Weka mailing list or posted at JIRA .","title":"Bug reports"},{"location":"getting_help/#irc-channel-for-discussing-weka","text":"##weka on freenode.","title":"IRC channel for discussing Weka"},{"location":"git/","text":"General # The main trunk of the Weka Git repository is accessible and browseable via the following URL: https://git.cms.waikato.ac.nz/weka/weka/-/tree/main/trunk Other branches can be accessed via https://git.cms.waikato.ac.nz/weka/weka For example, if you want to obtain the source code of the 3.8 version, use this URL: https://git.cms.waikato.ac.nz/weka/weka/-/tree/stable-3-8 Specific version # Whenever a release of Weka is generated, the repository gets tagged . The tag for a development version has the form dev-X-Y-Z For example, WEKA 3.9.6 corresponds to the tag dev-3-9-6. The tag for a stable version is stable-X-Y-Z The WEKA 3.8 version is one of those stable versions, e.g., stable-3-8-6 will be the tag for Weka 3.8.6.","title":"General"},{"location":"git/#general","text":"The main trunk of the Weka Git repository is accessible and browseable via the following URL: https://git.cms.waikato.ac.nz/weka/weka/-/tree/main/trunk Other branches can be accessed via https://git.cms.waikato.ac.nz/weka/weka For example, if you want to obtain the source code of the 3.8 version, use this URL: https://git.cms.waikato.ac.nz/weka/weka/-/tree/stable-3-8","title":"General"},{"location":"git/#specific-version","text":"Whenever a release of Weka is generated, the repository gets tagged . The tag for a development version has the form dev-X-Y-Z For example, WEKA 3.9.6 corresponds to the tag dev-3-9-6. The tag for a stable version is stable-X-Y-Z The WEKA 3.8 version is one of those stable versions, e.g., stable-3-8-6 will be the tag for Weka 3.8.6.","title":"Specific version"},{"location":"gpgpu/","text":"See : this post I am looking for input from WEKA users. Please leave a comment on the website and I'll respond back. The input/help I need from WEKA users is as follows: I need to know what algorithms would be desired to have optimized first. For now, I'm working on Bayes (for starters). I need willing volunteers to use the revised code that I create (I am not making any changes to the algorithms just diverting the mathematical calculations from the CPU to the GPU to increase speed) and let me know of any performance changes observed.","title":"Gpgpu"},{"location":"gui_chooser_starts_but_not_experimenter_or_explorer/","text":"The GUIChooser starts, but the Explorer and Experimenter do not start and output an Exception like this in the terminal: /usr/share/themes/Mist/gtk-2.0/gtkrc:48: Engine \"mist\" is unsupported, ignoring ---Registering Weka Editors--- java.lang.NullPointerException at weka.gui.explorer.PreprocessPanel.addPropertyChangeListener(PreprocessPanel.java:519) at javax.swing.plaf.synth.SynthPanelUI.installListeners(SynthPanelUI.java:49) at javax.swing.plaf.synth.SynthPanelUI.installUI(SynthPanelUI.java:38) at javax.swing.JComponent.setUI(JComponent.java:652) at javax.swing.JPanel.setUI(JPanel.java:131) ... This behavior happens only under Java 5/6 and Gnome/Linux, KDE doesn't produce this error. The reason for this is, that Weka tries to look more \"native\" and therefore sets a platform-specific Swing theme. Unfortunately, this doesn't seem to be working correctly in Java 5/6 together with Gnome. A workaround for this is to set the cross-platform Metal theme. In order to use another theme one only has to create the following properties file: LookAndFeel.props with this content: Theme = javax.swing.plaf.metal.MetalLookAndFeel","title":"Gui chooser starts but not experimenter or explorer"},{"location":"history/","text":"Book 1st ed. version (3.0) Old GUI version (3.2) Stable/Book 2nd ed. version (3.4) Stable/Book 3rd ed. version (3.6) Stable/Book 4th ed. version (3.8) Development version (3.9) 3.8.6 (pkgs) 3.9.6 (pkgs) 3.8.5 (pkgs) 3.9.5 (pkgs) 3.8.4 (pkgs) 3.9.4 (pkgs) 3.8.3 (pkgs) 3.9.3 (pkgs) 3.8.2 (pkgs) 3.9.2 (pkgs) 3.6.15 3.8.1 (pkgs) 3.9.1 (pkgs) 3.6.14 3.8.0 (pkgs) 3.9.0 (pkgs) 3.6.13 3.7.13 (pkgs) 3.6.12 3.7.12 (pkgs) 3.6.11 3.7.11 (pkgs) 3.6.10 3.7.10 (pkgs) 3.7.9 (pkgs) 3.6.9 3.7.8 (pkgs) 3.6.8 3.7.7 (pkgs) 3.6.7 3.7.6 (pkgs) 3.6.6 3.7.5 (pkgs) 3.4.19 3.6.5 3.7.4 (pkgs) 3.4.18 3.6.4 3.7.3 (pkgs) 3.4.17 3.6.3 3.7.2 (pkgs) 3.4.16 3.6.2 3.7.1 3.4.15 3.6.1 3.7.0 3.4.14 3.6.0 3.4.13 3.5.8 3.4.12 3.5.7 3.4.11 3.5.6 3.4.10 3.5.5 3.4.9 3.5.4 3.4.8 3.5.3 3.4.7 3.5.2 3.4.6 3.5.1 3.4.5 3.5.0 3.4.4 3.4.3 3.4.2 3.4.1 3.4 3.3.6 3.3.5 3.3.4 3.3.3 3.2.3 3.3.2 3.0.6 3.2.2 3.3.1 3.0.5 3.2.1 3.3 3.0.4 3.2 3.0.3 3.1.9 3.0.2 3.1.8 3.0.1 3.1.7 3.0 3.1.6 Prerelease 6 3.1.5 Prerelease 5 3.1.4 Prerelease 4","title":"History"},{"location":"how_do_i_modify_the_classpath/","text":"See the article CLASSPATH and check out this section for changing the environment variable. This article explains how to add a MySQL jar to the variable. With version 3.5.4 or later you can also just use the RunWEKA.ini file to modify your CLASSPATH.","title":"How do i modify the classpath"},{"location":"how_do_i_use_the_associator_generalized_sequential_patterns/","text":"The article GeneralizedSequentialPatterns contains more information on this associator.","title":"How do i use the associator generalized sequential patterns"},{"location":"how_to_run_weka_schemes_from_commandline/","text":"It is quite often the case that one has to run a classifier, filter, attribute selection, etc. from commandline, leaving the comfort of the GUI (most likely the Explorer). Due to the vast amount of options the Weka schemes offer, it can be quite tedious setting up a scheme on the commandline. In the following, a few different approaches are listed that can be used for running a scheme from the commandline: Hardcore approach (works for all versions of Weka) one just uses the -h option to display the commandline help with all available options and chooses the ones that apply, e.g.: java weka.classifiers.functions.SMO -h The drawback of this method is, that one has to take care of escaping nested quotes oneself. As soon as one has to use meta-classifiers, this gets real messy. An introduction to the commandline use can be found in the Primer . copy/paste approach With this approach, one doesn't have to worry about correct nesting, since Weka takes care of that, returning correctly nested and escaped options. Since version 3.5.3, one can right-click (or + left-click for Mac users) any GenericObjectEditor panel and select the Copy configuration to clipboard option to copy the currently shown configuration to the clipboard and then just paste it into the commandline. One only needs to add the appropriate java call and other general options, like datasets, class index, etc. Another copy/paste approach is copying the configurations from the Explorer log, which is available since version 3.5.4. Every action in the Explorer, like applying a filter, running a classifier, attribute selection, etc. outputs the command to the log as well. This makes is fairly easy copying it to the clipboard and using it in the console, only the java call and other general options need to be added. See also # Primer - introduction to Weka from the commandline CLASSPATH - how to load all necessary libraries or welcome to the JAR hell Command redirection - shows how to redirect output in files","title":"How to run weka schemes from commandline"},{"location":"how_to_run_weka_schemes_from_commandline/#see-also","text":"Primer - introduction to Weka from the commandline CLASSPATH - how to load all necessary libraries or welcome to the JAR hell Command redirection - shows how to redirect output in files","title":"See also"},{"location":"ikvm_with_weka_tutorial/","text":"This tutorial walks you through the creation of a Microsoft C# program that uses Weka , and some Java API classes, via IKVM . The process will be similar for other .NET languages. Set up / Installation # You will first need to install IKVM, which can be found here . You will also need a C# compiler/VM - Mono is an excellent open source solution for both linux and windows, or you could just use Microsoft Visual Studio .NET. Conversion from Java to a .NET dll # With that out of the way, the first thing you will want to do is to convert the Weka .jar file into a .NET dll. To do this, we will use ikvmc , which is the IKVM static compiler. On the console, go to the directory which contains weka.jar, and type: > ikvmc -target:library weka.jar The -target:library call causes ikvmc to create a .dll library instead of an executable. Note that the IKVM tutorial tells you that you should add -reference:/usr/lib/IKVM.GNU.Classpath.dll (or appropriate path) to the above command, it tells IKVM where to find the GNU Classpath library. However, IKVM.GNU.Classpath.dll Is no longer included in the download package, and is from very old versions of IKVM. When Sun open sources Java, it got replaced by the IKVM.OpenJDK.*.dll files. You should now have a file called \"weka.dll\", which is a .NET version of the entire weka API. That's exactly what we want! Use the dll in a .NET application # To try it out, lets use a small C# program that I wrote. The program simply runs the J48 classifier on the Iris dataset with a 66% test/data split, and prints out the correctness percentage. It also uses a few Java classes, and is already about 95% legal Java code. The code is here: //start of file Main.cs using System ; class MainClass { public static void Main ( string [] args ) { Console . WriteLine ( \"Hello Java, from C#!\" ); classifyTest (); } const int percentSplit = 66 ; public static void classifyTest () { try { weka . core . Instances insts = new weka . core . Instances ( new java . io . FileReader ( \"iris.arff\" )); insts . setClassIndex ( insts . numAttributes () - 1 ); weka . classifiers . Classifier cl = new weka . classifiers . trees . J48 (); Console . WriteLine ( \"Performing \" + percentSplit + \"% split evaluation.\" ); //randomize the order of the instances in the dataset. weka . filters . Filter myRandom = new weka . filters . unsupervised . instance . Randomize (); myRandom . setInputFormat ( insts ); insts = weka . filters . Filter . useFilter ( insts , myRandom ); int trainSize = insts . numInstances () * percentSplit / 100 ; int testSize = insts . numInstances () - trainSize ; weka . core . Instances train = new weka . core . Instances ( insts , 0 , trainSize ); cl . buildClassifier ( train ); int numCorrect = 0 ; for ( int i = trainSize ; i < insts . numInstances (); i ++ ) { weka . core . Instance currentInst = insts . instance ( i ); double predictedClass = cl . classifyInstance ( currentInst ); if ( predictedClass == insts . instance ( i ). classValue ()) numCorrect ++ ; } Console . WriteLine ( numCorrect + \" out of \" + testSize + \" correct (\" + ( double )(( double ) numCorrect / ( double ) testSize * 100.0 ) + \"%)\" ); } catch ( java . lang . Exception ex ) { ex . printStackTrace (); } } } //end of file Main.cs Compile and run it # Now we just need to compile it. If you are using MonoDevelop or Visual Studio, you will need to add references to weka.dll, and all of the IKVM.OpenJDK.*.dll files, and lastly IKVM.Runtime.dll into your project. Otherwise, on the command line, you can type: NOTE: replace IKVM.OpenJDK. .dll with the remaining IKVM.openJDK files. >mcs Main.cs -r:weka.dll,IKVM.Runtime.dll,IKVM.OpenJDK.core.dll, IKVM.OpenJDK.*.dll to run the Mono C# compiler with references to the appropriate dlls (according to the Mono documentation, the command line arguments for Visual Studio are the same). And there you go! Now you can run the program. But make sure that the Iris.arff dataset is in the same directory first. For mono: >mono Main.exe or if you are using visual studio, just: >Main.exe Hopefully you will get as output: Hello Java, from C#! Performing 66% split evaluation. 49 out of 51 correct (96.078431372549%) And there you have it. Now we have a working program that uses Weka classes, and some classes from the standard Java API, in a C# program for the .NET framework. Links # An Introduction to IKVM IKVM.NET Mono The official IKVM tutorial Use Weka with the Microsoft .NET Framework","title":"Ikvm with weka tutorial"},{"location":"ikvm_with_weka_tutorial/#set-up-installation","text":"You will first need to install IKVM, which can be found here . You will also need a C# compiler/VM - Mono is an excellent open source solution for both linux and windows, or you could just use Microsoft Visual Studio .NET.","title":"Set up / Installation"},{"location":"ikvm_with_weka_tutorial/#conversion-from-java-to-a-net-dll","text":"With that out of the way, the first thing you will want to do is to convert the Weka .jar file into a .NET dll. To do this, we will use ikvmc , which is the IKVM static compiler. On the console, go to the directory which contains weka.jar, and type: > ikvmc -target:library weka.jar The -target:library call causes ikvmc to create a .dll library instead of an executable. Note that the IKVM tutorial tells you that you should add -reference:/usr/lib/IKVM.GNU.Classpath.dll (or appropriate path) to the above command, it tells IKVM where to find the GNU Classpath library. However, IKVM.GNU.Classpath.dll Is no longer included in the download package, and is from very old versions of IKVM. When Sun open sources Java, it got replaced by the IKVM.OpenJDK.*.dll files. You should now have a file called \"weka.dll\", which is a .NET version of the entire weka API. That's exactly what we want!","title":"Conversion from Java to a .NET dll"},{"location":"ikvm_with_weka_tutorial/#use-the-dll-in-a-net-application","text":"To try it out, lets use a small C# program that I wrote. The program simply runs the J48 classifier on the Iris dataset with a 66% test/data split, and prints out the correctness percentage. It also uses a few Java classes, and is already about 95% legal Java code. The code is here: //start of file Main.cs using System ; class MainClass { public static void Main ( string [] args ) { Console . WriteLine ( \"Hello Java, from C#!\" ); classifyTest (); } const int percentSplit = 66 ; public static void classifyTest () { try { weka . core . Instances insts = new weka . core . Instances ( new java . io . FileReader ( \"iris.arff\" )); insts . setClassIndex ( insts . numAttributes () - 1 ); weka . classifiers . Classifier cl = new weka . classifiers . trees . J48 (); Console . WriteLine ( \"Performing \" + percentSplit + \"% split evaluation.\" ); //randomize the order of the instances in the dataset. weka . filters . Filter myRandom = new weka . filters . unsupervised . instance . Randomize (); myRandom . setInputFormat ( insts ); insts = weka . filters . Filter . useFilter ( insts , myRandom ); int trainSize = insts . numInstances () * percentSplit / 100 ; int testSize = insts . numInstances () - trainSize ; weka . core . Instances train = new weka . core . Instances ( insts , 0 , trainSize ); cl . buildClassifier ( train ); int numCorrect = 0 ; for ( int i = trainSize ; i < insts . numInstances (); i ++ ) { weka . core . Instance currentInst = insts . instance ( i ); double predictedClass = cl . classifyInstance ( currentInst ); if ( predictedClass == insts . instance ( i ). classValue ()) numCorrect ++ ; } Console . WriteLine ( numCorrect + \" out of \" + testSize + \" correct (\" + ( double )(( double ) numCorrect / ( double ) testSize * 100.0 ) + \"%)\" ); } catch ( java . lang . Exception ex ) { ex . printStackTrace (); } } } //end of file Main.cs","title":"Use the dll in a .NET application"},{"location":"ikvm_with_weka_tutorial/#compile-and-run-it","text":"Now we just need to compile it. If you are using MonoDevelop or Visual Studio, you will need to add references to weka.dll, and all of the IKVM.OpenJDK.*.dll files, and lastly IKVM.Runtime.dll into your project. Otherwise, on the command line, you can type: NOTE: replace IKVM.OpenJDK. .dll with the remaining IKVM.openJDK files. >mcs Main.cs -r:weka.dll,IKVM.Runtime.dll,IKVM.OpenJDK.core.dll, IKVM.OpenJDK.*.dll to run the Mono C# compiler with references to the appropriate dlls (according to the Mono documentation, the command line arguments for Visual Studio are the same). And there you go! Now you can run the program. But make sure that the Iris.arff dataset is in the same directory first. For mono: >mono Main.exe or if you are using visual studio, just: >Main.exe Hopefully you will get as output: Hello Java, from C#! Performing 66% split evaluation. 49 out of 51 correct (96.078431372549%) And there you have it. Now we have a working program that uses Weka classes, and some classes from the standard Java API, in a C# program for the .NET framework.","title":"Compile and run it"},{"location":"ikvm_with_weka_tutorial/#links","text":"An Introduction to IKVM IKVM.NET Mono The official IKVM tutorial Use Weka with the Microsoft .NET Framework","title":"Links"},{"location":"instance_id/","text":"People often want to tag their instances with identifiers , so they can keep track of them and the predictions made on them. Adding the ID # A new ID attribute is added real easy: one only needs to run the AddID filter over the dataset and it's done. Here's an example (at a DOS/Unix command prompt): java weka.filters.unsupervised.attribute.AddID -i data_without_id.arff -o data_with_id.arff (all on a single line) Note: the AddID filter adds a numeric attribute, not a String attribute to the dataset. If you want to remove this ID attribute for the classifier in a FilteredClassifier environment again, use the Remove filter instead of the RemoveType filter (same package). Removing the ID # If you run from the command line you can use the -p option to output predictions plus any other attributes you are interested in. So it is possible to have a string attribute in your data that acts as an identifier. A problem is that most classifiers don't like String attributes, but you can get around this by using the RemoveType (this removes String attributes by default). Here's an example. Lets say you have a training file named train.arff , a testing file named test.arff , and they have an identifier String attribute as their 5th attribute. You can get the predictions from J48 along with the identifier strings by issuing the following command (at a DOS/Unix command prompt): java weka.classifiers.meta.FilteredClassifier -F weka.filters.unsupervised.attribute.RemoveType -W weka.classifiers.trees.J48 -t train.arff -T test.arff -p 5 (all on a single line) If you want, you can redirect the output to a file by adding \" > output.txt \" to the end of the line. In the Explorer GUI you could try a similar trick of using the String attribute identifiers here as well. Choose the FilteredClassifier , with the RemoveType as the filter, and whatever classifier you prefer. When you visualize the results you will need click through each instance to see the identifier listed for each.","title":"Instance id"},{"location":"instance_id/#adding-the-id","text":"A new ID attribute is added real easy: one only needs to run the AddID filter over the dataset and it's done. Here's an example (at a DOS/Unix command prompt): java weka.filters.unsupervised.attribute.AddID -i data_without_id.arff -o data_with_id.arff (all on a single line) Note: the AddID filter adds a numeric attribute, not a String attribute to the dataset. If you want to remove this ID attribute for the classifier in a FilteredClassifier environment again, use the Remove filter instead of the RemoveType filter (same package).","title":"Adding the ID"},{"location":"instance_id/#removing-the-id","text":"If you run from the command line you can use the -p option to output predictions plus any other attributes you are interested in. So it is possible to have a string attribute in your data that acts as an identifier. A problem is that most classifiers don't like String attributes, but you can get around this by using the RemoveType (this removes String attributes by default). Here's an example. Lets say you have a training file named train.arff , a testing file named test.arff , and they have an identifier String attribute as their 5th attribute. You can get the predictions from J48 along with the identifier strings by issuing the following command (at a DOS/Unix command prompt): java weka.classifiers.meta.FilteredClassifier -F weka.filters.unsupervised.attribute.RemoveType -W weka.classifiers.trees.J48 -t train.arff -T test.arff -p 5 (all on a single line) If you want, you can redirect the output to a file by adding \" > output.txt \" to the end of the line. In the Explorer GUI you could try a similar trick of using the String attribute identifiers here as well. Choose the FilteredClassifier , with the RemoveType as the filter, and whatever classifier you prefer. When you visualize the results you will need click through each instance to see the identifier listed for each.","title":"Removing the ID"},{"location":"j48_weighter_patch/","text":"Description # J48-Weighter patch: Modification of J48 for Weighted Data. Reference # -none- Package # Patches to: weka.classifiers.trees.j48 weka.core weka.filters.unsupervised.attribute Download # Patch for Weka 3.4.5: j48-weighter.patch Additional Information # This patch addresses two separate but related issues: The proposed filter \"Weighter\" allows one to specify a numeric attribute to be used as an instance weight. As mentioned on Wekalist, tests using weighted sample-survey data indicated possible problems in the J48 decision tree algorithm. The Weighter filter # Weighter is a general-purpose filter independent of J48 or other classifiers, but to preserve the weight assignment it initially had to be run under FilteredClassifier. To make weights persistent via .arff files, some changes were made in Instances and Instance, while retaining compatibility with the existing ARFF format. Briefly, if Weighter is applied to an attribute, e.g. \"fnlwgt\" in the \"adult\" dataset from the UCI repository, that attribute is removed and its value is used as instance weight. Upon Save, the weight is appended to each instance under the attribute name \"::weight::fnlwgt\"; reading the .arff file inverts the Save process, transparent to the user. Repeated application of Weighter multiplies the weight and extends its name. The special case of invoking Weighter without an attribute argument restores the unweighted dataset, with an appended attribute named as above. J48 with instance weights # The simple rescaling inserted in weka.classifiers.trees.j48.Stats is intended to: use the correct sample size in the normal approximation to the binomial, make the scale of the .5 continuity correction consistent with the data, base the minimum-leaf-count option (-M) on unweighted counts. These changes make pruning more effective with weighted data, and help to reduce apparent overfitting. This should be the case whether the weights reflect missing value imputation (as is common in Weka), or survey-sampling probabilities (e.g. \"fnlwgt\" in the UCI \"adult\" sample). The modification to j48.Stats would not have worked on its own. In particular, j48.Distribution had been written to maintain one set of counts only. To work on weighted data statistical algorithms often require both weighted and unweighted counts. A few other minor modifications were introduced to change the way \"-M\" works. One effect is that, for this purpose, instances with missing x-values are no longer counted; they are considered missing.","title":"Description"},{"location":"j48_weighter_patch/#description","text":"J48-Weighter patch: Modification of J48 for Weighted Data.","title":"Description"},{"location":"j48_weighter_patch/#reference","text":"-none-","title":"Reference"},{"location":"j48_weighter_patch/#package","text":"Patches to: weka.classifiers.trees.j48 weka.core weka.filters.unsupervised.attribute","title":"Package"},{"location":"j48_weighter_patch/#download","text":"Patch for Weka 3.4.5: j48-weighter.patch","title":"Download"},{"location":"j48_weighter_patch/#additional-information","text":"This patch addresses two separate but related issues: The proposed filter \"Weighter\" allows one to specify a numeric attribute to be used as an instance weight. As mentioned on Wekalist, tests using weighted sample-survey data indicated possible problems in the J48 decision tree algorithm.","title":"Additional Information"},{"location":"j48_weighter_patch/#the-weighter-filter","text":"Weighter is a general-purpose filter independent of J48 or other classifiers, but to preserve the weight assignment it initially had to be run under FilteredClassifier. To make weights persistent via .arff files, some changes were made in Instances and Instance, while retaining compatibility with the existing ARFF format. Briefly, if Weighter is applied to an attribute, e.g. \"fnlwgt\" in the \"adult\" dataset from the UCI repository, that attribute is removed and its value is used as instance weight. Upon Save, the weight is appended to each instance under the attribute name \"::weight::fnlwgt\"; reading the .arff file inverts the Save process, transparent to the user. Repeated application of Weighter multiplies the weight and extends its name. The special case of invoking Weighter without an attribute argument restores the unweighted dataset, with an appended attribute named as above.","title":"The Weighter filter"},{"location":"j48_weighter_patch/#j48-with-instance-weights","text":"The simple rescaling inserted in weka.classifiers.trees.j48.Stats is intended to: use the correct sample size in the normal approximation to the binomial, make the scale of the .5 continuity correction consistent with the data, base the minimum-leaf-count option (-M) on unweighted counts. These changes make pruning more effective with weighted data, and help to reduce apparent overfitting. This should be the case whether the weights reflect missing value imputation (as is common in Weka), or survey-sampling probabilities (e.g. \"fnlwgt\" in the UCI \"adult\" sample). The modification to j48.Stats would not have worked on its own. In particular, j48.Distribution had been written to maintain one set of counts only. To work on weighted data statistical algorithms often require both weighted and unweighted counts. A few other minor modifications were introduced to change the way \"-M\" works. One effect is that, for this purpose, instances with missing x-values are no longer counted; they are considered missing.","title":"J48 with instance weights"},{"location":"java_virtual_machine/","text":"The Java virtual machine (JVM) is the platform dependent interpreter of the Java bytecode (i.e., the classes ). It translates the bytecode into machine specific instructions. Amount of available memory # If you start the virtual machine without any parameters it takes default values for stack and heap. In case you run into OutOfMemory exceptions, try to start your JVM with a bigger maximum heap size. (However, there's a limit, depending on your OS. See the 32-Bit and 64-Bit sections.) 32-bit # With a 32-Bit machine you can address at most 4GB of virtual memory . Different operating systems divide up the memory further into //system/kernel and user space*. From experience, you can achieve the following maximum sizes for the heap on Windows and Linux: Windows: 1.4GB Linux: 1.7GB 64-bit # Larger heap sizes are available when using 64-bit Java in a conjunction with a 64-bit operating system. There is more information available here .","title":"Java virtual machine"},{"location":"java_virtual_machine/#amount-of-available-memory","text":"If you start the virtual machine without any parameters it takes default values for stack and heap. In case you run into OutOfMemory exceptions, try to start your JVM with a bigger maximum heap size. (However, there's a limit, depending on your OS. See the 32-Bit and 64-Bit sections.)","title":"Amount of available memory"},{"location":"java_virtual_machine/#32-bit","text":"With a 32-Bit machine you can address at most 4GB of virtual memory . Different operating systems divide up the memory further into //system/kernel and user space*. From experience, you can achieve the following maximum sizes for the heap on Windows and Linux: Windows: 1.4GB Linux: 1.7GB","title":"32-bit"},{"location":"java_virtual_machine/#64-bit","text":"Larger heap sizes are available when using 64-bit Java in a conjunction with a 64-bit operating system. There is more information available here .","title":"64-bit"},{"location":"jupyter_notebooks/","text":"Jupyter notebooks are extremely popular in the Python world, simply because it is great to combine documentation and code in a visually appealing way. Great tool for teaching! Thanks to the IJava kernel and the JDK 9+ JShell feature, it is possible to run Java within Notebooks without compiling the code now as well. Installation on Linux # The following worked on Linux Mint 18.2: create a directory called weka-notebooks mkdir weka-notebooks change into the directory and create a Python virtual environment: cd weka-notebooks virtualenv -p /usr/bin/python3.5 venv install Jupyter notebooks and its dependencies: venv/bin/pip install jupyter then download the latest IJava release (at time of writing, this was 1.20 ) into this directory unzip the IJava archive: unzip -q ijava*.zip install the Java kernel into the virtual environment, using the IJava installer: venv/bin/python install.py --sys-prefix after that, fire up Jupyter using: venv/bin/jupyter-notebook now you can create new (Java) notebooks! Installation on Windows (using anaconda) # open a command prompt create a new environment using anaconda (e.g., for Python 3.5) conda create -n py35-ijava python=3.5 activate environment activate py35-ijava install Jupyter pip install jupyter download the latest IJava release (at time of writing, this was 1.20 ) unzip the IJava release (e.g., with your File browser or 7-Zip) change into the directory where you extracted the release, containing the install.py , e.g.: cd C:\\Users\\fracpete\\Downloads\\ijava-1.2.0 install the kernel python install.py --sys-prefix start Jupyter jupyter-notebook now you can create new (Java) notebooks!","title":"Jupyter notebooks"},{"location":"jupyter_notebooks/#installation-on-linux","text":"The following worked on Linux Mint 18.2: create a directory called weka-notebooks mkdir weka-notebooks change into the directory and create a Python virtual environment: cd weka-notebooks virtualenv -p /usr/bin/python3.5 venv install Jupyter notebooks and its dependencies: venv/bin/pip install jupyter then download the latest IJava release (at time of writing, this was 1.20 ) into this directory unzip the IJava archive: unzip -q ijava*.zip install the Java kernel into the virtual environment, using the IJava installer: venv/bin/python install.py --sys-prefix after that, fire up Jupyter using: venv/bin/jupyter-notebook now you can create new (Java) notebooks!","title":"Installation on Linux"},{"location":"jupyter_notebooks/#installation-on-windows-using-anaconda","text":"open a command prompt create a new environment using anaconda (e.g., for Python 3.5) conda create -n py35-ijava python=3.5 activate environment activate py35-ijava install Jupyter pip install jupyter download the latest IJava release (at time of writing, this was 1.20 ) unzip the IJava release (e.g., with your File browser or 7-Zip) change into the directory where you extracted the release, containing the install.py , e.g.: cd C:\\Users\\fracpete\\Downloads\\ijava-1.2.0 install the kernel python install.py --sys-prefix start Jupyter jupyter-notebook now you can create new (Java) notebooks!","title":"Installation on Windows (using anaconda)"},{"location":"just_in_time_jit_compiler/","text":"For maximum enjoyment, use a virtual machine that incorporates a just-in-time compiler . This can speed things up quite significantly. Note also that there can be large differences in execution time between different virtual machines. The Sun JDK/JRE all include a JIT compiler (\"hotspot\").","title":"Just in time jit compiler"},{"location":"jvm/","text":"see Java Virtual Machine","title":"Jvm"},{"location":"knowledge_flow_toolbars_are_empty/","text":"In the terminal, you will most likely see this output as well: Failed to instantiate: weka.gui.beans.Loader This behavior can happen under Gnome with Java 5/6, see GUIChooser starts but not Experimenter or Explorer for a solution.","title":"Knowledge flow toolbars are empty"},{"location":"learning_resources/","text":"Videos # Youtube channel of Data Mining with Weka MOOCs Tutorials # Learn Data Science Online MOOCs # Data Mining with Weka More Data Mining with Weka Advanced Data Mining with Weka","title":"Videos"},{"location":"learning_resources/#videos","text":"Youtube channel of Data Mining with Weka MOOCs","title":"Videos"},{"location":"learning_resources/#tutorials","text":"Learn Data Science Online","title":"Tutorials"},{"location":"learning_resources/#moocs","text":"Data Mining with Weka More Data Mining with Weka Advanced Data Mining with Weka","title":"MOOCs"},{"location":"lib_svm/","text":"Description # Wrapper class for the LibSVM library by Chih-Chung Chang and Chih-Jen Lin. The original wrapper, named WLSVM, was developed by Yasser EL-Manzalawy. The current version is complete rewrite of the wrapper, using Reflection in order to avoid compilation errors, in case the libsvm.jar is not in the CLASSPATH . Important note: From WEKA >= 3.7.2 installation and use of LibSVM in WEKA has been simplified by the creation of a LibSVM package that can be installed using either the graphical or command line package manager . Reference (Weka <= 3.6.8) # LibSVM WLSVM Package # weka.classifiers.functions Download # The wrapper class is part of WEKA since version 3.5.2. But LibSVM , as a third-party-tool needs to be downloaded separately. It is recommended to upgrade to a post-3.5.3 version (or git ) for bug-fixes and extensions (contains now the distributionForInstance method). CLASSPATH # Add the libsvm.jar from the LibSVM distribution to your CLASSPATH to make it available. Note: Do NOT start WEKA then with java -jar weka.jar . The -jar option overwrites the CLASSPATH , not augments it (a very common trap to fall into). Instead use something like this on Linux: java -classpath $CLASSPATH :weka.jar:libsvm.jar weka.gui.GUIChooser or this on Win32 (if you're starting it from commandline): java -classpath \"%CLASSPATH%;weka.jar;libsvm.jar\" weka.gui.GUIChooser If you're starting WEKA from the Start Menu on Windows, you'll have to add the libsvm.jar to your CLASSPATH environment variable. The following steps are for Windows XP (unfortunately, the GUI changes among the different Windows versions): right-click on My Computer and select Properties from the menu choose the Advanced tab and click on Environment variables at the bottom either add or modify a variable called CLASSPATH and add the libsvm.jar with full path to it Troubleshooting # LibSVM classes not in CLASSPATH! Check whether the libsvm.jar is really in your CLASSPATH. Execute the following command in the SimpleCLI : java weka.core.SystemInfo The property java.class.path must list the libsvm.jar . If it is listed, check whether the path is correct. If you're on Windows and you find %CLASSPATH% there, see next bullet point to fix this. On Windows, if you added the libsvm.jar to your CLASSPATH environment variable, it can still happen that WEKA pops up the error message that the LibSVM classes are not in your CLASSPATH. This can happen where the %CLASSPATH% does not get expanded to its actual value in starting up WEKA. You can inspect your current CLASSPATH with which WEKA got started up with the SimpleCLI (see previous bullet point). If %CLASSPATH% is listed there, your system has the same problem. You can also explicitly add a .jar file to RunWeka.ini . Note: backslashes have to be escaped, not only once, but twice (they get interpreted by Java twice!). In other words, instead of one you have to use four : C:\\some\\where then turns into C:\\\\\\\\some\\\\\\\\where . Issues with libsvm.jar that were discussed on the Weka list in April 2007 (and may no longer be relevant) # The following changes were not incorporated in WEKA, since it also means modifying the LibSVM Java code, which (I think) is autogenerated from the C code. The authors of LibSVM might have to consider that update. It's left to the reader to incorporate these changes. libsvm.svm uses Math.random # libsvm.svm calls Math.random so the model it returns is usually different for the same training set and svm parameters over time. Obviously, if you call libsvm.svm from weka.classifiers.functions.libsvm, and you call it again from libsvm.svm_train, the results are also different. You can use libsvm.svm_save_model to record the svms into files, and then compare the model file from WEKA LibSVM with the model file from libsvm.svm_predict. Then you can see that ProbA values use to be different. WEKA experimenter is based on using always the same random sequences in order to repeat experiments with the same results. So, I'm afraid some important design changes are required on libsvm.jar and weka.classifiers.functions.libsvm.class to keep such behaviour. We made a quick fix adding an static Random attribute to libsvm.svm class: static java . util . Random ranGen = new Random ( 0 ); We have changed all Math.random() invokations to ranGen.nextdouble(). Then we have obtained the same svm from weka LibSVM than from LibSVM train_svm. However, WEKA accuracy results on primary_tumor data were still worse, so there's something wrong when weka uses the svm model at testing step. Classes without instances # ARFF format provides some meta-information (i.e. attributes name and type, set of possible values for nominal attributes), but LibSVM format doesn't. So if there are classes in the dataset with zero occurrences through all the instances, LibSVM thinks that these classes don't exist whereas WEKA knows they exist. For example, there is a class in primary tumor dataset that never appears. When WEKA experimenter makes testing, it calls to: public static double svm_predict_probability ( svm_model model , svm_node [] x , double [] prob_estimates ) passing the array prob_estimates plenty of zeros (array cells are initialized to zero). The size of the array is equal to the number of classes (= 22). On the other hand, if this method is invoked from libsvm.svm_predict, the class that never appears is ignored, so the array dimension is now equal to 21. So accuracy results are different depending on origin of svm_predict_probability method invocation. I think that better results are obtained if classes without instances are ignored, but I don't know if it is very fair. In fact, accuracies from weka.libsvm and from libsvm.predict_svm seem to be the same if the class that never appears is removed from ARFF file. Note that this problem only appears when testing, because the training code uses always the svm_group_classes method to compute the number of classes, so Instances.numClasses() value is never used for training. Moreover, maybe the mismatch between the training number of classes and the testing number of classes is the reason behind worse accuracy results when svm_predict_probability invocation is made from WEKA, but I haven't proved it yet. Note that this problem does also happen when you have a class with less examples than the number of folds. For some folds, the class will not have training examples. We also made a quick fix for this problem: Add this public method to libsvm.svm_model class public int getNr_class(){return nr_class;} Make the following changes into distributionforInstance Method at weka.classifiers.functions.LibSVM First line of the method: int [] labels = new int [ instance . numClasses () ] ; could be changed to int [] labels = new int [ (( svm_model ) m_Model ). getNr_class () ] ; Last line in \"if(m_ProbablityEstimates)\" block: prob_estimates = new double [ instance . numClasses () ] ; could be changed to prob_estimates = new double [ (( svm_model ) m_Model ). getNr_class () ] ;","title":"Lib svm"},{"location":"lib_svm/#description","text":"Wrapper class for the LibSVM library by Chih-Chung Chang and Chih-Jen Lin. The original wrapper, named WLSVM, was developed by Yasser EL-Manzalawy. The current version is complete rewrite of the wrapper, using Reflection in order to avoid compilation errors, in case the libsvm.jar is not in the CLASSPATH . Important note: From WEKA >= 3.7.2 installation and use of LibSVM in WEKA has been simplified by the creation of a LibSVM package that can be installed using either the graphical or command line package manager .","title":"Description"},{"location":"lib_svm/#reference-weka-368","text":"LibSVM WLSVM","title":"Reference (Weka <= 3.6.8)"},{"location":"lib_svm/#package","text":"weka.classifiers.functions","title":"Package"},{"location":"lib_svm/#download","text":"The wrapper class is part of WEKA since version 3.5.2. But LibSVM , as a third-party-tool needs to be downloaded separately. It is recommended to upgrade to a post-3.5.3 version (or git ) for bug-fixes and extensions (contains now the distributionForInstance method).","title":"Download"},{"location":"lib_svm/#classpath","text":"Add the libsvm.jar from the LibSVM distribution to your CLASSPATH to make it available. Note: Do NOT start WEKA then with java -jar weka.jar . The -jar option overwrites the CLASSPATH , not augments it (a very common trap to fall into). Instead use something like this on Linux: java -classpath $CLASSPATH :weka.jar:libsvm.jar weka.gui.GUIChooser or this on Win32 (if you're starting it from commandline): java -classpath \"%CLASSPATH%;weka.jar;libsvm.jar\" weka.gui.GUIChooser If you're starting WEKA from the Start Menu on Windows, you'll have to add the libsvm.jar to your CLASSPATH environment variable. The following steps are for Windows XP (unfortunately, the GUI changes among the different Windows versions): right-click on My Computer and select Properties from the menu choose the Advanced tab and click on Environment variables at the bottom either add or modify a variable called CLASSPATH and add the libsvm.jar with full path to it","title":"CLASSPATH"},{"location":"lib_svm/#troubleshooting","text":"LibSVM classes not in CLASSPATH! Check whether the libsvm.jar is really in your CLASSPATH. Execute the following command in the SimpleCLI : java weka.core.SystemInfo The property java.class.path must list the libsvm.jar . If it is listed, check whether the path is correct. If you're on Windows and you find %CLASSPATH% there, see next bullet point to fix this. On Windows, if you added the libsvm.jar to your CLASSPATH environment variable, it can still happen that WEKA pops up the error message that the LibSVM classes are not in your CLASSPATH. This can happen where the %CLASSPATH% does not get expanded to its actual value in starting up WEKA. You can inspect your current CLASSPATH with which WEKA got started up with the SimpleCLI (see previous bullet point). If %CLASSPATH% is listed there, your system has the same problem. You can also explicitly add a .jar file to RunWeka.ini . Note: backslashes have to be escaped, not only once, but twice (they get interpreted by Java twice!). In other words, instead of one you have to use four : C:\\some\\where then turns into C:\\\\\\\\some\\\\\\\\where .","title":"Troubleshooting"},{"location":"lib_svm/#issues-with-libsvmjar-that-were-discussed-on-the-weka-list-in-april-2007-and-may-no-longer-be-relevant","text":"The following changes were not incorporated in WEKA, since it also means modifying the LibSVM Java code, which (I think) is autogenerated from the C code. The authors of LibSVM might have to consider that update. It's left to the reader to incorporate these changes.","title":"Issues with libsvm.jar that were discussed on the Weka list in April 2007 (and may no longer be relevant)"},{"location":"lib_svm/#libsvmsvm-uses-mathrandom","text":"libsvm.svm calls Math.random so the model it returns is usually different for the same training set and svm parameters over time. Obviously, if you call libsvm.svm from weka.classifiers.functions.libsvm, and you call it again from libsvm.svm_train, the results are also different. You can use libsvm.svm_save_model to record the svms into files, and then compare the model file from WEKA LibSVM with the model file from libsvm.svm_predict. Then you can see that ProbA values use to be different. WEKA experimenter is based on using always the same random sequences in order to repeat experiments with the same results. So, I'm afraid some important design changes are required on libsvm.jar and weka.classifiers.functions.libsvm.class to keep such behaviour. We made a quick fix adding an static Random attribute to libsvm.svm class: static java . util . Random ranGen = new Random ( 0 ); We have changed all Math.random() invokations to ranGen.nextdouble(). Then we have obtained the same svm from weka LibSVM than from LibSVM train_svm. However, WEKA accuracy results on primary_tumor data were still worse, so there's something wrong when weka uses the svm model at testing step.","title":"libsvm.svm uses Math.random"},{"location":"lib_svm/#classes-without-instances","text":"ARFF format provides some meta-information (i.e. attributes name and type, set of possible values for nominal attributes), but LibSVM format doesn't. So if there are classes in the dataset with zero occurrences through all the instances, LibSVM thinks that these classes don't exist whereas WEKA knows they exist. For example, there is a class in primary tumor dataset that never appears. When WEKA experimenter makes testing, it calls to: public static double svm_predict_probability ( svm_model model , svm_node [] x , double [] prob_estimates ) passing the array prob_estimates plenty of zeros (array cells are initialized to zero). The size of the array is equal to the number of classes (= 22). On the other hand, if this method is invoked from libsvm.svm_predict, the class that never appears is ignored, so the array dimension is now equal to 21. So accuracy results are different depending on origin of svm_predict_probability method invocation. I think that better results are obtained if classes without instances are ignored, but I don't know if it is very fair. In fact, accuracies from weka.libsvm and from libsvm.predict_svm seem to be the same if the class that never appears is removed from ARFF file. Note that this problem only appears when testing, because the training code uses always the svm_group_classes method to compute the number of classes, so Instances.numClasses() value is never used for training. Moreover, maybe the mismatch between the training number of classes and the testing number of classes is the reason behind worse accuracy results when svm_predict_probability invocation is made from WEKA, but I haven't proved it yet. Note that this problem does also happen when you have a class with less examples than the number of folds. For some folds, the class will not have training examples. We also made a quick fix for this problem: Add this public method to libsvm.svm_model class public int getNr_class(){return nr_class;} Make the following changes into distributionforInstance Method at weka.classifiers.functions.LibSVM First line of the method: int [] labels = new int [ instance . numClasses () ] ; could be changed to int [] labels = new int [ (( svm_model ) m_Model ). getNr_class () ] ; Last line in \"if(m_ProbablityEstimates)\" block: prob_estimates = new double [ instance . numClasses () ] ; could be changed to prob_estimates = new double [ (( svm_model ) m_Model ). getNr_class () ] ;","title":"Classes without instances"},{"location":"literature/","text":"Apart from Data Mining: Practical Machine Learning Tools and Techniques , there are several other books with material on Weka: Jason Bell (2020) Machine Learning: Hands-On for Developers and Technical Professionals, Second Edition , Wiley. Richard J. Roiger (2020) Just Enough R! An Interactive Approach to Machine Learning and Analytics , CRC Press. Parteek Bhatia (2019) Data Mining and Data Warehousing Principles and Practical Techniques , Cambridge University Press. Mark Wickham (2018) Practical Java Machine Learning Projects with Google Cloud Platform and Amazon Web Services , APress. AshishSingh Bhatia, Bostjan Kaluza (2018) Machine Learning in Java - Second Edition , Packt Publishing. Richard J. Roiger (2016) Data Mining: A Tutorial-Based Primer , CRC Press. Mei Yu Yuan (2016) Data Mining and Machine Learning: WEKA Technology and Practice , Tsinghua University Press (in Chinese). J\u00fcrgen Cleve, Uwe L\u00e4mmel (2016) Data Mining , De Gruyter (in German). Eric Rochester (2015) Clojure Data Analysis Cookbook - Second Edition , Packt Publishing. Bo\u0161tjan Kalu\u017ea (2013) Instant Weka How-to , Packt Publishing. Hongbo Du (2010) Data Mining Techniques and Applications , Cengage Learning. A book explaining why Weka won't learn (discovered by Stuart Inglis).","title":"Literature"},{"location":"mailing_list/","text":"The WEKA Mailing list can be found here: List for subscribing/unsubscribing to the list. Archives for searching previous posted messages. Before posting, please read the mailing list etiquette . Once you have subscribed to the list, you can send posts to the list using the following email address: weka-users@lists.sourceforge.net NB: The mailing list moved to Sourceforge.net in mid-December 2024, due to the university mailman server being decommissioned. You can find the old archives on this mirror .","title":"Mailing list"},{"location":"making_predictions/","text":"Command line # The following sections show how to obtain predictions/classifications without writing your own Java code via the command line. Classifiers # After a model has been saved , one can make predictions for a test set, whether that set contains valid class values or not. The output will contain both the actual and predicted class. (Note that if the test class contains simply '?' for the class label for each instance, the \"actual\" class label for each instance will not contain useful information, but the predicted class label will.) The -T command-line switch specifies the dataset of instances whose classes are to be predicted, while the -p switch allows the user to write out a range of attributes (examples: \"1-2\" for the first and second attributes, or \"0\" for no attributes). Sample command line: java weka.classifiers.trees.J48 -T unclassified.arff -l j48.model -p 0 The format of the output is as follows: : : [+| ] where \"+\" occurs only for those items that were mispredicted. Note that if the actual class label is always \"?\" (i.e., the dataset does not include known class labels), the error column will always be empty. Sample output: inst# actual predicted error prediction 1 1:? 1:0 0.757 2 1:? 1:0 0.824 3 1:? 1:0 0.807 4 1:? 1:0 0.807 5 1:? 1:0 0.79 6 1:? 2:1 0.661 ... In this case, taken directly from a test dataset where all class attributes were marked by \"?\", the \"actual\" column, which can be ignored, simply states that each class belongs to an unknown class. The \"predicted\" column shows that instances 1 through 5 are predicted to be of class 1, whose value is 0, and instance 6 is predicted to be of class 2, whose value is 1. The error field is empty; if predictions were being performed on a labeled test set, each instance where the prediction failed to match the label would contain a \"+\". The probability that instance 1 actually belongs to class 0 is estimated at 0.757. Notes: Since Weka 3.5.4 you can also output the complete class distribution, not just the prediction, by using the parameter -distribution in conjunction with the -p option. In this case, \"*\" is placed beside the probability in the distribution that corresponds to the predicted class value. If you have an ID attribute in your dataset as first attribute (you can always add one with the AddID filter), you could output it with -p 1 instead of using -p 0 . This works only for explicit train/test sets, but you can use the Explorer for cross-validation. Using the -classifications option instead of -p ... you can also use different output formats, like CSV : -classifications \"weka.classifiers.evaluation.output.prediction.CSV -p ...\" (the -p option takes the indices of the additional attributes to output). Filters # The AddClassification filter (package weka.filters.supervised.attribute ) can either train a classifier on the input data and transform this or load a serialized model to transform the input data (even though the filter was introduced in 3.5.4, due to a bug in the commandline option handling, it is recommended to download a version >3.5.5 from the Weka homepage). This filter can add the classification, class distribution and the error per row as extra attributes to the dataset. training the classifier, e.g., J48, on the input data and replacing the class values with the ones of the trained classifier: java \\ weka.filters.supervised.attribute.AddClassification \\ -W \"weka.classifiers.trees.J48\" \\ -classification \\ -remove-old-class \\ -i train.arff \\ -o train_classified.arff \\ -c last * using a serialized model, e.g., a J48 model, to replace the class values with the ones predicted by the serialized model: java \\ weka.filters.supervised.attribute.AddClassification \\ -serialized /some/where/j48.model \\ -classification \\ -remove-old-class \\ -i train.arff \\ -o train_classified.arff \\ -c last GUI # The Weka GUI allows you as well to output predictions based on a previously saved model. Explorer # See the Explorer section of the Saving and loading models article to setup the Explorer. Additionally, you need to check the Output predictions options in the More options dialog. Right-clicking on the respective results history item and selecting Re-evaluate model on current test set will output then the predictions as well (the statistics will be useless due to missing class values in the test set, so just ignore them). The output is similar to the one produced by the commandline. Example output for the anneal UCI dataset: == Predictions on test set == inst#, actual, predicted, error, probability distribution 1 ? 3:3 + 0 0 *1 0 0 0 2 ? 3:3 + 0 0 *1 0 0 0 3 ? 3:3 + 0 0 *1 0 0 0 ... 17 ? 6:U + 0 0 0 0 0 *1 18 ? 6:U + 0 0 0 0 0 *1 19 ? 3:3 + 0 0 *1 0 0 0 20 ? 3:3 + 0 0 *1 0 0 0 ... Note: The developer version (>3.5.6) can also output additional attributes like the commandline with the -p option. In the More options... dialog you can specify those attribute indices with Output additional attributes , e.g., first or 1-7 . In contrast to the commandline, this output also works for cross-validation. KnowledgeFlow # Using the PredictionAppender # With the PredictionAppender (from the Evaluation toolbar) you cannot use an already saved model, but you can train a classifier on a dataset and output an ARFF file with the predictions appended as additional attribute. Here's an example setup: /---dataSet--> TrainingSetMaker ---trainingSet--\\ ArffLoader --< >--> J48... \\---dataSet--> TestSetMaker -------testSet------/ ...J48 --batchClassifier--> PredictionAppender --testSet--> ArffSaver Using the AddClassification filter # The AddClassification filter can be used in the KnowledgeFlow as well, either for training a model, or for using a serialized model to perform the predictions. An example setup could look like this: ArffLoader --dataSet--> ClassAssigner --dataSet--> AddClassification --dataSet--> ArffSaver Java # If you want to perform the classification within your own code, see the classifying instances section of this article , explaining the Weka API in general. See also # Saving and loading models Use Weka in your Java code - general information about using the Weka API Using ID attributes Version # The developer version shortly before the release of 3.5.6 was used as basis for this article.","title":"Command line"},{"location":"making_predictions/#command-line","text":"The following sections show how to obtain predictions/classifications without writing your own Java code via the command line.","title":"Command line"},{"location":"making_predictions/#classifiers","text":"After a model has been saved , one can make predictions for a test set, whether that set contains valid class values or not. The output will contain both the actual and predicted class. (Note that if the test class contains simply '?' for the class label for each instance, the \"actual\" class label for each instance will not contain useful information, but the predicted class label will.) The -T command-line switch specifies the dataset of instances whose classes are to be predicted, while the -p switch allows the user to write out a range of attributes (examples: \"1-2\" for the first and second attributes, or \"0\" for no attributes). Sample command line: java weka.classifiers.trees.J48 -T unclassified.arff -l j48.model -p 0 The format of the output is as follows: : : [+| ] where \"+\" occurs only for those items that were mispredicted. Note that if the actual class label is always \"?\" (i.e., the dataset does not include known class labels), the error column will always be empty. Sample output: inst# actual predicted error prediction 1 1:? 1:0 0.757 2 1:? 1:0 0.824 3 1:? 1:0 0.807 4 1:? 1:0 0.807 5 1:? 1:0 0.79 6 1:? 2:1 0.661 ... In this case, taken directly from a test dataset where all class attributes were marked by \"?\", the \"actual\" column, which can be ignored, simply states that each class belongs to an unknown class. The \"predicted\" column shows that instances 1 through 5 are predicted to be of class 1, whose value is 0, and instance 6 is predicted to be of class 2, whose value is 1. The error field is empty; if predictions were being performed on a labeled test set, each instance where the prediction failed to match the label would contain a \"+\". The probability that instance 1 actually belongs to class 0 is estimated at 0.757. Notes: Since Weka 3.5.4 you can also output the complete class distribution, not just the prediction, by using the parameter -distribution in conjunction with the -p option. In this case, \"*\" is placed beside the probability in the distribution that corresponds to the predicted class value. If you have an ID attribute in your dataset as first attribute (you can always add one with the AddID filter), you could output it with -p 1 instead of using -p 0 . This works only for explicit train/test sets, but you can use the Explorer for cross-validation. Using the -classifications option instead of -p ... you can also use different output formats, like CSV : -classifications \"weka.classifiers.evaluation.output.prediction.CSV -p ...\" (the -p option takes the indices of the additional attributes to output).","title":"Classifiers"},{"location":"making_predictions/#filters","text":"The AddClassification filter (package weka.filters.supervised.attribute ) can either train a classifier on the input data and transform this or load a serialized model to transform the input data (even though the filter was introduced in 3.5.4, due to a bug in the commandline option handling, it is recommended to download a version >3.5.5 from the Weka homepage). This filter can add the classification, class distribution and the error per row as extra attributes to the dataset. training the classifier, e.g., J48, on the input data and replacing the class values with the ones of the trained classifier: java \\ weka.filters.supervised.attribute.AddClassification \\ -W \"weka.classifiers.trees.J48\" \\ -classification \\ -remove-old-class \\ -i train.arff \\ -o train_classified.arff \\ -c last * using a serialized model, e.g., a J48 model, to replace the class values with the ones predicted by the serialized model: java \\ weka.filters.supervised.attribute.AddClassification \\ -serialized /some/where/j48.model \\ -classification \\ -remove-old-class \\ -i train.arff \\ -o train_classified.arff \\ -c last","title":"Filters"},{"location":"making_predictions/#gui","text":"The Weka GUI allows you as well to output predictions based on a previously saved model.","title":"GUI"},{"location":"making_predictions/#explorer","text":"See the Explorer section of the Saving and loading models article to setup the Explorer. Additionally, you need to check the Output predictions options in the More options dialog. Right-clicking on the respective results history item and selecting Re-evaluate model on current test set will output then the predictions as well (the statistics will be useless due to missing class values in the test set, so just ignore them). The output is similar to the one produced by the commandline. Example output for the anneal UCI dataset: == Predictions on test set == inst#, actual, predicted, error, probability distribution 1 ? 3:3 + 0 0 *1 0 0 0 2 ? 3:3 + 0 0 *1 0 0 0 3 ? 3:3 + 0 0 *1 0 0 0 ... 17 ? 6:U + 0 0 0 0 0 *1 18 ? 6:U + 0 0 0 0 0 *1 19 ? 3:3 + 0 0 *1 0 0 0 20 ? 3:3 + 0 0 *1 0 0 0 ... Note: The developer version (>3.5.6) can also output additional attributes like the commandline with the -p option. In the More options... dialog you can specify those attribute indices with Output additional attributes , e.g., first or 1-7 . In contrast to the commandline, this output also works for cross-validation.","title":"Explorer"},{"location":"making_predictions/#knowledgeflow","text":"","title":"KnowledgeFlow"},{"location":"making_predictions/#using-the-predictionappender","text":"With the PredictionAppender (from the Evaluation toolbar) you cannot use an already saved model, but you can train a classifier on a dataset and output an ARFF file with the predictions appended as additional attribute. Here's an example setup: /---dataSet--> TrainingSetMaker ---trainingSet--\\ ArffLoader --< >--> J48... \\---dataSet--> TestSetMaker -------testSet------/ ...J48 --batchClassifier--> PredictionAppender --testSet--> ArffSaver","title":"Using the PredictionAppender"},{"location":"making_predictions/#using-the-addclassification-filter","text":"The AddClassification filter can be used in the KnowledgeFlow as well, either for training a model, or for using a serialized model to perform the predictions. An example setup could look like this: ArffLoader --dataSet--> ClassAssigner --dataSet--> AddClassification --dataSet--> ArffSaver","title":"Using the AddClassification filter"},{"location":"making_predictions/#java","text":"If you want to perform the classification within your own code, see the classifying instances section of this article , explaining the Weka API in general.","title":"Java"},{"location":"making_predictions/#see-also","text":"Saving and loading models Use Weka in your Java code - general information about using the Weka API Using ID attributes","title":"See also"},{"location":"making_predictions/#version","text":"The developer version shortly before the release of 3.5.6 was used as basis for this article.","title":"Version"},{"location":"mathematical_functions/","text":"Mathematical functions implemented on dataset instances, like tan, cos, exp, log, and so on can be achived using one of the following filters: AddExpression (Stable version) MathExpression (Stable version)","title":"Mathematical functions"},{"location":"maven/","text":"Maven is another build tool. But unlike Ant , it is a more high-level tool. Though its configuration file, pom.xml is written in XML as well, Maven uses a different approach to the build process. In Ant, you tell it where to find Java classes for compilation, what libraries to compile against, where to put the compiled ones and then how to combine them into a jar. With Maven, you only specify dependent libraries, a compile and a jar plugin and maybe tweak the options a bit. For this to work, Maven enforces a strict directory structure (though you can tweak that, if you need to). So why another build tool? # Whereas Ant scripts quite often create a fat jar , i.e., a jar that contains not only the project's code, but also the contain of libraries the code was compiled against. Handy if you only want to have a single jar. However, this is a nightmare, if you need to update a single library, but all you have is a single, enormous jar. Maven handles dependencies automatically , relying on libraries (they call them artifacts) to be publicly available, e.g., on Maven Central . It allows you to use newer versions of libraries than defined by the dependent libraries (e.g., critical bug fixes), without having to modify any jars manually. Though Maven can also generate fat jar files, it is not considered good practice, as it defeats Maven's automatic version resolution. In order to make Weka, and most of its packages, available to a wider audience (e.g., other software developers), we also publish on Maven Central. Compiling # For compiling Weka, you would issue a command like this (in the same directory as pom.xml ): mvn clean install If you don't want the tests to run, use this: mvn clean install -DskipTests = true","title":"Maven"},{"location":"maven/#so-why-another-build-tool","text":"Whereas Ant scripts quite often create a fat jar , i.e., a jar that contains not only the project's code, but also the contain of libraries the code was compiled against. Handy if you only want to have a single jar. However, this is a nightmare, if you need to update a single library, but all you have is a single, enormous jar. Maven handles dependencies automatically , relying on libraries (they call them artifacts) to be publicly available, e.g., on Maven Central . It allows you to use newer versions of libraries than defined by the dependent libraries (e.g., critical bug fixes), without having to modify any jars manually. Though Maven can also generate fat jar files, it is not considered good practice, as it defeats Maven's automatic version resolution. In order to make Weka, and most of its packages, available to a wider audience (e.g., other software developers), we also publish on Maven Central.","title":"So why another build tool?"},{"location":"maven/#compiling","text":"For compiling Weka, you would issue a command like this (in the same directory as pom.xml ): mvn clean install If you don't want the tests to run, use this: mvn clean install -DskipTests = true","title":"Compiling"},{"location":"memory_consumption_and_garbage_collector/","text":"There is the ability to print how much memory is available in the Explorer and Experimenter and to run the garbage collector. Just right click over the Status area in the Explorer/Experimenter.","title":"Memory consumption and garbage collector"},{"location":"message_classifier/","text":"In the following you'll find some information about the MessageClassifier from the 2nd edition of the Data Mining book by Witten and Frank. Source code # Depending on the version of the book, download the corresponding version (this article is based on the 2nd edition): 1st Edition: MessageClassifier 2nd Edition: MessageClassifier ( book , stable-3.8 , developer ) Compiling # compile the source code like this, if the weka.jar is already in your CLASSPATH environment variable: javac MessageClassifier.java * otherwise, use this command line (of course, replace /path/to/ with the correct path on your system): javac - classpath / path / to / weka . jar MessageClassifier . java Note: The classpath handling is omitted from here on. Training # If you run the MessageClassifier for the first time, you need to provide labeled examples to build a classifier from, i.e., messages (\" -m \") and the corresponding classes (\" -c \"). Since the data and the model are kept for future use, one has to specify a filename, where the MessageClassifier is serialized to (\" -t \"). Here's an example, that labels the message email1.txt as miss : java MessageClassifier -m email1.txt -c miss -t messageclassifier.model Repeat this for all the messages you want to have classified. Classifying # Classifying an unseen message is quite straight-forward, one just omits the class option (\" -c \"). The following call java MessageClassifier -m email1023.txt -t messageclassifier.model will produce something like this: Message classified as : miss","title":"Message classifier"},{"location":"message_classifier/#source-code","text":"Depending on the version of the book, download the corresponding version (this article is based on the 2nd edition): 1st Edition: MessageClassifier 2nd Edition: MessageClassifier ( book , stable-3.8 , developer )","title":"Source code"},{"location":"message_classifier/#compiling","text":"compile the source code like this, if the weka.jar is already in your CLASSPATH environment variable: javac MessageClassifier.java * otherwise, use this command line (of course, replace /path/to/ with the correct path on your system): javac - classpath / path / to / weka . jar MessageClassifier . java Note: The classpath handling is omitted from here on.","title":"Compiling"},{"location":"message_classifier/#training","text":"If you run the MessageClassifier for the first time, you need to provide labeled examples to build a classifier from, i.e., messages (\" -m \") and the corresponding classes (\" -c \"). Since the data and the model are kept for future use, one has to specify a filename, where the MessageClassifier is serialized to (\" -t \"). Here's an example, that labels the message email1.txt as miss : java MessageClassifier -m email1.txt -c miss -t messageclassifier.model Repeat this for all the messages you want to have classified.","title":"Training"},{"location":"message_classifier/#classifying","text":"Classifying an unseen message is quite straight-forward, one just omits the class option (\" -c \"). The following call java MessageClassifier -m email1023.txt -t messageclassifier.model will produce something like this: Message classified as : miss","title":"Classifying"},{"location":"metacost/","text":"This metaclassifier makes its base classifier cost-sensitive using the method specified in: Pedro Domingos: MetaCost: A general method for making classifiers cost-sensitive. In: Fifth International Conference on Knowledge Discovery and Data Mining, 155-164, 1999. This classifier should produce similar results to one created by passing the base learner to Bagging, which is in turn passed to a CostSensitiveClassifier operating on minimum expected cost. The difference is that MetaCost produces a single cost-sensitive classifier of the base learner, giving the benefits of fast classification and interpretable output (if the base learner itself is interpretable). This implementation uses all bagging iterations when reclassifying training data (the MetaCost paper reports a marginal improvement when only those iterations containing each training instance are used in reclassifying that instance). Examples # The following cost matrix is used for a 3-class problem: -3 1 1 1 -6 1 0 0 0 MetaCost will compute the costs ( Costs ) based on the class distribution the bagged base learner returns ( Class probs ) and select the class with the lowest cost ( Chosen class ): +---------------+-----------------+--------------+ | Class probs | Costs | Chosen class | +---------------+-----------------+--------------+ | 1.0, 0.0, 0.0 | -3.0, 1.0, 1.0 | 1 | | 0.0, 1.0, 0.0 | 1.0, -6.0, 1.0 | 2 | | 0.0, 0.0, 1.0 | 0.0, 0.0, 0.0 | 1 * | | 0.7, 0.1, 0.2 | -2.0, 0.1, 0.8 | 1 | | 0.2, 0.7, 0.1 | 0.1, -4.0. 0.9 | 2 | | 0.1, 0.2, 0.7 | -0.1, -1.1, 0.3 | 2 | +---------------+-----------------+--------------+ * in case of a tie, the first one will be picked. See also # CostSensitiveClassifier CostMatrix Links # Publication on CiteSeer","title":"Metacost"},{"location":"metacost/#examples","text":"The following cost matrix is used for a 3-class problem: -3 1 1 1 -6 1 0 0 0 MetaCost will compute the costs ( Costs ) based on the class distribution the bagged base learner returns ( Class probs ) and select the class with the lowest cost ( Chosen class ): +---------------+-----------------+--------------+ | Class probs | Costs | Chosen class | +---------------+-----------------+--------------+ | 1.0, 0.0, 0.0 | -3.0, 1.0, 1.0 | 1 | | 0.0, 1.0, 0.0 | 1.0, -6.0, 1.0 | 2 | | 0.0, 0.0, 1.0 | 0.0, 0.0, 0.0 | 1 * | | 0.7, 0.1, 0.2 | -2.0, 0.1, 0.8 | 1 | | 0.2, 0.7, 0.1 | 0.1, -4.0. 0.9 | 2 | | 0.1, 0.2, 0.7 | -0.1, -1.1, 0.3 | 2 | +---------------+-----------------+--------------+ * in case of a tie, the first one will be picked.","title":"Examples"},{"location":"metacost/#see-also","text":"CostSensitiveClassifier CostMatrix","title":"See also"},{"location":"metacost/#links","text":"Publication on CiteSeer","title":"Links"},{"location":"ms_sql_server_2000_desktop_engine/","text":"Installation # Download the Desktop Engine (see Links ) Extract the files by running the downloaded executable Edit the setup.ini file and add a strong password for the sa account: SAPWD=*password* Note: the default password is empty, which can prevent the setup from continuing the installation Run the setup Testing # This article lists Java code for testing the connection Troubleshooting # Error Establishing Socket with JDBC Driver Add TCP/IP to the list of protocols as stated in this article Login failed for user 'sa'. Reason: Not associated with a trusted SQL Server connection. For changing the authentication to mixed mode see this article Links # Microsoft SQL Server 2000 (Desktop Engine) Microsoft SQL Server 2000 JDBC Driver SP 3","title":"Installation"},{"location":"ms_sql_server_2000_desktop_engine/#installation","text":"Download the Desktop Engine (see Links ) Extract the files by running the downloaded executable Edit the setup.ini file and add a strong password for the sa account: SAPWD=*password* Note: the default password is empty, which can prevent the setup from continuing the installation Run the setup","title":"Installation"},{"location":"ms_sql_server_2000_desktop_engine/#testing","text":"This article lists Java code for testing the connection","title":"Testing"},{"location":"ms_sql_server_2000_desktop_engine/#troubleshooting","text":"Error Establishing Socket with JDBC Driver Add TCP/IP to the list of protocols as stated in this article Login failed for user 'sa'. Reason: Not associated with a trusted SQL Server connection. For changing the authentication to mixed mode see this article","title":"Troubleshooting"},{"location":"ms_sql_server_2000_desktop_engine/#links","text":"Microsoft SQL Server 2000 (Desktop Engine) Microsoft SQL Server 2000 JDBC Driver SP 3","title":"Links"},{"location":"mtj_with_nvblas/","text":"(The following is based on a post from Eibe Frank on the Weka mailing list.) Here is an example of running MTJ with NVBLAS (NVIDIA's BLAS wrapper) on Ubuntu: Installed https://prdownloads.sourceforge.net/weka/weka-3-8-6-azul-zulu-linux.zip in /home/eibe/Desktop Ran ~/Desktop/weka-3-8-6/weka.sh -main weka.core.WekaPackageManager -install-package netlibNativeLinux To install CPU-based system BLAS/LAPACK, ran sudo apt-get install libblas-dev liblapack-dev sudo ln -s /usr/lib/x86_64-linux-gnu/libblas.so.3 /usr/lib/libblas.so.3 sudo ln -s /usr/lib/x86_64-linux-gnu/liblapack.so.3 /usr/lib/liblapack.so.3 Downloaded and installed CUDA 11.6 from https://developer.nvidia.com/cuda-downloads Copied example nvblas.conf from https://docs.nvidia.com/cuda/nvblas/ into local directory using cat > nvblas.conf Edited nvblas.conf to have NVBLAS_CPU_BLAS_LIB /usr/lib/x86_64-linux-gnu/blas/libblas.so.3 Now, by adapting what's given at https://github.com/fommil/netlib-java/wiki/NVBLAS , issued export LD_LIBRARY_PATH=/usr/local/cuda-11.6/lib64:/usr/lib/x86_64-linux-gnu/blas/libblas.so.3 Then, ~/Desktop/weka-3-8-6/weka.sh -main weka.Run .RandomRBF -a 5000 > RandomRBF.a5000.arff LD_PRELOAD=libnvblas.so ~/Desktop/weka-3-8-6/weka.sh -main weka.Run .attributeSelection.PrincipalComponents -i RandomRBF.a5000.arff Observation: Memory is being allocated on the GPU. Looking at nvblas.log , the GPU is used, but only for some dgemm operations. However, according to https://docs.nvidia.com/cuda/nvblas/ , the tremm operation (which is executed on the CPU) should also be supported by the GPU.","title":"Mtj with nvblas"},{"location":"multi_instance_classification/","text":"Multi-instance (MI) classification is a supervised learning technique, but differs from normal supervised learning: it has multiple instances in an example only one class label is observable for all the instances in an example Classifiers # Multi-instance classifiers were originally available through a separate software package, Multi-Instance Learning Kit (= MILK). Weka handles relational attributes now natively since 3.5.3 and the multi-instance classifiers are available through the multiInstanceLearning package and filters through the multiInstanceFilters . Once the packages have been installed, the classifiers can be found in the following package: weka.classifiers.mi Data format # The data format for multi-instance classifiers is fairly simple: bag-id - nominal attribute; unique identifier for each bag bag - relational attribute; contains the instances of an example class - the class label for the examples Weka offers two filters to convert from flat file format (or propositional format), which is normally used in supervised classification, to multi-instance format and vice versa: weka.filters.unsupervised.attribute.PropositionalToMultiInstance weka.filters.unsupervised.attribute.MultiInstanceToPropositional Here is an example of the musk1 UCI dataset, used quite often in publications covering MI learning (Note: ... denotes omission): propositional format: This ARFF file lists all the attributes, molecule_name (which is the bag-id), f1 to f166 (containing the actual data of the instances) and the class attribute. @relation musk1 @attribute molecule_name {MUSK-jf78,MUSK-jf67,MUSK-jf59,...,NON-MUSK-199} @attribute f1 numeric @attribute f2 numeric @attribute f3 numeric @attribute f4 numeric @attribute f5 numeric ... @attribute f166 numeric @attribute class {0,1} @data MUSK-188,42,-198,-109,-75,-117,11,23,-88,-28,-27,...,48,-37,6,30,1 MUSK-188,42,-191,-142,-65,-117,55,49,-170,-45,5,...,48,-37,5,30,1 ... multi-instance format: Using the relational attribute, one only has three attributes on the first level: molecule_name , bag and class . The relational attribute contains the instances for each example, consisting of the attributes f1 to f166 . The data of the relational attribute is surrounded by quotes and the single instances inside the bag are separated by line-feeds (= \\n ). @relation musk1 @attribute molecule_name {MUSK-jf78,MUSK-jf67,MUSK-jf59,...,NON-MUSK-199} @attribute bag relational @attribute f1 numeric @attribute f2 numeric @attribute f3 numeric @attribute f4 numeric @attribute f5 numeric ... @attribute f166 numeric @end bag @attribute class {0,1} @data MUSK-188,\"42,-198,-109,-75,-117,11,23,-88,-28,-27,...,48,-37,6,30\\n42,-191,-142,-65,-117,55,49,-170,-45,5,...,48,-37,5,30\\n...\",1 ... See also # Use Weka in your Java code - general article about using the Weka API Creating an ARFF file - explains how to create an ARFF file from within Java, incl. relational attributes Links # Xin Xu. Statistical learning in multiple instance problem. Master's thesis, University of Waikato, Hamilton, NZ, 2003. 0657.594. Download MILK homepage multiInstanceLearning Javadoc multiInstanceFilters Javadoc","title":"Multi instance classification"},{"location":"multi_instance_classification/#classifiers","text":"Multi-instance classifiers were originally available through a separate software package, Multi-Instance Learning Kit (= MILK). Weka handles relational attributes now natively since 3.5.3 and the multi-instance classifiers are available through the multiInstanceLearning package and filters through the multiInstanceFilters . Once the packages have been installed, the classifiers can be found in the following package: weka.classifiers.mi","title":"Classifiers"},{"location":"multi_instance_classification/#data-format","text":"The data format for multi-instance classifiers is fairly simple: bag-id - nominal attribute; unique identifier for each bag bag - relational attribute; contains the instances of an example class - the class label for the examples Weka offers two filters to convert from flat file format (or propositional format), which is normally used in supervised classification, to multi-instance format and vice versa: weka.filters.unsupervised.attribute.PropositionalToMultiInstance weka.filters.unsupervised.attribute.MultiInstanceToPropositional Here is an example of the musk1 UCI dataset, used quite often in publications covering MI learning (Note: ... denotes omission): propositional format: This ARFF file lists all the attributes, molecule_name (which is the bag-id), f1 to f166 (containing the actual data of the instances) and the class attribute. @relation musk1 @attribute molecule_name {MUSK-jf78,MUSK-jf67,MUSK-jf59,...,NON-MUSK-199} @attribute f1 numeric @attribute f2 numeric @attribute f3 numeric @attribute f4 numeric @attribute f5 numeric ... @attribute f166 numeric @attribute class {0,1} @data MUSK-188,42,-198,-109,-75,-117,11,23,-88,-28,-27,...,48,-37,6,30,1 MUSK-188,42,-191,-142,-65,-117,55,49,-170,-45,5,...,48,-37,5,30,1 ... multi-instance format: Using the relational attribute, one only has three attributes on the first level: molecule_name , bag and class . The relational attribute contains the instances for each example, consisting of the attributes f1 to f166 . The data of the relational attribute is surrounded by quotes and the single instances inside the bag are separated by line-feeds (= \\n ). @relation musk1 @attribute molecule_name {MUSK-jf78,MUSK-jf67,MUSK-jf59,...,NON-MUSK-199} @attribute bag relational @attribute f1 numeric @attribute f2 numeric @attribute f3 numeric @attribute f4 numeric @attribute f5 numeric ... @attribute f166 numeric @end bag @attribute class {0,1} @data MUSK-188,\"42,-198,-109,-75,-117,11,23,-88,-28,-27,...,48,-37,6,30\\n42,-191,-142,-65,-117,55,49,-170,-45,5,...,48,-37,5,30\\n...\",1 ...","title":"Data format"},{"location":"multi_instance_classification/#see-also","text":"Use Weka in your Java code - general article about using the Weka API Creating an ARFF file - explains how to create an ARFF file from within Java, incl. relational attributes","title":"See also"},{"location":"multi_instance_classification/#links","text":"Xin Xu. Statistical learning in multiple instance problem. Master's thesis, University of Waikato, Hamilton, NZ, 2003. 0657.594. Download MILK homepage multiInstanceLearning Javadoc multiInstanceFilters Javadoc","title":"Links"},{"location":"not_so_faq/","text":"Associators # How do I use the associator GeneralizedSequentialPatterns? Classifiers # What do those numbers mean in a J48 tree?","title":"Not so FAQ"},{"location":"not_so_faq/#associators","text":"How do I use the associator GeneralizedSequentialPatterns?","title":"Associators"},{"location":"not_so_faq/#classifiers","text":"What do those numbers mean in a J48 tree?","title":"Classifiers"},{"location":"optimizing_parameters/","text":"Since finding the optimal parameters for a classifier can be a rather tedious process, Weka offers some ways of automating this process a bit. The following meta-classifiers allow you to optimize some parameters of your base classifier: weka.classifiers.meta.CVParameterSelection weka.classifiers.meta.GridSearch (only developer version) weka.classifiers.meta.MultiSearch ( external package for 3.7.11+) Auto-WEKA ( external package package for 3.7.13+) After finding the best possible setup, the meta-classifiers then train an instance of the base classifier with these parameters and use it for subsequent predictions. CVParameterSelection # This meta-classifier can optimize over an arbitrary number of parameters, with only one drawback (apart from the obvious explosion of possible parameter combinations): one cannot optimize on nested options, only direct options of the base classifier. What does that mean? It means, that you can optimize the C parameter of weka.classifiers.functions.SMO , but not the C of an weka.classifiers.functions.SMO within a weka.classifiers.meta.FilteredClassifier . Here are a few examples: J48 and it's confidence interval (\"-C\") load your dataset in the Explorer choose weka.classifiers.meta.CVParameterSelection as classifier select weka.classifiers.trees.J48 as base classifier within CVParameterSelection open the ArrayEditor for CVParameters and enter the following string (and click on Add ): C 0.1 0.5 5 - This will test the confidence parameter from 0.1 to 0.5 with step size 0.1 (= 5 steps) close dialogs and start the classifier you will get output similar to this one, with the best parameters found in bold: Cross-validated Parameter selection. Classifier: weka.classifiers.trees.J48 Cross-validation Parameter: '-C' ranged from 0.1 to 0.5 with 5.0 steps Classifier Options: **-C 0.1** -M 2 SMO and it's complexity parameter (\"-C\") load your dataset in the Explorer choose weka.classifiers.meta.CVParameterSelection as classifier select weka.classifiers.functions.SMO as base classifier within CVParameterSelection and modify its setup if necessary, e.g., RBF kernel open the ArrayEditor for CVParameters and enter the following string (and click on Add ): C 2 8 4 This will test the complexity parameters 2, 4, 6 and 8 (= 4 steps) * close dialogs and start the classifier * you will get output similar to this one, with the best parameters found in bold: Cross-validated Parameter selection. Classifier: weka.classifiers.functions.SMO Cross-validation Parameter: '-C' ranged from 2.0 to 8.0 with 4.0 steps Classifier Options: **-C 8** -L 0.0010 -P 1.0E-12 -N 0 -V -1 -W 1 -K \"weka.classifiers.functions.supportVector.RBFKernel -C 250007 -G 0.01\" * LibSVM and the gamma parameter of the RBF kernel (\"-G\") * load your dataset in the Explorer * choose weka.classifiers.meta.CVParameterSelection as classifier * select [weka.classifiers.functions.LibSVM](lib_svm.md) as base classifier within CVParameterSelection and modify its setup if necessary, e.g., RBF kernel * open the ArrayEditor for CVParameters and enter the following string (and click on Add ): G 0.01 0.1 10 This will iterate over the gamma parameter, using values from 0.01 to 0.1 (= 10 steps) * close dialogs and start the classifier * you will get output similar to this one, with the best parameters found in bold: Cross-validated Parameter selection. Classifier: weka.classifiers.functions.LibSVM Cross-validation Parameter: '-G' ranged from 0.01 to 0.1 with 10.0 steps Classifier Options: **-G 0.09** -S 0 -K 2 -D 3 -R 0.0 -N 0.5 -M 40.0 -C 1.0 -E 0.0010 -P 0.1 GridSearch # weka.classifiers.meta.GridSearch is a meta-classifier for exploring 2 parameters, hence the grid in the name. If one turns the log on, the classifier will create output suitable for gnuplot , i.e., sections of the log will contain script and data sections. Instead of just using a classifier, one can specify a base classifier and a filter, which both of them can be optimized (one parameter each). In contrast to CVParameterSelection , GridSearch is not limited to first-level parameters of the base classifier, since it's using Java Beans Introspection and one can specify paths to the properties one wants to optimize. A property here is the string of the parameter displayed in the GenericObjectEditor (generated though Introspection), e.g., bagSizePercent or classifier of weka.classifiers.meta.Bagging . Due to some important bugfixes, one should obtain a version of Weka >3.5.6 later than 11 Sept 2007. For each of the two axes, X and Y, one can specify the following parameters: property The dot-separated path pointing to the property to be optimized. In order to distinguish between paths for the filter or the classifier, one needs to prefix the path either with filter. or classifier. for filter or classifier path respectively. expression The mathematical expression to generate the value for the property, processed with the weka.core.MathematicalExpression class, which supports the following functions: abs , sqrt , log , exp , sin , cos , tan , rint , floor , pow , ceil . These variables are available in the expression: BASE , FROM , TO , STEP , I ; with I ranging from FROM to TO . min The minimum value to start from. max The maximum value. step The step size used to get from min to max . base Used in pow() calculations. GridSearch can also optimized based on the following measures: Correlation coefficient (= CC) Root mean squared error (= RMSE) Root relative squared error (= RRSE) Mean absolute error (= MAE) Root absolute error (= RAE) Combined: (1-abs(CC)) + RRSE + RAE Accuracy (= ACC) Kappa (= KAP) [only when using Weka packages] Note: Correlation coefficient is only available for numeric classes and Accuracy only for nominal ones. Here are a some examples (taken from the Javadoc of the classifier): Optimizing SMO with RBFKernel (C and gamma) Start the Explorer and load your dataset with nominal class. Set the evaluation to Accuracy . Set the filter to weka.filters.AllFilter since we don't need any special data processing and we don't optimize the filter in this case (data gets always passed through filter!). Set weka.classifiers.functions.SMO as classifier with weka.classifiers.functions.supportVector.RBFKernel as kernel. Set the XProperty to \"classifier.c\", XMin to \"1\", XMax to \"16\", XStep to \"1\" and the XExpression to \"I\". This will test the \"C\" parameter of SMO for the values from 1 to 16. Set the YProperty to \"classifier.kernel.gamma\", YMin to \"-5\", YMax to \"2\", YStep to \"1\", YBase to \"10\" and YExpression to \"pow(BASE,I)\". This will test the gamma of the RBFKernel with the values 10 -5 , 10 -4 ,..,10 2 . Output will be similar to this one here: Filter: weka.filters.AllFilter Classifier: weka.classifiers.functions.SMO -C 2.0 -L 0.0010 -P 1.0E-12 -N 0 -V -1 -W 1 -K \"weka.classifiers.functions.supportVector.RBFKernel -C 250007 -G 0.0\" X property: classifier.c Y property: classifier.kernel.gamma Evaluation: Accuracy Coordinates: [2.0, 0.0] Values: **2.0** (X coordinate), **1.0** (Y coordinate) * Optimizing PLSFilter with LinearRegression (# of components and ridge) - default setup * Start the Explorer and load your dataset with numeric class. * Set the evaluation to Correlation coefficient. * Set the filter to weka.filters.supervised.attribute.PLSFilter . * Set weka.classifiers.functions.LinearRegression as classifier and use no attribute selection and no elimination of colinear attributes (speeds up LinearRegression significantly!). * Set the XProperty to \"filter.numComponents\", XMin to \"5\", XMax to \"20\" (this depends heavily on your dataset, should be no more than the number of attributes!), XStep to \"1\" and XExpression to \"I\". This will test the number of components the PLSFilter will produce from 5 to 20. * Set the YProperty to \"classifier.ridge\", XMin to \"-10\", XMax to \"5\", YStep to \"1\" and YExpression to \"pow(BASE,I)\". This will try ridge parameters from 10 -10 to 10 5 . * Output will be similar to this one: Filter: weka.filters.supervised.attribute.PLSFilter -C 5 -M -A PLS1 -P center Classifier: weka.classifiers.functions.LinearRegression -S 1 -C -R 5.0 X property: filter.numComponents Y property: classifier.ridge Evaluation: Correlation coefficient Coordinates: [5.0, 5.0] Values: **5.0** (X coordinate), **100000.0** (Y coordinate) Notes: a property for the classifier starts with classifier. a property for the filter starts with filter. Arrays of objects are addressed with [ ] , with the index being 0-based. E.g., using a weka.filters.MultiFilter in GridSearch consisting of a ReplaceMissingValues and a PLSFilter filter one can address the numComponents property of the PLSFilter with filter.filter[1].numComponents MultiSearch # weka.classifiers.meta.MultiSearch is available through this Weka package (requires Weka 3.7.11 or later; for downloads see the Releases section). MultiSearch is similar to GridSearch, more general and simpler at the same time. More general, because it allows the optimization of an arbitrary number of parameters, not just two. Simpler, because it does not offer any search space expansions or gnuplot output and less options. For each parameter to optimize, the user has to define a search parameter . There are two types of parameters available: MathParameter - basically what GridSearch uses, with an expression to calculate the actual value using the min, max and step parameters ListParameter - the blank-separated list of values is used as input for the optimization (useful, if values cannot be described by a mathematical function) Here is a setup for finding the best ridge parameter (property classifier.ridge ) using the MathParameter search parameter using values from 10^-10 to 10^5: weka.classifiers.meta.MultiSearch \\ -E CC \\ -search \"weka.core.setupgenerator.MathParameter -property classifier.ridge -min -10.0 -max 5.0 -step 1.0 -base 10.0 -expression pow(BASE,I)\" \\ -sample-size 100.0 -initial-folds 2 -subsequent-folds 10 -num-slots 1 -S 1 \\ -W weka.classifiers.functions.LinearRegression -- -S 1 -C -R 1.0E-8 And here using the ListParameter search parameter for evaluating values 0.001, 0.05, 0.1, 0.5, 0.75 and 1.0 for the ridge parameter (property classifier.ridge ): weka.classifiers.meta.MultiSearch \\ -E CC \\ -search \"weka.core.setupgenerator.ListParameter -property classifier.ridge -list \\\"0.001 0.05 0.1 0.5 0.75 1.0\\\"\" \\ -sample-size 100.0 -initial-folds 2 -subsequent-folds 10 -num-slots 1 -S 1 \\ -W weka.classifiers.functions.LinearRegression -- -S 1 -C -R 1.0E-8 MultiSearch can be optimized based on the following measures: Correlation coefficient (= CC) Root mean squared error (= RMSE) Root relative squared error (= RRSE) Mean absolute error (= MAE) Root absolute error (= RAE) Combined: (1-abs(CC)) + RRSE + RAE Accuracy (= ACC) Kappa (= KAP) Auto-WEKA # Auto-WEKA is available as a package through the WEKA package manager. It provides the class weka.classifiers.meta.AutoWEKAClassifier and optimizes all parameters of all learners. It also automatically determines the best learner to use and the best attribute selection method for a given dataset. More information is available on the project website and the manual . Downloads # CVParam.java - optimizes J48's -C parameter See also # LibSVM - you need additional jars in your CLASSPATH to be able to use LibSVM Links # gnuplot homepage Java Beans Introspection","title":"Optimizing parameters"},{"location":"optimizing_parameters/#cvparameterselection","text":"This meta-classifier can optimize over an arbitrary number of parameters, with only one drawback (apart from the obvious explosion of possible parameter combinations): one cannot optimize on nested options, only direct options of the base classifier. What does that mean? It means, that you can optimize the C parameter of weka.classifiers.functions.SMO , but not the C of an weka.classifiers.functions.SMO within a weka.classifiers.meta.FilteredClassifier . Here are a few examples: J48 and it's confidence interval (\"-C\") load your dataset in the Explorer choose weka.classifiers.meta.CVParameterSelection as classifier select weka.classifiers.trees.J48 as base classifier within CVParameterSelection open the ArrayEditor for CVParameters and enter the following string (and click on Add ): C 0.1 0.5 5 - This will test the confidence parameter from 0.1 to 0.5 with step size 0.1 (= 5 steps) close dialogs and start the classifier you will get output similar to this one, with the best parameters found in bold: Cross-validated Parameter selection. Classifier: weka.classifiers.trees.J48 Cross-validation Parameter: '-C' ranged from 0.1 to 0.5 with 5.0 steps Classifier Options: **-C 0.1** -M 2 SMO and it's complexity parameter (\"-C\") load your dataset in the Explorer choose weka.classifiers.meta.CVParameterSelection as classifier select weka.classifiers.functions.SMO as base classifier within CVParameterSelection and modify its setup if necessary, e.g., RBF kernel open the ArrayEditor for CVParameters and enter the following string (and click on Add ): C 2 8 4 This will test the complexity parameters 2, 4, 6 and 8 (= 4 steps) * close dialogs and start the classifier * you will get output similar to this one, with the best parameters found in bold: Cross-validated Parameter selection. Classifier: weka.classifiers.functions.SMO Cross-validation Parameter: '-C' ranged from 2.0 to 8.0 with 4.0 steps Classifier Options: **-C 8** -L 0.0010 -P 1.0E-12 -N 0 -V -1 -W 1 -K \"weka.classifiers.functions.supportVector.RBFKernel -C 250007 -G 0.01\" * LibSVM and the gamma parameter of the RBF kernel (\"-G\") * load your dataset in the Explorer * choose weka.classifiers.meta.CVParameterSelection as classifier * select [weka.classifiers.functions.LibSVM](lib_svm.md) as base classifier within CVParameterSelection and modify its setup if necessary, e.g., RBF kernel * open the ArrayEditor for CVParameters and enter the following string (and click on Add ): G 0.01 0.1 10 This will iterate over the gamma parameter, using values from 0.01 to 0.1 (= 10 steps) * close dialogs and start the classifier * you will get output similar to this one, with the best parameters found in bold: Cross-validated Parameter selection. Classifier: weka.classifiers.functions.LibSVM Cross-validation Parameter: '-G' ranged from 0.01 to 0.1 with 10.0 steps Classifier Options: **-G 0.09** -S 0 -K 2 -D 3 -R 0.0 -N 0.5 -M 40.0 -C 1.0 -E 0.0010 -P 0.1","title":"CVParameterSelection"},{"location":"optimizing_parameters/#gridsearch","text":"weka.classifiers.meta.GridSearch is a meta-classifier for exploring 2 parameters, hence the grid in the name. If one turns the log on, the classifier will create output suitable for gnuplot , i.e., sections of the log will contain script and data sections. Instead of just using a classifier, one can specify a base classifier and a filter, which both of them can be optimized (one parameter each). In contrast to CVParameterSelection , GridSearch is not limited to first-level parameters of the base classifier, since it's using Java Beans Introspection and one can specify paths to the properties one wants to optimize. A property here is the string of the parameter displayed in the GenericObjectEditor (generated though Introspection), e.g., bagSizePercent or classifier of weka.classifiers.meta.Bagging . Due to some important bugfixes, one should obtain a version of Weka >3.5.6 later than 11 Sept 2007. For each of the two axes, X and Y, one can specify the following parameters: property The dot-separated path pointing to the property to be optimized. In order to distinguish between paths for the filter or the classifier, one needs to prefix the path either with filter. or classifier. for filter or classifier path respectively. expression The mathematical expression to generate the value for the property, processed with the weka.core.MathematicalExpression class, which supports the following functions: abs , sqrt , log , exp , sin , cos , tan , rint , floor , pow , ceil . These variables are available in the expression: BASE , FROM , TO , STEP , I ; with I ranging from FROM to TO . min The minimum value to start from. max The maximum value. step The step size used to get from min to max . base Used in pow() calculations. GridSearch can also optimized based on the following measures: Correlation coefficient (= CC) Root mean squared error (= RMSE) Root relative squared error (= RRSE) Mean absolute error (= MAE) Root absolute error (= RAE) Combined: (1-abs(CC)) + RRSE + RAE Accuracy (= ACC) Kappa (= KAP) [only when using Weka packages] Note: Correlation coefficient is only available for numeric classes and Accuracy only for nominal ones. Here are a some examples (taken from the Javadoc of the classifier): Optimizing SMO with RBFKernel (C and gamma) Start the Explorer and load your dataset with nominal class. Set the evaluation to Accuracy . Set the filter to weka.filters.AllFilter since we don't need any special data processing and we don't optimize the filter in this case (data gets always passed through filter!). Set weka.classifiers.functions.SMO as classifier with weka.classifiers.functions.supportVector.RBFKernel as kernel. Set the XProperty to \"classifier.c\", XMin to \"1\", XMax to \"16\", XStep to \"1\" and the XExpression to \"I\". This will test the \"C\" parameter of SMO for the values from 1 to 16. Set the YProperty to \"classifier.kernel.gamma\", YMin to \"-5\", YMax to \"2\", YStep to \"1\", YBase to \"10\" and YExpression to \"pow(BASE,I)\". This will test the gamma of the RBFKernel with the values 10 -5 , 10 -4 ,..,10 2 . Output will be similar to this one here: Filter: weka.filters.AllFilter Classifier: weka.classifiers.functions.SMO -C 2.0 -L 0.0010 -P 1.0E-12 -N 0 -V -1 -W 1 -K \"weka.classifiers.functions.supportVector.RBFKernel -C 250007 -G 0.0\" X property: classifier.c Y property: classifier.kernel.gamma Evaluation: Accuracy Coordinates: [2.0, 0.0] Values: **2.0** (X coordinate), **1.0** (Y coordinate) * Optimizing PLSFilter with LinearRegression (# of components and ridge) - default setup * Start the Explorer and load your dataset with numeric class. * Set the evaluation to Correlation coefficient. * Set the filter to weka.filters.supervised.attribute.PLSFilter . * Set weka.classifiers.functions.LinearRegression as classifier and use no attribute selection and no elimination of colinear attributes (speeds up LinearRegression significantly!). * Set the XProperty to \"filter.numComponents\", XMin to \"5\", XMax to \"20\" (this depends heavily on your dataset, should be no more than the number of attributes!), XStep to \"1\" and XExpression to \"I\". This will test the number of components the PLSFilter will produce from 5 to 20. * Set the YProperty to \"classifier.ridge\", XMin to \"-10\", XMax to \"5\", YStep to \"1\" and YExpression to \"pow(BASE,I)\". This will try ridge parameters from 10 -10 to 10 5 . * Output will be similar to this one: Filter: weka.filters.supervised.attribute.PLSFilter -C 5 -M -A PLS1 -P center Classifier: weka.classifiers.functions.LinearRegression -S 1 -C -R 5.0 X property: filter.numComponents Y property: classifier.ridge Evaluation: Correlation coefficient Coordinates: [5.0, 5.0] Values: **5.0** (X coordinate), **100000.0** (Y coordinate) Notes: a property for the classifier starts with classifier. a property for the filter starts with filter. Arrays of objects are addressed with [ ] , with the index being 0-based. E.g., using a weka.filters.MultiFilter in GridSearch consisting of a ReplaceMissingValues and a PLSFilter filter one can address the numComponents property of the PLSFilter with filter.filter[1].numComponents","title":"GridSearch"},{"location":"optimizing_parameters/#multisearch","text":"weka.classifiers.meta.MultiSearch is available through this Weka package (requires Weka 3.7.11 or later; for downloads see the Releases section). MultiSearch is similar to GridSearch, more general and simpler at the same time. More general, because it allows the optimization of an arbitrary number of parameters, not just two. Simpler, because it does not offer any search space expansions or gnuplot output and less options. For each parameter to optimize, the user has to define a search parameter . There are two types of parameters available: MathParameter - basically what GridSearch uses, with an expression to calculate the actual value using the min, max and step parameters ListParameter - the blank-separated list of values is used as input for the optimization (useful, if values cannot be described by a mathematical function) Here is a setup for finding the best ridge parameter (property classifier.ridge ) using the MathParameter search parameter using values from 10^-10 to 10^5: weka.classifiers.meta.MultiSearch \\ -E CC \\ -search \"weka.core.setupgenerator.MathParameter -property classifier.ridge -min -10.0 -max 5.0 -step 1.0 -base 10.0 -expression pow(BASE,I)\" \\ -sample-size 100.0 -initial-folds 2 -subsequent-folds 10 -num-slots 1 -S 1 \\ -W weka.classifiers.functions.LinearRegression -- -S 1 -C -R 1.0E-8 And here using the ListParameter search parameter for evaluating values 0.001, 0.05, 0.1, 0.5, 0.75 and 1.0 for the ridge parameter (property classifier.ridge ): weka.classifiers.meta.MultiSearch \\ -E CC \\ -search \"weka.core.setupgenerator.ListParameter -property classifier.ridge -list \\\"0.001 0.05 0.1 0.5 0.75 1.0\\\"\" \\ -sample-size 100.0 -initial-folds 2 -subsequent-folds 10 -num-slots 1 -S 1 \\ -W weka.classifiers.functions.LinearRegression -- -S 1 -C -R 1.0E-8 MultiSearch can be optimized based on the following measures: Correlation coefficient (= CC) Root mean squared error (= RMSE) Root relative squared error (= RRSE) Mean absolute error (= MAE) Root absolute error (= RAE) Combined: (1-abs(CC)) + RRSE + RAE Accuracy (= ACC) Kappa (= KAP)","title":"MultiSearch"},{"location":"optimizing_parameters/#auto-weka","text":"Auto-WEKA is available as a package through the WEKA package manager. It provides the class weka.classifiers.meta.AutoWEKAClassifier and optimizes all parameters of all learners. It also automatically determines the best learner to use and the best attribute selection method for a given dataset. More information is available on the project website and the manual .","title":"Auto-WEKA"},{"location":"optimizing_parameters/#downloads","text":"CVParam.java - optimizes J48's -C parameter","title":"Downloads"},{"location":"optimizing_parameters/#see-also","text":"LibSVM - you need additional jars in your CLASSPATH to be able to use LibSVM","title":"See also"},{"location":"optimizing_parameters/#links","text":"gnuplot homepage Java Beans Introspection","title":"Links"},{"location":"osx_mountain_lion_weka_x_y_z_is_damaged_and_cant_be_installed_you_should_eject_the_disk_image/","text":"Mac OS X 10.8 (Mountain Lion) introduced a new security feature that, by default, limits \"acceptable\" applications to only those downloaded from the Mac App store. Thankfully, you can alter this in the system preferences. Go to \"Security & Privacy\" and change the \"Allow applications downloaded from:\" to \"Anywhere\". Weka will launch successfully after this change.","title":"Osx mountain lion weka x y z is damaged and cant be installed you should eject the disk image"},{"location":"performing_attribute_selection/","text":"In Weka, you have three options of performing attribute selection from commandline (not everything is possible from the GUI): the native approach, using the attribute selection classes directly using a meta-classifier the filter approach Notes: The commandlines outlined in this article are for a Linux/Unix bash (the backslash tells the shell that the command isn't finished yet and continues on the next line). In case of Windows or the SimpleCLI, just remove those backslashes and put everything on one line. The Explorer in the developer version (>= 3.5.4) also outputs the commandline setups to its log. Just click on the Log button to display the log and copy/paste the commandlines (you will need to add the appropriate java call and dataset files, of course). Native # Using the attribute selection classes directly outputs some additional useful information, like number of subsets evaluated/best merit (for subset evaluators), ranked output with merit per attribute (for ranking based setups). The attribute selection classes are located in the following package: weka.attributeSelection Example using CfsSubsetEval and BestFirst : java weka.attributeSelection.CfsSubsetEval \\ -M \\ -s \"weka.attributeSelection.BestFirst -D 1 -N 5\" \\ -i Meta-classifier # Weka also offers a meta-classifier that takes a search algorithm and evaluator next to the base classifier. This makes the attribute selection process completely transparent and the base classifier receives only the reduced dataset. This is the full classname of the meta-classifier: weka.classifiers.meta.AttributeSelectedClassifier Example using CfsSubsetEval and BestFirst : java weka.classifiers.meta.AttributeSelectedClassifier \\ -t \\ -E \"weka.attributeSelection.CfsSubsetEval -M\" \\ -S \"weka.attributeSelection.BestFirst -D 1 -N 5\" \\ -W weka.classifiers.trees.J48 \\ -- \\ -C 0 .25 -M 2 Filter # In case you want to obtain the reduced/ranked data and not just output the selected/ranked attributes or using it internally in a classifier, you can use the filter approach. The following filter offers attribute selection: weka.filters.supervised.attribute.AttributeSelection Example using CfsSubsetEval and BestFirst in batch mode : java weka.filters.supervised.attribute.AttributeSelection \\ -E \"weka.attributeSelection.CfsSubsetEval -M\" \\ -S \"weka.attributeSelection.BestFirst -D 1 -N 5\" \\ -b \\ -i \\ -o \\ -r \\ -s Note: batch mode is not available from the Explorer. See also # Batch filtering - general information about batch filtering Use Weka in your Java code , section Attribute selection - if you want to use attribute selection from your own code.","title":"Performing attribute selection"},{"location":"performing_attribute_selection/#native","text":"Using the attribute selection classes directly outputs some additional useful information, like number of subsets evaluated/best merit (for subset evaluators), ranked output with merit per attribute (for ranking based setups). The attribute selection classes are located in the following package: weka.attributeSelection Example using CfsSubsetEval and BestFirst : java weka.attributeSelection.CfsSubsetEval \\ -M \\ -s \"weka.attributeSelection.BestFirst -D 1 -N 5\" \\ -i ","title":"Native"},{"location":"performing_attribute_selection/#meta-classifier","text":"Weka also offers a meta-classifier that takes a search algorithm and evaluator next to the base classifier. This makes the attribute selection process completely transparent and the base classifier receives only the reduced dataset. This is the full classname of the meta-classifier: weka.classifiers.meta.AttributeSelectedClassifier Example using CfsSubsetEval and BestFirst : java weka.classifiers.meta.AttributeSelectedClassifier \\ -t \\ -E \"weka.attributeSelection.CfsSubsetEval -M\" \\ -S \"weka.attributeSelection.BestFirst -D 1 -N 5\" \\ -W weka.classifiers.trees.J48 \\ -- \\ -C 0 .25 -M 2","title":"Meta-classifier"},{"location":"performing_attribute_selection/#filter","text":"In case you want to obtain the reduced/ranked data and not just output the selected/ranked attributes or using it internally in a classifier, you can use the filter approach. The following filter offers attribute selection: weka.filters.supervised.attribute.AttributeSelection Example using CfsSubsetEval and BestFirst in batch mode : java weka.filters.supervised.attribute.AttributeSelection \\ -E \"weka.attributeSelection.CfsSubsetEval -M\" \\ -S \"weka.attributeSelection.BestFirst -D 1 -N 5\" \\ -b \\ -i \\ -o \\ -r \\ -s Note: batch mode is not available from the Explorer.","title":"Filter"},{"location":"performing_attribute_selection/#see-also","text":"Batch filtering - general information about batch filtering Use Weka in your Java code , section Attribute selection - if you want to use attribute selection from your own code.","title":"See also"},{"location":"plotting_multiple_roc_curves/","text":"KnowledgeFlow # Description # Comparing different classifiers on one dataset can also be done via ROC curves , not just via Accuracy, Correlation coefficient etc. In the Explorer it is not possible to do that for several classifiers, this is only possible in the KnowledgeFlow . This is the basic setup (based on a Wekalist post): ArffLoader ---dataSet---> ClassAssigner ---dataSet---> ClassValuePicker (the class label you want the plot for) ---dataSet---> CrossValidationFoldMaker ---trainingSet/testSet (i.e. BOTH connections)---> Classifier of your choice ---batchClassifier---> ClassifierPerformanceEvaluator ---thresholdData---> ModelPerformanceChart This setup can be easily extended to host several classifiers, which illustrates the Plotting_multiple_roc.kfml example, containing J48 and RandomForest as classifiers. Java # Description # The VisualizeMultipleROC.java class lets you display several ROC curves in a single plot. The data it is using for display is from previously saved ROC curves. This example class is just a modified version of the VisualizeROC.java class, which displays only a single ROC curve (see Visualizing ROC curve article). See also # Wikipedia article on ROC curve Visualizing ROC curve ROC curves Downloads # Plotting_multiple_roc.kfml - Example KnowledgeFlow layout file VisualizeMultipleROC.java ( stable , developer )","title":"KnowledgeFlow"},{"location":"plotting_multiple_roc_curves/#knowledgeflow","text":"","title":"KnowledgeFlow"},{"location":"plotting_multiple_roc_curves/#description","text":"Comparing different classifiers on one dataset can also be done via ROC curves , not just via Accuracy, Correlation coefficient etc. In the Explorer it is not possible to do that for several classifiers, this is only possible in the KnowledgeFlow . This is the basic setup (based on a Wekalist post): ArffLoader ---dataSet---> ClassAssigner ---dataSet---> ClassValuePicker (the class label you want the plot for) ---dataSet---> CrossValidationFoldMaker ---trainingSet/testSet (i.e. BOTH connections)---> Classifier of your choice ---batchClassifier---> ClassifierPerformanceEvaluator ---thresholdData---> ModelPerformanceChart This setup can be easily extended to host several classifiers, which illustrates the Plotting_multiple_roc.kfml example, containing J48 and RandomForest as classifiers.","title":"Description"},{"location":"plotting_multiple_roc_curves/#java","text":"","title":"Java"},{"location":"plotting_multiple_roc_curves/#description_1","text":"The VisualizeMultipleROC.java class lets you display several ROC curves in a single plot. The data it is using for display is from previously saved ROC curves. This example class is just a modified version of the VisualizeROC.java class, which displays only a single ROC curve (see Visualizing ROC curve article).","title":"Description"},{"location":"plotting_multiple_roc_curves/#see-also","text":"Wikipedia article on ROC curve Visualizing ROC curve ROC curves","title":"See also"},{"location":"plotting_multiple_roc_curves/#downloads","text":"Plotting_multiple_roc.kfml - Example KnowledgeFlow layout file VisualizeMultipleROC.java ( stable , developer )","title":"Downloads"},{"location":"primer/","text":"WEKA is a comprehensive workbench for machine learning and data mining. Its main strengths lie in the classification area, where many of the main machine learning approaches have been implemented within a clean, object-oriented Java class hierarchy. Regression, association rule mining, time series prediction, and clustering algorithms have also been implemented. This document serves as a brief introduction to using WEKA from the command line interface. We will begin by describing basic concepts and ideas. Then, we will describe the weka.filters package, which is used to transform input data, e.g., for preprocessing, transformation, feature generation and so on. Following that, we will consider some machine learning algorithms that generate classification models. Afterwards, some practical examples are given. Note that, in the doc directory of the WEKA installation directory, you can find documentation of all Java classes in WEKA. Prepare to use it since this introduction is not intended to be complete. If you want to know exactly what is going on, take a look at the source code, which can be found in weka-src.jar and can be extracted via the jar utility from the Java Development Kit. Basic concepts # Dataset # A set of data items, the dataset, is a very basic concept of machine learning. A dataset is roughly equivalent to a two-dimensional spreadsheet or database table. In WEKA, it is implemented by the Instances class. A dataset is a collection of examples, each one of class Instance . Each Instance consists of a number of attributes, any of which can be nominal (= one of a predefined list of values), numeric (= a real or integer number) or a string (= an arbitrary long list of characters, enclosed in \"double quotes\"). WEKA also supports date attributes and relational attributes. The external representation of an Instances class is an ARFF file, which consists of a header describing the attribute types and the data as comma-separated list. Here is a short, commented example. A complete description of the ARFF file format can be found here . % This is a toy example, the UCI weather dataset. % Any relation to real weather is purely coincidental. Comment lines at the beginning of the dataset should give an indication of its source, context and meaning. @relation golfWeatherMichigan_1988/02/10_14days Here we state the internal name of the dataset. Try to be as descriptive as possible. @attribute outlook {sunny, overcast rainy} @attribute windy {TRUE, FALSE} Here we define two nominal attributes, outlook and windy . The former has three values: sunny , overcast and rainy ; the latter two: TRUE and FALSE . Nominal values with special characters, commas or spaces are enclosed in 'single quotes'. @attribute temperature numeric @attribute humidity numeric These lines define two numeric attributes. @attribute play {yes, no} The last attribute is the default target or class variable used for prediction. In our case it is a nominal attribute with two values, making this a binary classification problem. @data sunny,FALSE,85,85,no sunny,TRUE,80,90,no overcast,FALSE,83,86,yes rainy,FALSE,70,96,yes rainy,FALSE,68,80,yes The rest of the dataset consists of the token @data, followed by comma-separated values for the attributes -- one line per example. In our case there are five examples. Some basic statistics and validation of given ARFF files can be obtained via the main() routine of weka.core.Instances : java weka.core.Instances data/soybean.arff weka.core offers some other useful routines, e.g., converters.C45Loader and converters.CSVLoader , which can be used to convert C45 datasets and comma/tab-separated datasets respectively, e.g.: java weka.core.converters.CSVLoader data.csv > data.arff java weka.core.converters.C45Loader c45_filestem > data.arff Classifier # Any classification or regression algorithm in WEKA is derived from the abstract Classifier class. Surprisingly little is needed for a basic classifier: a routine which generates a classifier model from a training dataset (= buildClassifier ) and another routine which produces a classification for a given instance (= classifyInstance ), or generates a probability distribution for all classes of the instance (= distributionForInstance ). A classifier model is an arbitrary complex mapping from predictor attributes to the class attribute. The specific form and creation of this mapping, or model, differs from classifier to classifier. For example, ZeroR's model just consists of a single value: the most common class in the case of classification problems, or the median of all numeric values in case of predicting a numeric value (= regression learning). ZeroR is a trivial classifier, but it gives a lower bound on the performance of a given dataset that should be significantly improved by more complex classifiers. As such it is a reasonable test of how well the class can be predicted without considering the other attributes. Later , we will explain how to interpret the output from classifiers in detail -- for now just focus on the Correctly Classified Instances in the section Stratified cross-validation and notice how it improves from ZeroR to J48 when we use the soybean data: java weka.classifiers.rules.ZeroR -t soybean.arff java weka.classifiers.trees.J48 -t soybean.arff There are various approaches to determine the performance of classifiers. It can most simply be measured by counting the proportion of correctly predicted examples in a test dataset. This value is the classification accuracy , which is also 1-ErrorRate . Both terms are used in literature. The simplest case for evaluation is when we use a training set and a test set which are mutually independent. This is referred to as hold-out estimate. To estimate variance in these performance estimates, hold-out estimates may be computed by repeatedly by resampling the same dataset -- i.e., randomly shuffling it and then splitting it into training and test sets with a specific proportion of the examples, collecting all estimates on the test sets and computing average and standard deviation of accuracy. A more elaborate method is k -fold cross-validation. Here, a number of folds k is specified. The dataset is randomly shuffled and then split into k folds of equal size. In each iteration, one fold is used for testing and the other k-1 folds are used for training the classifier. The test results are collected and pooled (or averaged) over all folds. This gives the cross-validation estimate of accuracy. The folds can be purely random or slightly modified to create the same class distributions in each fold as in the complete dataset. In the latter case the cross-validation is called stratified . Leave-one-out (loo) cross-validation signifies that k is equal to the number of examples. Out of necessity, loo cv has to be non-stratified, i.e., the class distributions in the test sets are not the same as those in the training data. Therefore loo CV can produce misleading results in rare cases. However it is still quite useful in dealing with small datasets since it utilizes the greatest amount of training data from the dataset. weka filters # The weka.filters package contains Java classes that transform datasets -- by removing or adding attributes, resampling the dataset, removing examples and so on. This package offers useful support for data preprocessing, which is an important step in machine learning. All filters offer the command-line option -i for specifying the input dataset, and the option -o for specifying the output dataset. If any of these parameters is not given, this specifies standard input resp. output for use within pipes. Other parameters are specific to each filter and can be found out via - h , as with any other class. The weka.filters package is organized into supervised and unsupervised filtering, both of which are again subdivided into instance and attribute filtering. We will discuss each of the four subsection separately. weka.filters.supervised # Classes below weka.filters.supervised in WEKA's Java class hierarchy are for supervised filtering, i.e., taking advantage of the class information. For those filters, a class must be assigned by providing the index of the class attribute via -c . attribute # Discretize is used to discretize numeric attributes into nominal ones, based on the class information, via Fayyad & Irani's MDL method, or optionally with Kononeko's MDL method. Some learning schemes or classifiers can only process nominal data, e.g., rules.Prism ; and in some cases discretization may also reduce learning time and help combat overfitting. java weka.filters.supervised.attribute.Discretize -i data/iris.arff -o iris-nom.arff -c last java weka.filters.supervised.attribute.Discretize -i data/cpu.arff -o cpu-classvendor-nom.arff -c first NominalToBinary encodes all nominal attributes into binary (two-valued) attributes, which can be used to transform the dataset into a purely numeric representation, e.g., for visualization via multi-dimensional scaling. java weka.filters.supervised.attribute.NominalToBinary -i data/contact-lenses.arff -o contact-lenses-bin.arff -c last Note that most classifiers in WEKA utilize transformation filters internally, e.g., Logistic and SMO, so you may not have to use these filters explicity. instance # Resample creates a stratified subsample of the given dataset. This means that overall class distributions are approximately retained within the sample. A bias towards uniform class distribution can be specified via - B . java weka.filters.supervised.instance.Resample -i data/soybean.arff -o soybean-5%.arff -c last -Z 5 java weka.filters.supervised.instance.Resample -i data/soybean.arff -o soybean-uniform-5%.arff -c last -Z 5 -B 1 StratifiedRemoveFolds creates stratified cross-validation folds of the given dataset. This means that per default the class distributions are approximately retained within each fold. The following example splits soybean.arff into stratified training and test datasets, the latter consisting of 25% (=1/4) of the data. java weka.filters.supervised.instance.StratifiedRemoveFolds -i data/soybean.arff -o soybean-train.arff \\ -c last -N 4 -F 1 -V java weka.filters.supervised.instance.StratifiedRemoveFolds -i data/soybean.arff -o soybean-test.arff \\ -c last -N 4 -F 1 weka.filters.unsupervised # Classes below weka.filters.unsupervised in WEKA's Java class hierarchy are for unsupervised filtering, e.g., the non-stratified version of Resample. A class should not be assigned here. attribute # StringToWordVector transforms string attributes into a word vectors, e.g., creating one attribute for each word that either encodes presence or word count ( -C ) within the string. -W can be used to set an approximate limit on the number of words. When a class is assigned, the limit applies to each class separately. This filter is useful for text mining. Obfuscate renames the dataset name, all attribute names and nominal attribute values. This is intended for exchanging sensitive datasets without giving away restricted information. Remove is intended for explicit deletion of attributes from a dataset, e.g. for removing attributes of the iris dataset: java weka.filters.unsupervised.attribute.Remove -R 1 -2 -i data/iris.arff -o iris-simplified.arff java weka.filters.unsupervised.attribute.Remove -V -R 3 -last -i data/iris.arff -o iris-simplified.arff instance # Resample creates a non-stratified subsample of the given dataset. It performs random sampling without regard to the class information. Otherwise it is equivalent to its supervised variant. java weka.filters.unsupervised.instance.Resample -i data/soybean.arff -o soybean-5%.arff -Z 5 RemoveFolds creates cross-validation folds of the given dataset. The class distributions are not retained. The following example splits soybean.arff into training and test datasets, the latter consisting of 25% (=1/4) of the data. java weka.filters.unsupervised.instance.RemoveFolds -i data/soybean.arff -o soybean-train.arff -c last -N 4 -F 1 -V java weka.filters.unsupervised.instance.RemoveFolds -i data/soybean.arff -o soybean-test.arff -c last -N 4 -F 1 RemoveWithValues filters instances according to the value of an attribute. java weka.filters.unsupervised.instance.RemoveWithValues -i data/soybean.arff \\ -o soybean-without_herbicide_injury.arff -V -C last -L 19 weka.classifiers # Classifiers are at the core of WEKA. There are a lot of common options for classifiers, most of which are related to evaluation purposes. We will focus on the most important ones. All others including classifier-specific parameters can be found via - h , as usual. Parameter Description -t specifies the training file (ARFF format) -T specifies the test file in (ARFF format). If this parameter is missing, a crossvalidation will be performed (default: 10-fold cv) -x This parameter determines the number of folds for the cross-validation. A cv will only be performed if -T is missing. -c As we already know from the weka.filters section, this parameter sets the class variable with a one-based index. -d The model after training can be saved via this parameter. Each classifier has a different binary format for the model, so it can only be read back by the ct same classifier on a compatible dataset. Only the model on the training set is saved, not the multiple models generated via cross-validation. -l Loads a previously saved model, usually for testing on new, previously unseen data. In that case, a compatible test file should be specified, i.e. the same ributes in the same order. -p If a test file is specified, this parameter shows you the predictions and one attribute (0 for none) for all test instances. -o This parameter switches the human-readable output of the model description off. In case of support vector machines or NaiveBayes, this makes some sense unless you want to parse and visualize a lot of information. We now give a short list of selected classifiers in WEKA: trees.J48 A clone of the C4.5 decision tree learner bayes.NaiveBayes A Naive Bayesian learner. -K switches on kernel density estimation for numerical attributes which often improves performance. meta.ClassificationViaRegression -W functions.LinearRegression Multi-response linear regression. functions.Logistic Logistic Regression. functions.SMO Support Vector Machine (linear, polynomial and RBF kernel) with Seuential Minimal Optimization Algorithm due to [Platt, 1998]. Defaults to SVM with linear kernel, -E 5 -C 10 gives an SVM with polynomial kernel of degree 5 and lambda=10. lazy.KStar Instance-Based learner. -E sets the blend entropy automatically, which is usa`lly preferable. lazy.IBk Instance-Based learner with fixed neighborhood. -K sets the number of neighbors tou`se. IB1 is equivalent to IBk -K 1 rules.JRip A clone of the RIPPER rule learner. Based on a simple example, we will now explain the output of a typical classifier, weka.classifiers.trees.J48 . Consider the following call from the command line, or start the WEKA explorer and train J48 on weather.numeric.arff: java weka.classifiers.trees.J48 -t data/weather.numeric.arff J48 pruned tree ------------------ outlook = sunny | humidity <= 75: yes (2.0) | humidity > 75: no (3.0) outlook = overcast: yes (4.0) outlook = rainy | windy = TRUE: no (2.0) | windy = FALSE: yes (3.0) Number of Leaves : 5 Size of the tree : 8 The first part, unless you specify -o , is a human-readable form of the training set model. In this case, it is a decision tree. outlook is at the root of the tree and determines the first decision. In case it is overcast, we'll always play golf. The numbers in (parentheses) at the end of each leaf tell us the number of examples in this leaf. If one or more leaves were not pure (= all of the same class), the number of misclassified examples would also be given, after a /slash/ Time taken to build model: 0.05 seconds Time taken to test model on training data: 0 seconds As you can see, a decision tree learns quite fast and is evaluated even faster. == Error on training data == Correctly Classified Instance 14 100 % Incorrectly Classified Instances 0 0 % Kappa statistic 1 Mean absolute error 0 Root mean squared error 0 Relative absolute error 0 % Root relative squared error 0 % Total Number of Instances 14 == Detailed Accuracy By Class == TP Rate FP Rate Precision Recall F-Measure Class 1 0 1 1 1 yes 1 0 1 1 1 no == Confusion Matrix == a b <-- classified as 9 0 | a = yes 0 5 | b = no This is quite boring: our classifier is perfect, at least on the training data -- all instances were classified correctly and all errors are zero. As is usually the case, the training set accuracy is too optimistic. The detailed accuracy by class and the confusion matrix is similarily trivial. == Stratified cross-validation == Correctly Classified Instances 9 64.2857 % Incorrectly Classified Instances 5 35.7143 % Kappa statistic 0.186 Mean absolute error 0.2857 Root mean squared error 0.4818 Relative absolute error 60 % Root relative squared error 97.6586 % Total Number of Instances 14 == Detailed Accuracy By Class == TP Rate FP Rate Precision Recall F-Measure Class 0.778 0.6 0.7 0.778 0.737 yes 0.4 0.222 0.5 0.4 0.444 no == Confusion Matrix == a b <-- classified as 7 2 | a = yes 3 2 | b = no The stratified cross-validation paints a more realistic picture. The accuracy is around 64%. The kappa statistic measures the agreement of prediction with the true class -- 1.0 signifies complete agreement. The error values that are shown, e.g., the root of the mean squared error, indicate the accuracy of the probability estimates that are generated by the classification model. The confusion matrix is more commonly named contingency table . In our case we have two classes, and therefore a 2x2 confusion matrix, the matrix could be arbitrarily large. The number of correctly classified instances is the sum of diagonals in the matrix; all others are incorrectly classified (class \"a\" gets misclassified as \"b\" exactly twice, and class \"b\" gets misclassified as \"a\" three times). The True Positive (TP) rate is the proportion of examples which were classified as class x , among all examples which truly have class x , i.e., how much of the class was captured correctly. It is equivalent to Recall . In the confusion matrix, this is the diagonal element divided by the sum over the relevant row, i.e., 7/(7+2)=0.778 for class yes and 2/(3+2)=0.4 for class no in our example. The False Positive (FP) rate is the proportion of examples which were classified as class x , but belong to a different class, among all examples which are not of class x . In the matrix, this is the column sum of class x minus the diagonal element, divided by the row sums of all other classes; i.e. 3/5=0.6 for class yes and 2/9=0.222 for class no . The Precision is the proportion of the examples which truly have class x among all those which were classified as class x . In the matrix, this is the diagonal element divided by the sum over the relevant column, i.e. 7/(7+3)=0.7 for class yes and 2/(2+2)=0.5 for class no . The F-Measure is simply 2 Precision Recall/(Precision+Recall), a combined measure for precision and recall. These measures are useful for comparing classifiers. However, if more detailed information about the classifier's predictions are necessary, -p # outputs just the predictions for each test instance, along with a range of one-based attribute ids (0 for none). Let's look at the following example. We shall assume soybean-train.arff and soybean-test.arff have been constructed via weka.filters.supervised.instance.StratifiedRemoveFolds as in a previous example. java weka . classifiers . bayes . NaiveBayes - K - t soybean - train . arff - T soybean - test . arff - p 0 0 diaporthe-stem-canker 0.9999672587892333 diaporthe-stem-canker 1 diaporthe-stem-canker 0.9999992614503429 diaporthe-stem-canker 2 diaporthe-stem-canker 0.999998948559035 diaporthe-stem-canker 3 diaporthe-stem-canker 0.9999998441238833 diaporthe-stem-canker 4 diaporthe-stem-canker 0.9999989997681132 diaporthe-stem-canker 5 rhizoctonia-root-rot 0.9999999395928124 rhizoctonia-root-rot 6 rhizoctonia-root-rot 0.999998912860593 rhizoctonia-root-rot 7 rhizoctonia-root-rot 0.9999994386283236 rhizoctonia-root-rot ... The values in each line are separated by a single space. The fields are the zero-based test instance id, followed by the predicted class value, the confidence for the prediction (estimated probability of predicted class), and the true class. All these are correctly classified, so let's look at a few erroneous ones. 32 phyllosticta-leaf-spot 0.7789710144361445 brown-spot ... 39 alternarialeaf-spot 0.6403333824349896 brown-spot ... 44 phyllosticta-leaf-spot 0.893568420641914 brown-spot ... 46 alternarialeaf-spot 0.5788190397739439 brown-spot ... 73 brown-spot 0.4943768155314637 alternarialeaf-spot ... In each of these cases, a misclassification occurred, mostly between classes alternarialeaf-spot and brown-spot . The confidences seem to be lower than for correct classification, so for a real-life application it may make sense to output don't know below a certain threshold. WEKA also outputs a trailing newline. If we had chosen a range of attributes via -p , e.g., -p first-last , the mentioned attributes would have been output afterwards as comma-separated values, in parantheses. However, the zero-based instance id in the first column offers a safer way to determine the test instances. Usually, if you evaluate a classifier for a longer experiment, you will do something like this (for csh): java -Xmx1024m weka.classifiers.trees.J48 -t data.arff -k -d J48-data.model > & ! J48-data.out & The -Xmx1024m parameter for maximum heap size enables the Java heap, where Java stores objects, to grow to a maximum size of 1024 Megabytes. There is no overhead involved, it just leaves more room for the heap to grow. The - k flag gives you some additional performance statistics. In case your model performs well, it makes sense to save it via -d - you can always delete it later! The implicit cross-validation gives a more reasonable estimate of the expected accuracy on unseen data than the training set accuracy. The output both of standard error and output should be redirected, so you get both errors and the normal output of your classifier. The last & starts the task in the background. Keep an eye on your task via top and if you notice the hard disk works hard all the time (for linux), this probably means your task needs too much memory and will not finish in time for the exam. ;-) In that case, switch to a faster classifier or use filters , e.g., for Resample to reduce the size of your dataset or StratifiedRemoveFolds to create training and test sets - for most classifiers, training takes more time than testing. So, now you have run a lot of experiments -- which classifier is best? Try cat *.out | grep -A 3 \"Stratified\" | grep \"^Correctly\" ...this should give you all cross-validated accuracies. If the cross-validated accuracy is roughly the same as the training set accuracy, this indicates that your classifiers is presumably not overfitting the training set. Assume you have found the best classifier. To apply it on a new dataset, use something like java weka.classifiers.trees.J48 -l J48-data.model -T new-data.arff You will have to use the same classifier to load the model, but you need not set any options. Just add the new test file via -T . If you want, -p first-last will output all test instances with classifications and confidence scores, followed by all attribute values, so you can look at each error separately. The following more complex csh script creates datasets for learning curves, creating a 75% training set and 25% test set from a given dataset, then successively reducing the test set by factor 1.2 (83%), until it is also 25% in size. All this is repeated thirty times, with different random reorderings (- S ) and the results are written to different directories. The Experimenter GUI in WEKA can be used to design and run similar experiments. #!/bin/csh foreach f ( $* ) set run = 1 while ( $run < = 30 ) mkdir $run > & ! /dev/null java weka.filters.supervised.instance.StratifiedRemoveFolds -N 4 -F 1 -S $run -c last -i ../ $f -o $run /t_ $f java weka.filters.supervised.instance.StratifiedRemoveFolds -N 4 -F 1 -S $run -V -c last -i ../ $f -o $run /t0 $f foreach nr ( 0 1 2 3 4 5 ) set nrp1 = $nr @ nrp1++ java weka.filters.supervised.instance.Resample -S 0 -Z 83 -c last -i $run /t $nr$f -o $run /t $nrp1$f end echo Run $run of $f done . @ run++ end end If meta classifiers are used, i.e. classifiers whose options include classifier specifications - for example, StackingC or ClassificationViaRegression , care must be taken not to mix the parameters. For example, java weka.classifiers.meta.ClassificationViaRegression -W weka.classifiers.functions.LinearRegression -S 1 \\ -t data/iris.arff -x 2 gives us an illegal options exception for -S 1 . This parameter is meant for LinearRegression, not for ClassificationViaRegression, but WEKA does not know this by itself. One way to clarify this situation is to enclose the classifier specification, including all parameters, in \"double\" quotes, like this: java weka.classifiers.meta.ClassificationViaRegression -W \"weka.classifiers.functions.LinearRegression -S 1\" \\ -t data/iris.arff -x 2 However this does not always work, depending on how the option handling was implemented in the top-level classifier. While for Stacking this approach would work quite well, for ClassificationViaRegression it does not. We get the dubious error message that the class weka.classifiers.functions.LinearRegression -S 1 cannot be found. Fortunately, there is another approach: All parameters given after -- are processed by the first sub-classifier; another -- lets us specify parameters for the second sub-classifier and so on. java weka.classifiers.meta.ClassificationViaRegression -W weka.classifiers.functions.LinearRegression \\ -t data/iris.arff -x 2 -- -S 1 In some cases, both approaches have to be mixed, for example: java weka.classifiers.meta.Stacking -B \"weka.classifiers.lazy.IBk -K 10\" \\ -M \"weka.classifiers.meta.ClassificationViaRegression -W weka.classifiers.functions.LinearRegression -- -S 1\" \\ -t data/iris.arff -x 2 Notice that while ClassificationViaRegression honors the -- parameter, Stacking itself does not.","title":"Primer"},{"location":"primer/#basic-concepts","text":"","title":"Basic concepts"},{"location":"primer/#dataset","text":"A set of data items, the dataset, is a very basic concept of machine learning. A dataset is roughly equivalent to a two-dimensional spreadsheet or database table. In WEKA, it is implemented by the Instances class. A dataset is a collection of examples, each one of class Instance . Each Instance consists of a number of attributes, any of which can be nominal (= one of a predefined list of values), numeric (= a real or integer number) or a string (= an arbitrary long list of characters, enclosed in \"double quotes\"). WEKA also supports date attributes and relational attributes. The external representation of an Instances class is an ARFF file, which consists of a header describing the attribute types and the data as comma-separated list. Here is a short, commented example. A complete description of the ARFF file format can be found here . % This is a toy example, the UCI weather dataset. % Any relation to real weather is purely coincidental. Comment lines at the beginning of the dataset should give an indication of its source, context and meaning. @relation golfWeatherMichigan_1988/02/10_14days Here we state the internal name of the dataset. Try to be as descriptive as possible. @attribute outlook {sunny, overcast rainy} @attribute windy {TRUE, FALSE} Here we define two nominal attributes, outlook and windy . The former has three values: sunny , overcast and rainy ; the latter two: TRUE and FALSE . Nominal values with special characters, commas or spaces are enclosed in 'single quotes'. @attribute temperature numeric @attribute humidity numeric These lines define two numeric attributes. @attribute play {yes, no} The last attribute is the default target or class variable used for prediction. In our case it is a nominal attribute with two values, making this a binary classification problem. @data sunny,FALSE,85,85,no sunny,TRUE,80,90,no overcast,FALSE,83,86,yes rainy,FALSE,70,96,yes rainy,FALSE,68,80,yes The rest of the dataset consists of the token @data, followed by comma-separated values for the attributes -- one line per example. In our case there are five examples. Some basic statistics and validation of given ARFF files can be obtained via the main() routine of weka.core.Instances : java weka.core.Instances data/soybean.arff weka.core offers some other useful routines, e.g., converters.C45Loader and converters.CSVLoader , which can be used to convert C45 datasets and comma/tab-separated datasets respectively, e.g.: java weka.core.converters.CSVLoader data.csv > data.arff java weka.core.converters.C45Loader c45_filestem > data.arff","title":"Dataset"},{"location":"primer/#classifier","text":"Any classification or regression algorithm in WEKA is derived from the abstract Classifier class. Surprisingly little is needed for a basic classifier: a routine which generates a classifier model from a training dataset (= buildClassifier ) and another routine which produces a classification for a given instance (= classifyInstance ), or generates a probability distribution for all classes of the instance (= distributionForInstance ). A classifier model is an arbitrary complex mapping from predictor attributes to the class attribute. The specific form and creation of this mapping, or model, differs from classifier to classifier. For example, ZeroR's model just consists of a single value: the most common class in the case of classification problems, or the median of all numeric values in case of predicting a numeric value (= regression learning). ZeroR is a trivial classifier, but it gives a lower bound on the performance of a given dataset that should be significantly improved by more complex classifiers. As such it is a reasonable test of how well the class can be predicted without considering the other attributes. Later , we will explain how to interpret the output from classifiers in detail -- for now just focus on the Correctly Classified Instances in the section Stratified cross-validation and notice how it improves from ZeroR to J48 when we use the soybean data: java weka.classifiers.rules.ZeroR -t soybean.arff java weka.classifiers.trees.J48 -t soybean.arff There are various approaches to determine the performance of classifiers. It can most simply be measured by counting the proportion of correctly predicted examples in a test dataset. This value is the classification accuracy , which is also 1-ErrorRate . Both terms are used in literature. The simplest case for evaluation is when we use a training set and a test set which are mutually independent. This is referred to as hold-out estimate. To estimate variance in these performance estimates, hold-out estimates may be computed by repeatedly by resampling the same dataset -- i.e., randomly shuffling it and then splitting it into training and test sets with a specific proportion of the examples, collecting all estimates on the test sets and computing average and standard deviation of accuracy. A more elaborate method is k -fold cross-validation. Here, a number of folds k is specified. The dataset is randomly shuffled and then split into k folds of equal size. In each iteration, one fold is used for testing and the other k-1 folds are used for training the classifier. The test results are collected and pooled (or averaged) over all folds. This gives the cross-validation estimate of accuracy. The folds can be purely random or slightly modified to create the same class distributions in each fold as in the complete dataset. In the latter case the cross-validation is called stratified . Leave-one-out (loo) cross-validation signifies that k is equal to the number of examples. Out of necessity, loo cv has to be non-stratified, i.e., the class distributions in the test sets are not the same as those in the training data. Therefore loo CV can produce misleading results in rare cases. However it is still quite useful in dealing with small datasets since it utilizes the greatest amount of training data from the dataset.","title":"Classifier"},{"location":"primer/#weka-filters","text":"The weka.filters package contains Java classes that transform datasets -- by removing or adding attributes, resampling the dataset, removing examples and so on. This package offers useful support for data preprocessing, which is an important step in machine learning. All filters offer the command-line option -i for specifying the input dataset, and the option -o for specifying the output dataset. If any of these parameters is not given, this specifies standard input resp. output for use within pipes. Other parameters are specific to each filter and can be found out via - h , as with any other class. The weka.filters package is organized into supervised and unsupervised filtering, both of which are again subdivided into instance and attribute filtering. We will discuss each of the four subsection separately.","title":"weka filters"},{"location":"primer/#wekafilterssupervised","text":"Classes below weka.filters.supervised in WEKA's Java class hierarchy are for supervised filtering, i.e., taking advantage of the class information. For those filters, a class must be assigned by providing the index of the class attribute via -c .","title":"weka.filters.supervised"},{"location":"primer/#attribute","text":"Discretize is used to discretize numeric attributes into nominal ones, based on the class information, via Fayyad & Irani's MDL method, or optionally with Kononeko's MDL method. Some learning schemes or classifiers can only process nominal data, e.g., rules.Prism ; and in some cases discretization may also reduce learning time and help combat overfitting. java weka.filters.supervised.attribute.Discretize -i data/iris.arff -o iris-nom.arff -c last java weka.filters.supervised.attribute.Discretize -i data/cpu.arff -o cpu-classvendor-nom.arff -c first NominalToBinary encodes all nominal attributes into binary (two-valued) attributes, which can be used to transform the dataset into a purely numeric representation, e.g., for visualization via multi-dimensional scaling. java weka.filters.supervised.attribute.NominalToBinary -i data/contact-lenses.arff -o contact-lenses-bin.arff -c last Note that most classifiers in WEKA utilize transformation filters internally, e.g., Logistic and SMO, so you may not have to use these filters explicity.","title":"attribute"},{"location":"primer/#instance","text":"Resample creates a stratified subsample of the given dataset. This means that overall class distributions are approximately retained within the sample. A bias towards uniform class distribution can be specified via - B . java weka.filters.supervised.instance.Resample -i data/soybean.arff -o soybean-5%.arff -c last -Z 5 java weka.filters.supervised.instance.Resample -i data/soybean.arff -o soybean-uniform-5%.arff -c last -Z 5 -B 1 StratifiedRemoveFolds creates stratified cross-validation folds of the given dataset. This means that per default the class distributions are approximately retained within each fold. The following example splits soybean.arff into stratified training and test datasets, the latter consisting of 25% (=1/4) of the data. java weka.filters.supervised.instance.StratifiedRemoveFolds -i data/soybean.arff -o soybean-train.arff \\ -c last -N 4 -F 1 -V java weka.filters.supervised.instance.StratifiedRemoveFolds -i data/soybean.arff -o soybean-test.arff \\ -c last -N 4 -F 1","title":"instance"},{"location":"primer/#wekafiltersunsupervised","text":"Classes below weka.filters.unsupervised in WEKA's Java class hierarchy are for unsupervised filtering, e.g., the non-stratified version of Resample. A class should not be assigned here.","title":"weka.filters.unsupervised"},{"location":"primer/#attribute_1","text":"StringToWordVector transforms string attributes into a word vectors, e.g., creating one attribute for each word that either encodes presence or word count ( -C ) within the string. -W can be used to set an approximate limit on the number of words. When a class is assigned, the limit applies to each class separately. This filter is useful for text mining. Obfuscate renames the dataset name, all attribute names and nominal attribute values. This is intended for exchanging sensitive datasets without giving away restricted information. Remove is intended for explicit deletion of attributes from a dataset, e.g. for removing attributes of the iris dataset: java weka.filters.unsupervised.attribute.Remove -R 1 -2 -i data/iris.arff -o iris-simplified.arff java weka.filters.unsupervised.attribute.Remove -V -R 3 -last -i data/iris.arff -o iris-simplified.arff","title":"attribute"},{"location":"primer/#instance_1","text":"Resample creates a non-stratified subsample of the given dataset. It performs random sampling without regard to the class information. Otherwise it is equivalent to its supervised variant. java weka.filters.unsupervised.instance.Resample -i data/soybean.arff -o soybean-5%.arff -Z 5 RemoveFolds creates cross-validation folds of the given dataset. The class distributions are not retained. The following example splits soybean.arff into training and test datasets, the latter consisting of 25% (=1/4) of the data. java weka.filters.unsupervised.instance.RemoveFolds -i data/soybean.arff -o soybean-train.arff -c last -N 4 -F 1 -V java weka.filters.unsupervised.instance.RemoveFolds -i data/soybean.arff -o soybean-test.arff -c last -N 4 -F 1 RemoveWithValues filters instances according to the value of an attribute. java weka.filters.unsupervised.instance.RemoveWithValues -i data/soybean.arff \\ -o soybean-without_herbicide_injury.arff -V -C last -L 19","title":"instance"},{"location":"primer/#wekaclassifiers","text":"Classifiers are at the core of WEKA. There are a lot of common options for classifiers, most of which are related to evaluation purposes. We will focus on the most important ones. All others including classifier-specific parameters can be found via - h , as usual. Parameter Description -t specifies the training file (ARFF format) -T specifies the test file in (ARFF format). If this parameter is missing, a crossvalidation will be performed (default: 10-fold cv) -x This parameter determines the number of folds for the cross-validation. A cv will only be performed if -T is missing. -c As we already know from the weka.filters section, this parameter sets the class variable with a one-based index. -d The model after training can be saved via this parameter. Each classifier has a different binary format for the model, so it can only be read back by the ct same classifier on a compatible dataset. Only the model on the training set is saved, not the multiple models generated via cross-validation. -l Loads a previously saved model, usually for testing on new, previously unseen data. In that case, a compatible test file should be specified, i.e. the same ributes in the same order. -p If a test file is specified, this parameter shows you the predictions and one attribute (0 for none) for all test instances. -o This parameter switches the human-readable output of the model description off. In case of support vector machines or NaiveBayes, this makes some sense unless you want to parse and visualize a lot of information. We now give a short list of selected classifiers in WEKA: trees.J48 A clone of the C4.5 decision tree learner bayes.NaiveBayes A Naive Bayesian learner. -K switches on kernel density estimation for numerical attributes which often improves performance. meta.ClassificationViaRegression -W functions.LinearRegression Multi-response linear regression. functions.Logistic Logistic Regression. functions.SMO Support Vector Machine (linear, polynomial and RBF kernel) with Seuential Minimal Optimization Algorithm due to [Platt, 1998]. Defaults to SVM with linear kernel, -E 5 -C 10 gives an SVM with polynomial kernel of degree 5 and lambda=10. lazy.KStar Instance-Based learner. -E sets the blend entropy automatically, which is usa`lly preferable. lazy.IBk Instance-Based learner with fixed neighborhood. -K sets the number of neighbors tou`se. IB1 is equivalent to IBk -K 1 rules.JRip A clone of the RIPPER rule learner. Based on a simple example, we will now explain the output of a typical classifier, weka.classifiers.trees.J48 . Consider the following call from the command line, or start the WEKA explorer and train J48 on weather.numeric.arff: java weka.classifiers.trees.J48 -t data/weather.numeric.arff J48 pruned tree ------------------ outlook = sunny | humidity <= 75: yes (2.0) | humidity > 75: no (3.0) outlook = overcast: yes (4.0) outlook = rainy | windy = TRUE: no (2.0) | windy = FALSE: yes (3.0) Number of Leaves : 5 Size of the tree : 8 The first part, unless you specify -o , is a human-readable form of the training set model. In this case, it is a decision tree. outlook is at the root of the tree and determines the first decision. In case it is overcast, we'll always play golf. The numbers in (parentheses) at the end of each leaf tell us the number of examples in this leaf. If one or more leaves were not pure (= all of the same class), the number of misclassified examples would also be given, after a /slash/ Time taken to build model: 0.05 seconds Time taken to test model on training data: 0 seconds As you can see, a decision tree learns quite fast and is evaluated even faster. == Error on training data == Correctly Classified Instance 14 100 % Incorrectly Classified Instances 0 0 % Kappa statistic 1 Mean absolute error 0 Root mean squared error 0 Relative absolute error 0 % Root relative squared error 0 % Total Number of Instances 14 == Detailed Accuracy By Class == TP Rate FP Rate Precision Recall F-Measure Class 1 0 1 1 1 yes 1 0 1 1 1 no == Confusion Matrix == a b <-- classified as 9 0 | a = yes 0 5 | b = no This is quite boring: our classifier is perfect, at least on the training data -- all instances were classified correctly and all errors are zero. As is usually the case, the training set accuracy is too optimistic. The detailed accuracy by class and the confusion matrix is similarily trivial. == Stratified cross-validation == Correctly Classified Instances 9 64.2857 % Incorrectly Classified Instances 5 35.7143 % Kappa statistic 0.186 Mean absolute error 0.2857 Root mean squared error 0.4818 Relative absolute error 60 % Root relative squared error 97.6586 % Total Number of Instances 14 == Detailed Accuracy By Class == TP Rate FP Rate Precision Recall F-Measure Class 0.778 0.6 0.7 0.778 0.737 yes 0.4 0.222 0.5 0.4 0.444 no == Confusion Matrix == a b <-- classified as 7 2 | a = yes 3 2 | b = no The stratified cross-validation paints a more realistic picture. The accuracy is around 64%. The kappa statistic measures the agreement of prediction with the true class -- 1.0 signifies complete agreement. The error values that are shown, e.g., the root of the mean squared error, indicate the accuracy of the probability estimates that are generated by the classification model. The confusion matrix is more commonly named contingency table . In our case we have two classes, and therefore a 2x2 confusion matrix, the matrix could be arbitrarily large. The number of correctly classified instances is the sum of diagonals in the matrix; all others are incorrectly classified (class \"a\" gets misclassified as \"b\" exactly twice, and class \"b\" gets misclassified as \"a\" three times). The True Positive (TP) rate is the proportion of examples which were classified as class x , among all examples which truly have class x , i.e., how much of the class was captured correctly. It is equivalent to Recall . In the confusion matrix, this is the diagonal element divided by the sum over the relevant row, i.e., 7/(7+2)=0.778 for class yes and 2/(3+2)=0.4 for class no in our example. The False Positive (FP) rate is the proportion of examples which were classified as class x , but belong to a different class, among all examples which are not of class x . In the matrix, this is the column sum of class x minus the diagonal element, divided by the row sums of all other classes; i.e. 3/5=0.6 for class yes and 2/9=0.222 for class no . The Precision is the proportion of the examples which truly have class x among all those which were classified as class x . In the matrix, this is the diagonal element divided by the sum over the relevant column, i.e. 7/(7+3)=0.7 for class yes and 2/(2+2)=0.5 for class no . The F-Measure is simply 2 Precision Recall/(Precision+Recall), a combined measure for precision and recall. These measures are useful for comparing classifiers. However, if more detailed information about the classifier's predictions are necessary, -p # outputs just the predictions for each test instance, along with a range of one-based attribute ids (0 for none). Let's look at the following example. We shall assume soybean-train.arff and soybean-test.arff have been constructed via weka.filters.supervised.instance.StratifiedRemoveFolds as in a previous example. java weka . classifiers . bayes . NaiveBayes - K - t soybean - train . arff - T soybean - test . arff - p 0 0 diaporthe-stem-canker 0.9999672587892333 diaporthe-stem-canker 1 diaporthe-stem-canker 0.9999992614503429 diaporthe-stem-canker 2 diaporthe-stem-canker 0.999998948559035 diaporthe-stem-canker 3 diaporthe-stem-canker 0.9999998441238833 diaporthe-stem-canker 4 diaporthe-stem-canker 0.9999989997681132 diaporthe-stem-canker 5 rhizoctonia-root-rot 0.9999999395928124 rhizoctonia-root-rot 6 rhizoctonia-root-rot 0.999998912860593 rhizoctonia-root-rot 7 rhizoctonia-root-rot 0.9999994386283236 rhizoctonia-root-rot ... The values in each line are separated by a single space. The fields are the zero-based test instance id, followed by the predicted class value, the confidence for the prediction (estimated probability of predicted class), and the true class. All these are correctly classified, so let's look at a few erroneous ones. 32 phyllosticta-leaf-spot 0.7789710144361445 brown-spot ... 39 alternarialeaf-spot 0.6403333824349896 brown-spot ... 44 phyllosticta-leaf-spot 0.893568420641914 brown-spot ... 46 alternarialeaf-spot 0.5788190397739439 brown-spot ... 73 brown-spot 0.4943768155314637 alternarialeaf-spot ... In each of these cases, a misclassification occurred, mostly between classes alternarialeaf-spot and brown-spot . The confidences seem to be lower than for correct classification, so for a real-life application it may make sense to output don't know below a certain threshold. WEKA also outputs a trailing newline. If we had chosen a range of attributes via -p , e.g., -p first-last , the mentioned attributes would have been output afterwards as comma-separated values, in parantheses. However, the zero-based instance id in the first column offers a safer way to determine the test instances. Usually, if you evaluate a classifier for a longer experiment, you will do something like this (for csh): java -Xmx1024m weka.classifiers.trees.J48 -t data.arff -k -d J48-data.model > & ! J48-data.out & The -Xmx1024m parameter for maximum heap size enables the Java heap, where Java stores objects, to grow to a maximum size of 1024 Megabytes. There is no overhead involved, it just leaves more room for the heap to grow. The - k flag gives you some additional performance statistics. In case your model performs well, it makes sense to save it via -d - you can always delete it later! The implicit cross-validation gives a more reasonable estimate of the expected accuracy on unseen data than the training set accuracy. The output both of standard error and output should be redirected, so you get both errors and the normal output of your classifier. The last & starts the task in the background. Keep an eye on your task via top and if you notice the hard disk works hard all the time (for linux), this probably means your task needs too much memory and will not finish in time for the exam. ;-) In that case, switch to a faster classifier or use filters , e.g., for Resample to reduce the size of your dataset or StratifiedRemoveFolds to create training and test sets - for most classifiers, training takes more time than testing. So, now you have run a lot of experiments -- which classifier is best? Try cat *.out | grep -A 3 \"Stratified\" | grep \"^Correctly\" ...this should give you all cross-validated accuracies. If the cross-validated accuracy is roughly the same as the training set accuracy, this indicates that your classifiers is presumably not overfitting the training set. Assume you have found the best classifier. To apply it on a new dataset, use something like java weka.classifiers.trees.J48 -l J48-data.model -T new-data.arff You will have to use the same classifier to load the model, but you need not set any options. Just add the new test file via -T . If you want, -p first-last will output all test instances with classifications and confidence scores, followed by all attribute values, so you can look at each error separately. The following more complex csh script creates datasets for learning curves, creating a 75% training set and 25% test set from a given dataset, then successively reducing the test set by factor 1.2 (83%), until it is also 25% in size. All this is repeated thirty times, with different random reorderings (- S ) and the results are written to different directories. The Experimenter GUI in WEKA can be used to design and run similar experiments. #!/bin/csh foreach f ( $* ) set run = 1 while ( $run < = 30 ) mkdir $run > & ! /dev/null java weka.filters.supervised.instance.StratifiedRemoveFolds -N 4 -F 1 -S $run -c last -i ../ $f -o $run /t_ $f java weka.filters.supervised.instance.StratifiedRemoveFolds -N 4 -F 1 -S $run -V -c last -i ../ $f -o $run /t0 $f foreach nr ( 0 1 2 3 4 5 ) set nrp1 = $nr @ nrp1++ java weka.filters.supervised.instance.Resample -S 0 -Z 83 -c last -i $run /t $nr$f -o $run /t $nrp1$f end echo Run $run of $f done . @ run++ end end If meta classifiers are used, i.e. classifiers whose options include classifier specifications - for example, StackingC or ClassificationViaRegression , care must be taken not to mix the parameters. For example, java weka.classifiers.meta.ClassificationViaRegression -W weka.classifiers.functions.LinearRegression -S 1 \\ -t data/iris.arff -x 2 gives us an illegal options exception for -S 1 . This parameter is meant for LinearRegression, not for ClassificationViaRegression, but WEKA does not know this by itself. One way to clarify this situation is to enclose the classifier specification, including all parameters, in \"double\" quotes, like this: java weka.classifiers.meta.ClassificationViaRegression -W \"weka.classifiers.functions.LinearRegression -S 1\" \\ -t data/iris.arff -x 2 However this does not always work, depending on how the option handling was implemented in the top-level classifier. While for Stacking this approach would work quite well, for ClassificationViaRegression it does not. We get the dubious error message that the class weka.classifiers.functions.LinearRegression -S 1 cannot be found. Fortunately, there is another approach: All parameters given after -- are processed by the first sub-classifier; another -- lets us specify parameters for the second sub-classifier and so on. java weka.classifiers.meta.ClassificationViaRegression -W weka.classifiers.functions.LinearRegression \\ -t data/iris.arff -x 2 -- -S 1 In some cases, both approaches have to be mixed, for example: java weka.classifiers.meta.Stacking -B \"weka.classifiers.lazy.IBk -K 10\" \\ -M \"weka.classifiers.meta.ClassificationViaRegression -W weka.classifiers.functions.LinearRegression -- -S 1\" \\ -t data/iris.arff -x 2 Notice that while ClassificationViaRegression honors the -- parameter, Stacking itself does not.","title":"weka.classifiers"},{"location":"properties_file/","text":"General # A properties file is a simple text file with this structure: = Notes: Comments start with the hash sign # . Backslashes within values need to be doubled (the backslashes get interpreted already when a property is read). To make a rather long property line more readable, one can use a backslash to continue on the next line. The Filter property, e.g., looks like this: weka.filters.Filter = \\ > weka.filters.supervised.attribute, \\ > weka.filters.supervised.instance, \\ > weka.filters.unsupervised.attribute, \\ > weka.filters.unsupervised.instance Precedence # The Weka property files (extension .props ) are searched for in the following order: current directory (< Weka 3.7.2) the user's home directory (see FAQ Where is my home directory located? for more information) (>= Weka 3.7.2) $WEKA_HOME/props (the default value for WEKA_HOME is user's home directory/wekafiles). the class path (normally the weka.jar file) If WEKA encounters those files it only supplements the properties, never overrides them. In other words, a property in the property file of the current directory has a higher precedence than the one in the user's home directory. Note: Under Cywgin , the home directory is still the Windows one, since the java installation will be still one for Windows. How to modify a .props file? # It is quite possible, that the default setup of WEKA is not to your liking and that you want to tweak it a little bit. The use of .props files instead of hard-coding makes it quite easy to modify WEKA's behavior. As example, we are modifying the background color of the 2D plots in the Explorer, changing it to dark gray . The responsible .props file is weka/gui/visualize/Visualize.props . These are the necessary steps: close WEKA extract the .props file from the weka.jar , using an archive manager that can handle ZIP files (e.g., 7-Zip under Windows) place this .props file in your home directory (see FAQ Where is my home directory located? on how to determine your home directory), or for Weka 3.7.2 or higher place this .props file in $WEKA_HOME/props (the default value of WEKA_HOME is user's home directory/wekafiles) open this .props with a text editor ( NB: Notepad under Windows might not handle the Unix line-endings correctly!) navigate to the property weka.gui.visualize.Plot2D.backgroundColour and change the color after the equal sign (\"=\") to darkGray (the article about weka/gui/visualize/Visualize.props lists all possible colors) save the file and restart WEKA Notes # Escaping Backslashes in values need to be escaped (i.e., doubled), otherwise they get interpreted as character sequence. E.g., \"is\\this\" will be interpreted as \"is his\". Correctly escaped, this would read as \"is\\this\". See also # Further information about specific props files: weka/core/Capabilities.props weka/core/logging/Logging.props weka/experiment/DatabaseUtils.props weka/gui/GenericObjectEditor.props weka/gui/GUIEditors.props weka/gui/GenericPropertiesCreator.props weka/gui/GenericPropertiesCreator.excludes weka/gui/LookAndFeel.props weka/gui/MemoryUsage.props weka/gui/SimpleCLI.props weka/gui/beans/Beans.props weka/gui/experiment/Experimenter.props weka/gui/explorer/Explorer.props weka/gui/scripting/Groovy.props weka/gui/scripting/Jython.props weka/gui/treevisualizer/TreeVisualizer.props weka/gui/visualize/Visualize.props","title":"Properties File"},{"location":"properties_file/#general","text":"A properties file is a simple text file with this structure: = Notes: Comments start with the hash sign # . Backslashes within values need to be doubled (the backslashes get interpreted already when a property is read). To make a rather long property line more readable, one can use a backslash to continue on the next line. The Filter property, e.g., looks like this: weka.filters.Filter = \\ > weka.filters.supervised.attribute, \\ > weka.filters.supervised.instance, \\ > weka.filters.unsupervised.attribute, \\ > weka.filters.unsupervised.instance","title":"General"},{"location":"properties_file/#precedence","text":"The Weka property files (extension .props ) are searched for in the following order: current directory (< Weka 3.7.2) the user's home directory (see FAQ Where is my home directory located? for more information) (>= Weka 3.7.2) $WEKA_HOME/props (the default value for WEKA_HOME is user's home directory/wekafiles). the class path (normally the weka.jar file) If WEKA encounters those files it only supplements the properties, never overrides them. In other words, a property in the property file of the current directory has a higher precedence than the one in the user's home directory. Note: Under Cywgin , the home directory is still the Windows one, since the java installation will be still one for Windows.","title":"Precedence"},{"location":"properties_file/#how-to-modify-a-props-file","text":"It is quite possible, that the default setup of WEKA is not to your liking and that you want to tweak it a little bit. The use of .props files instead of hard-coding makes it quite easy to modify WEKA's behavior. As example, we are modifying the background color of the 2D plots in the Explorer, changing it to dark gray . The responsible .props file is weka/gui/visualize/Visualize.props . These are the necessary steps: close WEKA extract the .props file from the weka.jar , using an archive manager that can handle ZIP files (e.g., 7-Zip under Windows) place this .props file in your home directory (see FAQ Where is my home directory located? on how to determine your home directory), or for Weka 3.7.2 or higher place this .props file in $WEKA_HOME/props (the default value of WEKA_HOME is user's home directory/wekafiles) open this .props with a text editor ( NB: Notepad under Windows might not handle the Unix line-endings correctly!) navigate to the property weka.gui.visualize.Plot2D.backgroundColour and change the color after the equal sign (\"=\") to darkGray (the article about weka/gui/visualize/Visualize.props lists all possible colors) save the file and restart WEKA","title":"How to modify a .props file?"},{"location":"properties_file/#notes","text":"Escaping Backslashes in values need to be escaped (i.e., doubled), otherwise they get interpreted as character sequence. E.g., \"is\\this\" will be interpreted as \"is his\". Correctly escaped, this would read as \"is\\this\".","title":"Notes"},{"location":"properties_file/#see-also","text":"Further information about specific props files: weka/core/Capabilities.props weka/core/logging/Logging.props weka/experiment/DatabaseUtils.props weka/gui/GenericObjectEditor.props weka/gui/GUIEditors.props weka/gui/GenericPropertiesCreator.props weka/gui/GenericPropertiesCreator.excludes weka/gui/LookAndFeel.props weka/gui/MemoryUsage.props weka/gui/SimpleCLI.props weka/gui/beans/Beans.props weka/gui/experiment/Experimenter.props weka/gui/explorer/Explorer.props weka/gui/scripting/Groovy.props weka/gui/scripting/Jython.props weka/gui/treevisualizer/TreeVisualizer.props weka/gui/visualize/Visualize.props","title":"See also"},{"location":"props_file/","text":"see Properties file","title":"Props file"},{"location":"removing_misclassified_instances_from_dataset/","text":"Sometimes it is necessary to clean out the instances misclassified by a classifier from a dataset. The following example loads a dataset, runs the RemoveMisclassified filter and saves the resulting dataset in another file again: RemoveMisclassifiedTest Source code: import weka.classifiers.Classifier ; import weka.core.Instances ; import weka.filters.Filter ; import weka.filters.unsupervised.instance.RemoveMisclassified ; import java.io.BufferedReader ; import java.io.BufferedWriter ; import java.io.FileReader ; import java.io.FileWriter ; /** * Runs the RemoveMisclassified filter over a given ARFF file. * First parameter is the input file, the second the classifier * to use and the third one is the output file. * * Usage: RemoveMisclassifiedTest input.arff classname output.arff * * @author FracPete (fracpete at waikato dot ac dot nz) */ public class RemoveMisclassifiedTest { public static void main ( String [] args ) throws Exception { if ( args . length != 3 ) { System . out . println ( \"\\nUsage: RemoveMisclassifiedTest input.arff classname output.arff\\n\" ); System . exit ( 1 ); } // get data Instances input = new Instances ( new BufferedReader ( new FileReader ( args [ 0 ] ))); input . setClassIndex ( input . numAttributes () - 1 ); // get classifier Classifier c = Classifier . forName ( args [ 1 ] , new String [ 0 ] ); // setup and run filter RemoveMisclassified filter = new RemoveMisclassified (); filter . setClassifier ( c ); filter . setClassIndex ( - 1 ); filter . setNumFolds ( 0 ); filter . setThreshold ( 0.1 ); filter . setMaxIterations ( 0 ); filter . setInputFormat ( input ); Instances output = Filter . useFilter ( input , filter ); // output file BufferedWriter writer = new BufferedWriter ( new FileWriter ( args [ 2 ] )); writer . write ( output . toString ()); writer . newLine (); writer . flush (); writer . close (); } } See also # Use Weka in your Java code - for general use of the Weka API Save Instances to an ARFF File - for saving an Instances object to a file Downloads # RemoveMisclassifiedTest.java","title":"Removing misclassified instances from dataset"},{"location":"removing_misclassified_instances_from_dataset/#see-also","text":"Use Weka in your Java code - for general use of the Weka API Save Instances to an ARFF File - for saving an Instances object to a file","title":"See also"},{"location":"removing_misclassified_instances_from_dataset/#downloads","text":"RemoveMisclassifiedTest.java","title":"Downloads"},{"location":"requirements/","text":"The following matrix shows which minimum version of Java is necessary to run a specific Weka version. The latest official releases of Weka require Java 8 or later. Note that if you are using Windows and your computer has a display with high pixel density (HiDPI), you may need to use Java 9 or later to avoid problems with inappropriate scaling of Weka's graphical user interfaces. Weka Java 1.4 Java 5 Java 6 Java 7 Java 8 or later <3.4.0 \u2611 \u2611 \u2611 \u2611 \u2611 3.4.x \u2611 \u2611 \u2611 \u2611 \u2611 3.5.x <3.5.3 \u2611 \u2611 \u2611 \u2611 3.6.x \u2611 \u2611 \u2611 \u2611 3.7.x 3.7.0 <3.7.14 \u2611 \u2611 3.8.x <3.8.2 \u2611 3.9.x <3.9.2 \u2611","title":"Requirements"},{"location":"roc_curves/","text":"General # Weka just varies the threshold on the class probability estimates in each case. What does that mean? In case of a classifier that does not return proper class probabilities (like SMO with the -M option, or IB1), you will end up with only two points in the curve. Using a classifier that returns proper distributions, like BayesNet, J48 or SMO with -M option for building logistic models, you will get nice curves. The class used for calculating the ROC and also the AUC (= area under the curve) is weka.classifiers.evaluation.ThresholdCurve . Commandline # You can output the data for the ROC curves with the following options: -threshold-file The file to save the threshold data to. The format is determined by the extensions, e.g., '.arff' for ARFF format or '.csv' for CSV. -threshold-label