Skip to content

Releases: adaa-polsl/RuleKit-python

v.1.7.1

26 Jan 12:13
Compare
Choose a tag to compare

What's new in RuleKit version 1.7?

1. Manually initializing RuleKit is not longer necessary.

Prior to this version, RuleKit had to be manually initialised using the rulekit.RuleKit.init method.

from rulekit import RuleKit
from rulekit.classification import RuleClassifier

RuleKit.init()

clf = RuleClassifier()
clf.fit(X, y)

Now it is no longer necessary, and you can simply use any of the RuleKit operators directly.

from rulekit.classification import RuleClassifier

clf = RuleClassifier()
clf.fit(X, y)

2. Introducing negated conditions for nominal attributes in rules.

Using the new complementary_conditions parameter, the induction of negated conditions for nominal attributes can be enabled. Such conditions are of the form attribute = !{value}. This parameter has been added to all operator classes.

import pandas as pd
from rulekit.classification import RuleClassifier

df = pd.read_csv('https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/mushrooms.csv')
X = df.drop('type', axis=1)
y = df['type']

clf = RuleClassifier(complementary_conditions=True)
clf.fit(X, y)

for rule in clf.model.rules:
    print(rule)

IF stalk_surface_below_ring = !{y} AND spore_print_color = !{u} AND odor = !{n} AND stalk_root = !{c} THEN type = {p}
IF bruises = {f} AND odor = !{n} THEN type = {p}
IF stalk_surface_above_ring = {k} AND gill_spacing = {c} THEN type = {p}
IF bruises = {f} AND stalk_surface_above_ring = !{f} AND stalk_surface_below_ring = !{f} AND ring_number = !{t} AND stalk_root = !{e} AND gill_attachment = {f} THEN type = {p}
IF stalk_surface_below_ring = !{f} AND stalk_color_below_ring = !{n} AND spore_print_color = !{u} AND odor = !{a} AND gill_size = {n} AND cap_surface = !{f} THEN type = {p}
IF cap_shape = !{s} AND cap_color = !{c} AND habitat = !{w} AND stalk_color_below_ring = !{g} AND stalk_surface_below_ring = !{y} AND spore_print_color = !{n} AND gill_spacing = {c} AND gill_color = !{u} AND stalk_root = !{c} AND stalk_color_above_ring = !{g} AND ring_type = !{f} AND veil_color = {w} THEN type = {p}
IF cap_shape = !{c} AND stalk_surface_below_ring = !{y} AND spore_print_color = !{r} AND odor = {n} AND cap_surface = !{g} THEN type = {e}
IF cap_color = !{y} AND cap_shape = !{c} AND stalk_color_below_ring = !{y} AND spore_print_color = !{r} AND odor = {n} AND cap_surface = !{g} THEN type = {e}
IF spore_print_color = !{r} AND odor = !{f} AND stalk_color_above_ring = !{c} AND gill_size = {b} THEN type = {e}
IF cap_color = !{p} AND cap_shape = !{c} AND habitat = !{u} AND stalk_color_below_ring = !{y} AND gill_color = !{b} AND spore_print_color = !{r} AND ring_number = !{n} AND odor = !{f} AND cap_surface = !{g} THEN type = {e}

3. Approximate induction for classification rulesets.

To reduce the training time on the classification datasets, the so-called approximate induction can now be used. This will force the algorithm not to check all possible numerical conditions during the rule induction phase. You can configure the number of bins you want to use as possible splits to limit the calculation.

To enable aproximate induction use the approximate_induction parameter. To configure the maximum number of bins, use the approximate_bins_count parameter. At the moment, aproximate induction is only available for classification rule sets.

The following example shows how using this function can reduce training time without sacrificing predictive accuracy.

import pandas as pd
from rulekit.classification import RuleClassifier
from sklearn.metrics import balanced_accuracy_score

df_train = pd.read_parquet('https://github.com/cezary986/classification_tabular_datasets/raw/main/churn/train_test/train.parquet')
df_test = pd.read_parquet('https://github.com/cezary986/classification_tabular_datasets/raw/main/churn/train_test/test.parquet')

X_train = df_train.drop('class', axis=1)
y_train = df_train['class']
X_test = df_test.drop('class', axis=1)
y_test = df_test['class']

clf1 = RuleClassifier()
clf2 = RuleClassifier(approximate_induction=True, approximate_bins_count=20)
clf1.fit(X_train, y_train)
clf2.fit(X_train, y_train)

pd.DataFrame([
    {
        'Variant': 'Without approximate induction',
        'Training time [s]': clf1.model.total_time,
        'BAcc on test dataset': balanced_accuracy_score(y_test, clf1.predict(X_test)),
    },
    {
        'Variant': 'Without approximate induction',
        'Training time [s]': clf2.model.total_time,
        'BAcc on test dataset': balanced_accuracy_score(y_test, clf2.predict(X_test)),
    }
])
Variant Training time [s] BAcc on test dataset
Without approximate induction 5.730046 0.688744
With approximate induction 0.142259 0.703959

4. Observing and stopping the training process

You can now watch the progress of the training process and stop it at a certain point. To do this, all you need to do is create a class extending the events.RuleInductionProgressListener class. Such a class should implement one of the following methods:

  • on_new_rule(self, rule): Method called when the new rule was induced.
  • on_progress(self, total_examples_count: int, uncovered_examples_count: int): Method that observes the training progress, how many examples have been covered relative to the total number of training examples. The division uncovered_examples_count / total_examples_count can be taken as some approximation of training progress. Keep in mind however that not in all scenarios the ruleset will cover all of the training examples. The increase in progress probably will not be linear.
  • should_stop: Method to stop the training process at a given point. If it returns True, the training will be stopped. You can then proceed to use the not fully trained model.

Then you need to register your listener to the operator instance using add_event_listener method. All operators support this method.

An example of the use of this mechanism is shown below.

import pandas as pd
from rulekit.events import RuleInductionProgressListener
from rulekit.rules import Rule
from rulekit.classification import RuleClassifier

df_train = pd.read_parquet('https://github.com/cezary986/classification_tabular_datasets/raw/main/churn/train_test/train.parquet')

X_train = df_train.drop('class', axis=1)
y_train = df_train['class']


class MyProgressListener(RuleInductionProgressListener):
    _uncovered_examples_count: int = None
    _should_stop = False

    def on_new_rule(self, rule: Rule):
        pass

    def on_progress(
        self,
        total_examples_count: int,
        uncovered_examples_count: int
    ):
        if uncovered_examples_count < total_examples_count * 0.1:
            self._should_stop = True

    def should_stop(self) -> bool:
        return self._should_stop
    
clf = RuleClassifier()
clf.add_event_listener(MyProgressListener())
clf.fit(X_train, y_train)

for rule in clf.model.rules:
    print(rule)

IF number_customer_service_calls = (-inf, 3.50) AND account_length = (-inf, 224.50) AND total_day_minutes = (-inf, 239.95) AND international_plan = {no} THEN class = {no}
IF number_customer_service_calls = (-inf, 3.50) AND total_day_minutes = (-inf, 254.05) AND international_plan = {no} THEN class = {no}
IF number_customer_service_calls = (-inf, 3.50) AND total_day_minutes = (-inf, 255) AND international_plan = {no} THEN class = {no}
IF number_customer_service_calls = (-inf, 3.50) AND total_day_minutes = (-inf, 263.25) AND international_plan = {no} THEN class = {no}
IF total_intl_calls = (-inf, 19.50) AND number_customer_service_calls = (-inf, 3.50) AND total_eve_minutes = (-inf, 346.20) AND total_intl_minutes = (-inf, 19.85) AND total_day_calls = (-inf, 154) AND total_day_minutes = (-inf, 263.25) THEN class = {no}
IF number_customer_service_calls = (-inf, 4.50) AND total_day_minutes = <1.30, 254.05) AND international_plan = {no} THEN class = {no}
IF voice_mail_plan = {no} AND total_eve_minutes = <175.35, inf) AND total_day_calls = (-inf, 149) AND total_day_minutes = <263.25, inf) AND total_night_minutes = <115.85, inf) THEN class = {yes}

To simplify the way learning is stopped after a certain number of rules have been learned, the parameter max_rule_count has been added to the operators.

This parameter in classification denotes the maximum number of rules for each possible class in the training dataset.

5. Faster regression rules induction

Prior to version 1.7.0, regression rule induction using this package was inherently slow due to the calculation of median values. It is now possible to use the new mean_based_regression parameter to enable faster regression using mean values instead of calculating median values. See the example below.

import pandas as pd
from rulekit.regression import RuleRegressor

df_train = pd.read_csv('./housing.csv')

X_train = df_train.drop('class', axis=1)
y_train = df_train['class']

reg = RuleRegressor(mean_based_regression=True)
reg.fit(X_train, y_train)

5. Contrast set mining

THe package now includes an algorithm for contrast set (CS) identification Gudyś et al, 2022.

Following operator were introduced:

  • classification.ContrastSetRuleClassifier
  • regression.ContrastSetRuleRegressor
  • survival.ContrastSetSurvivalRules

Other changes

  • New parameter control_apriori_precision added to the operators classification.RuleClassifier and classification.ExpertRuleClassifier. If enabled, checks if the candidate precision is higher than the apriori precision of the class under test...
Read more

v1.6.0

15 Jun 13:23
Compare
Choose a tag to compare

Added new hyperparameters: select_best_candidate and max_uncovered_fraction for models.

v1.5.7

12 Jun 18:49
Compare
Choose a tag to compare
fix: add missing dependency