Releases: adaa-polsl/RuleKit-python
v.1.7.1
What's new in RuleKit version 1.7?
1. Manually initializing RuleKit is not longer necessary.
Prior to this version, RuleKit had to be manually initialised using the rulekit.RuleKit.init
method.
from rulekit import RuleKit
from rulekit.classification import RuleClassifier
RuleKit.init()
clf = RuleClassifier()
clf.fit(X, y)
Now it is no longer necessary, and you can simply use any of the RuleKit operators directly.
from rulekit.classification import RuleClassifier
clf = RuleClassifier()
clf.fit(X, y)
2. Introducing negated conditions for nominal attributes in rules.
Using the new complementary_conditions
parameter, the induction of negated conditions for nominal attributes can be enabled. Such conditions are of the form attribute = !{value}. This parameter has been added to all operator classes.
import pandas as pd
from rulekit.classification import RuleClassifier
df = pd.read_csv('https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/mushrooms.csv')
X = df.drop('type', axis=1)
y = df['type']
clf = RuleClassifier(complementary_conditions=True)
clf.fit(X, y)
for rule in clf.model.rules:
print(rule)
IF stalk_surface_below_ring = !{y} AND spore_print_color = !{u} AND odor = !{n} AND stalk_root = !{c} THEN type = {p}
IF bruises = {f} AND odor = !{n} THEN type = {p}
IF stalk_surface_above_ring = {k} AND gill_spacing = {c} THEN type = {p}
IF bruises = {f} AND stalk_surface_above_ring = !{f} AND stalk_surface_below_ring = !{f} AND ring_number = !{t} AND stalk_root = !{e} AND gill_attachment = {f} THEN type = {p}
IF stalk_surface_below_ring = !{f} AND stalk_color_below_ring = !{n} AND spore_print_color = !{u} AND odor = !{a} AND gill_size = {n} AND cap_surface = !{f} THEN type = {p}
IF cap_shape = !{s} AND cap_color = !{c} AND habitat = !{w} AND stalk_color_below_ring = !{g} AND stalk_surface_below_ring = !{y} AND spore_print_color = !{n} AND gill_spacing = {c} AND gill_color = !{u} AND stalk_root = !{c} AND stalk_color_above_ring = !{g} AND ring_type = !{f} AND veil_color = {w} THEN type = {p}
IF cap_shape = !{c} AND stalk_surface_below_ring = !{y} AND spore_print_color = !{r} AND odor = {n} AND cap_surface = !{g} THEN type = {e}
IF cap_color = !{y} AND cap_shape = !{c} AND stalk_color_below_ring = !{y} AND spore_print_color = !{r} AND odor = {n} AND cap_surface = !{g} THEN type = {e}
IF spore_print_color = !{r} AND odor = !{f} AND stalk_color_above_ring = !{c} AND gill_size = {b} THEN type = {e}
IF cap_color = !{p} AND cap_shape = !{c} AND habitat = !{u} AND stalk_color_below_ring = !{y} AND gill_color = !{b} AND spore_print_color = !{r} AND ring_number = !{n} AND odor = !{f} AND cap_surface = !{g} THEN type = {e}
3. Approximate induction for classification rulesets.
To reduce the training time on the classification datasets, the so-called approximate induction can now be used. This will force the algorithm not to check all possible numerical conditions during the rule induction phase. You can configure the number of bins you want to use as possible splits to limit the calculation.
To enable aproximate induction use the approximate_induction
parameter. To configure the maximum number of bins, use the approximate_bins_count
parameter. At the moment, aproximate induction is only available for classification rule sets.
The following example shows how using this function can reduce training time without sacrificing predictive accuracy.
import pandas as pd
from rulekit.classification import RuleClassifier
from sklearn.metrics import balanced_accuracy_score
df_train = pd.read_parquet('https://github.com/cezary986/classification_tabular_datasets/raw/main/churn/train_test/train.parquet')
df_test = pd.read_parquet('https://github.com/cezary986/classification_tabular_datasets/raw/main/churn/train_test/test.parquet')
X_train = df_train.drop('class', axis=1)
y_train = df_train['class']
X_test = df_test.drop('class', axis=1)
y_test = df_test['class']
clf1 = RuleClassifier()
clf2 = RuleClassifier(approximate_induction=True, approximate_bins_count=20)
clf1.fit(X_train, y_train)
clf2.fit(X_train, y_train)
pd.DataFrame([
{
'Variant': 'Without approximate induction',
'Training time [s]': clf1.model.total_time,
'BAcc on test dataset': balanced_accuracy_score(y_test, clf1.predict(X_test)),
},
{
'Variant': 'Without approximate induction',
'Training time [s]': clf2.model.total_time,
'BAcc on test dataset': balanced_accuracy_score(y_test, clf2.predict(X_test)),
}
])
Variant | Training time [s] | BAcc on test dataset |
---|---|---|
Without approximate induction | 5.730046 | 0.688744 |
With approximate induction | 0.142259 | 0.703959 |
4. Observing and stopping the training process
You can now watch the progress of the training process and stop it at a certain point. To do this, all you need to do is create a class extending the events.RuleInductionProgressListener
class. Such a class should implement one of the following methods:
on_new_rule(self, rule)
: Method called when the new rule was induced.on_progress(self, total_examples_count: int, uncovered_examples_count: int)
: Method that observes the training progress, how many examples have been covered relative to the total number of training examples. The division uncovered_examples_count / total_examples_count can be taken as some approximation of training progress. Keep in mind however that not in all scenarios the ruleset will cover all of the training examples. The increase in progress probably will not be linear.should_stop
: Method to stop the training process at a given point. If it returns True, the training will be stopped. You can then proceed to use the not fully trained model.
Then you need to register your listener to the operator instance using add_event_listener
method. All operators support this method.
An example of the use of this mechanism is shown below.
import pandas as pd
from rulekit.events import RuleInductionProgressListener
from rulekit.rules import Rule
from rulekit.classification import RuleClassifier
df_train = pd.read_parquet('https://github.com/cezary986/classification_tabular_datasets/raw/main/churn/train_test/train.parquet')
X_train = df_train.drop('class', axis=1)
y_train = df_train['class']
class MyProgressListener(RuleInductionProgressListener):
_uncovered_examples_count: int = None
_should_stop = False
def on_new_rule(self, rule: Rule):
pass
def on_progress(
self,
total_examples_count: int,
uncovered_examples_count: int
):
if uncovered_examples_count < total_examples_count * 0.1:
self._should_stop = True
def should_stop(self) -> bool:
return self._should_stop
clf = RuleClassifier()
clf.add_event_listener(MyProgressListener())
clf.fit(X_train, y_train)
for rule in clf.model.rules:
print(rule)
IF number_customer_service_calls = (-inf, 3.50) AND account_length = (-inf, 224.50) AND total_day_minutes = (-inf, 239.95) AND international_plan = {no} THEN class = {no}
IF number_customer_service_calls = (-inf, 3.50) AND total_day_minutes = (-inf, 254.05) AND international_plan = {no} THEN class = {no}
IF number_customer_service_calls = (-inf, 3.50) AND total_day_minutes = (-inf, 255) AND international_plan = {no} THEN class = {no}
IF number_customer_service_calls = (-inf, 3.50) AND total_day_minutes = (-inf, 263.25) AND international_plan = {no} THEN class = {no}
IF total_intl_calls = (-inf, 19.50) AND number_customer_service_calls = (-inf, 3.50) AND total_eve_minutes = (-inf, 346.20) AND total_intl_minutes = (-inf, 19.85) AND total_day_calls = (-inf, 154) AND total_day_minutes = (-inf, 263.25) THEN class = {no}
IF number_customer_service_calls = (-inf, 4.50) AND total_day_minutes = <1.30, 254.05) AND international_plan = {no} THEN class = {no}
IF voice_mail_plan = {no} AND total_eve_minutes = <175.35, inf) AND total_day_calls = (-inf, 149) AND total_day_minutes = <263.25, inf) AND total_night_minutes = <115.85, inf) THEN class = {yes}
To simplify the way learning is stopped after a certain number of rules have been learned, the parameter max_rule_count
has been added to the operators.
This parameter in classification denotes the maximum number of rules for each possible class in the training dataset.
5. Faster regression rules induction
Prior to version 1.7.0, regression rule induction using this package was inherently slow due to the calculation of median values. It is now possible to use the new mean_based_regression
parameter to enable faster regression using mean values instead of calculating median values. See the example below.
import pandas as pd
from rulekit.regression import RuleRegressor
df_train = pd.read_csv('./housing.csv')
X_train = df_train.drop('class', axis=1)
y_train = df_train['class']
reg = RuleRegressor(mean_based_regression=True)
reg.fit(X_train, y_train)
5. Contrast set mining
THe package now includes an algorithm for contrast set (CS) identification Gudyś et al, 2022.
Following operator were introduced:
classification.ContrastSetRuleClassifier
regression.ContrastSetRuleRegressor
survival.ContrastSetSurvivalRules
Other changes
- New parameter
control_apriori_precision
added to the operatorsclassification.RuleClassifier
andclassification.ExpertRuleClassifier
. If enabled, checks if the candidate precision is higher than the apriori precision of the class under test...