From Decision Trees to Rule-Based Systems: A Machine Learning Prototype

In the rapidly evolving landscape of AI and ML, the interpretability of models has become a vital aspect. Despite the accuracy of complex models, comprehending their decision-making process can be a challenge. This interpretability is critically important in scenarios like fraud detection, where clarity and transparency in decision-making are required for legal and ethical reasons. In this article, I will explore how decision trees, one of the most interpretable ML models, can be used to auto-generate rules.

The Power of Decision Trees

Decision trees are hierarchical models that partition data by making decisions based on feature values. These models are excellent for rule generation because each path from the root of the tree to a leaf node represents a rule. The ability to handle both categorical and numerical data makes decision trees versatile for a wide variety of datasets, making them a beneficial tool in the rule generation process.

To illustrate this, I’ll walk you through a Python script which generates a dataset of financial transactions, trains a decision tree on this data, and then converts the decision tree into a set of human-readable rules. The code snippet below shows the complete Python script.

from sklearn import tree 
from sklearn.tree import _tree
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from datetime import datetime
from faker import Faker
import matplotlib.pyplot as plt

# A function that converts the tree to human readable rules
def tree_to_human_readable_rules(tree, feature_names):
    tree_ = tree.tree_
    feature_name = [
        feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!"
        for i in tree_.feature
    ]
    rules_list = []

    def recurse(node, depth, rule):
        indent = "  " * depth
        if tree_.feature[node] != _tree.TREE_UNDEFINED:
            name = feature_name[node]
            threshold = tree_.threshold[node]
            rule_left = rule.copy()
            rule_left.append(f"{name} <= {threshold}")
            recurse(tree_.children_left[node], depth + 1, rule_left)
            rule_right = rule.copy()
            rule_right.append(f"{name} > {threshold}")
            recurse(tree_.children_right[node], depth + 1, rule_right)
        else:
            rule_str = " AND ".join(rule)
            if tree_.value[node][0][0] > tree_.value[node][0][1]:
                rule_str += ": Not Fraud"
            else:
                rule_str += ": Fraud"
            rules_list.append(rule_str)

    recurse(0, 1, [])
    return rules_list


fake = Faker()

# Set sample size to generate
sample_size = 1000

# Set fraud rate
fraud_rate = 0.05

# Create a DataFrame with samples
df = pd.DataFrame({
    'transaction_amount': np.random.randint(1, 10000, size=sample_size),
    'transaction_location': np.random.choice(['online', 'in_store', 'atm'], size=sample_size),
    'customer_location': np.random.choice(['local', 'international'], size=sample_size),
    'is_fraud': np.random.choice([0, 1], size=sample_size, p=[1-fraud_rate, fraud_rate]),
    'payment_method_type': np.random.choice(['credit_card', 'bank_account'], size=sample_size),
    'num_transactions_last_6_months': np.random.randint(0, 200, size=sample_size),
    'sum_transactions_last_6_months': np.random.randint(0, 200000, size=sample_size),
    'num_transactions_last_3_months': np.random.randint(0, 100, size=sample_size),
    'sum_transactions_last_3_months': np.random.randint(0, 100000, size=sample_size),
    'num_transactions_last_month': np.random.randint(0, 50, size=sample_size),
    'sum_transactions_last_month': np.random.randint(0, 50000, size=sample_size),
    'num_transactions_last_week': np.random.randint(0, 10, size=sample_size),
    'sum_transactions_last_week': np.random.randint(0, 10000, size=sample_size),
    'num_transactions_last_day': np.random.randint(0, 5, size=sample_size),
    'sum_transactions_last_day': np.random.randint(0, 2000, size=sample_size),
    'account_registration_date': pd.date_range(start='1/1/2010', end='1/1/2022', periods=sample_size),
    'user_age': np.random.randint(18, 120, size=sample_size)
})

# Current date
current_date = pd.to_datetime(datetime.now())

# Calculate the age of the account in days
df['account_age_days'] = (current_date - df['account_registration_date']).dt.days

# Convert categorical variables to one-hot encodings
df = pd.get_dummies(df, columns=['transaction_location', 'customer_location', 'payment_method_type'])

# Drop the original 'account_registration_date' column
df = df.drop('account_registration_date', axis=1)
 
# Split the data into features and target
dataset = df.drop('is_fraud', axis=1)
labels = df['is_fraud']

# Create a decision tree classifier object
clf = tree.DecisionTreeClassifier(max_depth=3, random_state=1234)

# Train the decision tree classifier
clf = clf.fit(dataset, labels)

# Visualize the decision tree
plt.figure(figsize=(20, 10))  # Adjust as needed
_ = tree.plot_tree(clf, 
               feature_names=dataset.columns,  
               class_names=['Not Fraud', 'Fraud'],
               filled=True)
plt.show()

# Visualize the decision tree as text
tree_rules = tree.export_text(clf, feature_names=list(dataset.columns))
print(tree_rules)

# Convert the decision tree to human readable rules
rules = tree_to_human_readable_rules(clf, dataset.columns)

# Print the rules
for r in rules:
    print(r)

Generating Synthetic Data

The first part of the script generates a synthetic dataset mimicking financial transactions, including features such as transaction amount, location, customer location, payment method type, and transaction history. Additionally, demographic data such as account_registration_date and user_age are included, with the former being used to calculate the number of days the account has been active (account_age_days).

Each transaction has an is_fraud label, which is randomly assigned based on a predefined fraud rate. This label is what our decision tree model will attempt to predict.

Training the Decision Tree

After generating the dataset, the script trains our decision tree classifier. In this example, I set the maximum depth of the tree to 3 as I wanted to ensure the resulting decision tree remains readable and interpretable.

Visualizing the Decision Tree

After the model is trained, I visualized the decision tree using matplotlib and sklearn’s plot_tree function. This visualization can help us understand the decisions made at each node of the tree and how these decisions lead to the final prediction at the leaf nodes.

In the visualization, each node represents a decision based on a feature value, and each path from the root to a leaf node represents a rule. The leaf nodes display the predicted outcome, either ‘Fraud’ or ‘Not Fraud.’

Converting the Decision Tree to Rules

The final and most crucial part of the script is the conversion of the decision tree into a set of human-readable rules. I accomplish this with a function named tree_to_human_readable_rules. This function traverses the decision tree and formulates a rule based on the decisions at each node.

This function works by initiating recursion at the root, traversing both child nodes (left and right), and adding the appropriate comparison (<= or >) to the rule each time it encounters a non-leaf node. When it reaches a leaf node, it appends the prediction to the rule and adds the rule to a list.

Conclusion

While decision trees may not always provide the highest predictive accuracy compared to other more complex machine learning models, they do offer excellent interpretability. As demonstrated in this Python script, decision trees can be a powerful tool for generating understandable rules. This is especially useful in areas like fraud detection, where the reasoning behind a prediction can be just as important as the prediction itself.

Advantages of Decision Trees

Interpretability: As discussed throughout this article, the ability of decision trees to provide clear, understandable rules is one of their major advantages.
Handling of both numerical and categorical data: Decision trees can handle a variety of data types, which can simplify preprocessing.
Non-parametric: Decision trees make no assumptions about the distribution of data and can therefore be useful for data that does not conform to typical statistical distributions.

Disadvantages of Decision Trees

Prone to overfitting: Without proper tuning, decision trees can create overly complex trees that do not generalize well to unseen data.
Unstable: Small changes in the data can lead to a drastically different tree. This can be mitigated by using ensemble methods, like Random Forests.
Biased with imbalanced datasets: Decision trees are biased towards the dominant class, so it’s important to balance the dataset before training.

When using decision trees, it’s crucial to monitor for signs of overfitting. Techniques such as pruning, setting the minimum number of samples required at a leaf node, or setting the maximum depth of the tree can help prevent overfitting. Moreover, cross-validation can be used to achieve more robust performance and stability.

Finally, it’s important to note that the Python script that I shared in this article serves as a small proof of concept to demonstrate the idea of rule generation using decision trees. It was created as part of my personal Machine Learning learning journey and is not intended as a fully production-ready system. When developing systems for real-world use, additional considerations need to be taken into account, including rigorous testing, validation on unseen data, feature engineering, hyperparameter tuning, and potentially exploring other, more complex models to improve accuracy and robustness.

Nevertheless, this prototype offers a tangible starting point from which to understand the potential of rule generation in machine learning, particularly in domains where interpretability is key, such as fraud detection. As always, the journey of learning and applying machine learning is iterative, and each step presents new challenges and opportunities for growth and innovation.

As a final note, it’s important to remember that each ML task has unique requirements and constraints, and the choice of model should be carefully considered based on these factors. In some cases, a simple, interpretable model like a decision tree may be the best choice, while in others, a more complex but less interpretable model may be more suitable.

Michael Kalika's Blog