Misc Features - Decision Tree Machine Learning

The SpiceLogic Decision Tree software does two different jobs. You can build your own decision tree to make a sound, rational choice. You can also go the other way and learn how other people made their choices, by looking at the data they left behind. That second job is called Decision Tree Machine Learning.

It helps to know the difference between two ways of thinking about decisions. Normative decision theory is about how a rational person should decide. Descriptive decision theory is about how people actually decide, whether those choices were smart or not. Machine learning sits on the descriptive side. You feed it real records of what people did, and it works backward to find the pattern behind their behavior. In plain terms, it is a way to read people's minds from their actions.

Here is a quick example of the difference. Say you want to know which loan applications a bank should approve. The normative way is to sit down and reason out the rules a careful lender would follow. The descriptive way is to take a stack of past applications, look at which ones the bank actually approved, and let the software find the pattern. You end up with the bank's real policy, even the parts nobody wrote down.

The SpiceLogic Decision Tree software supports both ways of working. With machine learning, it can study your data and build two kinds of trees for you: a Classification Decision Tree and a Regression Tree. The pages below walk through each one with a worked example.

Classification Decision Tree

Confusion Matrix

Regression Tree

Breast Cancer Data set - example

Classification Decision Tree

A classification tree predicts a category. The answer it gives back is a label, like Yes or No, or Low, Medium and High, rather than a number.

Here is a simple example. Imagine you want to sell hot dogs next to a tennis court, where the players are your main customers. The trouble is that the players do not show up every day, so some days you sell nothing. To figure out what brings them out, you start writing down the weather for each day: the outlook (sunny, overcast, rainy), the temperature, the wind, the humidity, and whether the players actually came out and played. After a few weeks your notes look like the table below.

Play Tennis classification dataset used as the input data source for the decision tree machine learning walkthrough, showing weather columns and the decision column.

The Decision Tree software can read this kind of data from a CSV file, an Excel file, or a SQL Server database. Say you have it in an Excel file. To turn that data into a decision tree, open the Decision Tree software and click the Machine Learning button shown below.

Decision Tree Analyzer start page with the Machine Learning option highlighted, used to launch the data-driven tree induction wizard.

Next, pick the type of data source you are using. For this example, choose Excel.

Data source type picker offering CSV, Excel, and SQL Server as inputs to the Decision Tree Analyzer machine learning module.

Now point it at your Excel file and choose the sheet that holds your data. If your workbook has more than one sheet, this is where you tell the software which one to read.

Excel data source step in the machine learning wizard, with a workbook chosen and its sheet selected as the source of training rows.

After you click Proceed, the software lists every column it found in your data. Tick the columns that you think actually affect the decision. The Day number, for example, has nothing to do with whether people play tennis, so leave the Day column unchecked. Then mark PlayTennis as the Decision Column, since that is the outcome you want the tree to predict.

Column chooser for the Play Tennis dataset, with the Day column unchecked and the PlayTennis column flagged as the decision column for classification.

Now click the Learn from Data button.

Learn from Data button in the machine learning wizard, which runs the induction algorithm to build a decision tree from the selected columns.

The software runs a machine learning algorithm over your data and builds a decision tree from it, like the one shown below. Read it from the top down. Each branch is a question about the weather, and following the answers leads you to a prediction of whether the players will show up. So on a sunny day with high humidity, you follow those branches and the tree tells you what to expect.

Generated classification decision tree for the Play Tennis dataset, with the root split on Outlook and branches resolving to Yes/No leaf decisions.

Click the Rule Set button to see the same tree written out as plain if-then rules. Some people find the rules easier to read and share than the tree diagram. A rule reads like "if the outlook is sunny and the humidity is high, then the players do not play", which is easy to hand to someone who never saw the tree.

Rule set view derived from the generated Play Tennis decision tree, listing one IF/THEN rule per root-to-leaf path.

If you tick the Compact checkbox, the software produces a shorter rule set. It joins conditions that lead to the same outcome with "OR", so you get fewer, tidier rules that say the same thing. For example, two separate rules that both end in "do not play" get folded into one, which is much easier to scan.

Compact rule set view of the Play Tennis tree, with the Compact checkbox enabled to combine equivalent conditions using OR.

A few buttons let you adjust things without starting over:

Rebuild builds the tree again from the same data source. This is handy after your data has changed, for example once you have added a few more weeks of records.
Columns lets you go back and change which columns count as deciding factors, so you can add or drop a factor and see how the tree changes. Drop the humidity column, rebuild, and you can see right away how much it mattered.
Data Source lets you switch to a different source, such as another SQL Server database, or a different Excel or CSV file path.

Test against a data source and get the Confusion Matrix

Once a classification tree is built, it is worth checking how well it actually predicts. The software lets you run the tree against a set of data and reports back with a Confusion Matrix and other performance numbers. A confusion matrix simply counts how often the tree got each answer right and how often it got it wrong. That way you can trust the tree before you rely on it. To start, click the Test button shown here.

Test button on a generated classification tree, used to run the tree against a labelled data source and produce a confusion matrix and performance metrics.

You will be asked to pick a data source, the same way you did when you first built the tree. You can test against the original data, or against a fresh batch you held back. Testing on a fresh batch is the better check, because it shows how the tree does on records it has never seen. Once it finishes, you get a Confusion Matrix view like the one below.

Confusion matrix view summarizing true positives, false positives, true negatives, and false negatives for the trained Play Tennis classifier.

To look closer at how the tree did on one particular class, click that class in the menu on the left. For example, clicking the Yes menu button shows the detailed view below, with the metrics for that single class.

Per-class performance details for the Yes class, showing precision, recall, F1 score, and other classification metrics derived from the confusion matrix.

If you want the full meaning behind each of these numbers, the Wikipedia Confusion matrix page explains every metric clearly.

Regression Tree

A regression tree predicts a number instead of a category. The answer it gives back is a real value, like the price of a house or an expected salary.

Building one works just like building a classification tree, with one difference: the Decision Column has to be a continuous, number-type variable. Sticking with the tennis example, suppose you no longer just want to know whether people play. You want to predict how many minutes they will play in a day, based on the outlook, temperature, wind, humidity, and so on. Your data set for that might look like the table below.

Variant of the Play Tennis dataset adapted for regression, with the decision column changed from a class to a continuous numeric value.

Follow the same steps shown for the Classification Tree analysis and build the tree. The result will look like the one below. The difference is at the leaves: instead of a Yes or No label, each ending gives you a predicted number. So one branch might end in 45 minutes and another in 90 minutes.

Generated regression decision tree, with leaf nodes carrying continuous predicted values instead of class labels.

One thing to note: for a regression tree, the Test button does not appear. That is on purpose. A confusion matrix only makes sense for a classification tree, where the answers are categories you can count as right or wrong. It does not apply when the prediction is a number.

Breast Cancer Data set - example

Let's try something real and useful. We can take a breast cancer data set and build a decision tree that classifies whether a recurrence event is likely.

You can download the CSV data set here: https://datahub.io/machine-learning/breast-cancer

The data set looks like this:

Breast cancer dataset preview used as a real-world classification example for decision tree machine learning.

Start the Decision Tree software and open the Machine Learning wizard the same way as before. Choose the CSV file data source and select the columns shown below. In this example I unchecked two columns on purpose, just to keep the resulting tree smaller and easier to read.

Column selection step for the breast cancer dataset with two non-informative columns unchecked to keep the resulting tree small.

Now click the Learn from Data button. The software runs its machine learning algorithm over the data set and produces a decision tree like the one below. From here you can read off which factors point toward a recurrence event and which point away from it.

Decision tree induced from the breast cancer dataset, classifying cases as recurrence-events or no-recurrence-events based on splits over deg-malig, node-caps, breast-quad, breast, and age columns.