Week 4 - Preventing Overfitting in Decision Trees

Jul 04, 2016 ../ml-classification

Overfitting in Decision Trees

Review of overfitting

Overfit = low training error, high true error.
Occurs in overly complex models.

Overfitting in decision trees

As depth increases, training error moves to 0.
Why? When choosing feature to split on, you're picking lowest classification error: will eventually move to 0.
0 training error is a big warning sign of overfitting.
Want to find the depth with the lowest test error.

Early stopping to avoid overfitting

Principles of Occam's razor: Learning decision trees

"Among competing hypotheses, the one with fewest assumptions should be selected", William of Occam, 13th Century.
Example: have 2 symptoms to present to doctor, could either diagnose with 2 diseases for each sympton, or 1 disease that explains both symptoms.
In decision trees: with 2 trees that have the same classification error, pick the simplest one.
How to pick simpler trees:
Early stopping: stop learning when tree becomes too complex (eg the depth is above some threshold).
Pruning: simplify tree after learning algorithm terminates (generally the preferred approach).

Early stopping in learning decision trees

Condition 1: stop splitting are reach a max depth.
Use validation set for finding ideal max depth param.
Condition 2: use classification error to limit depth of tree.
Stop recursing if classification error doesn't get better, eg, all choices of split at some leaf node don't get you a better classification error.
Generally add a magic param and stop if error doesn't decrease by said param.
Very useful in practise.
Condition 3: stop if number of data points is low (below some threshold N_min).
Should always implement this.

Pruning Decision Trees

Motivating pruning

Problem with stopping condition 1:
How do you know the ideal max depth?
Can be fiddlying and imperfect trying to find it.
Problem with stopping condition 2:
Short sighted: splits that seem useless (eg don't reduce classification error by much) can be followed by excellent splits.

Pruning decision trees to avoid overfitting

Firstly, you need a measure of tree complexity. Generally use number of leaves. L(T) = # number of leaf nodes in tree
Want to find good balance between low classification error and low complexity:
Total Cost C(T) = Error(T) + tuning_param * L(T)
if tuning_param == 0:
- Standard decision tree example.
if tuning_param == float('inf'):
- Only lowest complexity trees will be choosen (classification error won't matter and will just choose majority class).
if tuning_param in range(0, float('inf')):
- Find balance between the 2.

Tree pruning algorithm

Roughly: walk through each node in the tree and throw away if cost function is lower without the split.

Week 4: Handling Missing Data

Basic strategies for handling missing data

Challenges of missing data

Missing data can impact at training time: dataset contains null values, or at prediction time: input to predict contains missing values.

Strategy 1: Purification by skipping missing values

Most common method.
Just skip data points missing values.
Works if you have a "huge" dataset.
Can be problematic with small datasets where you need all the data.
Could skip an entire feature if there are a lot of missing data points.
Doesn't help if data is missing at prediction time.

Strategy 2: Purification by inputing missing data

Fill in missing values with a "calculated guess":
Categorical features should use mode of dataset (eg most commonly found value).
Numerical features use median (eg average value).
More advanced methods: expectation-maximization (EM) algorithm.
Can result in systematic errors.

Strategy 3: Modifying decision trees to handle missing data

Modifying decision trees to handle missing data

Add branch to decision tree when data is null.
Creates more accurate predictions.
Works for predictions and training.
Need to modify training algorithm.

Feature split selection with missing data

When select a feature to split on also decide which branch to send the missing values down.
Same principles as feature splitting: pick branch to assign missing values with lowest classification error.

Tags

Notes by Lex Toumbourou

Week 4 - Preventing Overfitting in Decision Trees

Overfitting in Decision Trees

Review of overfitting

Overfitting in decision trees

Early stopping to avoid overfitting

Principles of Occam's razor: Learning decision trees

Early stopping in learning decision trees

Pruning Decision Trees

Motivating pruning

Pruning decision trees to avoid overfitting

Tree pruning algorithm

Week 4: Handling Missing Data

Basic strategies for handling missing data

Challenges of missing data

Strategy 1: Purification by skipping missing values

Strategy 2: Purification by inputing missing data

Strategy 3: Modifying decision trees to handle missing data

Modifying decision trees to handle missing data

Feature split selection with missing data