## Overfitting in Decision Trees

### Review of overfitting

• Overfit = low training error, high true error.
• Occurs in overly complex models.

### Overfitting in decision trees

• As depth increases, training error moves to 0.
• Why? When choosing feature to split on, you're picking lowest classification error: will eventually move to 0.
• 0 training error is a big warning sign of overfitting.
• Want to find the depth with the lowest test error.

## Early stopping to avoid overfitting

### Principles of Occam's razor: Learning decision trees

• "Among competing hypotheses, the one with fewest assumptions should be selected", William of Occam, 13th Century.
• Example: have 2 symptoms to present to doctor, could either diagnose with 2 diseases for each sympton, or 1 disease that explains both symptoms.
• In decision trees: with 2 trees that have the same classification error, pick the simplest one.
• How to pick simpler trees:
• Early stopping: stop learning when tree becomes too complex (eg the depth is above some threshold).
• Pruning: simplify tree after learning algorithm terminates (generally the preferred approach).

### Early stopping in learning decision trees

• Condition 1: stop splitting are reach a max depth.
• Use validation set for finding ideal max depth param.
• Condition 2: use classification error to limit depth of tree.
• Stop recursing if classification error doesn't get better, eg, all choices of split at some leaf node don't get you a better classification error.
• Generally add a magic param and stop if error doesn't decrease by said param.
• Very useful in practise.
• Condition 3: stop if number of data points is low (below some threshold N_min).
• Should always implement this.

## Pruning Decision Trees

### Motivating pruning

• Problem with stopping condition 1:
• How do you know the ideal max depth?
• Can be fiddlying and imperfect trying to find it.
• Problem with stopping condition 2:
• Short sighted: splits that seem useless (eg don't reduce classification error by much) can be followed by excellent splits.

### Pruning decision trees to avoid overfitting

• Firstly, you need a measure of tree complexity. Generally use number of leaves. L(T) = # number of leaf nodes in tree
• Want to find good balance between low classification error and low complexity:
• Total Cost C(T) = Error(T) + tuning_param * L(T)
• if tuning_param == 0:
• Standard decision tree example.
• if tuning_param == float('inf'):
• Only lowest complexity trees will be choosen (classification error won't matter and will just choose majority class).
• if tuning_param in range(0, float('inf')):
• Find balance between the 2.

### Tree pruning algorithm

• Roughly: walk through each node in the tree and throw away if cost function is lower without the split.

# Week 4: Handling Missing Data

## Basic strategies for handling missing data

### Challenges of missing data

• Missing data can impact at training time: dataset contains null values, or at prediction time: input to predict contains missing values.

### Strategy 1: Purification by skipping missing values

• Most common method.
• Just skip data points missing values.
• Works if you have a "huge" dataset.
• Can be problematic with small datasets where you need all the data.
• Could skip an entire feature if there are a lot of missing data points.
• Doesn't help if data is missing at prediction time.

### Strategy 2: Purification by inputing missing data

• Fill in missing values with a "calculated guess":
• Categorical features should use mode of dataset (eg most commonly found value).
• Numerical features use median (eg average value).
• More advanced methods: expectation-maximization (EM) algorithm.
• Can result in systematic errors.

## Strategy 3: Modifying decision trees to handle missing data

### Modifying decision trees to handle missing data

• Add branch to decision tree when data is null.
• Creates more accurate predictions.
• Works for predictions and training.
• Need to modify training algorithm.

### Feature split selection with missing data

• When select a feature to split on also decide which branch to send the missing values down.
• Same principles as feature splitting: pick branch to assign missing values with lowest classification error.