ML Regression

Notes taken during ML Regression by Coursera.

Linear Algebra Refresher

Matrices and Vectors

  • Denoted by n x m (n rows, m columns).

Special Matrices

  • Square matrix: a matrix whose dimensions are n x n.
  • Zero matrix: denoted by 0n x m, a matrix where all the values are 0. (acts like 0 in scalar land)
  • Identity matrix: main diagonals are 1 and others 0. When multipling with another matrix, the result is the other matrix. (acts like 1 in scalar land)
  • Column matrix: n x 1 and row matrix: 1 x n. Aka vector.


  • Adding or subtract matrices: add or subtract each field. Must be the same size. (see Vector Addition)
  • Scalar multiplication: multiply each value in matrix by scalar (see Vector Scaling).
  • Matrix multiplication: A x B
    • A must have same number of columns as B does rows eg A(2 x 3) * B(3 x 2) is valid. The resulting size will be 2 x 2.
  • Example:

    A = [1, 2, 4]   B = [4, 2]
      [4, 3, 5]       [5, 8]
                      [7, 1]
       (2 x 3)        (3 x 2)
    Outcome = [(1 * 4) + (2 * 5) + (4 * 7)]
            [(1 * 2) + (2 * 8) + (4 * 1)]
            [(4 * 4) + (3 * 5) + (5 * 8)]
            [(4 * 2) + (3 * 8) + (5 * 1)]
  • Commutative property does not apply in matrix multiplication: order matters.


  • Function that takes a square matric and convert it to a number. Formula looks like:
[a c]
[b d] == a*d - c*b

Matrix Inverse

  • Given a square matrix A of size n x n we want to find another matrix such that:

AB = BA = I n

Matrix calculus

  • Each of the points in a matrix can be function, can determine the derivative of each point.

Week 1 - Simple Regression

  • Course Outline:
    • Module 1: Simple regression
      • Low number of inputs
      • Fit a simple line through the data
      • Need to define "goodness-of-fit" metric for each possible line.
      • Gradient descent algorithm:
        • Get estimated parameters
        • Interpret
        • Use to form predictions
    • Module 2: Multiple relationships
      • More complicated relationship than just a line.
      • Incorporate more inputs when training the model.
    • Module 3: Assessing performance
      • Determine when "overfitting" the data.
      • Bias-Variance Tradeoff
        • Simple models are well behave but can be too simple to describe a behaviour accurately.
        • Complex models can have odd behaviour.
    • Module 4: Ridge Regression Ridge total cost = measure of fit + measure of model complexity
      • Cross validation (??)
    • Module 5: Feature Selection & Lasso Regression
      • What are the most important features for model? Lasso total cost = measure of fit + (different) measure of model complexity
      • Will learn: coordinate descent algorithm.
    • Module 6: Nearest neighbor & kernel regression
      • Methods useful when you have a lot of data:
        • Nearest neighbor: find closest piece of data to what you're looking at and use for predictions
      • Kernel regression
  • Assumed background
  • Basic calculus (LOL)
  • Basic linear algebra:
    • Vectors
    • Matrices
    • Matrix multiply

Regression fundamentals

  • Housing prices
    • Look at recent sales in the neighbour hood to help inform your house price.
  • Data:
  • "For each house that sold recently, record square feet the house had and what the price was."
    • x1 = sq_ft, y2 = price````x2 = sq_ft, y2 = price, x3 = sq_ft, y3 = price etc
  • Regression model: y[i] = f(x[i]) + E[i] - Outcome is the function applied to the input plus some error.
    • E(e[i]) = 0 - expected value of error is 0 - will be equally likely to have positive or negative values.
  • Tasks in regression:
  • Task 1: Determining options for data model.
  • Task 2: Estimate the specific fit from the data.
  • Block diagram for regression (diagram from memory)

Regression block diagram

ML Model

  • Equation of a line: "intercept + slope * variable of interest" (f(x) = W₀ + W₁ * Xᵢ)
  • Single point includes the error (the distance from point back to line): (f(x) = W₀ + W₁ * Xᵢ + εᵢ).
  • Parameters of model = W₀ (intercept) and W₁ (slope)
    • Aka: regression coefficients.

Quality metric

  • "Cost" of using a given line.
  • Can be calculated with "Residual sum of squares (RSS)"

    • Take each error (εᵢ) - calculated as (actual_value - predicted value)²
    • Add them for each point.
    • In pseudo:

    RSS(W₀, W₁) = sum([(y - (W₀ + W₁ * Xᵢ))² for y in actual_prices])

  • Goal is to search over all possible lines to find the lowest RSS.

Using fitted line

  • Estimated parameters:
  • Ŵ₀
    • Intercept: value of y when x == 0
    • In the context of square feet features, it'd be houses that are just land.
    • Usually not very meaningful (aka intercept is negative, does that mean land only sales mean the seller gets money??)
  • Ŵ₁
    • Estimated slope
    • "Per unit change in input" - "when you go up 1 square foot, how much does the price go up?"

Finding the best fit (minimizing the cost)

  • min([RSS(W₀, W₁) for (W₀, W₁) in ALL_POSSIBLE_PARAMS])
  • Convex / concave functions:
  • Concave
    • Looks like arc of cave.
    • Line between 2 points lie below curve of function
    • Max is where derivative = 0
  • Convex

    • Opposite of concave
    • Any two points will be above the curve of the function
    • Min is where derivative = 0

    Concave Convex

  • Other functions:

    • Can be multiple solutions to derivative = 0 or no solutions
  • Finding the max or min analytically:
  • Get derivate of function eg (-2w + 20).
  • Set derivate to 0 then solve for w: (-2w + 20 = 0 == w = 10):
  • Finding the max via hill climbing:
  • Idea:

    • Keep increasing w via some "step size" until "converged" eg: derivate = 0.
    • Divide space:
    • On one side, the derivative will be more than 0, on the other less.
    • In pseudo:

    ``` # g = func # w = value

    def has_converged(g, w): return derivate(g, w) ~ 0

    while not has_converged(g, w): w = w + STEP_SIZE ```

    • While not converged:
    • Take previous w
    • Move in direction specified by derivated via some step size
    • Choosing stepsize and convergence criteria:
    • Fixed step size = set to constant size.
    • Can result in jumping over optimal, resulting in slow convergence.
    • Decrease the stepsize as you get closer to convergence.
    • Convergence will never get exactly to 0, but you set a threshold for convergence success.
    • Threshold depends on dataset.
    • Commons choices:
    • Algorithms that decrease with number of iterations:
      • η[t] = α / t (step size at position t is some alpha over t)
      • η[t] = α / sqrt(t) (step size at position t is some alpha over square root of t)

Multiple dimensions: gradients

  • When you have functions in higher dimensions, don't talk about "derivates" talk about "gradients".
  • Notation example:

Gradient Notation

  • Gradient definition:
  • A vector (array) with partial derivatives.
  • Partial derivatives:
  • Same as derivate but involves other derivates. Treats them like constants.
  • Worked example:

Gradient Example

  • Gradient descent == hill descent but compute the gradient instead of derivate for each iteration.

Asymmetric cost functions

  • "Prefer to under estimate than over"
  • Eg: don't want to predict my house price as too high.

Week 2 - Multiple Regression

  • Polynomial regression:
  • Even with only a single input variable, a linear equation model may not represent the relationship between the output variable.
  • Could use a higher-order function (quadratic, polynomial etc):
    • yᵢ = W₀ + W₁Xᵢ + W₂Xᵢ² ... + WpXᵢ^p + εi
    • Treat different powers of x as "features":
    • feature 1 = 1 (constant)
    • feature 2 = x
    • ..
    • feature p + 1 = x^p
    • Different coefficients = parameters.
  • More inputs added to regression model lets you factor in other relationships between your output variable.

General notation More notation

  • |Question| What letter will be used to represent number of observations?
    • N
  • |Question| What letter will be used to represent no of inputs? (x[i])
    • d
  • |Question| What letter will be used to represent no of features? (hⱼ(x))
    • D
  • Commonly used equation:

Commonly used equation

  • Interpreting coefficients of fitted function
  • Co-efficient's should be considered in the context of the entire model.
    • Eg: Number of bedrooms might have a negative coefficient if the square feet of the house is low.
  • If in a situation where you can't "fix" an input (eg if all features are a power of one input), then you can't interpret the coefficient.
  • Linear Algebra Review
  • Stages for computing the least squares fit:
  • Step 1: Rewrite regression model using linear algebra

    • Rewrite the regression model for a single observation in matrix notation:

    Rewrite Matrix Notation

    • Multiple two vectors:

      1. A row vector (aka transposed column vector) of parameters / coefficients
      2. A column vector of features.
    • Then add the error term.

    • Rewrite model for all observations:

    Matrix Notation All Observations

    * Rows of middle matrix (matrix ``H``) = vector of features from previous section.
    • Step 2: Compute the cost
    • Algorithm = search all different fits to find the smallest cost (RSS).
    • RSS in matrix notation:

    RSS Matrix Notation

    • y - Hw is a residual vector.
    • Using vector multiplication of that vector times the transpose, you end up with a scalar of the residual sum of squares.
    • Step 3: Take the gradient of the RSS:
    • -2 * H_vector_transposed * (y_vector - H_vector * w_vector)

    Gradient of RSS Notes

  • Step 4: Approach 1, closed-form solution: solve for W.

    • Set result to 0 and solve.
    • Matrix inverse * matrix = identity matrix (??)

Week 3 - Assessing Performance

  • Goal: figure out how much you are "losing" using your model compare to perfection.
    • Example: low predictions causing house to be listed too cheap.
  • Loss can be measured with a loss function: L(y, f_\hat{w}(\mathbf{x}))
    • Examples:
      • Absolute error: L(y, f_\hat{w}(\mathbf{x})) = |y - f_\hat{w}(\mathbf{x})|
      • Squared error: L(y, f_\hat{w}(\mathbf{x})) = (y - f_\hat{w}(\mathbf{x}))^2
  • Compute training error:
  • Define some loss function (as above).
  • Computing training error. * Example: Average loss on training set using squared error: * = 1/N \sum\limits_{i=1}^{N} L(y, f_\hat{w}(\mathbf{x}))
  • average of loss function

        * = $$1/N \sum\limits_{i=1}^{N} (y - f_\hat{w}(\mathbf{x}))^2$$
  • average of squared error

        * = $$\sqrt{1/N \sum\limits_{i=1}^{N} (y - f _\hat{w}(\mathbf{x}))^2}$$

(convert to root mean squared error (RMSE)) * provides more intuitive format (dollars, instead of squared dollars) * Training error vs model complexity: * Training error obviously is lowers as complexity of model increases (can get almost perfect fit with high complexity models): * Doesn't necessarily mean that predictions will be good, in fact they can get extremely bad with overfit models. * Summary: low training error != good predictions. * Generalisation error: theoretic idea for figuring out ideal model. * Weight house price pairs (house, price) by how they are to occur in dataset and use to evaluate predictions, use to evaluate predictions. * For a given square footage, how likely is it to occur in the dataset? * For houses with a given square footage, what house prices are we likely to see? * Theoretical idea: impossible to compute generation error: requires every possible dataset in existence. * Test error: like generalisation error but actually computable. * Basically, use test error* to roughly approximate generation error. * Average loss on houses in test set: 1/Ntest \sum_\limits{i=1}^{Ntest} L(y, f_\hat{w}(\mathbf{x})) * Note: f_\hat{w}(\mathbf{x})

was fit with training data.

  • Defining overfitting:
    • Overfitting if there exists a model with estimated params ww'

such that:

    1. Training error ($$\hat{w}$$

) < Training error ($$w'$$


    2. True error ($$\hat{w}$$

) > True error ($$w'$$

) * In other words: overfit if training error is low compared to another model but "true error" is high. Better to have high training error but low "true error". * Training/test split: how to think about dividing data between training and test split. * General rule of thumb: just enough points in test set to approximate generalisation error well. * To do: figure out how to approximate generalisation error. * If you don't have enough for training data, then need to use other methods like "cross validation". * 3 sources of error: 1. Noise 2. Bias 3. Variance

  • Noise:
    • Data is inherently noisy; there are heaps of variables you can't factor into model.
      • Variance of noise term: spread of ϵi\epsilon_i
    • Can't really control noise, there will be factors that influence observations outside of feature set.
  • Bias:
    • Difference between fit and "true function".
      • "Is our model flexible enough to capture true relationship between data and model."
    • Bias(x) = f_{w(true)}(\mathbf{x}) - f_\hat{w}(\mathbf{x})
    • Bias is basically how good our fit is; more parameters generally means less bias.
  • Variance:
    • How much variance does model have?
    • High complexity models == high variance.
    • Variance is basically how wild our model is; more parameters means higher variance.
  • Bias-variance tradeoff:
    • As error decreases, bias decreases and variance increases.
    • MSE=bias2+varianceMSE = bias^2 + variance

(mean-squared error) * Want to find sweet spot in model where bias and variance are together as low as possible. * We can't compute "true function" but there are ways to optimise an approximation a practical way. * Error vs amount of data: * True error decreases as data points in training set increase to limit of bias + noise. * Training error goes up as data points increase to same limit. * Validation set: * Choosing tuning parameters, λ\lambda

(eg degree of polynomial):

    * Naive approach: For each model complexity $$\lambda $$

: 1. Estimate params on training data. 2. Assess performance on test data.

        3. Choose $$\lambda $$

with lowest test

        * Problem with this the test data was already used for selecting $$ \lambda $$
  • will be overfit. * Better approach: split data 3 ways: training, validation and test set
        * Select $$\lambda $$

    that minimizes error on validation set.

        * Get approximate generalisation error of $$ \hat{w}_{\lambda\star} $$

    using test set. * Typical splits: * 80% / 10% / 10% * 50% / 25% / 25%

Week 4 - Ridge Regression

i* Symptom of overfitting: * When model is extremely overfit, magnitude of coefficients can be extremely large. * Overfitting not unique to polynomial regression: can also occur if you have lots of inputs. * Number of observations can influence overfitting: * Few observations (N small) can cause model to be quickly overfit as complexity grows. * Many observations (N very large) can be harder to overfit (but harder to find datasets).

    ![Number of Observations](./images/number-of-observations.png)
  • The larger the inputs, the more change of data not including inputs for all data points causing overfitting.
  • Balancing fits and magnitude of coefficients:
    • Can improve quality metric by factoring in coefficient magnitude to avoid complex models.
    • Total cost = measure of fit (RSS) + measure of coefficient magnitudes.
    • Want to find the ideal balance between the two.
    • Ways to measure coefficient magnitude:
      • Add coefficients: w^0+w^1...w^d\hat{w}_0 + \hat{w}_1 ... \hat{w}_d
        • the negative coefficients would cancel out the others.
      • Add the abs value of coefficients: i=0Dw^i\sum\limits_{i=0}^{D} | \hat{w}_i |
        • Aka "L1 norm"
      • Sum squared values of coefficients: i=0D(w^i)2\sum\limits_{i=0}^{D} (\hat{w}_i)^2

(sum of squared values of coefficients) * Aka "L2 norm" * Resulting ridge objective and its extreme solution. * Can use an extra parameter to tune how much weight is applied to the measure of coefficient magnitude: RSS(w)+λw22RSS(\mathbf{w}) + \lambda ||\mathbf{w}||_2^2 * Set the param to 0 = only RSS would weight in quality metric. * Set the param to \infty , w=0\mathbf{w} = 0

would be the solution.

* Need to find a balance between the two: enter "ridge regression" (aka $$L_2 $$


  • How ridge regression balances bias and variance.
    • High λ\lambda

= high bias, low variance.

    * As $$\lambda $$

get closer to infinitely, coefficients get smaller and less general.

* Low $$\lambda $$

= low bias, high variance.

    * As $$\lambda $$

get closer to 0, coefficients get larger (RSS takes over) and shit gets crazy.

  • Ridge regression demo
    • "Leave one out" (LOO) cross validation can show approximate average mean squared error (MSE) and is a useful technique for choosing λ\lambda


  • The ridge coefficient path:
  • Graphical representation of how the value of lambda affects the coefficient:

    Again, at 0 it's just the RSS ($$\hat{w}_{LS} $$

), as it gets larger, they approach 0.

  • Some sweet spot between smallish coefficients and a well fit model.
  • Computing the gradient of the ridge objective:
  • Can rewrite the L2 norm ($$|| \mathbf{w} ||_2^2 )invectornotationasfollows:) in vector notation as follows: w^tw $$


  • Then, the entire ridge regression total cost can be rewritten like: y(Hw)2(Hw)+w22y - (\mathbf{H}\mathbf{w})^2(\mathbf{H}\mathbf{w}) + || \mathbf{w} ||_2^2

, which is just the RSS + the L2 norm.

  • Can take the gradient of both terms and you get: 2HT(y(Hw))+2w-2\mathbf{H}^T( y - (\mathbf{H}\mathbf{w})) + 2\mathbf{w}


  • The gradient of the L2 norm is analygous to the 1d case: derivate of w2w^2

= 2w2w

  • Approach 1: closed-form solution:
    • Summary: set the gradient = 0 and solve for w^\hat{w}

. * Steps:

    1. Multiple the $$\mathbf{w} $$

vector by the identity matrix to make the derivation easier.

      $$\triangle cost(\mathbf{w})) = -2\mathbf{H}^T(\mathbf{y} - \mathbf{H}\mathbf{w}) + 2\lambda\mathbf{I}\mathbf{w} = 0 $$

    2. Divide both sides by 2.

        $$ =-\mathbf{H}^T(\mathbf{y} - \mathbf{H}\mathbf{w}) + \lambda\mathbf{I}\mathbf{w} = 0 $$

    3. Multiple out.

        $$-\mathbf{H}^T\mathbf{y} + \mathbf{H}^T \mathbf{H}\mathbf{w} + \lambda\mathbf{I}\mathbf{w} = 0 $$

    4. Add $$\mathbf{H}^T\mathbf{y} $$

to both sides.

        $$\mathbf{H}^T \mathbf{H}\mathbf{w} + \lambda\mathbf{I}\mathbf{w} = \mathbf{H}^T\mathbf{y}  $$

    5. Since the $$ \hat{w} $$

appears in both expressions, can factor it out.

        $$(\mathbf{H}^T \mathbf{H} + \lambda\mathbf{I})\mathbf{w} = \mathbf{H}^T\mathbf{y} $$

    6. Multiple both sides by the inverse of $$\mathbf{H}^T \mathbf{H} + \lambda\mathbf{I} $$


        $$\mathbf{w} = (\mathbf{H}^T \mathbf{H} + \lambda\mathbf{I})^{-1}\mathbf{H}^T\mathbf{y} $$
  • Discussing the closed-form solution

    * Can prove the closed-form solution is congruent with the above notes, by setting $$\lambda = 0 $$

    and seeing that results are equal to the least squares closed form solution:

    $$\hat{w}^{ridge} = (\mathbf{H}^T\mathbf{H})^{-1}\mathbf{H}^T\mathbf{y} = \hat{w}^{LS} $$-results
    • Setting lambda to infinity results in 0 because the inverse of infinity matrix is like dividing by infinity.

    • Recall previous solution for w^LS=(HTH)1HTy\hat{w}^{LS} = (\mathbf{H}^T\mathbf{H})^{-1}\mathbf{H}^T\mathbf{y} :

      • Invertible if number of linear independant observations is more than number of features ($$ N > D $$ ) - an added bonus of using ridge regression.

      • Complexity of the inverse: O(D3)O(D^3)

    • Properties for ridge regression solution:

      • Inverible always if λ>0\lambda > 0 even if N<DN < D .
      • Complexity of inverse is the same. May be prohibitive to do this solution with lots of features.

      • Called "regularised" because adding the λI\lambda\mathbf{I} is "regularising" the HTH\mathbf{H}^T \mathbf{H} , making it invertable ever with lots of features.

  • Approach 2: gradient descent

    • Same as RSS gradient descent but factors in the L2 norm cost component gradient.
  • Selecting tuning parameters via cross validation.
    • If you don't have enough data to build a validation set, for testing the performance of λ\lambda

, you can grab the validation set as a subset of the training data, but instead of grabbing a single slice, can grab all slices and average the results.

  • K-fold cross validation.
    • Algorithm:
      • For each lambda you're considering:
        • For each data slice of size N/KN/K

: * Go through each validation block on training data compute predictions.

            * Compute error for predictions using the $$\lambda $$


        * Calculate average error: $$CV(\lambda) = 1/K \sum\limits_{k=1}^K error_K(\lambda) $$

* Choosing ideal value of K:
    * Best approximation occurs for validation sets of size 1 (K = N) aka "leave-one-out" (LOO) cross validation.

    * Computationally intensive: requires computing N fits of model per $$\lambda $$


    * Common to use $$K=5 $$

(5-fold CV) or K=10K=10

(10-fold CV) if K=NK=N

is computational infeasible to run.

  • How to handle intercept term.
    • The goal of the ridge regression is to lower the coefficients values $ but that doesn't necessarily make sense for w^0\hat{w}_0

. * Solutions:

    * Option 1: leave out the $$\hat{w}_0 $$

coefficient when computing the optimal fit. * In closed-form solution, can use a modified identity matrix with the first position set to 0.

        * In gradient descent, can just skip the ridge component when computing the j value for $$\hat{w}_0 $$

. * Option 2: transform data so the expected intercept value is around 0 (called "centring").

Week 5 - Feature Selection & Lasso

  • Feature selection task motivation
    • Efficiency
      • when faced with large feature sets, predictions can get expensive.
      • When w^\hat{w}

is sparse (eg has a lot of 0s in the dataset) can eliminate all 0 features. * Interpretability * which features are relevant to housing predictions? * All subsets algorithm * Search over every combination of features computing the RSS: * Iterate through every N = 1 sized combination, then N = 2 and so on up to feature D. * Slow to compute with a lot of features: complexity = 2D2^{D} * With 15 features 215=327682^{15} = 32768 * Greedy algorithms * "Forward stepwise algorithm" * Find best N = 1 feature subset (maybe based on training error) * Next, find best N = 2 subsets that includes previous feature. * Use validation set or cross validation to choose when to stop greedy procedure - training error alone is insufficient. * Quicker to compute than all subsets, but may not find optimal combination. * Complexity: * 1st step: D models, 2nd step: D - 1 models, 3rd step: D - 2 models etc * Worst case complexity: D2D^2 * Other greedy algorithms * Backward stepwise: start with full model and work backwards removing one. * Combining forward and backward steps: add steps to remove features not considered relevant. * Regularised regression for feature selection * Ridge regression (L2 penalty) encourages small weights but not exactly 0. * Core idea: can we use a similar idea but actually get weights to 0 and effectively remove them from the model? * Potential method 1: Thresholding. * Remove features that falls below some threshold for ww

. * Problem: redundant features * If you have bathroom and number of showers, the weights would be distributed over both features. If you remove one, the coefficient should double on the other. * This means you could potentially end up discarding all redundant features. * Potential method 2: use L1 norm for regularized regression aka "Lasso regression" aka "L1 regularised regression".

    * $$ RSS(\mathbf{w}) + \lambda * ||\mathbf{w}||_1 $$

        * When tuning param, $$\lambda $$is 0, $$\hat{w}^{lasso} = \hat{w}^{LS} $$

(ie result is just least squares.

        * When $$\lambda = \infty $$

, w^lasso=0\hat{w}^{lasso} = 0

(ie coefficients shrink to 0). * Coefficient path for Lasso:

        ![Coefficient path for Lasso](coefficient-path-for-lasso.png)

        * For certain lambda values, features jump out of the model and other fall to 0. Individual impact of features becomes clearer.
  • Optimising the lasso objective
    • The derivate of wj|w_j| can't be calculated when wj=0|w_j| = 0

. Therefore, can't calculate the derivate. * Use "subgradients" over gradients. * Coordinate descent * Goal: minimise a function g(w0,w1,...,WD)g(w_0, w_1, ..., W_D)

, a function with multivariables. * Hard to find minimum for all coordinates, be easy for a single coordinate with all the other fixed.. * Algorithm: * While not converged: * Pick a coordinate (w coefficient) and fix the other. * Find the minimum of that function. * No stepsize required * Converges for lasso objective * Converges to optimum in some cases. * Can pick next coordinate at random, round robin or something smarter... * Normalising features * Idea: put all features into same numeric range (eg number of bathrooms, house square feet) * Apply to a column of the feature matrix. * Need to apply to training and test data. * Formula:

    $$\underline{h}_j(\mathbf{x}_k) = \frac{h_j(\mathbf{x}_k)}{ \sqrt{\sum\limits_{i=1}^{N} h_j(\mathbf{x}_i)^2}} $$
  • Coordinate descent for least squares regression
    • For each j feature, fix all other coordinates $ \mathbf{w}_{-j} $and take the partial with respect to wj\mathbf{w}_j
    • Gradient derived to 2Pj+2wj-2P_j + 2w_j ($$P_j $$

== residual without jth feature)

* Setting to 0 and solving for $$\hat{w}_j $$

results in w^j=Pj\hat{w}_j = P_j

* Intuition: if the residual between the predictions without j and the actual prediction is large, then the weight of j will be large and vice versa (to do: check this as your understanding improves).
  • Coordinate descent for lasso (for normalised features)
    • Same as cord descent for least squares, except we set w^j\hat{w}_j

using a "thresholding" function (in code):

    def set_w_j(p_j, lambda):
      if p_j < -lambda / 2:
          return p_j + lambda / 2
      if -lambda / 2 < p_j < lambda / 2:
          return 0
      if p_j > lambda / 2:
          return p_j - lambda / 2

* General idea here is "soft thresholding", aiming to get values to 0 that fit within some range:

    ![Soft thresholding]("./images/cord-descent-soft-thresholding.png")
  • Coordinate descent for lasso (for unnormalised features)
    • Normalisation factor is used during the set wjw_j

portion of the regression.

  • Choosing the penalty strength and other practical issues with lasso.
    • Same as ridge regression:
      • If enough data, validation set.
      • Compute average error
  • Summary
    • Searching for best features
      • All subsets
      • Greedy algorithms
      • Lasso regularised regression approach
    • Contrast greedy and optimal algorithms
    • Describe geometrically why L1 penalty leads to sparsity
    • Estimate lasso regression parameters using an iterative coordinate descent algorithm
    • Implement K-fold cross validation to select lasso tuning parameter λ\lambda
  • Note: be careful about interpreting features, need to consider in the context of the entire model.

Week 6 - Nearest Neighbors & Kernel Regression

  • Limitations of parametric regression
  • A polynomial fit might work well in certain regions of input space and not well in others.
    • Eg cubic fit might work well for higher square feet houses, but not so well for lower.
  • Ideal method:
    • Flexible enough to support "local structure" ie different fit at certain input space regions.
    • Doesn't require us to infer "structure breaks" ie the places where we want a different fit.
  • Nearest neighbour regression approach
  • For some input, find the closest observation. That's it.
  • What people do naturally when predicting house prices.
  • Formally:

    1. For some input, find closest xi\mathbf{x}_i

using some distance metric.

2. Predict: $$\hat{y}_q = y_{nn} $$
  • Distance metrics
    • 1D feature sets: Euclidian distance
      • distance(xj,xq)=xjxqdistance(x_j, x_q) = |x_j - x_q|
    • Multi dimensions:
      • Can use a weight on each feature to determine importance in computing distance ie sqft closeness is important, but year renovated may not be.
      • $ distance(\mathbf{x_j}, \mathbf{x_q}) = \sqrt{a_1(x_i[1] - x_q[1])^2) + .. a_D(x_i[D] - x_q[D])^2)} $
        • Note: aDa_D

is the "weight" in this equation.

  • 1-Nearest Neighbour Algorithm

    • Need to define your distance function.
    • In pseudo:

      ``` closest_distance = float('inf')

      for i in all_data_points: dist = calculate_distance(i, input) if dist < closest_distance: closest_distance = dist

      return closest_distance


    • Notes:

      • Sensitive to regions with missing data; need heaps of data for it to be good.
      • Sensitive to noisy datasets.
    • k-Nearest neighbours regression
    • Same as 1-Nearest Neighbour but average over values of "k"-nearest neighbours.
    • k-Nearest neighbours in practise
    • Usually has a much better fit than 1-nearest.
    • Issues in regions where there's sparse data and at boundaries.
    • Weighted k-nearest neighbours
    • Idea: weight neighbours by how close or far they are to data point.
    • Formula (where CqNN1C_{qNN1}

refers to some weight for NN1:

    $$\hat{y}_q = \dfrac{C_{qNN1}Y_{qNN1} + C_{qNN2}Y_{qNN2} .. C_{qNNK}Y_{qNNK}}{\sum\limits_{j=1}^{K}C_{qNNj}} $$

* Weighting data points:

    * Could just use inverse of distance: $$C_{qNN1} = \dfrac{1}{distance(\mathbf{X_j}, \mathbf{X_q})} $$

    * Can use "kernel" functions:
        * Gaussian kernel
        * Uniform kernel etc
  • Kernel regression

    • Instead of weighing n-neighbours, weigh all points with some "kernel":

      y^qy=1NCqiYqiy=1NCqi=y=1Nkernelλ(distance(xi,xq))yikernelλ(distance(xi,xq))\hat{y}_q \frac{\sum\limits_{y=1}^{N} C_{qi}Y_{qi}}{\sum\limits_{y=1}^{N} C_{qi}} = \frac{\sum\limits_{y=1}^{N} kernel_{\lambda}(distance(\mathbf{x_i},\mathbf{x_q})) * y_i}{kernel_{\lambda}(distance(\mathbf{x_i},\mathbf{x_q}))}

    • In stats called "Nadaya-Watson" kernel weighted averages.

    • Need to choice a good "bandwidth" (lambda value)
      • Too high = over smoothing; low variance, high bias.
      • Too low = over fitting; high variance, high bias.
    • Need to choose kernel but bandwidth more important.
    • Use validation set (if enough data) or cross-validation to choose λ\lambda


  • Global fits of parametric models vs local fits of kernel regression
  • If you were to predict datapoint by averaging all observations, you'd end up with a constant fit.
  • Kernel gives constant fit at a single point; a "locally constant fit".