Week 1 - Linear Classifiers & Logistic Regression

Assumed background

  • Course 1 & 2 in ML Specialisation.
  • Maths
  • Basic calculus
  • Basic vectors
  • Basic functions
    • Exponentiation e^x
    • Logarithms

Linear classifiers

  • Motivating example: Capturing sentiment of restaurant reviews

Intuition behind linear classifiers

  • Use training data to figure out coefficients of words like:
    good     |    1.0
    great    |    1.5
    bad      |   -1.0
    terrible |   -2.0

then use that data to predict sentiment on other texts.

  • Use validation set for verifying models.

Decision boundaries

  • Suppose you have a model with two coefficients:

    good  |  1.0
    bad   | -2

    The line between a positive and negative sentiment would sit at: 1 * #good - 2 * #bad. Anything above the line is positive and below is negative.

  • Decision Boundary types:

    • When 2 coefficients are non-zero: line
    • When 3 coefficients are non-zero: plane
    • When many coefficients are non-zero: hyperplane
    • More general classifiers: more complicated (squiggly?) lines

Linear classifier model

  • One you have your coefficients for each word, multiply each by the count of words and add together to get the score.

    word_count = get_word_count(sentence)
    W = get_coefficients_for_model()
    Score = W[0] + W[1] * word_count["awesome"] + W[2] * word_count["awful"]
  • Wrap Score() in a sign() function for producing a binary result (1 if above 0 or -1 if below).

  • Course notation is the same as used in ML Regression.
  • Can rewrite expression using course notation like:

    Score(xi)=w0+w1xi[1]+..+Wdxi[d]=wTxiScore(\mathbf{x_i}) = w_0 + w_1 \mathbf{x_i}[1] + .. + W_d \mathbf{x_i}[d] = \mathbf{w}^T\mathbf{x}_i

Effect of coefficient values on decision boundaries

  • Increasing intercept value, shifts Decision Boundary up.
  • Increase value of coefficients can change the Decision Boundary curve.

Using features

  • Features can modify impact of words in a sentence.
  • With feature added, equation looks like:

Score(xi)=w0h0(xi)+..+WdhD(xi)=wTh(xi)Score(\mathbf{x_i}) = w_0 h_0(\mathbf{x_i}) + .. + W_d h_D(\mathbf{x_i}) = \mathbf{w}^Th(\mathbf{x}_i)

  • Flow chart summary:
    1. Get some training data.
    2. Extract features (word counts, td-idf etc).
    3. Generate a model with said features (create coefficients).
    4. Validate model (test data etc).
    5. Make predicts using model.

Class probabilities

  • So far prediction = +1 or -1
  • How do you capture confidence that something is definitely positive or maybe negative etc? Enter probability.

Basics of probability

  • Probability a review is positive is 0.7
    • Interpretation: 70% of rows have y = +1
  • Can interpret probabilities as "degrees of belief" or "degrees of sureness".
  • Fundamental properties:
    • Probabilities should always be between 0 and 1 (no negative probabilities).
    • A set of probabilities classes should always add up to 1:

      P(y=+1)+P(y=1)=1P(y=+1) + P(y=-1) = 1

Basics of conditional probability

  • Probability y is one given input can be represented as: P(y=+1xi)P(y=+1|\mathbf{x}_i) where xi\mathbf{x}_i is some sentence.
  • Conditional probability should always be between 0 & 1.
  • Classes of conditional probabilities sum up to 1 over y:

P(y=+1xi)+P(y=1xi)=1P(y=+1|\mathbf{x}_i) + P(y=-1|\mathbf{x}_i) = 1

However, they don't add up to 1 for all x\mathbf{x}:

$$\sum\limits_{X}^{i=1} P(y=+1|\mathbf{x}_i) \neq 1$$

Using probabilities in classification

  • A lot of classifiers output degree of confidence.
  • We generally train a classifier to output some P^\mathbf{\hat{P}} which uses w^\mathbf{\hat{w}} values to make predictions. It can then use outputted probability to return +1 if > 0.5 (is positive) or -1 if < 0.5 (is negative) and also how confident we are with the answer.

Logistic Regression

Predicting class probabilities with (generalized) linear models

  • Goal: get probability from an outputted score.
  • Since the scores range from positive infinity to negative infinity (and probability from 0 to 1):
    • a Score of positive infinity would have a probability of 1.
    • a score of neg inf would have a probability of 0.
    • a score of 0 would have a probability of 0.5.
  • Need to use a "link" function to convert score range to a probability:

P^(y=+1xi)=g(wTh(xi))\hat{P}(y=+1|\mathbf{x}_i) = g(\mathbf{w}^Th(\mathbf{x}_i))

  • Doing this is called a "generalised linear model".

The sigmoid (or logistic) link function

  • For logistic regression, the link function is called "logistic function" or Sigmoid Activation Function:

    sigmoid(Score)=11+eScoresigmoid(Score) = \dfrac{1}{1 + e^{-Score}}

    Code examples:

    >>> import math
    >>> sigmoid = lambda score: 1 / (1 + math.e**(-score))
    >>> sigmoid(float('inf'))
    >>> sigmoid(float('-inf'))
    >>> sigmoid(-2)
    >>> sigmoid(0)
  • This works because e=0e^{-\infty} = 0 (sigmoid becomes 11\frac{1}{1}) and e=e^{\infty} = \infty (sigmoid becomes 1inf\frac{1}{\inf}).

Logistic regression model

  • Calculate the score with wTh(xi)\mathbf{w}^Th(\mathbf{x}_i) then run it through sigmoid and you have your outputted probability.

Effect of coefficient values on predicted probabilities

  • By changing the constant, the decision boundary lines shifts. When in the negative direction, the negative count a lot more than positives.
  • The bigger the weights, the more "sure" you are: single words can effect the probability a lot.

Overview of learning logistic regression model

  • Training a classifier = learning coefficients
    • Test learned model on validation set
  • To find best classifier need to define a quality metric.
    • Going to use "likelihood l(w)l(\mathbf{w})" function.
    • Best model = highest likelihood.
    • Use gradient ascent algorithm that has the highest likelihood.

Practical issues for classification

Encoding categorical inputs

  • Numeric inputs usually make sense to multiply via coefficient directly: age, number of bedrooms etc
  • Categorical inputs (gender, country of birth, postcode) need to be encoded in order to be multiplied via coefficient.
  • One way to encode categorical inputs: 1-hot encoding. Basically, for a table of features, all are 0 except 1 (hence 1-hot):

     x       h[1](x)    h[2](x)
    Male       0          1
    Female     1          0
  • Another way: bag of words encoding. Summary: Take a bunch of text and count words.

Multi class classification with 1-verse-all

  • To classify more than 2 classes, can use "1 versus all"
  • Train a classifier for each category, comparing one class to the others.
  • Figure out which P^\hat{P} value has the highest probability.