Week 1 - Linear Classifiers & Logistic Regression
Assumed background
- Course 1 & 2 in ML Specialisation.
- Maths
- Basic calculus
- Basic vectors
- Basic functions
- Exponentiation e^x
- Logarithms
Linear classifiers
- Motivating example: Capturing sentiment of restaurant reviews
Intuition behind linear classifiers
- Use training data to figure out coefficients of words like:
good | 1.0 great | 1.5 bad | -1.0 terrible | -2.0
then use that data to predict sentiment on other texts.
- Use validation set for verifying models.
Decision boundaries
-
Suppose you have a model with two coefficients:
good | 1.0 bad | -2
The line between a positive and negative sentiment would sit at:
1 * #good - 2 * #bad
. Anything above the line is positive and below is negative. -
Decision Boundary types:
- When 2 coefficients are non-zero: line
- When 3 coefficients are non-zero: plane
- When many coefficients are non-zero: hyperplane
- More general classifiers: more complicated (squiggly?) lines
Linear classifier model
-
One you have your coefficients for each word, multiply each by the count of words and add together to get the score.
word_count = get_word_count(sentence) W = get_coefficients_for_model() Score = W[0] + W[1] * word_count["awesome"] + W[2] * word_count["awful"]
-
Wrap
Score()
in asign()
function for producing a binary result (1 if above 0 or -1 if below).
- Course notation is the same as used in ML Regression.
-
Can rewrite expression using course notation like:
Effect of coefficient values on decision boundaries
- Increasing intercept value, shifts Decision Boundary up.
- Increase value of coefficients can change the Decision Boundary curve.
Using features
- Features can modify impact of words in a sentence.
- With feature added, equation looks like:
- Flow chart summary:
- Get some training data.
- Extract features (word counts, td-idf etc).
- Generate a model with said features (create coefficients).
- Validate model (test data etc).
- Make predicts using model.
Class probabilities
- So far prediction = +1 or -1
- How do you capture confidence that something is definitely positive or maybe negative etc? Enter probability.
Basics of probability
- Probability a review is positive is 0.7
- Interpretation: 70% of rows have y = +1
- Can interpret probabilities as "degrees of belief" or "degrees of sureness".
- Fundamental properties:
- Probabilities should always be between 0 and 1 (no negative probabilities).
-
A set of probabilities classes should always add up to 1:
Basics of conditional probability
- Probability y is one given input can be represented as: where is some sentence.
- Conditional probability should always be between 0 & 1.
- Classes of conditional probabilities sum up to 1 over y:
However, they don't add up to 1 for all :
$$\sum\limits_{X}^{i=1} P(y=+1|\mathbf{x}_i) \neq 1$$
Using probabilities in classification
- A lot of classifiers output degree of confidence.
- We generally train a classifier to output some which uses values to make predictions. It can then use outputted probability to return +1 if > 0.5 (is positive) or -1 if < 0.5 (is negative) and also how confident we are with the answer.
Logistic Regression
Predicting class probabilities with (generalized) linear models
- Goal: get probability from an outputted score.
- Since the scores range from positive infinity to negative infinity (and probability from 0 to 1):
- a Score of positive infinity would have a probability of 1.
- a score of neg inf would have a probability of 0.
- a score of 0 would have a probability of 0.5.
- Need to use a "link" function to convert score range to a probability:
- Doing this is called a "generalised linear model".
The sigmoid (or logistic) link function
-
For logistic regression, the link function is called "logistic function" or Sigmoid Activation Function:
Code examples:
>>> import math >>> sigmoid = lambda score: 1 / (1 + math.e**(-score)) >>> sigmoid(float('inf')) 1.0 >>> sigmoid(float('-inf')) 0.0 >>> sigmoid(-2) 0.11920292202211757 >>> sigmoid(0) 0.5
-
This works because (sigmoid becomes ) and (sigmoid becomes ).
Logistic regression model
- Calculate the score with then run it through sigmoid and you have your outputted probability.
Effect of coefficient values on predicted probabilities
- By changing the constant, the decision boundary lines shifts. When in the negative direction, the negative count a lot more than positives.
- The bigger the weights, the more "sure" you are: single words can effect the probability a lot.
Overview of learning logistic regression model
- Training a classifier = learning coefficients
- Test learned model on validation set
- To find best classifier need to define a quality metric.
- Going to use "likelihood " function.
- Best model = highest likelihood.
- Use gradient ascent algorithm that has the highest likelihood.
Practical issues for classification
Encoding categorical inputs
- Numeric inputs usually make sense to multiply via coefficient directly: age, number of bedrooms etc
- Categorical inputs (gender, country of birth, postcode) need to be encoded in order to be multiplied via coefficient.
-
One way to encode categorical inputs: 1-hot encoding. Basically, for a table of features, all are 0 except 1 (hence 1-hot):
x h[1](x) h[2](x) --------------------------- Male 0 1 Female 1 0
-
Another way: bag of words encoding. Summary: Take a bunch of text and count words.
Multi class classification with 1-verse-all
- To classify more than 2 classes, can use "1 versus all"
- Train a classifier for each category, comparing one class to the others.
- Figure out which value has the highest probability.