Week 3 - Object detection

Nov 29, 2017 ../convolutional-neural-networks

Object localization

Model outputs x and y coords of important places: corner of eye, centre of nose etc.
Building blocks of recognising emotions and things like snapchat filters.
Requires big annotated training set.

Creating car detection algorithm:
Create dataset with closely cropped cars (either 1 (is car) or 0 (isn't car)) and train a model.
Create a rectangle that slides across the image, for each rectangle feed to convnet to determine if it is or isn't a car.
Run multiple times with various different rectangle sizes.
Bigger stride == lower computational cost, worse performance.
Lower stride == high computational cost, better performance (more chance of sliding across target object).

Can convert the fully connected final layers of a standard conv model to conv layers as follows:
- Replace the second last fully connected layer, with 400 5x5 conv filters to return a 1x1x400 volume.
- Replace the last fc layer with a 1x1x400 conv layer.
- Replace the softmax output with a 1x1x4 conv layer.

Turning Convolution Layers to FC

This conversion lets you implement a convolutional implementation of sliding windows detection.
Instead of running forward prop on a bunch of different sections of the image, can run it just once with the filters sharing a lot of data.
Bounding boxes are generally not that accurate.

Output accurate bounding boxes: YOLO algorithm.
YOLO: You only look once.
1. Place grid over image, like 19x19.
2. For any object in the image, find the midpoint and assign the cell it falls in as the cell containing the object.
3. Create labels y = [pc, bx, by, bh, bw, c1, c2, c3] for each of the grid cells as labels for training.
Total volume of output: 19x19x8 (19x19 grid cells with 8 output elements in each).
Bounding box predictions will be relative to the grid it's in.
- bx and by should be between 0 and 1.
- bh and bw can be greater than 1, since an object bounding box can fall outside the box.
Problem of multiple images in a cell to be addressed later.

Intersection over union

Because you are running the object classification and localization algo for every grid cell, it's possible you may end up with multiple bounding boxes.
Non-max suppression basically finds the highest probability detection and removes (or "suppresses") the others. Details:
Discard all bounding boxes with Pc < 0.6 (low probability).
Pick the box with the highest Pc and output as a prediction.
You figure out which bounding boxes are intended for the same objects by looking at all bounding boxes with IoU >= 0.5 then discard.
When you have multiple objects in an image, you'd do non-max suppresses for each class detected in the image.

Deals with the problems of multiple object centre points in an image.
The y outputs now include 8 x number of possible objects in a cell (sometimes up to 5).
Change previous algo to find an object midpoint and anchor box.

Training set construction:
- If you have 3 classes, 2 anchor boxes per grid and 3x3 grid, y is 3 x 3 x 2 x 8.
  - 8 outputs = 1 probability of object + 4 bounding boxes + 3 class labels.
- For any grid cells with no object midpoint, probability would be 0 for both anchor boxes.
- For grid cells with an object midpoint, you'd have one of the anchor boxes probabilities set to 1.
Outputting non-max supressed outputs:
- For each grid cell, get 2 predicted bounding boxes.
- Get rid of low probability predictions.
- For each class (pedestrian, car, motorcycle), use non-max supression to output final predictions.

Run a "segmentation algorithm" to find "blobs" that you can potentially run your classifier on.
- Should reduce the amount of positions you have to run your convnet on.
R-CNN:
1. Propose regions.
2. Classify proposed regions outputting label + bounding box.
Fast R-CNN:
1. Propose regions.
2. Use conv implementation of sliding windows to classify proposed regions.
Faster R-CNN:
- Use conv network to propose regions.
Andrew thinks region proposal is interesting, but believes YOLO is a more promising direction for computer vision.