Home /permanent

Feature Vector

A feature vector is a Vector of numbers that represent some object.

Picture a dog adoption centre that helps you find a dog breed based on your preferences and circumstances.

They might first create a spreadsheet with each dog breed and a rating for each of its unique characteristics, for example, how active it is, how big it grows to full size, and how good it is around kids.

import pandas as pd

dogs = pd.DataFrame({
    'breed':        ['Labrador', 'Chihuahua', 'Border Collie', 'Basset Hound', 'Golden Retriever'],
    'activity':     [4,          3,            5,               2,              4],
    'size':         [4,          1,            3,               3,              4],
    'kid_friendly': [5,          2,            3,               4,              5],
})

dogs
breed activity size kid_friendly
0 Labrador 4 4 5
1 Chihuahua 3 1 2
2 Border Collie 5 3 3
3 Basset Hound 2 3 4
4 Golden Retriever 4 4 5

If we have our resident breed expert fill out the spreadsheet for us, with a rating between 1 and 5 for each feature, we now have a feature vector for each dog, based on the features we defined.

How can we use this?

Now, what if we gave each adoptee a survey where they could describe their preferences for dogs using the same features: * How active are you? (1–5) * How much space do you have? (1–5) * Do you have, or intend to have, young kids? (1–5)

adoptees = pd.DataFrame({
    'name':         ['Alice', 'Bob', 'Carol'],
    'activity':     [4,       2,     5],
    'size':         [3,       2,     4],
    'kid_friendly': [5,       3,     2],
})

adoptees
name activity size kid_friendly
0 Alice 4 3 5
1 Bob 2 2 3
2 Carol 5 4 2

Now we have a feature vector with the same feature values as the dogs, and we can use this to compute the distance between vectors.

We can plot both in the same space to see which breeds land closest to each adoptee:

![[../../../_media/feature-vector-scatter.png]]

We could simply take each feature from the dog vector, and subtract it from the person preference vector, and remove the negative term (aka the Absolute Value) and find the dog with the smaller distance. That distance measure is called Manhattan Distance.

dogs_indexed = dogs.set_index('breed')
adoptees_indexed = adoptees.set_index('name')

distances = pd.DataFrame(
    {name: (dogs_indexed - row).abs().sum(axis=1) for name, row in adoptees_indexed.iterrows()},
    index=dogs_indexed.index
)

print(distances)
print()
print("Best match per adoptee:")
print(distances.idxmin())
                  Alice  Bob  Carol
breed                              
Labrador              1    6      4
Chihuahua             6    3      5
Border Collie         3    4      2
Basset Hound          3    2      6
Golden Retriever      1    6      4

Best match per adoptee:
Alice         Labrador
Bob       Basset Hound
Carol    Border Collie
dtype: object

Or we could use the Dot Product to measure the angle between two vectors.

Now, that's all well and good, but are sure we've picked all the right features? What about if the amount the dog sheds turns out to be very important? Or maybe people want a relatively quiet dog? Or maybe we prefer ball-driven dogs? There could be a lot of different features.

Also, maybe our users find the questions hard to answer. Instead they just tell you some dogs they have previously liked, and asked for a similar breed.

In other words, can we learn feature vectors based on users past behaviour?

That's where an Embedding comes in.