One-Hot Encoding, Explained
A simple guide on the what, why, and how of One-Hot Encoding.
| UPDATED
One-Hot Encoding takes a single integer and produces a vector where a single element is 1 and all other elements are 0, like .
For example, imagine we’re working with categorical data, where only a limited number of colors are possible: red, green, or blue. One way we could represent this numerically is by assigning each color a number:
Color | Value |
---|---|
Red | 0 |
Green | 1 |
Blue | 2 |
This is known as integer encoding. For Machine Learning, this encoding can be problematic - in this example, we’re essentially saying “green” is the average of “red” and “blue”, which can lead to weird unexpected outcomes.
It’s often more useful to use the one-hot encoding instead:
Color | Integer Encoding | One-Hot Encoding |
---|---|---|
Red | 0 | |
Green | 1 | |
Blue | 2 |
This is much more useful to pass into something like a neural network.
One-Hot Encoding in Python
Below are several different ways to implement one-hot encoding in Python.
scikit-learn
Using scikit-learn’s OneHotEncoder:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
print(encoder.fit_transform([['red'], ['green'], ['blue']]))
'''
[[0. 0. 1.]
[0. 1. 0.]
[1. 0. 0.]]
'''
Keras
Using Keras’s to_categorical:
from keras.utils import to_categorical
print(to_categorical([0, 1, 2]))
'''
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]
'''
NumPy
Using NumPy:
import numpy as np
arr = [2, 1, 0]
max = np.max(arr) + 1
print(np.eye(max)[arr])
'''
[[0. 0. 1.]
[0. 1. 0.]
[1. 0. 0.]]
'''