ML | One Hot Encoding to treat Categorical data parameters
Sometimes in datasets, we encounter columns that contain categorical features (string values) for example parameter Gender will have categorical parameters like Male, Female. These labels have no specific order of preference and also since the data is string labels, the machine learning model can not work on such data.
One approach to solve this problem can be label encoding where we will assign a numerical value to these labels for example Male and Female mapped to 0 and 1. But this can add bias in our model as it will start giving higher preference to the Female parameter as 1>0 and ideally both labels are equally important in the dataset. To deal with this issue we will use One Hot Encoding technique.
One Hot Encoding:
In this technique, we each of the categorical parameters, it will prepare separate columns for both Male and Female label. SO, whenever there is Male in Gender, it will 1 in Male column and 0 in Female column and vice-versa.
Let’s understand with an example:
Consider the data where fruits and their corresponding categorical value and prices are given.
Fruit | Categorical value of fruit | Price |
---|---|---|
apple | 1 | 5 |
mango | 2 | 10 |
apple | 1 | 15 |
orange | 3 | 20 |
The output after one hot encoding the data is given as follows,
apple | mango | orange | price |
---|---|---|---|
1 | 0 | 0 | 5 |
0 | 1 | 0 | 10 |
1 | 0 | 0 | 15 |
0 | 0 | 1 | 20 |
Code: Python code implementation of One-Hot Encoding Technique
Loading the data
# Program for demonstration of one hot encoding # import libraries import numpy as np import pandas as pd # import the data required data = pd.read_csv( "employee_data.csv" ) print (data.head()) |
Output:
Checking for the labels in the categorical parameters
print (data[ 'Gender' ].unique()) print (data[ 'Remarks' ].unique()) |
Output:
array(['Male', 'Female'], dtype=object) array(['Nice', 'Good', 'Great'], dtype=object)
Checking for the label counts in the categorical parameters
data[ 'Gender' ].value_counts() data[ 'Remarks' ].value_counts() |
Output:
Female 7 Male 5 Name: Gender, dtype: int64 Nice 5 Great 4 Good 3 Name: Remarks, dtype: int64
One-Hot encoding the categorical parameters using get_dummies()
one_hot_encoded_data = pd.get_dummies(data, columns = [ 'Remarks' , 'Gender' ]) print (one_hot_encoded_data) |
Output:
We can observe that we have 3 Remarks and 2 Gender columns in the data. However, you can just use n-1 columns to define parameters if it has n unique labels. For example if we only keep Gender_Female column and drop Gender_Male column, then also we can convey the entire information as when label is 1, it means female and when label is 0 it means male. This way we can encode the categorical data and reduce the number of parameters as well.
Last Updated on October 29, 2021 by admin