One Hot Encoding to treat Categorical data parameters



ML | One Hot Encoding to treat Categorical data parameters

Sometimes in datasets, we encounter columns that contain categorical features (string values) for example parameter Gender will have categorical parameters like MaleFemale. These labels have no specific order of preference and also since the data is string labels, the machine learning model can not work on such data.

One approach to solve this problem can be label encoding where we will assign a numerical value to these labels for example Male and Female mapped to 0 and 1. But this can add bias in our model as it will start giving higher preference to the Female parameter as 1>0 and ideally both labels are equally important in the dataset. To deal with this issue we will use One Hot Encoding technique.

One Hot Encoding:

In this technique, we each of the categorical parameters, it will prepare separate columns for both Male and Female label. SO, whenever there is Male in Gender, it will 1 in Male column and 0 in Female column and vice-versa.

Let’s understand with an example:

Consider the data where fruits and their corresponding categorical value and prices are given.

Fruit Categorical value of fruit Price
apple 1 5
mango 2 10
apple 1 15
orange 3 20

The output after one hot encoding the data is given as follows,

apple mango orange price
1 0 0 5
0 1 0 10
1 0 0 15
0 0 1 20


Code: Python code implementation of One-Hot Encoding Technique

Loading the data

# Program for demonstration of one hot encoding
 
# import libraries
import numpy as np
import pandas as pd
 
# import the data required
data = pd.read_csv("employee_data.csv")
print(data.head())

Output:

Checking for the labels in the categorical parameters

print(data['Gender'].unique())
print(data['Remarks'].unique())

Output:

array(['Male', 'Female'], dtype=object)
array(['Nice', 'Good', 'Great'], dtype=object)

Checking for the label counts in the categorical parameters

data['Gender'].value_counts()
data['Remarks'].value_counts()

Output:

Female    7
Male      5
Name: Gender, dtype: int64

Nice     5
Great    4
Good     3
Name: Remarks, dtype: int64

One-Hot encoding the categorical parameters using get_dummies()

one_hot_encoded_data = pd.get_dummies(data, columns = ['Remarks', 'Gender'])
print(one_hot_encoded_data)

Output:

We can observe that we have 3 Remarks and 2 Gender columns in the data. However, you can just use n-1 columns to define parameters if it has n unique labels. For example if we only keep Gender_Female column and drop Gender_Male column, then also we can convey the entire information as when label is 1, it means female and when label is 0 it means male. This way we can encode the categorical data and reduce the number of parameters as well.

 

Last Updated on October 29, 2021 by admin

Leave a Reply

Your email address will not be published. Required fields are marked *

Recommended Blogs