# Pandas – Filling NaN in Categorical data

## Pandas – Filling NaN in Categorical data

Real-world data is full of missing values. In order to work on them, we need to impute these missing values and draw meaningful conclusions from them. In this article, we will discuss how to fill NaN values in Categorical Data. In the case of categorical features, we cannot use statistical imputation methods.

Let’s first create a sample dataset to understand methods of filling missing values:

 `# import modules` `import` `numpy as np` `import` `pandas as pd` `# create dataset` `data ``=` `{``'Id'``: [``1``, ``2``, ``3``, ``4``, ``5``, ``6``, ``7``, ``8``],` `        ` `        ``'Gender'``: [``'M'``, ``'M'``, ``'F'``, np.nan,` `                   ``np.nan, ``'F'``, ``'M'``, ``'F'``],` `        ` `        ``'Color'``: [np.nan, ``"Red"``, ``"Blue"``,` `                  ``"Red"``, np.nan, ``"Red"``,` `                  ``"Green"``, np.nan]}` `# convert to data frame` `df ``=` `pd.DataFrame(data)` `display(df)`

Output:

To fill missing values in Categorical features, we can follow either of the approaches mentioned below –

Method 1: Filling with most occurring class

One approach to fill these missing values can be to replace them with the most common or occurring class. We can do this by taking the index of the most common class which can be determined by using value_counts() method. Let’s see the example of how it works:

 `# filling with most common class` `df_clean ``=` `df.``apply``(``lambda` `x: x.fillna(x.value_counts().index[``0``]))` `df_clean`

Output:

Method 2: Filling with unknown class

At times, the missing information is valuable itself, and to impute it with the most common class won’t be appropriate. In such a case, we can replace them with a value like “Unknown” or “Missing” using the fillna() method. Let’s look at an example of this –

 `# filling with Unknown class` `df_clean ``=` `df.fillna(``"Unknown"``)` `df_clean`

Output:

Method 3: Using Categorical Imputer of sklearn-pandas library

We have scikit learn imputer, but it works only for numerical data. So we have sklearn_pandas with the transformer equivalent to that, which can work with string data. It replaces missing values with the most frequent ones in that column. Let’s see an example of replacing NaN values of “Color” column –

 `# using sklearn-pandas package` `from` `sklearn_pandas ``import` `CategoricalImputer` `# handling NaN values` `imputer ``=` `CategoricalImputer()` `data ``=` `np.array(df[``'Color'``], dtype``=``object``)` `imputer.fit_transform(data)`

Output:

Last Updated on October 23, 2021 by admin

## pandas.DataFrame.T() function in Pythonpandas.DataFrame.T() function in Python

pandas.DataFrame.T() function in Python pandas.DataFrame.T property is used to transpose index and columns of the data frame. The property T is somehow related to method transpose().  The main function of this property is to create a reflection of the data frame overs

## How to extract the value names and counts from value_counts() in Pandas ?How to extract the value names and counts from value_counts() in Pandas ?

How to extract the value names and counts from value_counts() in Pandas ? In this article, we will learn how we can extract the names and values using values_count() from panda. The panda library is equipped with a number of

## Pandas DataFrame.nlargest()Pandas DataFrame.nlargest()

Python Pandas DataFrame.nlargest() Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas nlargest() method is used to get n

## Python – Pandas df.size, df.shape and df.ndimPython – Pandas df.size, df.shape and df.ndim

In Python, Pandas is a widely used library for data manipulation and analysis. When working with Pandas DataFrames, you may often need to get information about the size, shape, and dimensions of your data. In this article, we will explore

## How to convert a Pandas Series to Python list?How to convert a Pandas Series to Python list?

How to convert a Pandas Series to Python list? In this article, we will discuss how to convert a Pandas series to a Python List and it’s type. This can be done using the tolist() method. Example 1: import pandas as pd

## How to extract date from Excel file using Pandas?How to extract date from Excel file using Pandas?

How to extract date from Excel file using Pandas? In this article, Let’s see how to extract date from the Excel file. Suppose our Excel file looks like below given image then we have to extract the date from the

## Drop rows from Pandas dataframe based on certain condition applied on a columnDrop rows from Pandas dataframe based on certain condition applied on a column

Pandas provides a rich collection of functions to perform data analysis in Python. While performing data analysis, quite often we require to filter the data to remove unnecessary rows or columns. In this post we are going to discuss several

## How to use pandas cut() and qcut()?How to use pandas cut() and qcut()?

How to use pandas cut() and qcut()? Pandas is an open-source library that is made mainly for working with relational or labeled data both easily and intuitively. It provides various data structures and operations for manipulating numerical data and time series.