When working with large datasets, it is not uncommon to encounter duplicate rows. Duplicates can cause problems when performing analysis or modeling, as they can skew results and lead to inaccurate conclusions. We will explore how to identify and remove duplicate rows in a Pandas dataframe.
In this article, we will be discussing about how to find duplicate rows in a Dataframe with help of different code examples. To achieve this we will use Dataframe.duplicated()
method of Pandas.
Syntax : DataFrame.duplicated(subset = None, keep = ‘first’)
Parameters:
subset: This Takes a column or list of column label. It’s default value is None. After passing columns, it will consider them only for duplicates.keep: This Controls how to consider duplicate value. It has only three distinct value and default is ‘first’.
- If ‘first’, This considers first value as unique and rest of the same values as duplicate.
- If ‘last’, This considers last value as unique and rest of the same values as duplicate.
- If ‘False’, This considers all of the same values as duplicates.
Returns: Boolean Series denoting duplicate rows.
Method 1: Find Duplicate Rows Based on All Columns
The first method to identify duplicate rows in a Pandas dataframe is to use the duplicated()
method. This method returns a boolean series indicating which rows are duplicates. We can then use this boolean series to filter the dataframe and return only the duplicate rows.
Let’s first create a sample dataframe:
import pandas as pd data = {'name': ['John', 'Sarah', 'John', 'David', 'Sarah'], 'age': [25, 28, 25, 32, 28], 'gender': ['Male', 'Female', 'Male', 'Male', 'Female']} df = pd.DataFrame(data) print(df)
Output:
name age gender 0 John 25 Male 1 Sarah 28 Female 2 John 25 Male 3 David 32 Male 4 Sarah 28 Female
duplicated()
method to identify the duplicate rows:
duplicates = df.duplicated() print(duplicates)
Output:
0 False 1 False 2 True 3 False 4 True dtype: bool
let’s filter the dataframe based on the boolean series:
duplicate_rows = df[duplicates] print(duplicate_rows)
Output:
name age gender 2 John 25 Male 4 Sarah 28 Female
Method 2: Find Duplicate Rows Based on Selected Columns
If we want to find duplicate rows based on only selected columns, we can pass a list of column names to the subset
parameter of the duplicated()
method. For example, if we want to find duplicate rows based on the Employee ID
and Last Name
columns, we can do the following:
import pandas as pd # creating dataframe employees = pd.DataFrame({ 'Employee ID': ['E001', 'E002', 'E003', 'E003', 'E004', 'E005', 'E006', 'E007', 'E007'], 'First Name': ['John', 'Nick', 'John', 'John', 'Lucy', 'Alice', 'Alice', 'Alice', 'Alice'], 'Last Name': ['Smith', 'Doe', 'Smith', 'Doe', 'Wang', 'Garcia', 'Garcia', 'Garcia', 'Garcia'], 'Department': ['HR', 'Marketing', 'IT', 'IT', 'Finance', 'HR', 'HR', 'HR', 'HR'], 'Salary': [50000, 60000, 55000, 55000, 70000, 45000, 45000, 45000, 45000], 'Joining Date': ['2017-01-01', '2018-01-01', '2017-01-01', '2017-02-01', '2017-01-15', '2017-05-01', '2017-05-01', '2017-05-01', '2017-05-01'], 'Age': [32, 26, 35, 35, 29, 25, 25, 25, 25] }) # find duplicate rows based on selected columns duplicate_rows = employees[employees.duplicated(subset=['Employee ID', 'Last Name'])] print(duplicate_rows)
This will return a dataframe with the duplicate rows based on the Employee ID
and Last Name
columns.
Similarly, we can use the drop_duplicates()
method to drop the duplicate rows based on selected columns:
# drop duplicate rows based on selected columns employees.drop_duplicates(subset=['Employee ID', 'Last Name'], inplace=True) print(employees)
Above code will modify the employees
dataframe by dropping the duplicate rows based on the Employee ID
and Last Name
columns.
Method 3: Using groupby() and size()
The groupby() method groups the rows in the dataframe based on the specified columns and applies an aggregation function to each group. Here, we will group the rows based on all the columns and count the number of rows in each group using the size() method. If the size of any group is greater than 1, it means that the group has duplicate rows.
# group the rows based on all columns and # count the number of rows in each group df_grouped = df.groupby(list(df.columns)).size().reset_index(name='count') #filter the groups with count greater than 1 df_grouped[df_grouped['count'] > 1]
The above code groups the rows based on all columns and counts the number of rows in each group using the size() method. The reset_index() method is used to reset the index of the resulting dataframe. Finally, we filter the groups with count greater than 1 to get the duplicate rows.
# create a new dataframe with duplicate rows data = {'Name': ['John', 'Sarah', 'John', 'Mark', 'Sarah'], 'Age': [28, 22, 28, 25, 22], 'Gender': ['Male', 'Female', 'Male', 'Male', 'Female']} df = pd.DataFrame(data) # group the rows based on all columns and # count the number of rows in each group df_grouped = df.groupby(list(df.columns)).size().reset_index(name='count') # filter the groups with count greater than 1 df_grouped[df_grouped['count'] > 1]
Output:
Name Age Gender count
John 28 Male 2
Sarah 22 Female 2
As we can see, this method has correctly identified the duplicate rows in our sample dataframe.
Method 4: Drop_duplicates() method
The drop_duplicates()
method can be used to remove duplicate rows from a DataFrame based on specific columns. This method returns a new DataFrame with the duplicate rows removed.
Let’s use this method to remove duplicate rows based on the “name” and “age” columns:
# create a new DataFrame data = {'name': ['John', 'Sarah', 'John', 'Mike', 'Mike'], 'age': [25, 33, 25, 40, 40], 'gender': ['Male', 'Female', 'Male', 'Male', 'Male']} df = pd.DataFrame(data) # remove duplicate rows based on 'name' and 'age' columns df = df.drop_duplicates(['name', 'age']) # print the new DataFrame print(df)
Output:
name age gender 0 John 25 Male 1 Sarah 33 Female 3 Mike 40 Male
As we can see from the output, the duplicated rows based on the “name” and “age” columns have been removed.
Last Updated on May 16, 2023 by admin