Find duplicate rows in a Dataframe [With Examples]



When working with large datasets, it is not uncommon to encounter duplicate rows. Duplicates can cause problems when performing analysis or modeling, as they can skew results and lead to inaccurate conclusions. We will explore how to identify and remove duplicate rows in a Pandas dataframe.

In this article, we will be discussing about how to find duplicate rows in a Dataframe with help of different code examples. To achieve this we will use Dataframe.duplicated() method of Pandas.

Syntax : DataFrame.duplicated(subset = None, keep = ‘first’)

Parameters:
subset: This Takes a column or list of column label. It’s default value is None. After passing columns, it will consider them only for duplicates.

keep: This Controls how to consider duplicate value. It has only three distinct value and default is ‘first’.

  • If ‘first’, This considers first value as unique and rest of the same values as duplicate.
  • If ‘last’, This considers last value as unique and rest of the same values as duplicate.
  • If ‘False’, This considers all of the same values as duplicates.

Returns: Boolean Series denoting duplicate rows.

Method 1: Find Duplicate Rows Based on All Columns

The first method to identify duplicate rows in a Pandas dataframe is to use the duplicated() method. This method returns a boolean series indicating which rows are duplicates. We can then use this boolean series to filter the dataframe and return only the duplicate rows.

Let’s first create a sample dataframe:

  import pandas as pd

  data = {'name': ['John', 'Sarah', 'John', 'David', 'Sarah'],
          'age': [25, 28, 25, 32, 28],
          'gender': ['Male', 'Female', 'Male', 'Male', 'Female']}
  df = pd.DataFrame(data)

  print(df)
  

Output:

     name   age  gender
  0   John   25    Male
  1  Sarah   28  Female
  2   John   25    Male
  3  David   32    Male
  4  Sarah    28  Female

duplicated() method to identify the duplicate rows:

  duplicates = df.duplicated()

  print(duplicates)
  

Output:

  0    False
  1    False
  2     True
  3    False
  4     True
  dtype: bool

let’s filter the dataframe based on the boolean series:

  duplicate_rows = df[duplicates]

  print(duplicate_rows)
  

Output:

     name   age  gender
  2   John   25    Male
  4  Sarah   28  Female

Method 2: Find Duplicate Rows Based on Selected Columns

If we want to find duplicate rows based on only selected columns, we can pass a list of column names to the subset parameter of the duplicated() method. For example, if we want to find duplicate rows based on the Employee ID and Last Name columns, we can do the following:

import pandas as pd

# creating dataframe
employees = pd.DataFrame({
'Employee ID': ['E001', 'E002', 'E003', 'E003', 'E004', 'E005', 'E006', 'E007', 'E007'],
'First Name': ['John', 'Nick', 'John', 'John', 'Lucy', 'Alice', 'Alice', 'Alice', 'Alice'],
'Last Name': ['Smith', 'Doe', 'Smith', 'Doe', 'Wang', 'Garcia', 'Garcia', 'Garcia', 'Garcia'],
'Department': ['HR', 'Marketing', 'IT', 'IT', 'Finance', 'HR', 'HR', 'HR', 'HR'],
'Salary': [50000, 60000, 55000, 55000, 70000, 45000, 45000, 45000, 45000],
'Joining Date': ['2017-01-01', '2018-01-01', '2017-01-01', '2017-02-01',
                 '2017-01-15', '2017-05-01', '2017-05-01', '2017-05-01', '2017-05-01'],
'Age': [32, 26, 35, 35, 29, 25, 25, 25, 25]
})

# find duplicate rows based on selected columns

duplicate_rows = employees[employees.duplicated(subset=['Employee ID', 'Last Name'])]
print(duplicate_rows)

This will return a dataframe with the duplicate rows based on the Employee ID and Last Name columns.

Similarly, we can use the drop_duplicates() method to drop the duplicate rows based on selected columns:

# drop duplicate rows based on selected columns

employees.drop_duplicates(subset=['Employee ID', 'Last Name'], inplace=True)
print(employees)

 

Above code will modify the employees dataframe by dropping the duplicate rows based on the Employee ID and Last Name columns.

Method 3: Using groupby() and size()

The groupby() method groups the rows in the dataframe based on the specified columns and applies an aggregation function to each group. Here, we will group the rows based on all the columns and count the number of rows in each group using the size() method. If the size of any group is greater than 1, it means that the group has duplicate rows.

# group the rows based on all columns and
# count the number of rows in each group

df_grouped = df.groupby(list(df.columns)).size().reset_index(name='count')

#filter the groups with count greater than 1
df_grouped[df_grouped['count'] > 1]

The above code groups the rows based on all columns and counts the number of rows in each group using the size() method. The reset_index() method is used to reset the index of the resulting dataframe. Finally, we filter the groups with count greater than 1 to get the duplicate rows.

# create a new dataframe with duplicate rows

data = {'Name': ['John', 'Sarah', 'John', 'Mark', 'Sarah'],
'Age': [28, 22, 28, 25, 22],
'Gender': ['Male', 'Female', 'Male', 'Male', 'Female']}
df = pd.DataFrame(data)

# group the rows based on all columns and
# count the number of rows in each group

df_grouped = df.groupby(list(df.columns)).size().reset_index(name='count')

# filter the groups with count greater than 1
df_grouped[df_grouped['count'] > 1]

Output:

Name    Age    Gender    count
John    28     Male      2
Sarah   22     Female    2

As we can see, this method has correctly identified the duplicate rows in our sample dataframe.

 

Method 4: Drop_duplicates() method

The drop_duplicates() method can be used to remove duplicate rows from a DataFrame based on specific columns. This method returns a new DataFrame with the duplicate rows removed.

Let’s use this method to remove duplicate rows based on the “name” and “age” columns:

# create a new DataFrame

data = {'name': ['John', 'Sarah', 'John', 'Mike', 'Mike'],
'age': [25, 33, 25, 40, 40],
'gender': ['Male', 'Female', 'Male', 'Male', 'Male']}

df = pd.DataFrame(data)
# remove duplicate rows based on 'name' and 'age' columns

df = df.drop_duplicates(['name', 'age'])
# print the new DataFrame

print(df)

Output:

    name  age  gender
0   John   25    Male
1  Sarah   33  Female
3   Mike   40    Male

As we can see from the output, the duplicated rows based on the “name” and “age” columns have been removed.

Last Updated on May 16, 2023 by admin

Leave a Reply

Your email address will not be published. Required fields are marked *

Recommended Blogs