How to select rows from a dataframe based on column values ?



How to select rows from a dataframe based on column values ?

The rows of a dataframe can be selected based on conditions as we do use the SQL queries. The various methods to achieve this is explained in this article with examples. To explain the method a dataset has been created which contains data of points scored by 10 people in various games. The dataset is loaded into the dataframe and visualized first. Ten people with unique player id(Pid) have played different games with different game id(game_id) and the points scored in each game is added as an entry to the table. Some of the player’s points are not recorded and thus NaN value appears in the table.

 

Note: To get the CSV file used, click here.

import pandas as pd
 
df=pd.read_csv(r"__your file path__\example2.csv")
print(df)

Output:

dataset example2.csv

Boolean Indexing method

In this method, for a specified column condition, each row is checked for true/false. The rows which yield True will be considered for the output. This can be achieved in various ways. The query used is Select rows where the column Pid=’p01′

Example 1: Checking condition while indexing

# Choose entries with id p01
df_new = df[df['Pid'] == 'p01']
 
print(df_new)

Output

Example 2: Specifying the condition ‘mask’ variable

 

The selected rows are assigned to a new dataframe with the index of rows from old dataframe as an index in the new one and the columns remaining the same.

# condition mask
mask = df['Pid'] == 'p01'
 
# new dataframe with selected rows
df_new = pd.DataFrame(df[mask])
 
print(df_new)

Output

Example 3: Combining mask and dataframes.values property

The query here is Select the rows with game_id ‘g21’.

# condition with df.values property
mask = df['game_id'].values == 'g21'
 
# new dataframe
df_new = df[mask]
 
print(df_new)

Output

Positional indexing

The methods loc() and iloc() can be used for slicing the dataframes in Python. Among the differences between loc() and iloc(), the important thing to be noted is iloc() takes only integer indices, while loc() can take up boolean indices also.

 

Example 1: Using loc()

The mask gives the boolean value as an index for each row and whichever rows evaluate to true will appear in the result. Here, the query is to select the rows where game_id is g21.

# for boolean indexing
mask = df['game_id'].values == 'g21'
 
# using loc() method
df_new = df.loc[mask]
 
print(df_new)

Output

Example 2: Using iloc()

The query is the same as the one taken above. The iloc() takes only integers as an argument and thus, the mask array is passed as a parameter to the numpy’s flatnonzero() function that returns the index in the list where the value is not zero (false)

# condition mask
mask = df['game_id'].values == 'g21'
print("Mask array :", mask)
 
# getting non zero indices
pos = np.flatnonzero(mask)
print("\nRows selected :", pos)
 
# selecting rows
df.iloc[pos]

Output

 

Using dataframe.query()

The query() method takes up the expression that returns a boolean value, processes all the rows in the dataframe, and returns the resultant dataframe with selected rows.

Example 1: Select  rows where name=”Albert”

df.query('name=="Albert"')

Output

Example 2: Select rows where points>50 and the player is not Albert.

This example is to demonstrate that logical operators like AND/OR can be used to check multiple conditions.

df.query('points>50 & name!="Albert"')

Output

 

Using isin()

This method of dataframe takes up an iterable or a series or another dataframe as a parameter and checks whether elements of the dataframe exists in it. The rows whichever evaluates to true are considered for the resultant.

Example 1: Select the rows where players are Albert, Louis, and John.

#Players to be selected
li=['Albert','Louis','John']
 
df[df.name.isin(li)]

Output

Example 2: Select rows where points>50 and players are not Albert, Louis and John.

The tiled symbol (~) provides the negation of the expression evaluated.

# values to be present in selected rows
li = ['Albert', 'Louis', 'John']
 
# selecting rows from dataframe
df[(df.points > 50) & (~df.name.isin(li))]

Output

 

Using np.where()

The numpy’s where() function can be combined with the pandas’ isin() function to produce a faster result. The numpy.where() is proved to produce results faster than the normal methods used above.

Example:

import numpy as np
 
df_new = df.iloc[np.where(df.name.isin(li))]

Output

Comparison with other methods

# to calculate timing
import numpy as np
% % timeit
 
 
# using mixture of numpy and pandas method
df_new = df.iloc[np.where(df.name.isin(li))]

Output:

756 µs ± 132 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# to calculate time
%%timeit
 
li=['Albert','Louis','John']
 
# Pandas method only
df[(df.points>50)&(~df.name.isin(li))]

Output

1.7 ms ± 307 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Last Updated on November 13, 2021 by admin

Leave a Reply

Your email address will not be published. Required fields are marked *

Recommended Blogs