Randomly select rows from Pandas DataFrame



In data analysis and machine learning tasks, it is often necessary to randomly select rows from a Pandas DataFrame. This can be useful for tasks such as creating train-test splits, generating random samples, or conducting statistical analysis on a subset of the data. In this article, we will explore different methods to achieve this.

Method 1: Using the sample() Method

The sample() method in Pandas allows us to randomly select rows from a DataFrame. We can specify the number of rows we want to select using the n parameter.

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 30, 35, 40, 45],
'Salary': [50000, 60000, 70000, 80000, 90000]}

df = pd.DataFrame(data)
# Randomly select 2 rows

random_rows = df.sample(n=2)
print(random_rows)

The above code creates a DataFrame and then uses the sample() method to randomly select 2 rows from the DataFrame. The selected rows are stored in the random_rows variable and then printed.

Method 2: Using the sample() Method with frac Parameter

An alternative way to randomly select rows is by using the sample() method with the frac parameter. This allows us to specify the fraction of rows to be selected.

# Randomly select 50% of the rows
random_rows = df.sample(frac=0.5)
print(random_rows)

The above code randomly selects 50% of the rows from the DataFrame using the sample() method with the frac parameter. The selected rows are then printed.

Method 3: Using the sample() Method with random_state Parameter

If we want to obtain the same set of randomly selected rows each time we run the code, we can use the random_state parameter.

# Randomly select 3 rows with a fixed random state
random_rows = df.sample(n=3, random_state=42)
print(random_rows)

In the above code, the random_state parameter is set to 42, which ensures that the same 3 rows are selected each time the code is executed.

Method 4: Using the numpy.random.choice() Function

An alternative approach is to use the numpy.random.choice() function along with the index values of the DataFrame to randomly select rows.

import numpy as np
# Get the index values of the DataFrame

indices = df.index.values
# Randomly select 4 rows

random_indices = np.random.choice(indices, size=4, replace=False)
random_rows = df.loc[random_indices]
print(random_rows)

In this code, we first obtain the index values of the DataFrame using df.index.values. Then,

we use numpy.random.choice() to randomly select 4 unique indices from the DataFrame’s index. Finally, we use the selected indices to retrieve the corresponding rows from the DataFrame.

Method 5: Using the np.random.randint() Function

You can use the np.random.randint() function to generate random integer indices and select rows based on those indices.

import numpy as np
# Generate random integer indices

indices = np.random.randint(0, len(df), size=3)
# Select rows based on the indices

random_rows = df.iloc[indices]
print(random_rows)

In this method, we use np.random.randint() to generate 3 random integer indices within the range of the DataFrame’s length. Then, we use df.iloc to select the rows corresponding to the generated indices.

Method 6: Using the pandas DataFrame.sample() Method with weights

The sample() method can also accept an optional weights parameter, which allows you to assign probabilities to each row and perform weighted random sampling.

# Assign weights to each row
weights = [0.1, 0.2, 0.3, 0.2, 0.2]
# Perform weighted random sampling

random_rows = df.sample(n=2, weights=weights, replace=False)
print(random_rows)

In this method, we assign a weight to each row using a list. The higher the weight, the more likely that row will be selected. The sample() method then performs random sampling based on the assigned weights.

Method 7: Using the pandas DataFrame.query() Method

The query() method allows you to randomly select rows based on specific conditions using a query expression.

# Randomly select rows where Age is greater than 30
random_rows = df.query('Age > 30')
print(random_rows)

In this method, we use the query() method to select rows where the Age column is greater than 30. This can be helpful when you want to randomly select rows based on certain criteria.

Last Updated on May 17, 2023 by admin

Leave a Reply

Your email address will not be published. Required fields are marked *

Recommended Blogs