In data analysis and machine learning tasks, it is often necessary to randomly select rows from a Pandas DataFrame. This can be useful for tasks such as creating train-test splits, generating random samples, or conducting statistical analysis on a subset of the data. In this article, we will explore different methods to achieve this.
Method 1: Using the sample() Method
The sample()
method in Pandas allows us to randomly select rows from a DataFrame. We can specify the number of rows we want to select using the n
parameter.
import pandas as pd data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'], 'Age': [25, 30, 35, 40, 45], 'Salary': [50000, 60000, 70000, 80000, 90000]} df = pd.DataFrame(data) # Randomly select 2 rows random_rows = df.sample(n=2) print(random_rows)
The above code creates a DataFrame and then uses the sample()
method to randomly select 2 rows from the DataFrame. The selected rows are stored in the random_rows
variable and then printed.
Method 2: Using the sample() Method with frac Parameter
An alternative way to randomly select rows is by using the sample()
method with the frac
parameter. This allows us to specify the fraction of rows to be selected.
# Randomly select 50% of the rows random_rows = df.sample(frac=0.5) print(random_rows)
The above code randomly selects 50% of the rows from the DataFrame using the sample()
method with the frac
parameter. The selected rows are then printed.
Method 3: Using the sample() Method with random_state Parameter
If we want to obtain the same set of randomly selected rows each time we run the code, we can use the random_state
parameter.
# Randomly select 3 rows with a fixed random state random_rows = df.sample(n=3, random_state=42) print(random_rows)
In the above code, the random_state
parameter is set to 42
, which ensures that the same 3 rows are selected each time the code is executed.
Method 4: Using the numpy.random.choice() Function
An alternative approach is to use the numpy.random.choice()
function along with the index values of the DataFrame to randomly select rows.
import numpy as np # Get the index values of the DataFrame indices = df.index.values # Randomly select 4 rows random_indices = np.random.choice(indices, size=4, replace=False) random_rows = df.loc[random_indices] print(random_rows)
In this code, we first obtain the index values of the DataFrame using df.index.values
. Then,
we use numpy.random.choice()
to randomly select 4 unique indices from the DataFrame’s index. Finally, we use the selected indices to retrieve the corresponding rows from the DataFrame.
Method 5: Using the np.random.randint() Function
You can use the np.random.randint()
function to generate random integer indices and select rows based on those indices.
import numpy as np # Generate random integer indices indices = np.random.randint(0, len(df), size=3) # Select rows based on the indices random_rows = df.iloc[indices] print(random_rows)
In this method, we use np.random.randint()
to generate 3 random integer indices within the range of the DataFrame’s length. Then, we use df.iloc
to select the rows corresponding to the generated indices.
Method 6: Using the pandas DataFrame.sample() Method with weights
The sample()
method can also accept an optional weights
parameter, which allows you to assign probabilities to each row and perform weighted random sampling.
# Assign weights to each row weights = [0.1, 0.2, 0.3, 0.2, 0.2] # Perform weighted random sampling random_rows = df.sample(n=2, weights=weights, replace=False) print(random_rows)
In this method, we assign a weight to each row using a list. The higher the weight, the more likely that row will be selected. The sample()
method then performs random sampling based on the assigned weights.
Method 7: Using the pandas DataFrame.query() Method
The query()
method allows you to randomly select rows based on specific conditions using a query expression.
# Randomly select rows where Age is greater than 30 random_rows = df.query('Age > 30') print(random_rows)
In this method, we use the query()
method to select rows where the Age column is greater than 30. This can be helpful when you want to randomly select rows based on certain criteria.
Last Updated on May 17, 2023 by admin