Split Pandas Dataframe by Rows



If you are working with a large Pandas dataframe, it may be useful to split it into smaller chunks based on the rows. Doing this can make it easier to work with the data and perform specific tasks on the subsets. In this article, we will explore several methods to split a Pandas dataframe by rows.

Let’s see different approaches for splitting Dataframe to get the desired results, with help of different code examples.

Method 1: Split by Number of Rows

The first method involves splitting the dataframe into a specified number of rows. This can be useful if you want to split the dataframe into equal-sized chunks.

Code to split a dataframe into a specified number of rows:

import pandas as pd

def split_dataframe(df, chunk_size):
    return [df[i:i+chunk_size] for i in range(0, len(df), chunk_size)]

df = pd.read_csv("large_dataframe.csv")

chunk_size = 10000

chunks = split_dataframe(df, chunk_size)

In above code, we define a function called split_dataframe that takes two arguments: dataframe and desired chunk size. We then use list comprehension to create a list of dataframes, each with the specified number of rows.

Method 2: Split by Index

The second method involves splitting the dataframe based on the index. This can be useful if you want to split the dataframe based on specific rows or ranges of rows.

Code to split a dataframe based on the index:

df = pd.read_csv("large_dataframe.csv")

# Split dataframe into two parts
part1 = df.loc[:9999]
part2 = df.loc[10000:]

# Split dataframe into three parts
part1 = df.loc[:4999]
part2 = df.loc[5000:9999]
part3 = df.loc[10000:]

In above code, we use the loc method to select specific rows based on the index. We can split the dataframe into two or more parts by defining the appropriate ranges.

Method 3: Split by Value

The third method is splitting the dataframe based on the values in a specific column. This can be useful if you want to split the dataframe based on specific categories or ranges of values.

Code to split a dataframe based on a specific value:

df = pd.read_csv("large_dataframe.csv")

# Split dataframe into two parts based on category
part1 = df[df["Category"] == "A"]
part2 = df[df["Category"] == "B"]

In above code, we use boolean indexing to select rows based on a specific category value in the “Category” column. We can split the dataframe into as many parts as necessary by defining additional boolean expressions.

More code Examples

import pandas as pd
# create a sample DataFrame

data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
'age': [25, 30, 35, 40, 45],
'gender': ['F', 'M', 'M', 'M', 'F'],
'salary': [50000, 60000, 70000, 80000, 90000]}
df = pd.DataFrame(data)
# split the DataFrame into two equal parts

df1, df2 = df[:len(df)//2], df[len(df)//2:]
# split the DataFrame into three parts based on age

age_groups = pd.cut(df['age'], bins=[0, 30, 40, 50])
df_groups = df.groupby(age_groups)

df_young = df_groups.get_group('<30')
df_mid = df_groups.get_group('30-40')
df_old = df_groups.get_group('>40')
# split the DataFrame into four parts based on gender and salary

gender_groups = df.groupby('gender')
salary_groups = df.groupby(pd.cut(df['salary'], bins=[0, 60000, 80000, 100000]))

df_female_low = gender_groups.get_group('F').loc[salary_groups.get_group('(0, 60000]').index]
df_female_high = gender_groups.get_group('F').loc[salary_groups.get_group('(80000, 100000]').index]
df_male_low = gender_groups.get_group('M').loc[salary_groups.get_group('(0, 60000]').index]
df_male_mid = gender_groups.get_group('M').loc[salary_groups.get_group('(60000, 80000]').index]
# print the results

print("Original DataFrame:\n", df)
print("\nSplit into two parts:\n", df1, "\n", df2)
print("\nSplit into three parts based on age:\n", df_young, "\n", df_mid, "\n", df_old)
print("\nSplit into four parts based on gender and salary:\n", df_female_low, "\n", df_female_high, "\n", df_male_low, "\n", df_male_mid)

The first example splits the DataFrame into two equal parts, and the second example splits it into three parts based on age. The third example splits it into four parts based on gender and salary.

In first example, we are using slicing syntax to split the DataFrame in half.
In second example, we create bins for the age column using the cut() function, and then group the DataFrame by these bins using the groupby() function. Then use the get_group() method to retrieve each group based on its label.
In third example, we group the DataFrame by both the gender and salary columns using the groupby() function, and then use loc[] attribute to retrieve specific rows from each group based on their salary ranges.

Last Updated on May 13, 2023 by admin

Leave a Reply

Your email address will not be published. Required fields are marked *

Recommended Blogs