If you are working with a large Pandas dataframe, it may be useful to split it into smaller chunks based on the rows. Doing this can make it easier to work with the data and perform specific tasks on the subsets. In this article, we will explore several methods to split a Pandas dataframe by rows.
Let’s see different approaches for splitting Dataframe to get the desired results, with help of different code examples.
Method 1: Split by Number of Rows
The first method involves splitting the dataframe into a specified number of rows. This can be useful if you want to split the dataframe into equal-sized chunks.
Code to split a dataframe into a specified number of rows:
import pandas as pd def split_dataframe(df, chunk_size): return [df[i:i+chunk_size] for i in range(0, len(df), chunk_size)] df = pd.read_csv("large_dataframe.csv") chunk_size = 10000 chunks = split_dataframe(df, chunk_size)
In above code, we define a function called split_dataframe that takes two arguments: dataframe and desired chunk size. We then use list comprehension to create a list of dataframes, each with the specified number of rows.
Method 2: Split by Index
The second method involves splitting the dataframe based on the index. This can be useful if you want to split the dataframe based on specific rows or ranges of rows.
Code to split a dataframe based on the index:
df = pd.read_csv("large_dataframe.csv") # Split dataframe into two parts part1 = df.loc[:9999] part2 = df.loc[10000:] # Split dataframe into three parts part1 = df.loc[:4999] part2 = df.loc[5000:9999] part3 = df.loc[10000:]
In above code, we use the loc
method to select specific rows based on the index. We can split the dataframe into two or more parts by defining the appropriate ranges.
Method 3: Split by Value
The third method is splitting the dataframe based on the values in a specific column. This can be useful if you want to split the dataframe based on specific categories or ranges of values.
Code to split a dataframe based on a specific value:
df = pd.read_csv("large_dataframe.csv") # Split dataframe into two parts based on category part1 = df[df["Category"] == "A"] part2 = df[df["Category"] == "B"]
In above code, we use boolean indexing to select rows based on a specific category value in the “Category” column. We can split the dataframe into as many parts as necessary by defining additional boolean expressions.
More code Examples
import pandas as pd # create a sample DataFrame data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'], 'age': [25, 30, 35, 40, 45], 'gender': ['F', 'M', 'M', 'M', 'F'], 'salary': [50000, 60000, 70000, 80000, 90000]} df = pd.DataFrame(data) # split the DataFrame into two equal parts df1, df2 = df[:len(df)//2], df[len(df)//2:] # split the DataFrame into three parts based on age age_groups = pd.cut(df['age'], bins=[0, 30, 40, 50]) df_groups = df.groupby(age_groups) df_young = df_groups.get_group('<30') df_mid = df_groups.get_group('30-40') df_old = df_groups.get_group('>40') # split the DataFrame into four parts based on gender and salary gender_groups = df.groupby('gender') salary_groups = df.groupby(pd.cut(df['salary'], bins=[0, 60000, 80000, 100000])) df_female_low = gender_groups.get_group('F').loc[salary_groups.get_group('(0, 60000]').index] df_female_high = gender_groups.get_group('F').loc[salary_groups.get_group('(80000, 100000]').index] df_male_low = gender_groups.get_group('M').loc[salary_groups.get_group('(0, 60000]').index] df_male_mid = gender_groups.get_group('M').loc[salary_groups.get_group('(60000, 80000]').index] # print the results print("Original DataFrame:\n", df) print("\nSplit into two parts:\n", df1, "\n", df2) print("\nSplit into three parts based on age:\n", df_young, "\n", df_mid, "\n", df_old) print("\nSplit into four parts based on gender and salary:\n", df_female_low, "\n", df_female_high, "\n", df_male_low, "\n", df_male_mid)
The first example splits the DataFrame into two equal parts, and the second example splits it into three parts based on age. The third example splits it into four parts based on gender and salary.
In first example, we are using slicing syntax to split the DataFrame in half.
In second example, we create bins for the age column using the cut() function, and then group the DataFrame by these bins using the groupby() function. Then use the get_group() method to retrieve each group based on its label.
In third example, we group the DataFrame by both the gender and salary columns using the groupby() function, and then use loc[] attribute to retrieve specific rows from each group based on their salary ranges.
Last Updated on May 13, 2023 by admin