Read a zipped file as a Pandas DataFrame



Read a zipped file as a Pandas DataFrame

In this article, we will try to find out how can we read data from a zip file using a panda data frame.

Why we need a zip file?

People use related groups of files together and to make files compact, so they are easier and faster to share via the web. Zip files are ideal for archiving since they save storage space. And, they are also useful for securing data using the encryption method.

Requirement:

zipfile36 module: This module is used to perform various operations on a zip file using a simple python program. It can be installed using the below command:

pip install zipfile36

Example 1: Reading a CSV file from a zipped folder using ZipFile.extract() method

In this example, we will extract a CSV file named “data.csv” from a zipped folder named “data.zip” and create a pandas DataFrame from it.

import zipfile
import pandas as pd

# open the zip file
with zipfile.ZipFile('data.zip', 'r') as zip_ref:
# extract the CSV file
zip_ref.extract('data.csv')

# create a pandas DataFrame from the extracted CSV file
df = pd.read_csv('data.csv')

# print the first 5 rows of the DataFrame
print(df.head())

Example 2: Reading a CSV file from a zipped folder using ZipFile.read() method

In this example, we will read a CSV file named “data.csv” from a zipped folder named “data.zip” using the ZipFile.read() method and create a pandas DataFrame from it.

import zipfile
import pandas as pd
import io

# open the zip file
with zipfile.ZipFile('data.zip', 'r') as zip_ref:
# read the CSV file
data = zip_ref.read('data.csv')

# create a pandas DataFrame from the CSV data
df = pd.read_csv(io.BytesIO(data))

# print the first 5 rows of the DataFrame
print(df.head())

Example 3: Reading multiple CSV files from a zipped folder using ZipFile.namelist() method

In this example, we will read multiple CSV files from a zipped folder named “data.zip” using the ZipFile.namelist() method and create a list of pandas DataFrames from them.

import zipfile
import pandas as pd

# open the zip file
with zipfile.ZipFile('data.zip', 'r') as zip_ref:
# get a list of CSV files in the zip folder
csv_files = [f for f in zip_ref.namelist() if f.endswith('.csv')]

# create a list of pandas DataFrames from the CSV files
dfs = []
for csv_file in csv_files:
# read the CSV file and create a DataFrame
csv_data = zip_ref.read(csv_file)
df = pd.read_csv(io.BytesIO(csv_data))
dfs.append(df)

# print the first 5 rows of the first DataFrame
print(dfs[0].head())

In this method, the CSV file is first extracted from the zip file using the ZipFile.extract() method, and then the extracted file is read using the pandas.read_csv() method.

import pandas as pd
import zipfile

# Open the zip file
with zipfile.ZipFile('data.zip', 'r') as zip_ref:

# Extract the csv file
zip_ref.extract('data.csv')

# Read the csv file using pandas
df = pd.read_csv('data.csv')

# Display the dataframe
print(df.head())

Method 5: Using ZipFile.read() method to read the CSV file directly as bytes and then convert it to a pandas dataframe.

In this method, the CSV file is read directly from the zip file using the ZipFile.read() method, which returns the file as bytes. These bytes are then passed to the pandas.read_csv() method using the io.BytesIO() function, which converts the bytes to a file-like object that can be read by pandas.

import pandas as pd
import zipfile
import io

# Open the zip file
with zipfile.ZipFile('data.zip', 'r') as zip_ref:

# Read the csv file as bytes
csv_bytes = zip_ref.read('data.csv')

# Convert bytes to file-like object
csv_file = io.BytesIO(csv_bytes)

# Read the csv file using pandas
df = pd.read_csv(csv_file)

# Display the dataframe
print(df.head())

Method 6: Using ZipFile.namelist() method to get the names of all files in the zip archive and then selecting the desired file to read using pandas.

In this method, the ZipFile.namelist() method is used to get a list of all the files in the zip archive. We can then select the desired file to read using pandas.read_csv() method.

import pandas as pd
import zipfile

# Open the zip file
with zipfile.ZipFile('data.zip', 'r') as zip_ref:

# Get the list of all files in the zip archive
file_list = zip_ref.namelist()

# Select the desired file to read
file_name = 'data.csv'

# Read the csv file using pandas
df = pd.read_csv(zip_ref.open(file_name))

# Display the dataframe
print(df.head())

Last Updated on April 16, 2023 by admin

Leave a Reply

Your email address will not be published. Required fields are marked *

Recommended Blogs