Convert CSV to Pandas Dataframe



Convert CSV to Pandas Dataframe

In this article, we will discuss how to convert CSV to Pandas Dataframe.

In this article, we will explore efficient ways to convert CSV files into Pandas Dataframe. We will also discuss some best practices, data type handling, and advanced data manipulation techniques.

Convert CSV to Pandas Dataframe

We can read this CSV file into a Pandas Dataframe using read_csv method :

# Import Pandas library
import pandas as pd

# Read the CSV file into a Pandas Dataframe
df = pd.read_csv("pythonpandas.csv")

# Display the Dataframe
print(df)

Now, Let’s see more examples on converting CSV to Pandas Dataframe.

Handling CSV with Missing Headers

If your CSV file does not have column headers, you can specify them using the names parameter while reading the file.

# Assuming the CSV file has no headers and the columns are named 'Column1', 'Column2', etc.
df = pd.read_csv("data.csv", names=['Column1', 'Column2', 'Column3', 'Column4'])

# Display the Dataframe
print(df)

Converting CSV with Custom Index

By default, Pandas generates a numeric index for the rows in the Dataframe. However, you can specify a custom column as the index using the index_col parameter.

# Assuming the 'ID' column should be used as the index
df = pd.read_csv("data.csv", index_col='ID')

# Display the Dataframe
print(df)

Converting CSV with Text Qualifiers

If your CSV file uses text qualifiers (e.g., double quotes) to enclose fields, you can handle them using the quotechar parameter.

# Assuming the CSV file uses double quotes (") as text qualifiers
df = pd.read_csv("data.csv", quotechar='"')

# Display the Dataframe
print(df)

In some cases, a CSV file may contain footer rows that you want to skip while reading the data. You can do this using the skipfooter parameter.

# Skipping the last two rows in the CSV file
df = pd.read_csv("data.csv", skipfooter=2)

# Display the Dataframe
print(df)

Specifying Data Types for Columns

To ensure specific data types for the columns, you can provide a dictionary to the dtype parameter.

# Assuming 'ID' should be treated as a string and 'Amount' as a floating-point number
df = pd.read_csv("data.csv", dtype={'ID': str, 'Amount': float})

# Display the Dataframe
print(df)

Handling CSV with Multiple Delimiters

For CSV files with multiple delimiters or irregular formatting, you can use the regex parameter to define the delimiter pattern.

# Assuming the CSV file uses semicolons (;) or tabs as delimiters
df = pd.read_csv("data.csv", sep=';|\t', engine='python')

# Display the Dataframe
print(df)

Converting CSV with Non-English Characters

If your CSV file contains non-English characters (e.g., accented characters), you can specify the appropriate encoding using the encoding parameter.

# Assuming the CSV file uses UTF-8 encoding
df = pd.read_csv("data.csv", encoding='utf-8')

# Display the Dataframe
print(df)

Handling Date and Time Formats

For CSV files with date and time columns in non-standard formats, you can use the date_parser parameter along with a custom date parsing function.

# Custom date parsing function to handle dates in the format 'dd/mm/yyyy'
def custom_date_parser(date_string):
    return pd.to_datetime(date_string, format='%d/%m/%Y')

# Assuming the 'Date' column contains
# dates in the format 'dd/mm/yyyy'
df = pd.read_csv("data.csv", parse_dates=['Date'],
                   date_parser=custom_date_parser)

# Display the Dataframe
print(df)

Converting CSV with Thousand Separators

If your CSV file uses thousand separators (e.g., commas) in numeric fields, you can remove them using the thousands parameter.

# Assuming the 'Amount' column uses commas as thousand separators
df = pd.read_csv("data.csv", thousands=',')

# Display the Dataframe
print(df)

Handling CSV with Multi-level Headers

If your CSV file has multi-level headers (also known as hierarchical headers), you can use the header parameter along with index_col to read and create a multi-level column index in the Dataframe.

# Assuming the CSV file has multi-level headers
# Example CSV:
# Name, Age, Occupation
# , Male, Female
# John, 28, Engineer
# Alice, 34, Scientist
# Bob, 23, Analyst

# Read the CSV file with multi-level headers
df = pd.read_csv("data.csv", header=[0, 1], index_col=0)

# Display the Dataframe
print(df)

Handling CSV with Non-Standard Missing Values

If your CSV file uses non-standard values to represent missing data, you can specify them using the na_values parameter.

# Assuming the CSV file uses 'N/A' and
# 'NA' to represent missing values
df = pd.read_csv("data.csv", na_values=['N/A', 'NA'])

# Display the Dataframe
print(df)

Converting CSV with Custom Date Format and Timezone

For CSV files containing date and time columns in a custom format with timezone information, you can use the date_parser parameter along with the datetime module to handle timezone-aware datetime objects.

# Custom date parsing function to handle dates with timezone information in the format 'YYYY-MM-DD HH:mm:ss TZ'
from datetime import datetime, timezone, timedelta

def custom_date_parser(date_string):
    dt = datetime.strptime(date_string, '%Y-%m-%d %H:%M:%S %Z')
    # Assuming the timezone offset is in hours, e.g., '+05:30' for India
    tz_offset = timedelta(hours=int(date_string[-6:-3]), minutes=int(date_string[-2:]))
    dt = dt.replace(tzinfo=timezone(tz_offset))
    return dt

# Assuming the 'Timestamp' column contains dates with timezone information
df = pd.read_csv("data.csv", parse_dates=['Timestamp'], date_parser=custom_date_parser)

# Display the Dataframe
print(df)

Converting CSV with Complex Separators and Text Qualifiers

If your CSV file uses complex separators (e.g., a mix of commas and semicolons) along with text qualifiers, you can handle them using regular expressions and the quotechar parameter.

# Assuming the CSV file uses commas and 
# semicolons as separators and double quotes
# as text qualifiers
df = pd.read_csv("data.csv", sep='[;,]', quotechar='"',
                                        engine='python')

# Display the Dataframe
print(df)

Handling CSV with Categorical Data

If your CSV file contains categorical data, you can explicitly specify the data type as a categorical variable using the dtype parameter to optimize memory usage.

# Assuming the 'Category' column contains categorical data
df = pd.read_csv("data.csv", dtype={'Category': 'category'})

# Display the Dataframe
print(df)

Converting CSV with Large Datasets and Memory Optimization

For extremely large CSV files that may exceed available memory, you can use dask.dataframe to perform distributed computing.

# Assuming you have Dask installed
import dask.dataframe as dd

# Reading a large CSV file using Dask
df = dd.read_csv("data.csv")

# Perform computations on the Dask Dataframe
result_df = df.groupby('Category').sum().compute()

# Display the result Dataframe
print(result_df)

References –
Pandas Official Reference
Towards datascience reference
Pandas Read CSV – w3schools reference

Last Updated on August 6, 2023 by admin

Leave a Reply

Your email address will not be published. Required fields are marked *

Recommended Blogs