Pandas Series.str.extract()
In Python’s Pandas library, Series.str.extract() function enables you to extract specific parts of text data by specifying a regular expression pattern. It extracts the capture groups in the regex pattern as columns in a DataFrame for each string in the Series.
The function takes three parameters: pat (a regex pattern with capturing groups), flags (an integer that controls regex behavior), and expand (a boolean to indicate whether to return a DataFrame with one column per capture group).
Syntax: Series.str.extract(pat, flags=0, expand=True)
Parameter:
- pat: Regular expression pattern with capturing groups.
- flags: int, default 0 (no flags).
- expand: If True, return DataFrame with one column per capture group.
Returns: DataFrame or Series or Index.
Example 1: Use Series.str.extract() function to extract groups from the string in the underlying data of the given series object.
import pandas as pd
import re
# Creating the Seriessr = pd.Series([‘New_York’, ‘Lisbon’, ‘Tokyo’, ‘Paris’, ‘Munich’])
# Creating the index
idx = [‘City 1’, ‘City 2’, ‘City 3’, ‘City 4’, ‘City 5’]
# Set the index
sr.index = idx
# Print the series
print(sr)
# Extract groups having a vowel followed by any character
result = sr.str.extract(pat=‘([aeiou].)’)
# Print the result as a DataFrame
print(result)
Output:
City 1 New_York
City 2 Lisbon
City 3 Tokyo
City 4 Paris
City 5 Munich
dtype: object
0City 1 ew
City 2 is
City 3 ok
City 4 ar
City 5 un
As we can see in the output, the Series.str.extract() function has returned a DataFrame containing a column of the extracted group.
Example 2: Use Series.str.extract() function to extract groups from the string in the underlying data of the given series object.
import pandas as pd
import re
# Creating the Seriessr = pd.Series([‘Mike’, ‘Alessa’, ‘Nick’, ‘Kim’, ‘Britney’])
# Creating the index
idx = [‘Name 1’, ‘Name 2’, ‘Name 3’, ‘Name 4’, ‘Name 5’]
# Set the index
sr.index = idx
# Print the series
print(sr)
# Extract groups having any capital letter followed by ‘i’ and any other character
result = sr.str.extract(pat=‘([A-Z]i.)’)
# Print the result as a DataFrame
print(result)
Output:
Name 1 Mike
Name 2 Alessa
Name 3 Nick
Name 4 Kim
Name 5 Britney
dtype: object
0Name 1 Mic
Name 2 Ale
Name 3 Nic
Name 4 Kim
Name 5 Bri
As we can see in the output, the Series.str.extract() function has returned a DataFrame containing a column of the extracted group.
Extracting Phone Numbers from a Pandas Series
The Series.str.extract()
function can be used to extract phone numbers from a Pandas Series containing strings. This can be useful for data cleaning and analysis purposes.
import pandas as pd
# create a Pandas Series containing phone numbers
phone_numbers = pd.Series([‘(123) 456-7890’, ‘555-555-5555’, ‘1234567890’])
# extract phone numbers with area code and exchange code
extracted_numbers = phone_numbers.str.extract(r'(\d{3}).*(\d{3}).*(\d{4})’, expand=True)
# print the extracted phone numbers
print(extracted_numbers)
Output:
0 1 2
0 123 456 7890
1 555 555 5555
2 123 456 7890
In this example, we create a Pandas Series containing phone numbers in different formats. We then use the Series.str.extract()
function with a regular expression pattern to extract the area code, exchange code, and the remaining digits of the phone number. The resulting DataFrame contains the extracted phone numbers in separate columns.
Extracting Email Addresses from a Pandas Series
The Series.str.extract()
function can also be used to extract email addresses from a Pandas Series containing strings. This can be useful for data cleaning and analysis purposes.
import pandas as pd
# create a Pandas Series containing email addresses
email_addresses = pd.Series([‘johndoe@example.com’, ‘jane_doe@example.co.uk’, ‘info@example.org’])
# extract the username and domain name from the email addresses
extracted_emails = email_addresses.str.extract(r'(\w+)@(.+)’, expand=True)
# print the extracted email addresses
print(extracted_emails)
Output:
0 1
0 johndoe example.com
1 jane_doe example.co.uk
2 info example.org
In this example, we create a Pandas Series containing email addresses in different formats. We then use the Series.str.extract()
function with a regular expression pattern to extract the username and domain name from each email address. The resulting DataFrame contains the extracted email addresses in separate columns.
Extracting Dates from a Pandas Series
The Series.str.extract()
function can also be used to extract dates from a Pandas Series containing strings. This can be useful for data cleaning and analysis purposes.
import pandas as pd
# create a Pandas Series containing dates
dates = pd.Series([‘January 1, 2022’, ’02/14/2023′, ‘Mar 31 2024’])
# extract the month, day, and year from the dates
extracted_dates = dates.str.extract(r'(\w+)\W+(\d+),?\W+(\d{4})’, expand=True)
# print the extracted dates
print(extracted_dates)
Output:
0 1 2
0 January 1 2022
1 February 14 2023
2 March 31 2024
In this example, we create a Pandas Series containing dates in different formats. We then use the Series.str.extract()
function with a regular expression pattern to extract the month, day, and year from each date. The resulting DataFrame contains the extracted dates in separate columns.
Last Updated on April 16, 2023 by admin