Learning the Pandas Library: Series

Explore Pandas series data structure.

I purchased the book Learning the Pandas Library via a Humble Bundle.

In [19]:
%%html
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Dive deeper into Python learning with our latest bundle filled with ebooks, software, and videos!<a href="https://t.co/ACULekLqBx">https://t.co/ACULekLqBx</a></p>&mdash; Humble Bundle (@humble) <a href="https://twitter.com/humble/status/1171483506167255046?ref_src=twsrc%5Etfw">September 10, 2019</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> 

The book breaks the concepts down into digestible chunks. It's one of the best I have found so far about Pandas.

In [2]:
import operator as op
from pprint import pprint
In [3]:
import faker
import pandas as pd
from IPython.display import display
In [4]:
fake = faker.Faker()

Create a Series.

In [5]:
list(filter(lambda x: 'year' in x, dir(fake)))
Out[5]:
['date_this_year', 'date_time_this_year', 'year']
In [8]:
import random

YEAR_COUNT = 5

unique_years = set()
while len(unique_years) < YEAR_COUNT:
    unique_years.update([fake.year(), ])  
UNIQUE_YEARS = tuple(unique_years)
In [12]:
DATA_LENGTH = 20
years, names = zip(*((random.choice(UNIQUE_YEARS), fake.name()) 
                   for _ in range(DATA_LENGTH)))
assert len(set(years)) < len(names), f'"years" is not unique.'

Create

In [17]:
birth_years = pd.Series(names, index=years, name="birth_years").sort_index()
In [18]:
display(birth_years)
1970     Rebecca Richardson
1970            Keith Smith
1970         Dawn Gutierrez
1974    Dr. Louis Hernandez
1976           Brandi Glenn
1981           Paul Meadows
1982          Hannah Turner
1984    Natalie Fitzpatrick
1987        Patty Schneider
1996           Brett Jacobs
2004       Matthew Fletcher
2004           Karen Holden
2006            Miguel Lynn
2006           Mary Harrell
2006            Roy Johnson
2007           Tami Higgins
2007            Susan Lopez
2012        Karina Reynolds
2012          Andrea Hughes
2014            Laura Hicks
Name: birth_years, dtype: object

Read

In [20]:
birth_years['2004']
Out[20]:
2004    Matthew Fletcher
2004        Karen Holden
Name: birth_years, dtype: object

Update

Because an index operation either updates or appends, one must be aware of the data they are dealing with. Be careful if you intend to add a value with an index entry that already exists in the series. Assignment via an index operation to an existing index entry will overwrite previous entries.

— Matt Harrison. Learning the Pandas Library: Python Tools for Data Munging, Data Analysis, and Visualization (Kindle Locations 407-409).

In [22]:
pprint(birth_years['2004'])
new_name = None
while new_name not in names:
    new_name = fake.name()

birth_years['2004'] = f'ad hoc update: {new_name}'
birth_years['2004']
2004    Matthew Fletcher
2004        Karen Holden
Name: birth_years, dtype: object
Out[22]:
2004    ad hoc update: Susan Lopez
2004    ad hoc update: Susan Lopez
Name: birth_years, dtype: object

If you had to deal with data such as this… We can update values, based purely on position, by performing an index assignment on the .iloc attribute…

— Matt Harrison. Learning the Pandas Library: Python Tools for Data Munging, Data Analysis, and Visualization (Kindle Locations 414-417).

In [27]:
for index in (i for i, year in enumerate(birth_years.index) if year == '2004'):
    birth_years.iloc[index] = fake.name()
birth_years
Out[27]:
1970     Rebecca Richardson
1970            Keith Smith
1970         Dawn Gutierrez
1974    Dr. Louis Hernandez
1976           Brandi Glenn
1981           Paul Meadows
1982          Hannah Turner
1984    Natalie Fitzpatrick
1987        Patty Schneider
1996           Brett Jacobs
2004            Carol Wolfe
2004          Donna Roberts
2006            Miguel Lynn
2006           Mary Harrell
2006            Roy Johnson
2007           Tami Higgins
2007            Susan Lopez
2012        Karina Reynolds
2012          Andrea Hughes
2014            Laura Hicks
Name: birth_years, dtype: object

Delete

In [41]:
del birth_years['2004']
pprint(birth_years)
1970     Rebecca Richardson
1970            Keith Smith
1970         Dawn Gutierrez
1974    Dr. Louis Hernandez
1976           Brandi Glenn
1981           Paul Meadows
1982          Hannah Turner
1984    Natalie Fitzpatrick
1987        Patty Schneider
1996           Brett Jacobs
2006            Miguel Lynn
2006           Mary Harrell
2006            Roy Johnson
2007           Tami Higgins
2007            Susan Lopez
2012        Karina Reynolds
2012          Andrea Hughes
2014            Laura Hicks
Name: birth_years, dtype: object
In [33]:
people = pd.Series(
    [fake.name() for _ in range(20)],
    name='people'
)
In [34]:
display(people)
0     Stephanie Jacobson
1          Scott Fleming
2            Gary Castro
3          Anna Gonzales
4            Joel Hayden
5          Tamara Torres
6           Sarah Santos
7            Mary Rivera
8          Corey Estrada
9           Emily Harris
10            Amy Newman
11     Rebecca Contreras
12         Steven Turner
13          Jimmy Nguyen
14         Darren Sparks
15             Mark Koch
16         Chelsea Avila
17           Sarah Moore
18            Gary White
19          Nicole Short
Name: people, dtype: object

Search for all names that contain either a 'D' or a 'J'

Use the functional version of bitwise OR operator: |.

In [35]:
LETTERS = 'DJ'
masks = [pd.Series([letter in item for item in people]) 
         for letter in LETTERS]
people[op.or_(*masks)]
Out[35]:
0     Stephanie Jacobson
4            Joel Hayden
13          Jimmy Nguyen
14         Darren Sparks
Name: people, dtype: object

Search for all names that contain the first half of uppercase letters.

Use the functional version of bitwise OR operator: | along with reduce from the functools module.

In [36]:
from string import ascii_uppercase
from functools import reduce
In [37]:
LETTERS = ascii_uppercase[:len(ascii_uppercase)//2]
LETTERS
Out[37]:
'ABCDEFGHIJKLM'
In [38]:
mask = reduce(op.or_, (pd.Series([letter in item for item in people]) 
                        for letter in LETTERS))
pprint(mask)
pprint(people[mask])
0      True
1      True
2      True
3      True
4      True
5     False
6     False
7      True
8      True
9      True
10     True
11     True
12    False
13     True
14     True
15     True
16     True
17     True
18     True
19    False
dtype: bool
0     Stephanie Jacobson
1          Scott Fleming
2            Gary Castro
3          Anna Gonzales
4            Joel Hayden
7            Mary Rivera
8          Corey Estrada
9           Emily Harris
10            Amy Newman
11     Rebecca Contreras
13          Jimmy Nguyen
14         Darren Sparks
15             Mark Koch
16         Chelsea Avila
17           Sarah Moore
18            Gary White
Name: people, dtype: object
In [39]:
neg_mask = [bool(item ^ 1) for item in mask]
In [40]:
people[neg_mask]
Out[40]:
5     Tamara Torres
6      Sarah Santos
12    Steven Turner
19     Nicole Short
Name: people, dtype: object