Real Python Pandas Groupby Tutorial

In [1]:
import pandas as pd
In [2]:
# 3 decimal places in output display
pd.set_option("display.precision", 3)
In [3]:
# Don't wrap repr(DataFrame) accross additional lines

pd.set_option("display.expand_frame_repr", False)
In [4]:
# Set max rows displayed in output to 25
pd.set_option("display.max_rows", 25)

Download datasets

In [5]:
import urllib.request
from pathlib import Path
import os

zipfilepath = Path(os.curdir, "groupby-data.zip")
with urllib.request.urlopen(
    urllib.request.Request(
        "https://github.com/realpython/materials/raw/master/pandas-groupby/groupby-data.zip",
        headers={
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0"
        },
    )
) as fh:
    zipfilepath.write_bytes(fh.read())

Store on my Mino server.

In [6]:
from subprocess import check_output
import shlex
from pathlib import Path
import urllib.parse

minopath = Path("dokkuminio")
filename = Path("groupby-data.zip")
path = Path(minopath, "mymedia/realpython")
filepath = path.joinpath(filename)
command = f"mc cp {filename} {filepath}"
command
Out[6]:
'mc cp groupby-data.zip dokkuminio/mymedia/realpython/groupby-data.zip'
In [7]:
check_output(
    shlex.split(command), universal_newlines=True,
)
Out[7]:
'`groupby-data.zip` -> `dokkuminio/mymedia/realpython/groupby-data.zip`\nTotal: 28.24 MB, Transferred: 28.24 MB, Speed: 104.84 MB/s\n'
In [8]:
dataset_url = urllib.parse.urlunsplit(
    ("https", "minio.apps.selfip.com", Path(*filepath.parts[1:]).as_posix(), "", "")
)
dataset_url
Out[8]:
'https://minio.apps.selfip.com/mymedia/realpython/groupby-data.zip'

Remove local file.

In [9]:
zipfilepath.unlink()
assert not zipfilepath.exists()

Download remote file and store in temporary file.

In [10]:
tmp_zipfile, response = urllib.request.urlretrieve(dataset_url)
tmp_zipfile, response
Out[10]:
('/tmp/tmp77tpwj1z', <http.client.HTTPMessage at 0x7fa121cfdc30>)
In [11]:
response.items()
Out[11]:
[('Server', 'nginx'),
 ('Date', 'Sat, 23 Nov 2019 14:50:48 GMT'),
 ('Content-Type', 'application/zip'),
 ('Content-Length', '29612211'),
 ('Connection', 'close'),
 ('Accept-Ranges', 'bytes'),
 ('Content-Security-Policy', 'block-all-mixed-content'),
 ('ETag', '"c5e4045daa771f652d557c88a0f1cf7a-1"'),
 ('Last-Modified', 'Sat, 23 Nov 2019 14:50:48 GMT'),
 ('Vary', 'Origin'),
 ('X-Amz-Request-Id', '15D9D2301E5FD7DD'),
 ('X-Minio-Deployment-Id', '50591baa-6478-4283-9b56-a19476205422'),
 ('X-Xss-Protection', '1; mode=block')]

Unzip the tempfile.

In [12]:
import zipfile
import tempfile

destination = Path(tempfile.mkdtemp())
with zipfile.ZipFile(tmp_zipfile, "r") as zip_ref:
    zip_ref.extractall(destination)
In [13]:
import itertools as it

legislators_path, news_path, airqual_path = it.islice(destination.iterdir(), 0, 3)
legislators_path, news_path, airqual_path
Out[13]:
(PosixPath('/tmp/tmp08a67hrn/legislators-historical.csv'),
 PosixPath('/tmp/tmp08a67hrn/news.csv'),
 PosixPath('/tmp/tmp08a67hrn/airqual.csv'))

Dissecting a dataset of historical members of Congress

Load the data

In [14]:
import pandas as pd

CATEGORY = "category"
keys = "first_name gender type state party".split()
NON_CATEGORY_KEYS = BIRTHDAY_, LAST_NAME_ = "birthday", "last_name"
dtypes = dict(zip(keys, it.repeat(CATEGORY)))
dtypes
Out[14]:
{'first_name': 'category',
 'gender': 'category',
 'type': 'category',
 'state': 'category',
 'party': 'category'}

Because I am lazy, I am going to add column name names to the globals with an Enum.

I find repeating myself by typing and retyping commonly used strings error prone and tedious.

So I use the following to create variables programmatically.

For those who think updating globals() is risky, I added a conditional to prevent overwriting existing globals.

In [ ]:
from enum import Enum

ColumnNames = Enum(
    "ColumnNames",
    type=str,
    names=(
        zip(
            (
                *(key.upper() for key in keys),
                *(extra.upper() for extra in NON_CATEGORY_KEYS),
            ),
            (
                *keys,
                *NON_CATEGORY_KEYS
            ),
        )
    ),
)

Add the Enum names to the globals.

In [15]:
if not any(key.value in (GLOBALS := globals()) for key in ColumnNames):
    GLOBALS.update(ColumnNames.__members__)
assert BIRTHDAY == BIRTHDAY_
ColumnNames.__members__
Out[15]:
mappingproxy({'FIRST_NAME': <ColumnNames.FIRST_NAME: 'first_name'>,
              'GENDER': <ColumnNames.GENDER: 'gender'>,
              'TYPE': <ColumnNames.TYPE: 'type'>,
              'STATE': <ColumnNames.STATE: 'state'>,
              'PARTY': <ColumnNames.PARTY: 'party'>,
              'BIRTHDAY': <ColumnNames.BIRTHDAY: 'birthday'>,
              'LAST_NAME': <ColumnNames.LAST_NAME: 'last_name'>})
In [16]:
df = pd.read_csv(
    legislators_path,
    dtype=dtypes,
    usecols=[*dtypes, BIRTHDAY, LAST_NAME,],
    parse_dates=[BIRTHDAY],
)
df.tail()
Out[16]:
last_name first_name birthday gender type state party
11970 Garrett Thomas 1972-03-27 M rep VA Republican
11971 Handel Karen 1962-04-18 F rep GA Republican
11972 Jones Brenda 1959-10-24 F rep MI Democrat
11973 Marino Tom 1952-08-15 M rep PA Republican
11974 Jones Walter 1943-02-10 M rep NC Republican

You can see that most columns of the dataset have the type category, which reduces the memory load on your machine.

In [17]:
df.dtypes
Out[17]:
last_name             object
first_name          category
birthday      datetime64[ns]
gender              category
type                category
state               category
party               category
dtype: object

The “Hello, World!” of Pandas GroupBy

Because I added the ColumnNames names to the globals, I don't have to type strings.

STATE is str "state".

Select the last_name count by state.

In [20]:
n_by_state = df.groupby(STATE)[LAST_NAME].count()
n_by_state.head(10)
Out[20]:
ColumnNames.STATE
AK     16
AL    206
AR    117
AS      2
AZ     48
CA    361
CO     90
CT    240
DC      2
DE     97
Name: ColumnNames.LAST_NAME, dtype: int64

A list of multiple column names.

In [26]:
n_by_state_gender = df.groupby([STATE, GENDER])[LAST_NAME].count()
n_by_state_gender
Out[26]:
ColumnNames.STATE  ColumnNames.GENDER
AK                 M                      16
AL                 F                       3
                   M                     203
AR                 F                       5
                   M                     112
                                        ... 
WI                 M                     196
WV                 F                       1
                   M                     119
WY                 F                       2
                   M                      38
Name: ColumnNames.LAST_NAME, Length: 104, dtype: int64

In the Pandas version, the grouped-on columns are pushed into the MultiIndex of the resulting Series by default:

In [27]:
type(n_by_state_gender)
Out[27]:
pandas.core.series.Series
In [28]:
n_by_state_gender.index[:5]
Out[28]:
MultiIndex([('AK', 'M'),
            ('AL', 'F'),
            ('AL', 'M'),
            ('AR', 'F'),
            ('AR', 'M')],
           names=['ColumnNames.STATE', 'ColumnNames.GENDER'])

To more closely emulate the SQL result and push the grouped-on columns back into columns in the result, you an use as_index=False:

In [29]:
df.groupby([STATE, GENDER], as_index=False)[LAST_NAME].count()
Out[29]:
ColumnNames.STATE ColumnNames.GENDER ColumnNames.LAST_NAME
0 AK F NaN
1 AK M 16.0
2 AL F 3.0
3 AL M 203.0
4 AR F 5.0
... ... ... ...
111 WI M 196.0
112 WV F 1.0
113 WV M 119.0
114 WY F 2.0
115 WY M 38.0

116 rows × 3 columns

Note: In df.groupby(["state", "gender"])["last_name"].count(), you could also use .size() instead of .count(), since you know that there are no NaN last names. Using .count() excludes NaN values, while .size() includes everything, NaN or not.

Also note that the SQL queries above explicitly use ORDER BY, whereas .groupby() does not. That’s because .groupby() does this by default through its parameter sort, which is True unless you tell it otherwise

In [30]:
df.groupby(STATE, sort=False)[LAST_NAME].count()
Out[30]:
ColumnNames.STATE
DE      97
VA     432
SC     251
MD     305
PA    1053
      ... 
AK      16
PI      13
VI       4
GU       4
AS       2
Name: ColumnNames.LAST_NAME, Length: 58, dtype: int64

Next, you’ll dive into the object that .groupby() actually produces.

One term that’s frequently used alongside .groupby() is split-apply-combine. This refers to a chain of three steps:

  1. Split a table into groups
  2. Apply some operations to each of those smaller tables
  3. Combine the results

One useful way to inspect a Pandas GroupBy object and see the splitting in action is to iterate over it. This is implemented in DataFrameGroupBy.__iter__() and produces an iterator of (group, DataFrame) pairs for DataFrames

In [31]:
by_state = df.groupby(STATE)

Learn something new.

f"{state!r}" quotes the value. uses __repr__ instead of __str__

Resources

By default, f-strings will use __str__(), but you can make sure they use __repr__() if you include the conversion flag !r:

Tip

If you’re working on a challenging aggregation problem, then iterating over the Pandas GroupBy object can be a great way to visualize the split part of split-apply-combine.

In [51]:
LINE, END = "-" * 30, "\n\n"


def endprint(item):
    print(item, end=END)


for state, frame in it.islice(by_state, 5):
    for _print, item in zip(
        (print, print, endprint,),
        (f"First 2 entries for {state!r}", LINE, frame.head(2)),
    ):
        _print(item)
First 2 entries for 'AK'
------------------------------
     last_name first_name   birthday gender type state        party
6619    Waskey      Frank 1875-04-20      M  rep    AK     Democrat
6647      Cale     Thomas 1848-09-17      M  rep    AK  Independent

First 2 entries for 'AL'
------------------------------
    last_name first_name   birthday gender type state       party
912   Crowell       John 1780-09-18      M  rep    AL  Republican
991    Walker       John 1783-08-12      M  sen    AL  Republican

First 2 entries for 'AR'
------------------------------
     last_name first_name   birthday gender type state party
1001     Bates      James 1788-08-25      M  rep    AR   NaN
1279    Conway      Henry 1793-03-18      M  rep    AR   NaN

First 2 entries for 'AS'
------------------------------
          last_name first_name   birthday gender type state     party
10797         Sunia       Fofó 1937-03-13      M  rep    AS  Democrat
11755  Faleomavaega        Eni 1943-08-15      M  rep    AS  Democrat

First 2 entries for 'AZ'
------------------------------
     last_name first_name   birthday gender type state       party
3674    Poston    Charles 1825-04-20      M  rep    AZ  Republican
3725   Goodwin       John 1824-10-18      M  rep    AZ  Republican

In [ ]:
 

The .groups attribute will give you a dictionary of {group name: group label} pairs.

In [41]:
by_state.groups["MT"]
Out[41]:
Int64Index([ 3756,  3941,  4094,  5033,  5342,  5536,  5537,  5581,  5968,
             6086,  6188,  6295,  6308,  6316,  6529,  6662,  6967,  7252,
             7344,  7615,  7642,  8113,  8167,  8231,  8298,  8387,  8458,
             8587,  8713,  8820,  8850,  9026,  9113,  9410,  9493,  9511,
             9618, 10037, 10080, 10156, 10261, 10282, 10473, 10683, 10939,
            11164, 11250, 11284, 11700, 11731, 11793, 11864],
           dtype='int64')
In [53]:
for i in by_state.groups["MT"][-5:]:
    endprint(df.iloc[i])
last_name                   Burns
first_name                 Conrad
birthday      1935-01-25 00:00:00
gender                          M
type                          sen
state                          MT
party                  Republican
Name: 11284, dtype: object

last_name                 Rehberg
first_name                 Dennis
birthday      1955-10-05 00:00:00
gender                          M
type                          rep
state                          MT
party                  Republican
Name: 11700, dtype: object

last_name                  Baucus
first_name                    Max
birthday      1941-12-11 00:00:00
gender                          M
type                          sen
state                          MT
party                    Democrat
Name: 11731, dtype: object

last_name                   Walsh
first_name                   John
birthday      1960-11-03 00:00:00
gender                          M
type                          sen
state                          MT
party                    Democrat
Name: 11793, dtype: object

last_name                   Zinke
first_name                   Ryan
birthday      1961-11-01 00:00:00
gender                          M
type                          rep
state                          MT
party                  Republican
Name: 11864, dtype: object

Note: I use the generic term Pandas GroupBy object to refer to both a DataFrameGroupBy object or a SeriesGroupBy object, which have a lot of commonalities between them.

Next, what about the apply part?

You can think of this step of the process as applying the same operation (or callable) to every “sub-table” that is produced by the splitting stage.

In [54]:
state, frame = next(iter(by_state))  # First tuple from iterator
state
Out[54]:
'AK'
In [55]:
frame.head(3)
Out[55]:
last_name first_name birthday gender type state party
6619 Waskey Frank 1875-04-20 M rep AK Democrat
6647 Cale Thomas 1848-09-17 M rep AK Independent
7442 Grigsby George 1874-12-02 M rep AK NaN
In [56]:
frame[LAST_NAME].count() # Count for state == 'AK'
Out[56]:
16

The last step, combine, is the most self-explanatory. It simply takes the results of all of the applied operations on all of the sub-tables and combines them back together in an intuitive way.

Next up

Example 2: Air Quality Dataset

to be continued…