[1]:
import transportation_tutorials as tt
import pandas as pd
import numpy as np

Crosstab and Pivot Tables

If you are familiar with the “pivot table” functionality in spreadsheets, you’ll find pretty much the same functionality in pandas using pivot_table.

To demonstrate some features of pivot tables with pandas, we’ll load and populate the households data from the Jupiter study area:

[2]:
hh = pd.read_csv(tt.data('SERPM8-BASE2015-HOUSEHOLDS'), index_col=0)
hh.set_index('hh_id', inplace=True)
[3]:
# Count of persons per HH
persons = pd.read_csv(tt.data('SERPM8-BASE2015-PERSONS'))
hh = hh.merge(
    persons.groupby('hh_id').size().rename('hhsize'),
    left_on=['hh_id'],
    right_index=True,
)
[4]:
# Count of trips per HH
trips = pd.read_csv(tt.data('SERPM8-BASE2015-TRIPS'))
hh = hh.merge(
    trips.groupby(['hh_id']).size().rename('n_trips'),
    left_on=['hh_id'],
    right_index=True,
)

Simple Pivot Tables

To generate a pivot table, we will use the pivot_table method of the households DataFrame. This method takes three principal arguments: index, columns, and values. The index and columns indicate the aggregation categories for the rows and columns of the resulting summary table, respectively. The values indicate which column of values should be aggregated.

[5]:
hh.pivot_table(
    index='hhsize',
    columns='autos',
    values='n_trips',
)
[5]:
autos 0 1 2 3 4
hhsize
1 3.000000 3.825093 3.390000 3.142857 NaN
2 5.122137 6.108434 6.641809 6.690341 6.538462
3 7.200000 8.551613 9.376471 9.581549 9.172043
4 10.750000 10.922330 11.739943 12.326733 11.936364
5 13.230769 12.893617 13.583893 14.485714 13.821782
6 7.500000 12.868421 15.778351 15.783784 17.593407
7 NaN 20.500000 19.733333 18.725000 20.117647
8 11.000000 13.600000 16.791667 16.428571 20.200000
9 NaN NaN 22.631579 21.863636 19.687500
10 NaN NaN 18.000000 34.000000 18.000000

If you only want to evalate the pivot table on one dimension, you can omit either index or columns, which gives the results you might expect:

[6]:
hh.pivot_table(
    columns='autos',
    values='n_trips',
)
[6]:
autos 0 1 2 3 4
n_trips 4.600985 4.813088 8.114862 10.090541 13.364162
[7]:
hh.pivot_table(
    index='hhsize',
    values='n_trips',
)
[7]:
n_trips
hhsize
1 3.770963
2 6.516076
3 9.286667
4 11.827869
5 13.812303
6 15.790909
7 19.536585
8 15.851852
9 21.109589
10 23.333333

Marginals

The two outputs above are actually the marginals of the original pivot table. If you want to generate the entire pivot table and include the marginals on rows and columns, you can do that in one step with the margins argument:

[8]:
hh.pivot_table(
    index='hhsize',
    columns='autos',
    values='n_trips',
    margins=True,
)
[8]:
autos 0 1 2 3 4 All
hhsize
1 3.000000 3.825093 3.390000 3.142857 NaN 3.770963
2 5.122137 6.108434 6.641809 6.690341 6.538462 6.516076
3 7.200000 8.551613 9.376471 9.581549 9.172043 9.286667
4 10.750000 10.922330 11.739943 12.326733 11.936364 11.827869
5 13.230769 12.893617 13.583893 14.485714 13.821782 13.812303
6 7.500000 12.868421 15.778351 15.783784 17.593407 15.790909
7 NaN 20.500000 19.733333 18.725000 20.117647 19.536585
8 11.000000 13.600000 16.791667 16.428571 20.200000 15.851852
9 NaN NaN 22.631579 21.863636 19.687500 21.109589
10 NaN NaN 18.000000 34.000000 18.000000 23.333333
All 4.600985 4.813088 8.114862 10.090541 13.364162 7.176941

Aggregation Functions

By default, pivot_table computes the mean (average) for the values column aggregated within the categories that define each cell. However, alternate aggregation functions can be used, such as 'count', 'std' (the standard deviation), or 'sem' (the standard error of the mean). One or more functions can be specified using the aggfunc argument:

[9]:
hh.pivot_table(
    index='hhsize',
    columns='autos',
    values='n_trips',
    aggfunc=['mean', 'count', 'sem'],
)
[9]:
mean count sem
autos 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
hhsize
1 3.000000 3.825093 3.390000 3.142857 NaN 215.0 4551.0 200.0 7.0 NaN 0.114196 0.029032 0.117081 0.704698 NaN
2 5.122137 6.108434 6.641809 6.690341 6.538462 131.0 1411.0 5042.0 704.0 52.0 0.206478 0.068357 0.039829 0.108316 0.362725
3 7.200000 8.551613 9.376471 9.581549 9.172043 30.0 310.0 1360.0 607.0 93.0 0.523911 0.177461 0.084386 0.140620 0.375009
4 10.750000 10.922330 11.739943 12.326733 11.936364 8.0 103.0 696.0 303.0 110.0 2.024405 0.375504 0.134725 0.229000 0.396456
5 13.230769 12.893617 13.583893 14.485714 13.821782 13.0 47.0 298.0 175.0 101.0 1.387461 0.545975 0.243191 0.327388 0.444192
6 7.500000 12.868421 15.778351 15.783784 17.593407 6.0 38.0 194.0 111.0 91.0 1.962142 0.801111 0.390703 0.544559 0.552932
7 NaN 20.500000 19.733333 18.725000 20.117647 NaN 4.0 45.0 40.0 34.0 NaN 1.658312 0.819214 0.807011 0.840979
8 11.000000 13.600000 16.791667 16.428571 20.200000 3.0 15.0 24.0 7.0 5.0 3.605551 1.463199 1.367293 1.836886 2.457641
9 NaN NaN 22.631579 21.863636 19.687500 NaN NaN 19.0 22.0 32.0 NaN NaN 1.035088 1.211945 0.988276
10 NaN NaN 18.000000 34.000000 18.000000 NaN NaN 1.0 1.0 1.0 NaN NaN NaN NaN NaN

If all you want is to simply count the number of rows in each category, you can use aggfunc='size', which will do so without requiring a values argument.

[10]:
hh.pivot_table(
    index='hhsize',
    columns='autos',
    aggfunc='size',
)
[10]:
autos 0 1 2 3 4
hhsize
1 215.0 4551.0 200.0 7.0 NaN
2 131.0 1411.0 5042.0 704.0 52.0
3 30.0 310.0 1360.0 607.0 93.0
4 8.0 103.0 696.0 303.0 110.0
5 13.0 47.0 298.0 175.0 101.0
6 6.0 38.0 194.0 111.0 91.0
7 NaN 4.0 45.0 40.0 34.0
8 3.0 15.0 24.0 7.0 5.0
9 NaN NaN 19.0 22.0 32.0
10 NaN NaN 1.0 1.0 1.0

You can observe that the empty cells in the result above do not show zero, but rather NaN – and this results in the data type for the content being float instead of int, as there is no NaN representation in integers. To correct this, set the fill_value to zero as well:

[11]:
hh.pivot_table(
    index='hhsize',
    columns='autos',
    aggfunc='size',
    fill_value=0,
)
[11]:
autos 0 1 2 3 4
hhsize
1 215 4551 200 7 0
2 131 1411 5042 704 52
3 30 310 1360 607 93
4 8 103 696 303 110
5 13 47 298 175 101
6 6 38 194 111 91
7 0 4 45 40 34
8 3 15 24 7 5
9 0 0 19 22 32
10 0 0 1 1 1

Using Binning of Continuous Variables

As with the groupby function, using a non-categorical column for aggregation can be problematic, or may result in a pivot table larger than desired. For example, if we switch ‘hhsize’ for ‘income’ on this dataset, the pivot table will have over 2,000 rows.

[12]:
hh.pivot_table(
    index='income',
    columns='autos',
    values='n_trips',
)
[12]:
autos 0 1 2 3 4
income
-9499 NaN NaN 8.333333 NaN NaN
-2400 2.000000 NaN NaN NaN NaN
-2000 NaN NaN 12.000000 NaN NaN
0 3.641026 4.524064 4.956522 2.000000 NaN
1 NaN 4.000000 NaN NaN NaN
100 NaN 3.500000 NaN NaN NaN
200 NaN 5.000000 NaN NaN NaN
220 NaN 7.000000 NaN NaN NaN
230 NaN 4.500000 NaN NaN NaN
400 NaN NaN 10.000000 NaN NaN
480 3.000000 2.000000 NaN NaN NaN
500 NaN 5.250000 NaN NaN NaN
600 NaN NaN 6.000000 NaN NaN
650 NaN 2.000000 NaN NaN NaN
690 NaN NaN 12.000000 NaN NaN
800 NaN NaN 2.000000 NaN NaN
900 NaN 8.500000 NaN NaN NaN
950 NaN 4.000000 NaN NaN NaN
1000 NaN 3.500000 8.000000 NaN NaN
1100 4.000000 3.000000 NaN NaN NaN
1300 4.666667 3.812500 NaN NaN NaN
1400 3.000000 5.333333 NaN NaN NaN
1500 NaN 3.000000 NaN NaN NaN
1600 NaN 3.000000 9.000000 NaN NaN
1700 NaN 3.000000 NaN NaN NaN
1800 2.000000 5.000000 NaN NaN NaN
2000 NaN 3.000000 2.000000 NaN NaN
2100 NaN 4.000000 NaN NaN NaN
2200 NaN 2.500000 2.000000 NaN NaN
2400 NaN 2.500000 NaN NaN NaN
... ... ... ... ... ...
646300 NaN NaN 10.000000 14.000000 NaN
646700 NaN NaN 6.666667 NaN NaN
647000 NaN NaN 6.500000 NaN NaN
664000 NaN NaN 6.000000 NaN 2.0
667000 NaN NaN 13.333333 NaN NaN
668000 NaN NaN 7.750000 6.000000 NaN
671260 NaN NaN 9.000000 NaN 12.0
680000 NaN NaN 7.000000 NaN NaN
697000 NaN 12.000000 9.000000 14.250000 NaN
701100 NaN 4.000000 8.500000 NaN NaN
704000 NaN NaN 9.500000 NaN NaN
707000 NaN NaN 12.000000 NaN NaN
720600 NaN NaN 10.000000 10.000000 NaN
729000 NaN NaN 8.000000 NaN NaN
746000 NaN NaN 12.500000 NaN NaN
755000 NaN NaN 20.666667 25.000000 NaN
782000 NaN NaN 17.000000 NaN NaN
793000 NaN NaN 9.500000 8.000000 NaN
798000 NaN 4.800000 NaN NaN NaN
805000 NaN NaN 6.000000 NaN NaN
807600 NaN NaN 6.500000 NaN NaN
812000 NaN NaN 6.600000 9.000000 NaN
846000 NaN NaN 8.000000 NaN NaN
857000 NaN NaN NaN 7.000000 NaN
907000 NaN NaN 11.000000 9.666667 NaN
916000 NaN NaN 13.500000 19.800000 NaN
922500 NaN NaN 14.000000 7.500000 7.0
975000 NaN 12.500000 NaN 14.000000 NaN
985000 NaN NaN 10.333333 NaN NaN
1040000 NaN NaN 4.000000 NaN NaN

2048 rows × 5 columns

This problem can be addressed by reformatting the relevant variable into a categorical form, using a binning function. For pivot tables, the pandas.qcut function is often a good choice for this, as it will bin the variable so that each bin has a similar total number of observations from the source table.

[13]:
hh.pivot_table(
    index=pd.qcut(hh.income, 5),
    columns='autos',
    values='n_trips',
)
[13]:
autos 0 1 2 3 4
income
(-9499.001, 31600.0] 4.325806 4.516639 7.717423 9.283019 13.230769
(31600.0, 57000.0] 4.807018 4.940185 7.586078 9.391061 12.796296
(57000.0, 91000.0] 7.416667 4.740959 7.990544 9.912644 14.240741
(91000.0, 145800.0] 4.444444 5.722326 8.471237 9.681750 12.485915
(145800.0, 1040000.0] 5.833333 4.968872 8.328530 10.888717 13.673267

As you might notice above, the index argument (and also the columns) does not need to be the name of a column in the original DataFrame, but instead can be an indexed-alike Series or similar object than can be used to define the group-by categories for that axis of the resulting pivot.