[1]:

import transportation_tutorials as tt
import pandas as pd
import numpy as np

Crosstab and Pivot Tables¶

If you are familiar with the “pivot table” functionality in spreadsheets, you’ll find pretty much the same functionality in pandas using pivot_table.

To demonstrate some features of pivot tables with pandas, we’ll load and populate the households data from the Jupiter study area:

[2]:

hh = pd.read_csv(tt.data('SERPM8-BASE2015-HOUSEHOLDS'), index_col=0)
hh.set_index('hh_id', inplace=True)

[3]:

# Count of persons per HH
persons = pd.read_csv(tt.data('SERPM8-BASE2015-PERSONS'))
hh = hh.merge(
    persons.groupby('hh_id').size().rename('hhsize'),
    left_on=['hh_id'],
    right_index=True,
)

[4]:

# Count of trips per HH
trips = pd.read_csv(tt.data('SERPM8-BASE2015-TRIPS'))
hh = hh.merge(
    trips.groupby(['hh_id']).size().rename('n_trips'),
    left_on=['hh_id'],
    right_index=True,
)

Simple Pivot Tables¶

To generate a pivot table, we will use the pivot_table method of the households DataFrame. This method takes three principal arguments: index, columns, and values. The index and columns indicate the aggregation categories for the rows and columns of the resulting summary table, respectively. The values indicate which column of values should be aggregated.

[5]:

hh.pivot_table(
    index='hhsize',
    columns='autos',
    values='n_trips',
)

[5]:

autos	0	1	2	3	4
hhsize
1	3.000000	3.825093	3.390000	3.142857	NaN
2	5.122137	6.108434	6.641809	6.690341	6.538462
3	7.200000	8.551613	9.376471	9.581549	9.172043
4	10.750000	10.922330	11.739943	12.326733	11.936364
5	13.230769	12.893617	13.583893	14.485714	13.821782
6	7.500000	12.868421	15.778351	15.783784	17.593407
7	NaN	20.500000	19.733333	18.725000	20.117647
8	11.000000	13.600000	16.791667	16.428571	20.200000
9	NaN	NaN	22.631579	21.863636	19.687500
10	NaN	NaN	18.000000	34.000000	18.000000

If you only want to evalate the pivot table on one dimension, you can omit either index or columns, which gives the results you might expect:

[6]:

hh.pivot_table(
    columns='autos',
    values='n_trips',
)

[6]:

autos	0	1	2	3	4
n_trips	4.600985	4.813088	8.114862	10.090541	13.364162

[7]:

hh.pivot_table(
    index='hhsize',
    values='n_trips',
)

[7]:

	n_trips
hhsize
1	3.770963
2	6.516076
3	9.286667
4	11.827869
5	13.812303
6	15.790909
7	19.536585
8	15.851852
9	21.109589
10	23.333333

Marginals¶

The two outputs above are actually the marginals of the original pivot table. If you want to generate the entire pivot table and include the marginals on rows and columns, you can do that in one step with the margins argument:

[8]:

hh.pivot_table(
    index='hhsize',
    columns='autos',
    values='n_trips',
    margins=True,
)

[8]:

autos	0	1	2	3	4	All
hhsize
1	3.000000	3.825093	3.390000	3.142857	NaN	3.770963
2	5.122137	6.108434	6.641809	6.690341	6.538462	6.516076
3	7.200000	8.551613	9.376471	9.581549	9.172043	9.286667
4	10.750000	10.922330	11.739943	12.326733	11.936364	11.827869
5	13.230769	12.893617	13.583893	14.485714	13.821782	13.812303
6	7.500000	12.868421	15.778351	15.783784	17.593407	15.790909
7	NaN	20.500000	19.733333	18.725000	20.117647	19.536585
8	11.000000	13.600000	16.791667	16.428571	20.200000	15.851852
9	NaN	NaN	22.631579	21.863636	19.687500	21.109589
10	NaN	NaN	18.000000	34.000000	18.000000	23.333333
All	4.600985	4.813088	8.114862	10.090541	13.364162	7.176941

Aggregation Functions¶

By default, pivot_table computes the mean (average) for the values column aggregated within the categories that define each cell. However, alternate aggregation functions can be used, such as 'count', 'std' (the standard deviation), or 'sem' (the standard error of the mean). One or more functions can be specified using the aggfunc argument:

[9]:

hh.pivot_table(
    index='hhsize',
    columns='autos',
    values='n_trips',
    aggfunc=['mean', 'count', 'sem'],
)

[9]:

	mean					count					sem
autos	0	1	2	3	4	0	1	2	3	4	0	1	2	3	4
hhsize
1	3.000000	3.825093	3.390000	3.142857	NaN	215.0	4551.0	200.0	7.0	NaN	0.114196	0.029032	0.117081	0.704698	NaN
2	5.122137	6.108434	6.641809	6.690341	6.538462	131.0	1411.0	5042.0	704.0	52.0	0.206478	0.068357	0.039829	0.108316	0.362725
3	7.200000	8.551613	9.376471	9.581549	9.172043	30.0	310.0	1360.0	607.0	93.0	0.523911	0.177461	0.084386	0.140620	0.375009
4	10.750000	10.922330	11.739943	12.326733	11.936364	8.0	103.0	696.0	303.0	110.0	2.024405	0.375504	0.134725	0.229000	0.396456
5	13.230769	12.893617	13.583893	14.485714	13.821782	13.0	47.0	298.0	175.0	101.0	1.387461	0.545975	0.243191	0.327388	0.444192
6	7.500000	12.868421	15.778351	15.783784	17.593407	6.0	38.0	194.0	111.0	91.0	1.962142	0.801111	0.390703	0.544559	0.552932
7	NaN	20.500000	19.733333	18.725000	20.117647	NaN	4.0	45.0	40.0	34.0	NaN	1.658312	0.819214	0.807011	0.840979
8	11.000000	13.600000	16.791667	16.428571	20.200000	3.0	15.0	24.0	7.0	5.0	3.605551	1.463199	1.367293	1.836886	2.457641
9	NaN	NaN	22.631579	21.863636	19.687500	NaN	NaN	19.0	22.0	32.0	NaN	NaN	1.035088	1.211945	0.988276
10	NaN	NaN	18.000000	34.000000	18.000000	NaN	NaN	1.0	1.0	1.0	NaN	NaN	NaN	NaN	NaN

If all you want is to simply count the number of rows in each category, you can use aggfunc='size', which will do so without requiring a values argument.

[10]:

hh.pivot_table(
    index='hhsize',
    columns='autos',
    aggfunc='size',
)

[10]:

autos	0	1	2	3	4
hhsize
1	215.0	4551.0	200.0	7.0	NaN
2	131.0	1411.0	5042.0	704.0	52.0
3	30.0	310.0	1360.0	607.0	93.0
4	8.0	103.0	696.0	303.0	110.0
5	13.0	47.0	298.0	175.0	101.0
6	6.0	38.0	194.0	111.0	91.0
7	NaN	4.0	45.0	40.0	34.0
8	3.0	15.0	24.0	7.0	5.0
9	NaN	NaN	19.0	22.0	32.0
10	NaN	NaN	1.0	1.0	1.0

You can observe that the empty cells in the result above do not show zero, but rather NaN – and this results in the data type for the content being float instead of int, as there is no NaN representation in integers. To correct this, set the fill_value to zero as well:

[11]:

hh.pivot_table(
    index='hhsize',
    columns='autos',
    aggfunc='size',
    fill_value=0,
)

[11]:

autos	0	1	2	3	4
hhsize
1	215	4551	200	7	0
2	131	1411	5042	704	52
3	30	310	1360	607	93
4	8	103	696	303	110
5	13	47	298	175	101
6	6	38	194	111	91
7	0	4	45	40	34
8	3	15	24	7	5
9	0	0	19	22	32
10	0	0	1	1	1

Using Binning of Continuous Variables¶

As with the groupby function, using a non-categorical column for aggregation can be problematic, or may result in a pivot table larger than desired. For example, if we switch ‘hhsize’ for ‘income’ on this dataset, the pivot table will have over 2,000 rows.

[12]:

hh.pivot_table(
    index='income',
    columns='autos',
    values='n_trips',
)

[12]:

autos	0	1	2	3	4
income
-9499	NaN	NaN	8.333333	NaN	NaN
-2400	2.000000	NaN	NaN	NaN	NaN
-2000	NaN	NaN	12.000000	NaN	NaN
0	3.641026	4.524064	4.956522	2.000000	NaN
1	NaN	4.000000	NaN	NaN	NaN
100	NaN	3.500000	NaN	NaN	NaN
200	NaN	5.000000	NaN	NaN	NaN
220	NaN	7.000000	NaN	NaN	NaN
230	NaN	4.500000	NaN	NaN	NaN
400	NaN	NaN	10.000000	NaN	NaN
480	3.000000	2.000000	NaN	NaN	NaN
500	NaN	5.250000	NaN	NaN	NaN
600	NaN	NaN	6.000000	NaN	NaN
650	NaN	2.000000	NaN	NaN	NaN
690	NaN	NaN	12.000000	NaN	NaN
800	NaN	NaN	2.000000	NaN	NaN
900	NaN	8.500000	NaN	NaN	NaN
950	NaN	4.000000	NaN	NaN	NaN
1000	NaN	3.500000	8.000000	NaN	NaN
1100	4.000000	3.000000	NaN	NaN	NaN
1300	4.666667	3.812500	NaN	NaN	NaN
1400	3.000000	5.333333	NaN	NaN	NaN
1500	NaN	3.000000	NaN	NaN	NaN
1600	NaN	3.000000	9.000000	NaN	NaN
1700	NaN	3.000000	NaN	NaN	NaN
1800	2.000000	5.000000	NaN	NaN	NaN
2000	NaN	3.000000	2.000000	NaN	NaN
2100	NaN	4.000000	NaN	NaN	NaN
2200	NaN	2.500000	2.000000	NaN	NaN
2400	NaN	2.500000	NaN	NaN	NaN
...	...	...	...	...	...
646300	NaN	NaN	10.000000	14.000000	NaN
646700	NaN	NaN	6.666667	NaN	NaN
647000	NaN	NaN	6.500000	NaN	NaN
664000	NaN	NaN	6.000000	NaN	2.0
667000	NaN	NaN	13.333333	NaN	NaN
668000	NaN	NaN	7.750000	6.000000	NaN
671260	NaN	NaN	9.000000	NaN	12.0
680000	NaN	NaN	7.000000	NaN	NaN
697000	NaN	12.000000	9.000000	14.250000	NaN
701100	NaN	4.000000	8.500000	NaN	NaN
704000	NaN	NaN	9.500000	NaN	NaN
707000	NaN	NaN	12.000000	NaN	NaN
720600	NaN	NaN	10.000000	10.000000	NaN
729000	NaN	NaN	8.000000	NaN	NaN
746000	NaN	NaN	12.500000	NaN	NaN
755000	NaN	NaN	20.666667	25.000000	NaN
782000	NaN	NaN	17.000000	NaN	NaN
793000	NaN	NaN	9.500000	8.000000	NaN
798000	NaN	4.800000	NaN	NaN	NaN
805000	NaN	NaN	6.000000	NaN	NaN
807600	NaN	NaN	6.500000	NaN	NaN
812000	NaN	NaN	6.600000	9.000000	NaN
846000	NaN	NaN	8.000000	NaN	NaN
857000	NaN	NaN	NaN	7.000000	NaN
907000	NaN	NaN	11.000000	9.666667	NaN
916000	NaN	NaN	13.500000	19.800000	NaN
922500	NaN	NaN	14.000000	7.500000	7.0
975000	NaN	12.500000	NaN	14.000000	NaN
985000	NaN	NaN	10.333333	NaN	NaN
1040000	NaN	NaN	4.000000	NaN	NaN

2048 rows × 5 columns

This problem can be addressed by reformatting the relevant variable into a categorical form, using a binning function. For pivot tables, the pandas.qcut function is often a good choice for this, as it will bin the variable so that each bin has a similar total number of observations from the source table.

[13]:

hh.pivot_table(
    index=pd.qcut(hh.income, 5),
    columns='autos',
    values='n_trips',
)

[13]:

autos	0	1	2	3	4
income
(-9499.001, 31600.0]	4.325806	4.516639	7.717423	9.283019	13.230769
(31600.0, 57000.0]	4.807018	4.940185	7.586078	9.391061	12.796296
(57000.0, 91000.0]	7.416667	4.740959	7.990544	9.912644	14.240741
(91000.0, 145800.0]	4.444444	5.722326	8.471237	9.681750	12.485915
(145800.0, 1040000.0]	5.833333	4.968872	8.328530	10.888717	13.673267

As you might notice above, the index argument (and also the columns) does not need to be the name of a column in the original DataFrame, but instead can be an indexed-alike Series or similar object than can be used to define the group-by categories for that axis of the resulting pivot.

Table Of Contents

Crosstab and Pivot Tables¶

Simple Pivot Tables¶

Marginals¶

Aggregation Functions¶

Using Binning of Continuous Variables¶