Using Pivot Tables

[1]:
import transportation_tutorials as tt
import pandas as pd
import numpy as np

Questions

  1. Within the Jupiter study area, what is the average distance for bike tours to work? (Hint: It is 4.03 miles)
  2. What tour purpose has the highest average tour distance? (Hint: Work tours)
  3. What is the median distance of walking for all tour purposes? (Hint: 0.548 miles)

Data

To answer the questions, use the following data file:

[11]:
tour = pd.read_csv(tt.data('SERPM8-BASE2015-TOURS'))
tour.head()
[11]:
hh_id person_id person_num person_type tour_id tour_category tour_purpose orig_mgra dest_mgra start_period end_period tour_mode tour_distance tour_time atWork_freq num_ob_stops num_ib_stops out_btap out_atap in_btap in_atap util_1 util_2 util_3 util_4 util_5 util_6 util_7 util_8 util_9 util_10 util_11 util_12 util_13 util_14 util_15 util_16 util_17 util_18 util_19 util_20 prob_1 prob_2 prob_3 prob_4 prob_5 prob_6 prob_7 prob_8 prob_9 prob_10 prob_11 prob_12 prob_13 prob_14 prob_15 prob_16 prob_17 prob_18 prob_19 prob_20
0 1690841 4502948 1 1 0 MANDATORY Work 7736 9290 8 29 6 22.261 32.311001 6 0 0 0 0 0 0 -1.395076 -998.970825 -2.360061 -2.051275 -2.361161 -1.139561 -999.0 -999.0 -996.401001 -997.447021 -996.244019 -996.244019 -997.664978 -999.883789 -999.883789 -1001.335999 -999.883789 -999.883789 -1001.335999 -999.0 0.340576 0.0 0.000096 0.091587 0.000000 0.567742 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 1690841 4502948 1 1 1 AT_WORK Work-Based 9290 7980 13 14 1 1.910 4.752000 0 0 0 0 0 0 0 -0.259837 -998.668518 -0.758937 -999.608398 -0.687137 -999.858276 -999.0 -999.0 -5.843162 -28.534241 -8.014107 -1024.000000 -1024.000000 -30.581329 -1024.000000 -32.709736 -30.581329 -1024.000000 -32.709736 -999.0 0.555677 0.0 0.204790 0.000000 0.236415 0.000000 0.0 0.0 0.002799 0.0 0.000319 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 1690841 4502948 1 1 2 AT_WORK Work-Based 9290 10608 19 23 3 1.357 3.777000 0 0 0 0 0 0 0 -0.195235 -998.668518 -0.694335 -999.608398 -0.622535 -999.858276 -999.0 -999.0 -4.755019 -28.262205 -8.655456 -1024.000000 -1024.000000 -33.426655 -1024.000000 -31.728464 -33.426655 -1024.000000 -31.728464 -999.0 0.553007 0.0 0.203806 0.000000 0.235279 0.000000 0.0 0.0 0.007751 0.0 0.000157 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 1690841 4502949 2 2 0 MANDATORY Work 7736 8289 27 30 4 30.930 55.431000 1 1 0 0 0 0 0 -2.548089 -998.970825 -3.130183 -1.918921 -3.237083 -1.397364 -999.0 -999.0 -996.401001 -997.447021 -996.244019 -996.244019 -997.664978 -999.883789 -999.883789 -1001.335999 -999.883789 -999.883789 -1001.335999 -999.0 0.068926 0.0 0.000000 0.242591 0.000000 0.688483 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 1690841 4502949 2 2 1 MANDATORY Work 7736 8289 31 36 4 30.930 55.431000 1 0 0 0 0 0 0 -2.493357 -998.970825 -3.075933 -1.860832 -3.182833 -1.339275 -999.0 -999.0 -996.401001 -997.447021 -996.244019 -996.244019 -997.664978 -999.883789 -999.883789 -1001.335999 -999.883789 -999.883789 -1001.335999 -999.0 0.068497 0.0 0.000000 0.242703 0.000000 0.688801 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
[13]:
tour_mode_dict = {
    1: "DRIVEALONEFREE",
    2: "DRIVEALONEPAY",
    3: "SHARED2GP",
    4: "SHARED2PAY",
    5: "SHARED3GP",
    6: "SHARED3PAY",
    7: "TNCALONE",
    8: "TNCSHARED",
    9: "WALK",
    10: "BIKE",
    11: "WALK_MIX",
    12: "WALK_PRMW",
    13: "WALK_PRMD",
    14: "PNR_MIX",
    15: "PNR_PRMW",
    16: "PNR_PRMD",
    17: "KNR_MIX",
    18: "KNR_PRMW",
    19: "KNR_PRMD",
    20: "SCHBUS",
}

Solution

To find the answers in this exercise, we need to deal with three variables: tour_mode, tour_purpose and tour_distance. Let’s look at the data type of these three variables. We use .info() to view the data types of dataframe columns.

[6]:
tour[['tour_mode', 'tour_purpose', 'tour_distance']].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47485 entries, 0 to 47484
Data columns (total 3 columns):
tour_mode        47485 non-null int64
tour_purpose     47485 non-null object
tour_distance    47485 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 1.1+ MB

As we can see, tour_mode column in the data is categorized as integers. It is often helpful to have string category for easier exploration. Given that we know the notations of integer categories, we can convert the integer categories to string categories using a dictionary like tour_mode_dict.

[14]:
tour['tour_mode'] = tour['tour_mode'].map(tour_mode_dict)
[15]:
tour.tour_mode.value_counts()
[15]:
DRIVEALONEFREE    25911
SHARED2GP         10013
SHARED3GP          4857
DRIVEALONEPAY      2057
SCHBUS             1590
WALK               1213
BIKE                674
SHARED2PAY          500
SHARED3PAY          465
WALK_MIX            171
PNR_PRMW             11
KNR_PRMD              9
KNR_MIX               5
KNR_PRMW              3
PNR_PRMD              3
PNR_MIX               3
Name: tour_mode, dtype: int64
[16]:
tour.tour_purpose.value_counts()
[16]:
Work             16238
Shop              6291
School            5605
Discretionary     4980
Maintenance       4787
Eating Out        3064
Visiting          2641
Work-Based        2260
University         820
Escort             799
Name: tour_purpose, dtype: int64

Now that we have the perfect data types for required variables, we create a pivot table using .pivot_table() method and set the categorical columns (tour_mode and tour_purpose) in tour dataframe as index and columns. Then, we can set the continuous variable column, tour_distance as values in the .pivot_table() method.

Please note that .pivot_table() method by default calculates the average values of the column that is set as values.

[17]:
tour.pivot_table(index = 'tour_mode', columns = 'tour_purpose', values = 'tour_distance')
[17]:
tour_purpose Discretionary Eating Out Escort Maintenance School Shop University Visiting Work Work-Based
tour_mode
BIKE 2.764216 3.739237 NaN 2.872279 3.100353 4.323302 4.953091 2.448940 4.034763 NaN
DRIVEALONEFREE 5.681107 6.299282 3.848886 4.993209 4.396832 8.658878 7.974719 6.533719 12.089045 3.387014
DRIVEALONEPAY 38.526133 21.259400 23.726999 22.798727 NaN 59.314263 NaN 34.439682 38.741760 19.173934
KNR_MIX NaN NaN NaN NaN NaN NaN NaN NaN 33.410000 NaN
KNR_PRMD NaN NaN NaN NaN NaN NaN NaN NaN 37.558556 NaN
KNR_PRMW NaN NaN NaN NaN NaN NaN NaN NaN 39.154332 NaN
PNR_MIX NaN NaN NaN NaN NaN NaN NaN NaN 21.311667 NaN
PNR_PRMD NaN NaN NaN NaN NaN NaN NaN NaN 61.115000 NaN
PNR_PRMW NaN NaN NaN NaN NaN NaN NaN NaN 58.980637 NaN
SCHBUS NaN NaN NaN NaN 5.816528 NaN NaN NaN NaN NaN
SHARED2GP 7.469143 6.521744 4.544895 6.026752 4.176316 9.935834 8.809288 8.371339 13.427365 4.131103
SHARED2PAY 26.186025 22.228364 23.677999 28.893666 NaN 59.844746 21.773001 33.488765 39.481301 13.410200
SHARED3GP 7.622825 6.677294 6.378542 6.842151 4.308441 10.631094 8.269645 8.709989 13.986657 4.055721
SHARED3PAY 21.648273 22.139000 25.274000 20.237000 NaN 59.383289 16.806000 38.686500 37.353821 13.419333
WALK 0.637523 0.906817 0.491000 0.782691 0.973146 0.810000 1.343474 0.612707 0.714053 0.385786
WALK_MIX 3.486778 3.483111 5.693667 4.597381 2.434500 8.946885 3.589333 3.416000 9.022067 0.754857

From the pivot table that we just created, we can see that the average distance of biking to work is 4.034763 mile, which is the answer to our first question listed above.


To answer the second question, we just need a one-dimensional pivot table that we can create by ignoring either index or columns in .pivot_table() method.

[18]:
tour.pivot_table(columns = 'tour_purpose', values = 'tour_distance')
[18]:
tour_purpose Discretionary Eating Out Escort Maintenance School Shop University Visiting Work Work-Based
tour_distance 6.355063 6.341543 4.266534 5.413463 4.53435 13.421542 7.796431 7.534867 15.804213 3.841985

It is pretty clear that the work tours have the highest average tour distance as shown in the table above.


If we want to create a pivot table with median values of the column assigned in values, we can use aggfunc argument to specify that requirement. Again, we can use a one-dimensional pivot table to find the answer of third question.

[11]:
tour.pivot_table(index = 'tour_mode', values = 'tour_distance', aggfunc = ['median'])
[11]:
median
tour_distance
tour_mode
BIKE 2.890500
DRIVEALONEFREE 6.098000
DRIVEALONEPAY 31.275000
KNR_MIX 29.497999
KNR_PRMD 36.278000
KNR_PRMW 18.084999
PNR_MIX 16.518000
PNR_PRMD 55.566002
PNR_PRMW 56.125999
SCHBUS 2.624000
SHARED2GP 5.912000
SHARED2PAY 31.558000
SHARED3GP 5.085000
SHARED3PAY 30.006001
WALK 0.548000
WALK_MIX 3.294000

Looking at the table, it is obvious that the median walking distance for all kinds of tour is almost half a mile (0.548 mile).