[1]:
import os
import gzip
import numpy as np
import pandas as pd
import transportation_tutorials as tt

Basic Data Analysis with Pandas

Pandas is the de facto standard for statistical analysis of tabular data using Python.

The basic data structure in pandas is a DataFrame. A DataFrame is a two dimensional table of data, with optional row and column labels.

You can construct a DataFrame from raw data in a few different ways. One simple way is to pass a dictionary, as here:

[2]:
raw = {'A': [7,8,9], 'B': [1,2,3], 'C': [4.1,5.2,6.3]}
df = pd.DataFrame(raw)
df
[2]:
A B C
0 7 1 4.1
1 8 2 5.2
2 9 3 6.3

Row and Column Labels

In addition to the data in the table, DataFrames are defined with row and column labels, identified as index and columns, respectively. You might notice that in the first DataFrame we created, the keys of the raw dictionary were adopted automatically as the column labels. The row labels are not drawn from the raw dictionary as there’s nothing in it to use for that purpose. But every DataFrame must have some kind of row labels, so some are created automatically. If we provide more info to the DataFrame constructor so that it can see what the row labels should be, then it will use those:

[3]:
raw_los = {
    'Speed': {'Car': 60, 'Bus': 20, 'Walk':3},
    'Cost': {'Car': 3.25, 'Bus': 1.75, 'Walk':0},
}
los = pd.DataFrame(raw_los)
los
[3]:
Speed Cost
Car 60 3.25
Bus 20 1.75
Walk 3 0.00
[4]:
los.index
[4]:
Index(['Car', 'Bus', 'Walk'], dtype='object')
[5]:
los.columns
[5]:
Index(['Speed', 'Cost'], dtype='object')

The values for either index or columns can be changed, simply by overwriting the appropriate attribute of the DataFrame.

[6]:
los.index = [2,1,3]
los
[6]:
Speed Cost
2 60 3.25
1 20 1.75
3 3 0.00

You can also choose an existing column of the DataFrame to be the index, using the set_index method.

[7]:
los.set_index('Speed')
[7]:
Cost
Speed
60 3.25
20 1.75
3 0.00

This probably isn’t a good thing to do with measured attributes like ‘Speed’, but it can come in handy to set the index as an identifier code (e.g. a TAZ code), especially after reading data from a file.

As you might notice, the index or columns can be made up of text, or numbers – or, in fact, any Python object.

[8]:
los.index = [{'key':'value'},set([None]),bytes(2)]
los
[8]:
Speed Cost
{'key': 'value'} 60 3.25
{None} 20 1.75
b'\x00\x00' 3 0.00

But you’ll find things much simpler if you stick to strings or integers for these labels, unless there is some very compelling reason to deviate from that plan.

Every DataFrame has both index and columns, even if one or both of these sets of labels are not defined when creating the DataFrame. For example, the example df created from raw above only has column names, but the index with row names was created by default.

[9]:
df.index
[9]:
RangeIndex(start=0, stop=3, step=1)

This default index is a RangeIndex, which is a memory-saving feature (at least when the DataFrame is large), as the actual index values are not stored, just the start, stop, and step values (i.e., start at 0, stop before you get to 3, counting by 1’s).

Data Types

Within the DataFrame, each column of data has a uniform data type, but the data types are permitted to vary across the columns. In the example DataFrame we created here, the data type was inferred as int64 for columns ‘A’ and ‘B’, and float64 for ‘C’ (because the values in that column’s data are integers). Note that all the values displayed in column ‘C’ have decimal values. We can also check explicitly on the data types for each column:

[10]:
df
[10]:
A B C
0 7 1 4.1
1 8 2 5.2
2 9 3 6.3
[11]:
df.dtypes
[11]:
A      int64
B      int64
C    float64
dtype: object

If we try to force data into a DataFrame that doesn’t follow this pattern of maintaining one common data type in each column, the underlying data will be up-casted to preserve this data type by columns rule. For example, if we take the transpose of the example DataFrame (swapping the rows and columns), the integer data gets up-casted to floats:

[12]:
df.T
[12]:
0 1 2
A 7.0 8.0 9.0
B 1.0 2.0 3.0
C 4.1 5.2 6.3

The numerical values of each data cell are preserved, but now all three columns are represeted as floating point numbers.

[13]:
df.T.dtypes
[13]:
0    float64
1    float64
2    float64
dtype: object

Reading from a File

For virtually all transportation data analysis, the data used will be loaded from a file, instead of being entered directly in Python commands. Fortunately, Pandas comes with a host of data reading methods that can be used to read data in a variety of formats:

You can click through to the documentation for each of these. In this tutorial, we’ll cover the most basic reading function in the pandas library, which is read_csv. Despite the name, this function isn’t just for reading comma seperated values, but also pretty much any delimited text file, including tab and space delimited files.

A very small example csv data file is included in the transportation_tutorials package:

[14]:
os.path.basename(tt.data('FL-COUNTY-POP'))
[14]:
'FL-COUNTY-POP.csv.gz'

For most csv files, you can just pass the name of the file to the read_csv function, and pandas will figure out the rest. This includes, as shown here, when a csv file is gzipped on disk, in which case it is transparently decompressed while reading.

[15]:
fl = pd.read_csv(tt.data('FL-COUNTY'))
[16]:
fl
[16]:
Name 2019 Population Growth Since 2010 Land Area
0 Miami-Dade County 2751796 9.754518 1897.72
1 Broward County 1935878 10.438146 1209.79
2 Palm Beach County 1471150 11.134781 1969.76
3 Hillsborough County 1408566 14.187346 1020.21
4 Orange County 1348975 17.435335 903.43
5 Pinellas County 970637 5.908084 273.80
6 Duval County 937934 8.349131 762.19
7 Lee County 739224 19.139938 784.51
8 Polk County 686483 13.819128 1797.84
9 Brevard County 589162 8.306616 1015.66
10 Volusia County 538692 8.949042 1101.03
11 Pasco County 525643 12.922512 746.89
12 Seminole County 462659 9.358846 309.22
13 Sarasota County 419119 10.310730 555.87
14 Manatee County 385571 19.209805 742.93
15 Collier County 372880 15.585507 1998.32
16 Marion County 354353 6.944143 1584.55
17 Osceola County 352180 30.508575 1327.45
18 Lake County 346017 16.228536 938.38
19 Escambia County 313512 5.192661 656.46
20 St. Lucie County 313506 12.659688 571.93
21 Leon County 290292 5.189313 666.85
22 Alachua County 266944 7.809988 875.02
23 St. Johns County 243812 27.492731 600.66
24 Clay County 212230 10.863853 604.36
25 Okaloosa County 202970 12.315593 930.25
26 Hernando County 186553 7.857172 472.54
27 Bay County 183563 8.484283 758.46
28 Charlotte County 182033 13.863851 680.28
29 Santa Rosa County 174272 13.969564 1011.61
... ... ... ... ...
37 Monroe County 77013 5.181715 983.28
38 Putnam County 73464 -0.997251 727.62
39 Columbia County 69612 3.043401 797.57
40 Walton County 68376 23.840400 1037.63
41 Jackson County 48330 -2.650767 917.76
42 Gadsden County 46071 -3.607072 516.33
43 Suwannee County 44190 4.406379 688.55
44 Okeechobee County 41605 3.934549 768.91
45 Levy County 40355 -0.906100 1118.21
46 Hendry County 40347 3.414072 1152.75
47 DeSoto County 36862 5.512938 637.06
48 Wakulla County 32120 4.201135 606.42
49 Baker County 28283 4.511862 585.23
50 Hardee County 27411 -1.132552 637.78
51 Bradford County 27038 -5.249509 293.96
52 Washington County 24567 -0.622952 582.80
53 Taylor County 21833 -3.338203 1043.31
54 Holmes County 19558 -1.446208 478.78
55 Madison County 18449 -4.151081 695.95
56 Gilchrist County 17743 4.364449 349.68
57 Dixie County 16673 1.658435 705.05
58 Gulf County 16160 2.123357 564.01
59 Union County 15517 -0.180122 243.56
60 Calhoun County 14483 -1.112932 567.33
61 Hamilton County 14184 -3.398488 513.79
62 Jefferson County 14144 -4.095471 598.10
63 Glades County 13754 6.843782 806.01
64 Franklin County 11727 1.779205 534.73
65 Lafayette County 8451 -4.074915 543.41
66 Liberty County 8242 -1.269765 835.56

67 rows × 4 columns

Within the Jupyter notebook, by default if you try to view a large data frame, the middle section is not displayed. If you look carefully above, you’ll see a row of ellipsis marking the place where some rows are not shown. Still, we get the first 30 and last 30 rows of the table. If you don’t want to review so large a section of a DataFrame, you can use the head method of DataFrames to see just the first few rows.

[17]:
fl.head()
[17]:
Name 2019 Population Growth Since 2010 Land Area
0 Miami-Dade County 2751796 9.754518 1897.72
1 Broward County 1935878 10.438146 1209.79
2 Palm Beach County 1471150 11.134781 1969.76
3 Hillsborough County 1408566 14.187346 1020.21
4 Orange County 1348975 17.435335 903.43

One notable tabular file format you may encounter in transportation planning that is not directly readable by pandas is the DBF file. Fortunately, there are other related tools that can read this kind of file, including geopandas. Geopandas includes just one main reading function: read_file. When you read a DBF file with geopandas.read_file, you’ll get a GeoDataFrame object instead of a regular pandas.DataFrame – but don’t worry, everything you can do with a DataFrame you can also do with a GeoDataFrame.

[18]:
import os
import geopandas as gpd
[19]:
os.path.basename(tt.data('US-STATES'))
[19]:
'US-STATES.dbf'
[20]:
states = gpd.read_file(tt.data('US-STATES'))
states.head()
[20]:
STATEFP STATENS AFFGEOID GEOID STUSPS NAME LSAD ALAND AWATER geometry
0 01 01779775 0400000US01 01 AL Alabama 00 131173688951 4593686489 None
1 02 01785533 0400000US02 02 AK Alaska 00 1477946266785 245390495931 None
2 04 01779777 0400000US04 04 AZ Arizona 00 294198560125 1027346486 None
3 05 00068085 0400000US05 05 AR Arkansas 00 134771517596 2960191698 None
4 06 01779778 0400000US06 06 CA California 00 403501101370 20466718403 None
[21]:
import larch

An alternate DBF reading implementation is included in Larch. This optimized DBF reader can currently only handle text and numeric datatyped (i.e., types ‘C’, ‘F’, and ‘N’; no dates, memos, or logical fields). However, it can read data quite fast even for large files, and will extract the basic contents of the file (headers, number of rows) even without actually reading the data.

[22]:
f = larch.DBF(tt.data('US-STATES'))

To actually load the data into a DataFrame, use the load_dataframe method.

[23]:
states = f.load_dataframe()
states.set_index('STUSPS', inplace=True)
states.head()
[23]:
STATEFP STATENS AFFGEOID GEOID NAME LSAD ALAND AWATER
STUSPS
AL 01 01779775 0400000US01 01 Alabama 00 131173688951 4593686489
AK 02 01785533 0400000US02 02 Alaska 00 1477946266785 245390495931
AZ 04 01779777 0400000US04 04 Arizona 00 294198560125 1027346486
AR 05 00068085 0400000US05 05 Arkansas 00 134771517596 2960191698
CA 06 01779778 0400000US06 06 California 00 403501101370 20466718403

Data Cleaning

Sometimes, data that is read in from raw files will require some cleaning to be usable. Consider this example of data on bridges in Florida:

[24]:
with gzip.open(tt.data('FL-BRIDGES'), 'rt') as previewfile:
    print(*(next(previewfile) for x in range(6)))
County,Total #,Good #,Fair #,Poor #,SD #,Total Area,Good Area,Fair Area,Poor Area,SD Area
 ALACHUA (001),111,64,47,-,-,64767,55794,8973,,
 BAKER (003),89,30,52,7,8,32162,19369,12282,510,623
 BAY (005),122,49,63,10,11,210039,98834,109628,1577,10120
 BRADFORD (007),62,23,37,2,2,9330,5492,3217,620,620
 BREVARD (009),241,160,81,-,-,364138,204179,159959,,

The data is loaded by pandas in the usual way without errors.

[25]:
bridges = pd.read_csv(tt.data('FL-BRIDGES'))

However, not all is well, and problems will appear during analysis.

[26]:
bridges['SD #'].sum()
[26]:
'-8112-711-324152218128--6-1--22913622-286111-136313-3-7-3331352-13917696256'

That is clearly not the number of structurally deficient bridges in Florida. And it’s not even a number at all, it’s a string. If you look carefully and compare against the first few rows of the DataFrame shown below, you may figure out that the string is actually a concatenation of the numbers in the column, not the sum.

[27]:
bridges.head()
[27]:
County Total # Good # Fair # Poor # SD # Total Area Good Area Fair Area Poor Area SD Area
0 ALACHUA (001) 111 64 47 - - 64767 55794 8973 NaN NaN
1 BAKER (003) 89 30 52 7 8 32162 19369 12282 510.0 623.0
2 BAY (005) 122 49 63 10 11 210039 98834 109628 1577.0 10120.0
3 BRADFORD (007) 62 23 37 2 2 9330 5492 3217 620.0 620.0
4 BREVARD (009) 241 160 81 - - 364138 204179 159959 NaN NaN

This is happening because the values in that columns are read by pandas as strings, not numbers, because the zero values are given as ‘-’ instead of 0. We can see the problem clearly if we look at the info for the bridges DataFrame.

[28]:
bridges.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68 entries, 0 to 67
Data columns (total 11 columns):
County        68 non-null object
Total #       68 non-null int64
Good #        68 non-null int64
Fair #        68 non-null int64
Poor #        68 non-null object
SD #          68 non-null object
Total Area    68 non-null int64
Good Area     68 non-null int64
Fair Area     68 non-null int64
Poor Area     47 non-null float64
SD Area       49 non-null float64
dtypes: float64(2), int64(6), object(3)
memory usage: 6.0+ KB

To fix this, we can use the replace method to fill all the ‘-’ values with zeros, and then change the dtype of the affected columns back to integer.

[29]:
bridges = bridges.replace('-', 0)
bridges[['Poor #', 'SD #']] = bridges[['Poor #', 'SD #']].astype(int)

Then we are able to compute the total number of structurally deficient bridges in Florida.

[30]:
bridges['SD #'].sum()
[30]:
512

Whoa, that seems high. Let’s check on the values in the table and see where all those structurally deficient bridges are. We can sort on the ‘SD #’ column and take a look at the highest few rows:

[31]:
bridges.sort_values('SD #', ascending=False).head()
[31]:
County Total # Good # Fair # Poor # SD # Total Area Good Area Fair Area Poor Area SD Area
67 TOTALS 12313 8534 3545 233 256 16759416 11232342 5149876 376033.0 402201.0
14 DUVAL (031) 763 479 262 22 22 1795247 932820 811836 50590.0 50590.0
15 ESCAMBIA (033) 243 116 111 16 18 585435 259541 233423 92471.0 92585.0
42 MIAMI-DADE (086) 957 804 146 7 13 1743024 1315449 404966 22609.0 33739.0
45 OKALOOSA (091) 216 111 92 13 13 269312 77988 171330 19993.0 19993.0

Oops! The data we loaded includes a ‘TOTALS’ row, but we don’t want that row included in our DataFrame for analysis. We can drop it (using the index for that row, 67), to fix the problem:

[32]:
bridges.drop(67, inplace=True)
bridges['SD #'].sum()
[32]:
256

A slightly different problem occurs in the data for ‘SD Area’. In this column, the zero values were not given by ‘-’ values, but instead were actually omitted. When read in by pandas, omitted values are valid input and are different from zero values.

If the intention is that missing values are indeed missing, this behavior is expected and useful. If, however, the missing values should be interpreted as zeros, then that instruction needs to be applied to the DataFrame explicitly, or problems may arise. For example, missing values don’t count in the denominator when computing the mean for a column:

[33]:
bridges['SD Area'].mean()
[33]:
8379.208333333334

If we compute the mean manually using the actual number of counties, we get a different value:

[34]:
bridges['SD Area'].sum() / len(bridges['SD Area'])
[34]:
6003.014925373134

We can use the fillna method to set all the missing values to zero, and then we get the correct answer:

[35]:
bridges.fillna(0, inplace=True)
bridges['SD Area'].mean()
[35]:
6003.014925373134

Expanding

New columns can be added to an existing DataFrame simply by assiging a value to a new column label, using plain square bracket notation. This makes it easy to create and derived values to a DataFrame.

[36]:
fl['Population Density'] = fl['2019 Population'] / fl['Land Area']
[37]:
fl.head()
[37]:
Name 2019 Population Growth Since 2010 Land Area Population Density
0 Miami-Dade County 2751796 9.754518 1897.72 1450.053749
1 Broward County 1935878 10.438146 1209.79 1600.176890
2 Palm Beach County 1471150 11.134781 1969.76 746.867639
3 Hillsborough County 1408566 14.187346 1020.21 1380.662805
4 Orange County 1348975 17.435335 903.43 1493.170473

Adding rows to a DataFrame can be done as well, using the loc indexer (see the next section on slicing for more about loc).

[38]:
fl.loc[999,['Name', '2019 Population']] = ('Dry Tortugas', 0)
[39]:
fl.tail()
[39]:
Name 2019 Population Growth Since 2010 Land Area Population Density
63 Glades County 13754.0 6.843782 806.01 17.064304
64 Franklin County 11727.0 1.779205 534.73 21.930694
65 Lafayette County 8451.0 -4.074915 543.41 15.551793
66 Liberty County 8242.0 -1.269765 835.56 9.864043
999 Dry Tortugas 0.0 NaN NaN NaN

Slicing

There are two main ways to slice a DataFrame, by label and by position.

Slicing by Position

Slicing by index is done using the iloc. This makes the DataFrame operate in a manner similar to a numpy array.

Giving only one index or set of indexes selects from the rows only.

[40]:
fl.iloc[1:3]
[40]:
Name 2019 Population Growth Since 2010 Land Area Population Density
1 Broward County 1935878.0 10.438146 1209.79 1600.176890
2 Palm Beach County 1471150.0 11.134781 1969.76 746.867639

Giving two sets of of indexes selects from the rows and columns.

[41]:
states.iloc[1:3, 2:5]
[41]:
AFFGEOID GEOID NAME
STUSPS
AK 0400000US02 02 Alaska
AZ 0400000US04 04 Arizona

See more in Selection by Position in the Pandas documentation.

Slicing by Label

Slicing by label is done using the loc. Instead of finding rows or columns based on the raw position in the DataFrame, the loc indexer finds rows or columns based upon the values in the index or columns attributes of the DataFrame.

[42]:
states.loc['AK':'CA']
[42]:
STATEFP STATENS AFFGEOID GEOID NAME LSAD ALAND AWATER
STUSPS
AK 02 01785533 0400000US02 02 Alaska 00 1477946266785 245390495931
AZ 04 01779777 0400000US04 04 Arizona 00 294198560125 1027346486
AR 05 00068085 0400000US05 05 Arkansas 00 134771517596 2960191698
CA 06 01779778 0400000US06 06 California 00 403501101370 20466718403

An important feature of the loc selector is that the resulting selection includes both the starting and the ending label. This is different from the usual Python slicing process, where the slice runs from the starting position up to but not including the ending position.

It can also be confusing to use the loc selector when the index being used has integer values. Case must be taken to select the rows you want, even when the index is a RangeIndex starting from zero. Consider these two selections, which are very similar but yield different results:

[43]:
fl.loc[1:3]
[43]:
Name 2019 Population Growth Since 2010 Land Area Population Density
1 Broward County 1935878.0 10.438146 1209.79 1600.176890
2 Palm Beach County 1471150.0 11.134781 1969.76 746.867639
3 Hillsborough County 1408566.0 14.187346 1020.21 1380.662805
[44]:
fl.iloc[1:3]
[44]:
Name 2019 Population Growth Since 2010 Land Area Population Density
1 Broward County 1935878.0 10.438146 1209.79 1600.176890
2 Palm Beach County 1471150.0 11.134781 1969.76 746.867639

See more in Selection by Label in the Pandas documentation.

Sorting

Sorting must be done explictly using the index, or one or more columns of values.

[45]:
states.sort_index().head()
[45]:
STATEFP STATENS AFFGEOID GEOID NAME LSAD ALAND AWATER
STUSPS
AK 02 01785533 0400000US02 02 Alaska 00 1477946266785 245390495931
AL 01 01779775 0400000US01 01 Alabama 00 131173688951 4593686489
AR 05 00068085 0400000US05 05 Arkansas 00 134771517596 2960191698
AS 60 01802701 0400000US60 60 American Samoa 00 197759069 1307243753
AZ 04 01779777 0400000US04 04 Arizona 00 294198560125 1027346486
[46]:
states.sort_values('NAME').head()
[46]:
STATEFP STATENS AFFGEOID GEOID NAME LSAD ALAND AWATER
STUSPS
AL 01 01779775 0400000US01 01 Alabama 00 131173688951 4593686489
AK 02 01785533 0400000US02 02 Alaska 00 1477946266785 245390495931
AS 60 01802701 0400000US60 60 American Samoa 00 197759069 1307243753
AZ 04 01779777 0400000US04 04 Arizona 00 294198560125 1027346486
AR 05 00068085 0400000US05 05 Arkansas 00 134771517596 2960191698

Merging

If we have two DataFrames that contain related data, we may want to merge the contents into a single DataFrame for some kinds of analysis. For example, suppose we have a DataFrame that gives enumerates counties in Florida by FDOT District, like this one:

[47]:
districts = pd.read_csv(tt.data('FL-COUNTY-BY-DISTRICT'))
districts.head()
[47]:
County District
0 Charlotte 1
1 Collier 1
2 DeSoto 1
3 Glades 1
4 Hardee 1

This is useful information if we want to compute statistics by FDOT District instead of by county, but it will need to be merged with the other available Florida county data we have.

Pandas provides a merge function, which provides the core functionality for all standard database-style join operations between DataFrames. These operations let the analyst connect two DataFrames that are related in some systematic way.

The principal arguments to the merge function include:

  • left: The left-side DataFrame to merge.
  • right: The right-side DataFrame to merge (can also be a named pandas.Series instead of a DataFrame).
  • how: Type of merge to be performed, defaults to ‘inner’.
    • ‘left’, or ‘right’: use only keys from left or right frame, similar to a SQL outer join; rows from the other DataFrame that do not match one of these keys are dropped.
    • ‘outer’: use union of keys from both frames, similar to a SQL full outer join; keeps all rows in both DataFrames.
    • ‘inner’: use intersection of keys from both frames, similar to a SQL inner join; rows in either DataFrame that do not match one of the keys in the other are dropped

In addition, some combination of selector arguments must be given, to indicate the keys to use for the join:

  • on: A label or list, giving column or index level names to join on, which must be found in both DataFrames.
  • left_on, right_on: A label or list, giving column or index level names to join on in the left or right DataFrame, respectively.
  • left_index, right_index: A bool (defaults to False) indicating if the index should be used instead of data columns, for the left or right DataFrame, respectively.

Advanced users will find details on a variety of other merge arguments in the function documentation. There are also merge and join shorthand methods available as methods on a DataFrame, although each represents an application of the more general merge method illustrated here.

To merge the county data with the District numbers, we can merge the two datasets using the merge function. In the population data file we previously loaded, the county names are given in a column named ‘Name’, while in the district mbmership table, the county names are given in a column named ‘County’, we we’ll use the left_on and right_on arguments to identify that the merge keys have different names.

[48]:
pd.merge(fl, districts, left_on='Name', right_on='County')
[48]:
Name 2019 Population Growth Since 2010 Land Area Population Density County District

In this instance, the ‘inner’ merge resulted in a new DataFrame, but with zero rows of data. This is because, as it turns out, the merge keys are not actually the same:

[49]:
fl.head()
[49]:
Name 2019 Population Growth Since 2010 Land Area Population Density
0 Miami-Dade County 2751796.0 9.754518 1897.72 1450.053749
1 Broward County 1935878.0 10.438146 1209.79 1600.176890
2 Palm Beach County 1471150.0 11.134781 1969.76 746.867639
3 Hillsborough County 1408566.0 14.187346 1020.21 1380.662805
4 Orange County 1348975.0 17.435335 903.43 1493.170473
[50]:
districts.head()
[50]:
County District
0 Charlotte 1
1 Collier 1
2 DeSoto 1
3 Glades 1
4 Hardee 1

The names of the counties in the first table are given with the appelation “Name County”, while in the second table, they are identified merely as “Name”. In order to use the merge function, the keys must match exactly across the two tables. To make this work, we’ll need to strip the ‘County’ from the names in the first table. We can do so with the str.replace method, like this:

[51]:
fl['County'] = fl['Name'].str.replace(' County', '')

Then, if we attempt the merge again, we’ll get the results we expect.

[52]:
fl_2 = pd.merge(fl, districts, on='County')
fl_2.head()
[52]:
Name 2019 Population Growth Since 2010 Land Area Population Density County District
0 Miami-Dade County 2751796.0 9.754518 1897.72 1450.053749 Miami-Dade 6
1 Broward County 1935878.0 10.438146 1209.79 1600.176890 Broward 4
2 Palm Beach County 1471150.0 11.134781 1969.76 746.867639 Palm Beach 4
3 Hillsborough County 1408566.0 14.187346 1020.21 1380.662805 Hillsborough 7
4 Orange County 1348975.0 17.435335 903.43 1493.170473 Orange 5

The merge here, adding the FDOT district to the County table, is a one-to-one merge (i.e., every row in the left table matches one and only one row in the right table). However, that one-to-one is not required to undertake a merge. We can also merge data that has one-to-many and many-to-one relationships, and even many-to-many, although that merge type is less common in most transportation applications.

As an example, consider this table that gives the FDOT District name and headquarters location for each district:

[53]:
district_info = pd.read_csv(tt.data('FL-DISTRICTS'), index_col='District')
district_info
[53]:
Name Headquarters
District
1 Southwest Florida Bartow
2 Northeast Florida Lake City
3 Northwest Florida Chipley
4 Southeast Florida Fort Lauderdale
5 Central Florida DeLand
6 South Florida Miami
7 West Central Florida Tampa

If we want to add a column that gives the district name to the counties table we already have, we can use a many-to-one merge. This is actually just detected automatically, so there is no extra argument to give.

[54]:
pd.merge(fl_2, district_info, on='District').head()
[54]:
Name_x 2019 Population Growth Since 2010 Land Area Population Density County District Name_y Headquarters
0 Miami-Dade County 2751796.0 9.754518 1897.72 1450.053749 Miami-Dade 6 South Florida Miami
1 Monroe County 77013.0 5.181715 983.28 78.322553 Monroe 6 South Florida Miami
2 Broward County 1935878.0 10.438146 1209.79 1600.176890 Broward 4 Southeast Florida Fort Lauderdale
3 Palm Beach County 1471150.0 11.134781 1969.76 746.867639 Palm Beach 4 Southeast Florida Fort Lauderdale
4 St. Lucie County 313506.0 12.659688 571.93 548.154494 St. Lucie 4 Southeast Florida Fort Lauderdale

Because both the left and right DataFrames have a column named ‘Name’, in the merged result the column from the left DataFrame has a ‘_x’ appended, while the column from the right DataFrame has a ‘_y’ appended. This is configurable using the suffixes argument, or by just changing the column names before merging. In fact, many merge details can be controlled easily by manipulating the left and right arguments before merging. For example, if we only want to get the district name, not the headquarters, we can slice the right DataFrame to only include what we want:

[55]:
pd.merge(fl_2, district_info[['Name']].add_prefix('District_'), on='District').head()
[55]:
Name 2019 Population Growth Since 2010 Land Area Population Density County District District_Name
0 Miami-Dade County 2751796.0 9.754518 1897.72 1450.053749 Miami-Dade 6 South Florida
1 Monroe County 77013.0 5.181715 983.28 78.322553 Monroe 6 South Florida
2 Broward County 1935878.0 10.438146 1209.79 1600.176890 Broward 4 Southeast Florida
3 Palm Beach County 1471150.0 11.134781 1969.76 746.867639 Palm Beach 4 Southeast Florida
4 St. Lucie County 313506.0 12.659688 571.93 548.154494 St. Lucie 4 Southeast Florida

Multi-level Column and Index Values

Often data we are working with has multiple categories of columns. For example, the bridge data we worked with above has two data types: the number of bridges by condition, and the deck area by condition. We can explicitly gather together those groups of columns using a multi-level index.

To demonstate this, we’ll make a copy of the bridges DataFrame with the county name as the index.

[56]:
bridges2 = bridges.set_index('County')

There are ten columns of this dataframe now, five counts and five areas.

[57]:
bridges2.columns
[57]:
Index(['Total #', 'Good #', 'Fair #', 'Poor #', 'SD #', 'Total Area',
       'Good Area', 'Fair Area', 'Poor Area', 'SD Area'],
      dtype='object')

We can use a list of tuples to create a MultiIndex that gives both the column’s category and a name. Notice that we can duplicate the condition names without trouble, because the duplicate values have different categories.

[58]:
bridges2.columns = pd.MultiIndex.from_tuples(
    [('Count','Total'), ('Count','Good'), ('Count','Fair'), ('Count','Poor'), ('Count','SD'),
     ('Area','Total'), ('Area','Good'), ('Area','Fair'), ('Area','Poor'), ('Area','SD'),     ],
    names=['measure', 'condition'],
)
[59]:
bridges2.head()
[59]:
measure Count Area
condition Total Good Fair Poor SD Total Good Fair Poor SD
County
ALACHUA (001) 111 64 47 0 0 64767 55794 8973 0.0 0.0
BAKER (003) 89 30 52 7 8 32162 19369 12282 510.0 623.0
BAY (005) 122 49 63 10 11 210039 98834 109628 1577.0 10120.0
BRADFORD (007) 62 23 37 2 2 9330 5492 3217 620.0 620.0
BREVARD (009) 241 160 81 0 0 364138 204179 159959 0.0 0.0

MultiIndexes can be used for both the columns and the index of a DataFrame.

Stacking and Unstacking

One advantage of using this multi-index approach to grouping similar data is that it makes it much easier to reshape the DataFrame for different types of analysis. For example, we can transform the bridges dataframe to have a for for each county and condition, and just two columns (count and area), by stack ing the data.

[60]:
bridges3 = bridges2.stack()
bridges3.head(12)
[60]:
measure Area Count
County condition
ALACHUA (001) Fair 8973.0 47
Good 55794.0 64
Poor 0.0 0
SD 0.0 0
Total 64767.0 111
BAKER (003) Fair 12282.0 52
Good 19369.0 30
Poor 510.0 7
SD 623.0 8
Total 32162.0 89
BAY (005) Fair 109628.0 63
Good 98834.0 49

One thing you might notice here is that the conditions have been reordered. The new order isn’t consistent with the categorical meanings, but instead is simply alphabetic. We can preserve the ordering by converting these values into ordered categorical values like this:

[61]:
conditions = pd.CategoricalIndex(
    bridges2.columns.levels[1],
    ordered=True,
    categories=['Total', 'Good', 'Fair', 'Poor', 'SD'],
)
bridges2.columns.set_levels(conditions, 1, inplace=True)
bridges2.stack().head(12)
[61]:
measure Area Count
County condition
ALACHUA (001) Total 64767.0 111
Good 55794.0 64
Fair 8973.0 47
Poor 0.0 0
SD 0.0 0
BAKER (003) Total 32162.0 89
Good 19369.0 30
Fair 12282.0 52
Poor 510.0 7
SD 623.0 8
BAY (005) Total 210039.0 122
Good 98834.0 49

Although obviously that’s a bit complicated and it may be easier to simply rename the conditions in a way that the alphabetic sorting works for us.

Now, for example, we can easily get the average deck area of bridges by condition.

[62]:
bridges3['AvgArea'] = bridges3['Area'] / bridges3['Count']
bridges3.head(12)
[62]:
measure Area Count AvgArea
County condition
ALACHUA (001) Fair 8973.0 47 190.914894
Good 55794.0 64 871.781250
Poor 0.0 0 NaN
SD 0.0 0 NaN
Total 64767.0 111 583.486486
BAKER (003) Fair 12282.0 52 236.192308
Good 19369.0 30 645.633333
Poor 510.0 7 72.857143
SD 623.0 8 77.875000
Total 32162.0 89 361.370787
BAY (005) Fair 109628.0 63 1740.126984
Good 98834.0 49 2017.020408

We can also reverse this transformation, returning to a wider format, with the unstack method.

[63]:
bridges4 = bridges3.unstack()
bridges4.head()
[63]:
measure Area Count AvgArea
condition Fair Good Poor SD Total Fair Good Poor SD Total Fair Good Poor SD Total
County
ALACHUA (001) 8973.0 55794.0 0.0 0.0 64767.0 47 64 0 0 111 190.914894 871.781250 NaN NaN 583.486486
BAKER (003) 12282.0 19369.0 510.0 623.0 32162.0 52 30 7 8 89 236.192308 645.633333 72.857143 77.875 361.370787
BAY (005) 109628.0 98834.0 1577.0 10120.0 210039.0 63 49 10 11 122 1740.126984 2017.020408 157.700000 920.000 1721.631148
BRADFORD (007) 3217.0 5492.0 620.0 620.0 9330.0 37 23 2 2 62 86.945946 238.782609 310.000000 310.000 150.483871
BREVARD (009) 159959.0 204179.0 0.0 0.0 364138.0 81 160 0 0 241 1974.802469 1276.118750 NaN NaN 1510.946058

If you unstack a DataFrame that doesn’t have a MultiIndex index with at least two levels, you’ll get a one dimensional Series instead of a DataFrame.

[64]:
bridges4.unstack()
[64]:
measure  condition  County
Area     Fair       ALACHUA (001)         8973.000000
                    BAKER (003)          12282.000000
                    BAY (005)           109628.000000
                    BRADFORD (007)        3217.000000
                    BREVARD (009)       159959.000000
                                            ...
AvgArea  Total      UNION (125)            447.107143
                    VOLUSIA (127)         1439.052863
                    WAKULLA (129)          355.660000
                    WALTON (131)           475.378995
                    WASHINGTON (133)       751.208333
Length: 1005, dtype: float64

The pandas documentation includes a lot more detail about reshaping data, including a variety of other tools for reshaping in other ways.

Grouping

A common task in data analysis is to conduct group-wise analysis. This involves splitting a DataFrame into groups of related rows, then computing some sort of measure on each group. The results are then usually combined into a final coherent data structure, generally having one row for each original row (a transformation), or one row for each group (an aggregation). It is possible to introduce some filtering in the process so the results do not conform to one of these patterns, but that is a more advanced topic that will not be discussed here.

For example, we might group counties by district, and then total the population values, to get to total population by district:

[65]:
fl_2.head()
[65]:
Name 2019 Population Growth Since 2010 Land Area Population Density County District
0 Miami-Dade County 2751796.0 9.754518 1897.72 1450.053749 Miami-Dade 6
1 Broward County 1935878.0 10.438146 1209.79 1600.176890 Broward 4
2 Palm Beach County 1471150.0 11.134781 1969.76 746.867639 Palm Beach 4
3 Hillsborough County 1408566.0 14.187346 1020.21 1380.662805 Hillsborough 7
4 Orange County 1348975.0 17.435335 903.43 1493.170473 Orange 5
[66]:
fl_2.groupby('District')[['2019 Population']].sum()
[66]:
2019 Population
District
1 3048172.0
2 2139433.0
3 1468387.0
4 4034840.0
5 4227713.0
6 2828809.0
7 3237046.0

The splitting into groups in done using the groupby method of DataFrames, which is fairly flexible—it can be used to group rows or columns, and to group on an index or on values. Pandas provides an extensive tutorial that covers groupby in more depth. In this tutorial, we’ll focus on the most common tasks for data analysis in the transportation context (and, most data analysis): grouping rows using one or more columns, or a function thereof, as the grouping keys.

In the example above, the groupby method is called with a single argument ‘District’, which indicates that the groups are delineated by common values of the column named ‘District’. Grouping by an identifier like this is convenient, because the identifiers are naturally well defined and the groups are inherently meaningful.

Similar for merging, groupby keys need to be actually identical in order to match. Attempting to use groupby directly with continuous data that not generally categorical in nature will not give good results – you can expect that more or less every row will be its own group. Instead, if you want to group rows in this way, you’ll need to convert it to categorical data first. This can be done using pandas.cut or pandas.qcut.

  • Using cut defines categorical bins either by dividing the entire range of values into a number of equal sized bins (by giving the number of bins as an integer) or by giving a list of explicit breakpoints for the bins.
  • Using qcut defines categorical bins either setting the bin edges so that each bin contains a roughly equal number of observations (by giving the number of bins as an integer) or by giving a list of explicit quantile levels for the bin breakpoints (e.g., [0,10,25,50,75,90,100]).

Let’s use cut to create five categorical values of county population:

[67]:
population_binned = pd.cut(fl_2['2019 Population'], bins=5)
population_binned.head()
[67]:
0    (2203085.2, 2751796.0]
1    (1654374.4, 2203085.2]
2    (1105663.6, 1654374.4]
3    (1105663.6, 1654374.4]
4    (1105663.6, 1654374.4]
Name: 2019 Population, dtype: category
Categories (5, interval[float64]): [(5498.446, 556952.8] < (556952.8, 1105663.6] < (1105663.6, 1654374.4] < (1654374.4, 2203085.2] < (2203085.2, 2751796.0]]
[68]:
population_binned
[68]:
0     (2203085.2, 2751796.0]
1     (1654374.4, 2203085.2]
2     (1105663.6, 1654374.4]
3     (1105663.6, 1654374.4]
4     (1105663.6, 1654374.4]
               ...
62      (5498.446, 556952.8]
63      (5498.446, 556952.8]
64      (5498.446, 556952.8]
65      (5498.446, 556952.8]
66      (5498.446, 556952.8]
Name: 2019 Population, Length: 67, dtype: category
Categories (5, interval[float64]): [(5498.446, 556952.8] < (556952.8, 1105663.6] < (1105663.6, 1654374.4] < (1654374.4, 2203085.2] < (2203085.2, 2751796.0]]

Those bins divide the range of population values evenly into 5 bins, but they do appear a bit random. We can set more reasonable value by explicitly giving bin boundaries that are round numbers of our choosing:

[69]:
bins=[5_000, 500_000, 1_000_000, 1_500_000, 2_000_000, 3_000_000]
population_binned = pd.cut(fl_2['2019 Population'], bins=bins)
population_binned.head()
[69]:
0    (2000000, 3000000]
1    (1500000, 2000000]
2    (1000000, 1500000]
3    (1000000, 1500000]
4    (1000000, 1500000]
Name: 2019 Population, dtype: category
Categories (5, interval[int64]): [(5000, 500000] < (500000, 1000000] < (1000000, 1500000] < (1500000, 2000000] < (2000000, 3000000]]

We can then pass this categorical data directly to the groupby method, instead of naming existing columns in the DataFrame to use as the groupby keys.

[70]:
fl_2.groupby(population_binned)[['2019 Population']].sum()
[70]:
2019 Population
2019 Population
(5000, 500000] 7080260.0
(500000, 1000000] 4987775.0
(1000000, 1500000] 4228691.0
(1500000, 2000000] 1935878.0
(2000000, 3000000] 2751796.0

Aggregation

The examples above all use the sum method, which is an aggregate function: it takes a set of values drawn from a group of rows, and returns a single output value. Other common aggregation functions include:

function description
mean within-group means
std within-group standard deviation
min within-group minimum
max within-group maximum
first first value in group
last last value in group
size number of group members including NaN
count number of group members excluding NaN

When aggregation function are applied, the result is generally one row per group, instead of one row per row.

Multiple aggregation functions can be applied simultaneously using the agg method, like this:

[71]:
fl_2.groupby(population_binned)[['2019 Population']].agg([np.mean, np.std, np.size])
[71]:
2019 Population
mean std size
2019 Population
(5000, 500000] 1.287320e+05 130660.887798 55.0
(500000, 1000000] 7.125393e+05 182201.018266 7.0
(1000000, 1500000] 1.409564e+06 61093.609816 3.0
(1500000, 2000000] 1.935878e+06 NaN 1.0
(2000000, 3000000] 2.751796e+06 NaN 1.0

Transformation

In contrast with aggregation, which returns a value for each group, transform returns a value for each row in the original DataFrame. Many of the same functions can be used as with transform, and the results are simply broadcast back to the original rows (i.e., a copy of the relevant aggregated value is attached to each original row, so that if there are multiple rows in a group, then there are multiple copies of the aggregate output). The return value is a new Series or DataFrame, indexed the same as the original, which makes it easy to use transform to create new derived columns based on group-wise aggregations or transforms.

For example, to add a new column that contains the FDOT District total population for each county, we can give the sum function to transform, and write the result as a new column in the original DataFrame:

[72]:
pop_sum = fl_2.groupby('District')['2019 Population'].transform(sum)
pop_sum.head()
[72]:
0    2828809.0
1    4034840.0
2    4034840.0
3    3237046.0
4    4227713.0
Name: 2019 Population, dtype: float64
[73]:
fl_2.head()
[73]:
Name 2019 Population Growth Since 2010 Land Area Population Density County District
0 Miami-Dade County 2751796.0 9.754518 1897.72 1450.053749 Miami-Dade 6
1 Broward County 1935878.0 10.438146 1209.79 1600.176890 Broward 4
2 Palm Beach County 1471150.0 11.134781 1969.76 746.867639 Palm Beach 4
3 Hillsborough County 1408566.0 14.187346 1020.21 1380.662805 Hillsborough 7
4 Orange County 1348975.0 17.435335 903.43 1493.170473 Orange 5

The second and third counties in the fl_2 DataFrame are both in District 4, so both rows of this output give the total population in District 4. Generating the output like this makes it super easy to attach the result as a new column in the DataFrame:

[74]:
fl_2['District Population'] = fl_2.groupby('District')['2019 Population'].transform(sum)
[75]:
fl_2.head()
[75]:
Name 2019 Population Growth Since 2010 Land Area Population Density County District District Population
0 Miami-Dade County 2751796.0 9.754518 1897.72 1450.053749 Miami-Dade 6 2828809.0
1 Broward County 1935878.0 10.438146 1209.79 1600.176890 Broward 4 4034840.0
2 Palm Beach County 1471150.0 11.134781 1969.76 746.867639 Palm Beach 4 4034840.0
3 Hillsborough County 1408566.0 14.187346 1020.21 1380.662805 Hillsborough 7 3237046.0
4 Orange County 1348975.0 17.435335 903.43 1493.170473 Orange 5 4227713.0

In addition to expanding aggregate measures to match with the rows of the original DataFrame, the transform function can also generate unique row-wise values, which can be computed relative to the group instead of relative to the entire DataFrame. For example, to compute the fraction of the population of each district living in each county:

[76]:
fl_2['Fraction of Pop'] = fl_2.groupby('District')['2019 Population'].transform(lambda x: (x / x.sum()))
fl_2.head()
[76]:
Name 2019 Population Growth Since 2010 Land Area Population Density County District District Population Fraction of Pop
0 Miami-Dade County 2751796.0 9.754518 1897.72 1450.053749 Miami-Dade 6 2828809.0 0.972775
1 Broward County 1935878.0 10.438146 1209.79 1600.176890 Broward 4 4034840.0 0.479791
2 Palm Beach County 1471150.0 11.134781 1969.76 746.867639 Palm Beach 4 4034840.0 0.364612
3 Hillsborough County 1408566.0 14.187346 1020.21 1380.662805 Hillsborough 7 3237046.0 0.435139
4 Orange County 1348975.0 17.435335 903.43 1493.170473 Orange 5 4227713.0 0.319079