[1]:

import os
import gzip
import numpy as np
import pandas as pd
import transportation_tutorials as tt

Basic Data Analysis with Pandas¶

Pandas is the de facto standard for statistical analysis of tabular data using Python.

The basic data structure in pandas is a DataFrame. A DataFrame is a two dimensional table of data, with optional row and column labels.

You can construct a DataFrame from raw data in a few different ways. One simple way is to pass a dictionary, as here:

[2]:

raw = {'A': [7,8,9], 'B': [1,2,3], 'C': [4.1,5.2,6.3]}
df = pd.DataFrame(raw)
df

[2]:

	A	B	C
0	7	1	4.1
1	8	2	5.2
2	9	3	6.3

Row and Column Labels¶

In addition to the data in the table, DataFrames are defined with row and column labels, identified as index and columns, respectively. You might notice that in the first DataFrame we created, the keys of the raw dictionary were adopted automatically as the column labels. The row labels are not drawn from the raw dictionary as there’s nothing in it to use for that purpose. But every DataFrame must have some kind of row labels, so some are created automatically. If we provide more info to the DataFrame constructor so that it can see what the row labels should be, then it will use those:

[3]:

raw_los = {
    'Speed': {'Car': 60, 'Bus': 20, 'Walk':3},
    'Cost': {'Car': 3.25, 'Bus': 1.75, 'Walk':0},
}
los = pd.DataFrame(raw_los)
los

[3]:

	Speed	Cost
Car	60	3.25
Bus	20	1.75
Walk	3	0.00

[4]:

los.index

[4]:

Index(['Car', 'Bus', 'Walk'], dtype='object')

[5]:

los.columns

[5]:

Index(['Speed', 'Cost'], dtype='object')

The values for either index or columns can be changed, simply by overwriting the appropriate attribute of the DataFrame.

[6]:

los.index = [2,1,3]
los

[6]:

	Speed	Cost
2	60	3.25
1	20	1.75
3	3	0.00

You can also choose an existing column of the DataFrame to be the index, using the set_index method.

[7]:

los.set_index('Speed')

[7]:

	Cost
Speed
60	3.25
20	1.75
3	0.00

This probably isn’t a good thing to do with measured attributes like ‘Speed’, but it can come in handy to set the index as an identifier code (e.g. a TAZ code), especially after reading data from a file.

As you might notice, the index or columns can be made up of text, or numbers – or, in fact, any Python object.

[8]:

los.index = [{'key':'value'},set([None]),bytes(2)]
los

[8]:

	Speed	Cost
{'key': 'value'}	60	3.25
{None}	20	1.75
b'\x00\x00'	3	0.00

But you’ll find things much simpler if you stick to strings or integers for these labels, unless there is some very compelling reason to deviate from that plan.

Every DataFrame has both index and columns, even if one or both of these sets of labels are not defined when creating the DataFrame. For example, the example df created from raw above only has column names, but the index with row names was created by default.

[9]:

df.index

[9]:

RangeIndex(start=0, stop=3, step=1)

This default index is a RangeIndex, which is a memory-saving feature (at least when the DataFrame is large), as the actual index values are not stored, just the start, stop, and step values (i.e., start at 0, stop before you get to 3, counting by 1’s).

Data Types¶

Within the DataFrame, each column of data has a uniform data type, but the data types are permitted to vary across the columns. In the example DataFrame we created here, the data type was inferred as int64 for columns ‘A’ and ‘B’, and float64 for ‘C’ (because the values in that column’s data are integers). Note that all the values displayed in column ‘C’ have decimal values. We can also check explicitly on the data types for each column:

[10]:

df

[10]:

	A	B	C
0	7	1	4.1
1	8	2	5.2
2	9	3	6.3

[11]:

df.dtypes

[11]:

A      int64
B      int64
C    float64
dtype: object

If we try to force data into a DataFrame that doesn’t follow this pattern of maintaining one common data type in each column, the underlying data will be up-casted to preserve this data type by columns rule. For example, if we take the transpose of the example DataFrame (swapping the rows and columns), the integer data gets up-casted to floats:

[12]:

df.T

[12]:

	0	1	2
A	7.0	8.0	9.0
B	1.0	2.0	3.0
C	4.1	5.2	6.3

The numerical values of each data cell are preserved, but now all three columns are represeted as floating point numbers.

[13]:

df.T.dtypes

[13]:

  float64
  float64
  float64
dtype: object

Reading from a File¶

For virtually all transportation data analysis, the data used will be loaded from a file, instead of being entered directly in Python commands. Fortunately, Pandas comes with a host of data reading methods that can be used to read data in a variety of formats:

You can click through to the documentation for each of these. In this tutorial, we’ll cover the most basic reading function in the pandas library, which is read_csv. Despite the name, this function isn’t just for reading comma seperated values, but also pretty much any delimited text file, including tab and space delimited files.

A very small example csv data file is included in the transportation_tutorials package:

[14]:

os.path.basename(tt.data('FL-COUNTY-POP'))

[14]:

'FL-COUNTY-POP.csv.gz'

For most csv files, you can just pass the name of the file to the read_csv function, and pandas will figure out the rest. This includes, as shown here, when a csv file is gzipped on disk, in which case it is transparently decompressed while reading.

[15]:

fl = pd.read_csv(tt.data('FL-COUNTY'))

[16]:

fl

[16]:

	Name	2019 Population	Growth Since 2010	Land Area
0	Miami-Dade County	2751796	9.754518	1897.72
1	Broward County	1935878	10.438146	1209.79
2	Palm Beach County	1471150	11.134781	1969.76
3	Hillsborough County	1408566	14.187346	1020.21
4	Orange County	1348975	17.435335	903.43
5	Pinellas County	970637	5.908084	273.80
6	Duval County	937934	8.349131	762.19
7	Lee County	739224	19.139938	784.51
8	Polk County	686483	13.819128	1797.84
9	Brevard County	589162	8.306616	1015.66
10	Volusia County	538692	8.949042	1101.03
11	Pasco County	525643	12.922512	746.89
12	Seminole County	462659	9.358846	309.22
13	Sarasota County	419119	10.310730	555.87
14	Manatee County	385571	19.209805	742.93
15	Collier County	372880	15.585507	1998.32
16	Marion County	354353	6.944143	1584.55
17	Osceola County	352180	30.508575	1327.45
18	Lake County	346017	16.228536	938.38
19	Escambia County	313512	5.192661	656.46
20	St. Lucie County	313506	12.659688	571.93
21	Leon County	290292	5.189313	666.85
22	Alachua County	266944	7.809988	875.02
23	St. Johns County	243812	27.492731	600.66
24	Clay County	212230	10.863853	604.36
25	Okaloosa County	202970	12.315593	930.25
26	Hernando County	186553	7.857172	472.54
27	Bay County	183563	8.484283	758.46
28	Charlotte County	182033	13.863851	680.28
29	Santa Rosa County	174272	13.969564	1011.61
...	...	...	...	...
37	Monroe County	77013	5.181715	983.28
38	Putnam County	73464	-0.997251	727.62
39	Columbia County	69612	3.043401	797.57
40	Walton County	68376	23.840400	1037.63
41	Jackson County	48330	-2.650767	917.76
42	Gadsden County	46071	-3.607072	516.33
43	Suwannee County	44190	4.406379	688.55
44	Okeechobee County	41605	3.934549	768.91
45	Levy County	40355	-0.906100	1118.21
46	Hendry County	40347	3.414072	1152.75
47	DeSoto County	36862	5.512938	637.06
48	Wakulla County	32120	4.201135	606.42
49	Baker County	28283	4.511862	585.23
50	Hardee County	27411	-1.132552	637.78
51	Bradford County	27038	-5.249509	293.96
52	Washington County	24567	-0.622952	582.80
53	Taylor County	21833	-3.338203	1043.31
54	Holmes County	19558	-1.446208	478.78
55	Madison County	18449	-4.151081	695.95
56	Gilchrist County	17743	4.364449	349.68
57	Dixie County	16673	1.658435	705.05
58	Gulf County	16160	2.123357	564.01
59	Union County	15517	-0.180122	243.56
60	Calhoun County	14483	-1.112932	567.33
61	Hamilton County	14184	-3.398488	513.79
62	Jefferson County	14144	-4.095471	598.10
63	Glades County	13754	6.843782	806.01
64	Franklin County	11727	1.779205	534.73
65	Lafayette County	8451	-4.074915	543.41
66	Liberty County	8242	-1.269765	835.56

67 rows × 4 columns

Within the Jupyter notebook, by default if you try to view a large data frame, the middle section is not displayed. If you look carefully above, you’ll see a row of ellipsis marking the place where some rows are not shown. Still, we get the first 30 and last 30 rows of the table. If you don’t want to review so large a section of a DataFrame, you can use the head method of DataFrames to see just the first few rows.

[17]:

fl.head()

[17]:

	Name	2019 Population	Growth Since 2010	Land Area
0	Miami-Dade County	2751796	9.754518	1897.72
1	Broward County	1935878	10.438146	1209.79
2	Palm Beach County	1471150	11.134781	1969.76
3	Hillsborough County	1408566	14.187346	1020.21
4	Orange County	1348975	17.435335	903.43

One notable tabular file format you may encounter in transportation planning that is not directly readable by pandas is the DBF file. Fortunately, there are other related tools that can read this kind of file, including geopandas. Geopandas includes just one main reading function: read_file. When you read a DBF file with geopandas.read_file, you’ll get a GeoDataFrame object instead of a regular pandas.DataFrame – but don’t worry, everything you can do with a DataFrame you can also do with a GeoDataFrame.

[18]:

import os
import geopandas as gpd

[19]:

os.path.basename(tt.data('US-STATES'))

[19]:

'US-STATES.dbf'

[20]:

states = gpd.read_file(tt.data('US-STATES'))
states.head()

[20]:

	STATEFP	STATENS	AFFGEOID	GEOID	STUSPS	NAME	ALAND	AWATER	geometry
0	01	01779775	0400000US01	01	AL	Alabama	131173688951	4593686489	None
1	02	01785533	0400000US02	02	AK	Alaska	1477946266785	245390495931	None
2	04	01779777	0400000US04	04	AZ	Arizona	294198560125	1027346486	None
3	05	00068085	0400000US05	05	AR	Arkansas	134771517596	2960191698	None
4	06	01779778	0400000US06	06	CA	California	403501101370	20466718403	None

[21]:

import larch

An alternate DBF reading implementation is included in Larch. This optimized DBF reader can currently only handle text and numeric datatyped (i.e., types ‘C’, ‘F’, and ‘N’; no dates, memos, or logical fields). However, it can read data quite fast even for large files, and will extract the basic contents of the file (headers, number of rows) even without actually reading the data.

[22]:

f = larch.DBF(tt.data('US-STATES'))

To actually load the data into a DataFrame, use the load_dataframe method.

[23]:

states = f.load_dataframe()
states.set_index('STUSPS', inplace=True)
states.head()

[23]:

	STATEFP	STATENS	AFFGEOID	GEOID	NAME	LSAD	ALAND	AWATER
STUSPS
AL	01	01779775	0400000US01	01	Alabama	00	131173688951	4593686489
AK	02	01785533	0400000US02	02	Alaska	00	1477946266785	245390495931
AZ	04	01779777	0400000US04	04	Arizona	00	294198560125	1027346486
AR	05	00068085	0400000US05	05	Arkansas	00	134771517596	2960191698
CA	06	01779778	0400000US06	06	California	00	403501101370	20466718403

Data Cleaning¶

Sometimes, data that is read in from raw files will require some cleaning to be usable. Consider this example of data on bridges in Florida:

[24]:

with gzip.open(tt.data('FL-BRIDGES'), 'rt') as previewfile:
    print(*(next(previewfile) for x in range(6)))

County,Total #,Good #,Fair #,Poor #,SD #,Total Area,Good Area,Fair Area,Poor Area,SD Area
 ALACHUA (001),111,64,47,-,-,64767,55794,8973,,
 BAKER (003),89,30,52,7,8,32162,19369,12282,510,623
 BAY (005),122,49,63,10,11,210039,98834,109628,1577,10120
 BRADFORD (007),62,23,37,2,2,9330,5492,3217,620,620
 BREVARD (009),241,160,81,-,-,364138,204179,159959,,

The data is loaded by pandas in the usual way without errors.

[25]:

bridges = pd.read_csv(tt.data('FL-BRIDGES'))

However, not all is well, and problems will appear during analysis.

[26]:

bridges['SD #'].sum()

[26]:

'-8112-711-324152218128--6-1--22913622-286111-136313-3-7-3331352-13917696256'

That is clearly not the number of structurally deficient bridges in Florida. And it’s not even a number at all, it’s a string. If you look carefully and compare against the first few rows of the DataFrame shown below, you may figure out that the string is actually a concatenation of the numbers in the column, not the sum.

[27]:

bridges.head()

[27]:

	County	Total #	Good #	Fair #	Poor #	SD #	Total Area	Good Area	Fair Area	Poor Area	SD Area
0	ALACHUA (001)	111	64	47	-	-	64767	55794	8973	NaN	NaN
1	BAKER (003)	89	30	52	7	8	32162	19369	12282	510.0	623.0
2	BAY (005)	122	49	63	10	11	210039	98834	109628	1577.0	10120.0
3	BRADFORD (007)	62	23	37	2	2	9330	5492	3217	620.0	620.0
4	BREVARD (009)	241	160	81	-	-	364138	204179	159959	NaN	NaN

This is happening because the values in that columns are read by pandas as strings, not numbers, because the zero values are given as ‘-’ instead of 0. We can see the problem clearly if we look at the info for the bridges DataFrame.

[28]:

bridges.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68 entries, 0 to 67
Data columns (total 11 columns):
County        68 non-null object
Total #       68 non-null int64
Good #        68 non-null int64
Fair #        68 non-null int64
Poor #        68 non-null object
SD #          68 non-null object
Total Area    68 non-null int64
Good Area     68 non-null int64
Fair Area     68 non-null int64
Poor Area     47 non-null float64
SD Area       49 non-null float64
dtypes: float64(2), int64(6), object(3)
memory usage: 6.0+ KB

To fix this, we can use the replace method to fill all the ‘-’ values with zeros, and then change the dtype of the affected columns back to integer.

[29]:

bridges = bridges.replace('-', 0)
bridges[['Poor #', 'SD #']] = bridges[['Poor #', 'SD #']].astype(int)

Then we are able to compute the total number of structurally deficient bridges in Florida.

[30]:

bridges['SD #'].sum()

[30]:

Whoa, that seems high. Let’s check on the values in the table and see where all those structurally deficient bridges are. We can sort on the ‘SD #’ column and take a look at the highest few rows:

[31]:

bridges.sort_values('SD #', ascending=False).head()

[31]:

	County	Total #	Good #	Fair #	Poor #	SD #	Total Area	Good Area	Fair Area	Poor Area	SD Area
67	TOTALS	12313	8534	3545	233	256	16759416	11232342	5149876	376033.0	402201.0
14	DUVAL (031)	763	479	262	22	22	1795247	932820	811836	50590.0	50590.0
15	ESCAMBIA (033)	243	116	111	16	18	585435	259541	233423	92471.0	92585.0
42	MIAMI-DADE (086)	957	804	146	7	13	1743024	1315449	404966	22609.0	33739.0
45	OKALOOSA (091)	216	111	92	13	13	269312	77988	171330	19993.0	19993.0

Oops! The data we loaded includes a ‘TOTALS’ row, but we don’t want that row included in our DataFrame for analysis. We can drop it (using the index for that row, 67), to fix the problem:

[32]:

bridges.drop(67, inplace=True)
bridges['SD #'].sum()

[32]:

A slightly different problem occurs in the data for ‘SD Area’. In this column, the zero values were not given by ‘-’ values, but instead were actually omitted. When read in by pandas, omitted values are valid input and are different from zero values.

If the intention is that missing values are indeed missing, this behavior is expected and useful. If, however, the missing values should be interpreted as zeros, then that instruction needs to be applied to the DataFrame explicitly, or problems may arise. For example, missing values don’t count in the denominator when computing the mean for a column:

[33]:

bridges['SD Area'].mean()

[33]:

8379.208333333334

If we compute the mean manually using the actual number of counties, we get a different value:

[34]:

bridges['SD Area'].sum() / len(bridges['SD Area'])

[34]:

6003.014925373134

We can use the fillna method to set all the missing values to zero, and then we get the correct answer:

[35]:

bridges.fillna(0, inplace=True)
bridges['SD Area'].mean()

[35]:

6003.014925373134

Expanding¶

New columns can be added to an existing DataFrame simply by assiging a value to a new column label, using plain square bracket notation. This makes it easy to create and derived values to a DataFrame.

[36]:

fl['Population Density'] = fl['2019 Population'] / fl['Land Area']

[37]:

fl.head()

[37]:

	Name	2019 Population	Growth Since 2010	Land Area	Population Density
0	Miami-Dade County	2751796	9.754518	1897.72	1450.053749
1	Broward County	1935878	10.438146	1209.79	1600.176890
2	Palm Beach County	1471150	11.134781	1969.76	746.867639
3	Hillsborough County	1408566	14.187346	1020.21	1380.662805
4	Orange County	1348975	17.435335	903.43	1493.170473

Adding rows to a DataFrame can be done as well, using the loc indexer (see the next section on slicing for more about loc).

[38]:

fl.loc[999,['Name', '2019 Population']] = ('Dry Tortugas', 0)

[39]:

fl.tail()

[39]:

	Name	2019 Population	Growth Since 2010	Land Area	Population Density
63	Glades County	13754.0	6.843782	806.01	17.064304
64	Franklin County	11727.0	1.779205	534.73	21.930694
65	Lafayette County	8451.0	-4.074915	543.41	15.551793
66	Liberty County	8242.0	-1.269765	835.56	9.864043
999	Dry Tortugas	0.0	NaN	NaN	NaN

Slicing¶

There are two main ways to slice a DataFrame, by label and by position.

Slicing by Position¶

Slicing by index is done using the iloc. This makes the DataFrame operate in a manner similar to a numpy array.

Giving only one index or set of indexes selects from the rows only.

[40]:

fl.iloc[1:3]

[40]:

	Name	2019 Population	Growth Since 2010	Land Area	Population Density
1	Broward County	1935878.0	10.438146	1209.79	1600.176890
2	Palm Beach County	1471150.0	11.134781	1969.76	746.867639

Giving two sets of of indexes selects from the rows and columns.

[41]:

states.iloc[1:3, 2:5]

[41]:

	AFFGEOID	GEOID	NAME
STUSPS
AK	0400000US02	02	Alaska
AZ	0400000US04	04	Arizona

See more in Selection by Position in the Pandas documentation.

Slicing by Label¶

Slicing by label is done using the loc. Instead of finding rows or columns based on the raw position in the DataFrame, the loc indexer finds rows or columns based upon the values in the index or columns attributes of the DataFrame.

[42]:

states.loc['AK':'CA']

[42]:

	STATEFP	STATENS	AFFGEOID	GEOID	NAME	LSAD	ALAND	AWATER
STUSPS
AK	02	01785533	0400000US02	02	Alaska	00	1477946266785	245390495931
AZ	04	01779777	0400000US04	04	Arizona	00	294198560125	1027346486
AR	05	00068085	0400000US05	05	Arkansas	00	134771517596	2960191698
CA	06	01779778	0400000US06	06	California	00	403501101370	20466718403

An important feature of the loc selector is that the resulting selection includes both the starting and the ending label. This is different from the usual Python slicing process, where the slice runs from the starting position up to but not including the ending position.

It can also be confusing to use the loc selector when the index being used has integer values. Case must be taken to select the rows you want, even when the index is a RangeIndex starting from zero. Consider these two selections, which are very similar but yield different results:

[43]:

fl.loc[1:3]

[43]:

	Name	2019 Population	Growth Since 2010	Land Area	Population Density
1	Broward County	1935878.0	10.438146	1209.79	1600.176890
2	Palm Beach County	1471150.0	11.134781	1969.76	746.867639
3	Hillsborough County	1408566.0	14.187346	1020.21	1380.662805

[44]:

fl.iloc[1:3]

[44]:

	Name	2019 Population	Growth Since 2010	Land Area	Population Density
1	Broward County	1935878.0	10.438146	1209.79	1600.176890
2	Palm Beach County	1471150.0	11.134781	1969.76	746.867639

See more in Selection by Label in the Pandas documentation.

Sorting¶

Sorting must be done explictly using the index, or one or more columns of values.

[45]:

states.sort_index().head()

[45]:

	STATEFP	STATENS	AFFGEOID	GEOID	NAME	LSAD	ALAND	AWATER
STUSPS
AK	02	01785533	0400000US02	02	Alaska	00	1477946266785	245390495931
AL	01	01779775	0400000US01	01	Alabama	00	131173688951	4593686489
AR	05	00068085	0400000US05	05	Arkansas	00	134771517596	2960191698
AS	60	01802701	0400000US60	60	American Samoa	00	197759069	1307243753
AZ	04	01779777	0400000US04	04	Arizona	00	294198560125	1027346486

[46]:

states.sort_values('NAME').head()

[46]:

	STATEFP	STATENS	AFFGEOID	GEOID	NAME	LSAD	ALAND	AWATER
STUSPS
AL	01	01779775	0400000US01	01	Alabama	00	131173688951	4593686489
AK	02	01785533	0400000US02	02	Alaska	00	1477946266785	245390495931
AS	60	01802701	0400000US60	60	American Samoa	00	197759069	1307243753
AZ	04	01779777	0400000US04	04	Arizona	00	294198560125	1027346486
AR	05	00068085	0400000US05	05	Arkansas	00	134771517596	2960191698

Merging¶

If we have two DataFrames that contain related data, we may want to merge the contents into a single DataFrame for some kinds of analysis. For example, suppose we have a DataFrame that gives enumerates counties in Florida by FDOT District, like this one:

[47]:

districts = pd.read_csv(tt.data('FL-COUNTY-BY-DISTRICT'))
districts.head()

[47]:

	County	District
0	Charlotte	1
1	Collier	1
2	DeSoto	1
3	Glades	1
4	Hardee	1

This is useful information if we want to compute statistics by FDOT District instead of by county, but it will need to be merged with the other available Florida county data we have.

Pandas provides a merge function, which provides the core functionality for all standard database-style join operations between DataFrames. These operations let the analyst connect two DataFrames that are related in some systematic way.

The principal arguments to the merge function include:

left: The left-side DataFrame to merge.
right: The right-side DataFrame to merge (can also be a named pandas.Series instead of a DataFrame).
how: Type of merge to be performed, defaults to ‘inner’.
- ‘left’, or ‘right’: use only keys from left or right frame, similar to a SQL outer join; rows from the other DataFrame that do not match one of these keys are dropped.
- ‘outer’: use union of keys from both frames, similar to a SQL full outer join; keeps all rows in both DataFrames.
- ‘inner’: use intersection of keys from both frames, similar to a SQL inner join; rows in either DataFrame that do not match one of the keys in the other are dropped

In addition, some combination of selector arguments must be given, to indicate the keys to use for the join:

on: A label or list, giving column or index level names to join on, which must be found in both DataFrames.
left_on, right_on: A label or list, giving column or index level names to join on in the left or right DataFrame, respectively.
left_index, right_index: A bool (defaults to False) indicating if the index should be used instead of data columns, for the left or right DataFrame, respectively.

Advanced users will find details on a variety of other merge arguments in the function documentation. There are also merge and join shorthand methods available as methods on a DataFrame, although each represents an application of the more general merge method illustrated here.

To merge the county data with the District numbers, we can merge the two datasets using the merge function. In the population data file we previously loaded, the county names are given in a column named ‘Name’, while in the district mbmership table, the county names are given in a column named ‘County’, we we’ll use the left_on and right_on arguments to identify that the merge keys have different names.

[48]:

pd.merge(fl, districts, left_on='Name', right_on='County')

[48]:

	Name	2019 Population	Growth Since 2010	Land Area	Population Density	County	District

In this instance, the ‘inner’ merge resulted in a new DataFrame, but with zero rows of data. This is because, as it turns out, the merge keys are not actually the same:

[49]:

fl.head()

[49]:

	Name	2019 Population	Growth Since 2010	Land Area	Population Density
0	Miami-Dade County	2751796.0	9.754518	1897.72	1450.053749
1	Broward County	1935878.0	10.438146	1209.79	1600.176890
2	Palm Beach County	1471150.0	11.134781	1969.76	746.867639
3	Hillsborough County	1408566.0	14.187346	1020.21	1380.662805
4	Orange County	1348975.0	17.435335	903.43	1493.170473

[50]:

districts.head()

[50]:

	County	District
0	Charlotte	1
1	Collier	1
2	DeSoto	1
3	Glades	1
4	Hardee	1

The names of the counties in the first table are given with the appelation “Name County”, while in the second table, they are identified merely as “Name”. In order to use the merge function, the keys must match exactly across the two tables. To make this work, we’ll need to strip the ‘County’ from the names in the first table. We can do so with the str.replace method, like this:

[51]:

fl['County'] = fl['Name'].str.replace(' County', '')

Then, if we attempt the merge again, we’ll get the results we expect.

[52]:

fl_2 = pd.merge(fl, districts, on='County')
fl_2.head()

[52]:

	Name	2019 Population	Growth Since 2010	Land Area	Population Density	County	District
0	Miami-Dade County	2751796.0	9.754518	1897.72	1450.053749	Miami-Dade	6
1	Broward County	1935878.0	10.438146	1209.79	1600.176890	Broward	4
2	Palm Beach County	1471150.0	11.134781	1969.76	746.867639	Palm Beach	4
3	Hillsborough County	1408566.0	14.187346	1020.21	1380.662805	Hillsborough	7
4	Orange County	1348975.0	17.435335	903.43	1493.170473	Orange	5

The merge here, adding the FDOT district to the County table, is a one-to-one merge (i.e., every row in the left table matches one and only one row in the right table). However, that one-to-one is not required to undertake a merge. We can also merge data that has one-to-many and many-to-one relationships, and even many-to-many, although that merge type is less common in most transportation applications.

As an example, consider this table that gives the FDOT District name and headquarters location for each district:

[53]:

district_info = pd.read_csv(tt.data('FL-DISTRICTS'), index_col='District')
district_info

[53]:

	Name	Headquarters
District
1	Southwest Florida	Bartow
2	Northeast Florida	Lake City
3	Northwest Florida	Chipley
4	Southeast Florida	Fort Lauderdale
5	Central Florida	DeLand
6	South Florida	Miami
7	West Central Florida	Tampa

If we want to add a column that gives the district name to the counties table we already have, we can use a many-to-one merge. This is actually just detected automatically, so there is no extra argument to give.

[54]:

pd.merge(fl_2, district_info, on='District').head()

[54]:

	Name_x	2019 Population	Growth Since 2010	Land Area	Population Density	County	District	Name_y	Headquarters
0	Miami-Dade County	2751796.0	9.754518	1897.72	1450.053749	Miami-Dade	6	South Florida	Miami
1	Monroe County	77013.0	5.181715	983.28	78.322553	Monroe	6	South Florida	Miami
2	Broward County	1935878.0	10.438146	1209.79	1600.176890	Broward	4	Southeast Florida	Fort Lauderdale
3	Palm Beach County	1471150.0	11.134781	1969.76	746.867639	Palm Beach	4	Southeast Florida	Fort Lauderdale
4	St. Lucie County	313506.0	12.659688	571.93	548.154494	St. Lucie	4	Southeast Florida	Fort Lauderdale

Because both the left and right DataFrames have a column named ‘Name’, in the merged result the column from the left DataFrame has a ‘_x’ appended, while the column from the right DataFrame has a ‘_y’ appended. This is configurable using the suffixes argument, or by just changing the column names before merging. In fact, many merge details can be controlled easily by manipulating the left and right arguments before merging. For example, if we only want to get the district name, not the headquarters, we can slice the right DataFrame to only include what we want:

[55]:

pd.merge(fl_2, district_info[['Name']].add_prefix('District_'), on='District').head()

[55]:

	Name	2019 Population	Growth Since 2010	Land Area	Population Density	County	District	District_Name
0	Miami-Dade County	2751796.0	9.754518	1897.72	1450.053749	Miami-Dade	6	South Florida
1	Monroe County	77013.0	5.181715	983.28	78.322553	Monroe	6	South Florida
2	Broward County	1935878.0	10.438146	1209.79	1600.176890	Broward	4	Southeast Florida
3	Palm Beach County	1471150.0	11.134781	1969.76	746.867639	Palm Beach	4	Southeast Florida
4	St. Lucie County	313506.0	12.659688	571.93	548.154494	St. Lucie	4	Southeast Florida

Multi-level Column and Index Values¶

Often data we are working with has multiple categories of columns. For example, the bridge data we worked with above has two data types: the number of bridges by condition, and the deck area by condition. We can explicitly gather together those groups of columns using a multi-level index.

To demonstate this, we’ll make a copy of the bridges DataFrame with the county name as the index.

[56]:

bridges2 = bridges.set_index('County')

There are ten columns of this dataframe now, five counts and five areas.

[57]:

bridges2.columns

[57]:

Index(['Total #', 'Good #', 'Fair #', 'Poor #', 'SD #', 'Total Area',
       'Good Area', 'Fair Area', 'Poor Area', 'SD Area'],
      dtype='object')

We can use a list of tuples to create a MultiIndex that gives both the column’s category and a name. Notice that we can duplicate the condition names without trouble, because the duplicate values have different categories.

[58]:

bridges2.columns = pd.MultiIndex.from_tuples(
    [('Count','Total'), ('Count','Good'), ('Count','Fair'), ('Count','Poor'), ('Count','SD'),
     ('Area','Total'), ('Area','Good'), ('Area','Fair'), ('Area','Poor'), ('Area','SD'),     ],
    names=['measure', 'condition'],
)

[59]:

bridges2.head()

[59]:

measure	Count					Area
condition	Total	Good	Fair	Poor	SD	Total	Good	Fair	Poor	SD
County
ALACHUA (001)	111	64	47	0	0	64767	55794	8973	0.0	0.0
BAKER (003)	89	30	52	7	8	32162	19369	12282	510.0	623.0
BAY (005)	122	49	63	10	11	210039	98834	109628	1577.0	10120.0
BRADFORD (007)	62	23	37	2	2	9330	5492	3217	620.0	620.0
BREVARD (009)	241	160	81	0	0	364138	204179	159959	0.0	0.0

MultiIndexes can be used for both the columns and the index of a DataFrame.

Stacking and Unstacking¶

One advantage of using this multi-index approach to grouping similar data is that it makes it much easier to reshape the DataFrame for different types of analysis. For example, we can transform the bridges dataframe to have a for for each county and condition, and just two columns (count and area), by stack ing the data.

[60]:

bridges3 = bridges2.stack()
bridges3.head(12)

[60]:

	measure	Area	Count
County	condition
ALACHUA (001)	Fair	8973.0	47
	Good	55794.0	64
	Poor	0.0	0
	SD	0.0	0
	Total	64767.0	111
BAKER (003)	Fair	12282.0	52
	Good	19369.0	30
	Poor	510.0	7
	SD	623.0	8
	Total	32162.0	89
BAY (005)	Fair	109628.0	63
BAY (005)	Good	98834.0	49

One thing you might notice here is that the conditions have been reordered. The new order isn’t consistent with the categorical meanings, but instead is simply alphabetic. We can preserve the ordering by converting these values into ordered categorical values like this:

[61]:

conditions = pd.CategoricalIndex(
    bridges2.columns.levels[1],
    ordered=True,
    categories=['Total', 'Good', 'Fair', 'Poor', 'SD'],
)
bridges2.columns.set_levels(conditions, 1, inplace=True)
bridges2.stack().head(12)

[61]:

	measure	Area	Count
County	condition
ALACHUA (001)	Total	64767.0	111
	Good	55794.0	64
	Fair	8973.0	47
	Poor	0.0	0
	SD	0.0	0
BAKER (003)	Total	32162.0	89
	Good	19369.0	30
	Fair	12282.0	52
	Poor	510.0	7
	SD	623.0	8
BAY (005)	Total	210039.0	122
BAY (005)	Good	98834.0	49

Although obviously that’s a bit complicated and it may be easier to simply rename the conditions in a way that the alphabetic sorting works for us.

Now, for example, we can easily get the average deck area of bridges by condition.

[62]:

bridges3['AvgArea'] = bridges3['Area'] / bridges3['Count']
bridges3.head(12)

[62]:

	measure	Area	Count	AvgArea
County	condition
ALACHUA (001)	Fair	8973.0	47	190.914894
	Good	55794.0	64	871.781250
	Poor	0.0	0	NaN
	SD	0.0	0	NaN
	Total	64767.0	111	583.486486
BAKER (003)	Fair	12282.0	52	236.192308
	Good	19369.0	30	645.633333
	Poor	510.0	7	72.857143
	SD	623.0	8	77.875000
	Total	32162.0	89	361.370787
BAY (005)	Fair	109628.0	63	1740.126984
BAY (005)	Good	98834.0	49	2017.020408

We can also reverse this transformation, returning to a wider format, with the unstack method.

[63]:

bridges4 = bridges3.unstack()
bridges4.head()

[63]:

measure	Area					Count					AvgArea
condition	Fair	Good	Poor	SD	Total	Fair	Good	Poor	SD	Total	Fair	Good	Poor	SD	Total
County
ALACHUA (001)	8973.0	55794.0	0.0	0.0	64767.0	47	64	0	0	111	190.914894	871.781250	NaN	NaN	583.486486
BAKER (003)	12282.0	19369.0	510.0	623.0	32162.0	52	30	7	8	89	236.192308	645.633333	72.857143	77.875	361.370787
BAY (005)	109628.0	98834.0	1577.0	10120.0	210039.0	63	49	10	11	122	1740.126984	2017.020408	157.700000	920.000	1721.631148
BRADFORD (007)	3217.0	5492.0	620.0	620.0	9330.0	37	23	2	2	62	86.945946	238.782609	310.000000	310.000	150.483871
BREVARD (009)	159959.0	204179.0	0.0	0.0	364138.0	81	160	0	0	241	1974.802469	1276.118750	NaN	NaN	1510.946058

If you unstack a DataFrame that doesn’t have a MultiIndex index with at least two levels, you’ll get a one dimensional Series instead of a DataFrame.

[64]:

bridges4.unstack()

[64]:

measure  condition  County
Area     Fair       ALACHUA (001)         8973.000000
                    BAKER (003)          12282.000000
                    BAY (005)           109628.000000
                    BRADFORD (007)        3217.000000
                    BREVARD (009)       159959.000000
                                            ...
AvgArea  Total      UNION (125)            447.107143
                    VOLUSIA (127)         1439.052863
                    WAKULLA (129)          355.660000
                    WALTON (131)           475.378995
                    WASHINGTON (133)       751.208333
Length: 1005, dtype: float64

The pandas documentation includes a lot more detail about reshaping data, including a variety of other tools for reshaping in other ways.

Grouping¶

A common task in data analysis is to conduct group-wise analysis. This involves splitting a DataFrame into groups of related rows, then computing some sort of measure on each group. The results are then usually combined into a final coherent data structure, generally having one row for each original row (a transformation), or one row for each group (an aggregation). It is possible to introduce some filtering in the process so the results do not conform to one of these patterns, but that is a more advanced topic that will not be discussed here.

For example, we might group counties by district, and then total the population values, to get to total population by district:

[65]:

fl_2.head()

[65]:

	Name	2019 Population	Growth Since 2010	Land Area	Population Density	County	District
0	Miami-Dade County	2751796.0	9.754518	1897.72	1450.053749	Miami-Dade	6
1	Broward County	1935878.0	10.438146	1209.79	1600.176890	Broward	4
2	Palm Beach County	1471150.0	11.134781	1969.76	746.867639	Palm Beach	4
3	Hillsborough County	1408566.0	14.187346	1020.21	1380.662805	Hillsborough	7
4	Orange County	1348975.0	17.435335	903.43	1493.170473	Orange	5

[66]:

fl_2.groupby('District')[['2019 Population']].sum()

[66]:

	2019 Population
District
1	3048172.0
2	2139433.0
3	1468387.0
4	4034840.0
5	4227713.0
6	2828809.0
7	3237046.0

The splitting into groups in done using the groupby method of DataFrames, which is fairly flexible—it can be used to group rows or columns, and to group on an index or on values. Pandas provides an extensive tutorial that covers groupby in more depth. In this tutorial, we’ll focus on the most common tasks for data analysis in the transportation context (and, most data analysis): grouping rows using one or more columns, or a function thereof, as the grouping keys.

In the example above, the groupby method is called with a single argument ‘District’, which indicates that the groups are delineated by common values of the column named ‘District’. Grouping by an identifier like this is convenient, because the identifiers are naturally well defined and the groups are inherently meaningful.

Similar for merging, groupby keys need to be actually identical in order to match. Attempting to use groupby directly with continuous data that not generally categorical in nature will not give good results – you can expect that more or less every row will be its own group. Instead, if you want to group rows in this way, you’ll need to convert it to categorical data first. This can be done using pandas.cut or pandas.qcut.

Using cut defines categorical bins either by dividing the entire range of values into a number of equal sized bins (by giving the number of bins as an integer) or by giving a list of explicit breakpoints for the bins.
Using qcut defines categorical bins either setting the bin edges so that each bin contains a roughly equal number of observations (by giving the number of bins as an integer) or by giving a list of explicit quantile levels for the bin breakpoints (e.g., [0,10,25,50,75,90,100]).

Let’s use cut to create five categorical values of county population:

[67]:

population_binned = pd.cut(fl_2['2019 Population'], bins=5)
population_binned.head()

[67]:

  (2203085.2, 2751796.0]
  (1654374.4, 2203085.2]
  (1105663.6, 1654374.4]
  (1105663.6, 1654374.4]
  (1105663.6, 1654374.4]
Name: 2019 Population, dtype: category
Categories (5, interval[float64]): [(5498.446, 556952.8] < (556952.8, 1105663.6] < (1105663.6, 1654374.4] < (1654374.4, 2203085.2] < (2203085.2, 2751796.0]]

[68]:

population_binned

[68]:

   (2203085.2, 2751796.0]
   (1654374.4, 2203085.2]
   (1105663.6, 1654374.4]
   (1105663.6, 1654374.4]
   (1105663.6, 1654374.4]
               ...
    (5498.446, 556952.8]
    (5498.446, 556952.8]
    (5498.446, 556952.8]
    (5498.446, 556952.8]
    (5498.446, 556952.8]
Name: 2019 Population, Length: 67, dtype: category
Categories (5, interval[float64]): [(5498.446, 556952.8] < (556952.8, 1105663.6] < (1105663.6, 1654374.4] < (1654374.4, 2203085.2] < (2203085.2, 2751796.0]]

Those bins divide the range of population values evenly into 5 bins, but they do appear a bit random. We can set more reasonable value by explicitly giving bin boundaries that are round numbers of our choosing:

[69]:

bins=[5_000, 500_000, 1_000_000, 1_500_000, 2_000_000, 3_000_000]
population_binned = pd.cut(fl_2['2019 Population'], bins=bins)
population_binned.head()

[69]:

  (2000000, 3000000]
  (1500000, 2000000]
  (1000000, 1500000]
  (1000000, 1500000]
  (1000000, 1500000]
Name: 2019 Population, dtype: category
Categories (5, interval[int64]): [(5000, 500000] < (500000, 1000000] < (1000000, 1500000] < (1500000, 2000000] < (2000000, 3000000]]

We can then pass this categorical data directly to the groupby method, instead of naming existing columns in the DataFrame to use as the groupby keys.

[70]:

fl_2.groupby(population_binned)[['2019 Population']].sum()

[70]:

	2019 Population
2019 Population
(5000, 500000]	7080260.0
(500000, 1000000]	4987775.0
(1000000, 1500000]	4228691.0
(1500000, 2000000]	1935878.0
(2000000, 3000000]	2751796.0

Aggregation¶

The examples above all use the sum method, which is an aggregate function: it takes a set of values drawn from a group of rows, and returns a single output value. Other common aggregation functions include:

function	description
`mean`	within-group means
`std`	within-group standard deviation
`min`	within-group minimum
`max`	within-group maximum
`first`	first value in group
`last`	last value in group
`size`	number of group members including NaN
`count`	number of group members excluding NaN

When aggregation function are applied, the result is generally one row per group, instead of one row per row.

Multiple aggregation functions can be applied simultaneously using the agg method, like this:

[71]:

fl_2.groupby(population_binned)[['2019 Population']].agg([np.mean, np.std, np.size])

[71]:

	2019 Population
	mean	std	size
2019 Population
(5000, 500000]	1.287320e+05	130660.887798	55.0
(500000, 1000000]	7.125393e+05	182201.018266	7.0
(1000000, 1500000]	1.409564e+06	61093.609816	3.0
(1500000, 2000000]	1.935878e+06	NaN	1.0
(2000000, 3000000]	2.751796e+06	NaN	1.0

Transformation¶

In contrast with aggregation, which returns a value for each group, transform returns a value for each row in the original DataFrame. Many of the same functions can be used as with transform, and the results are simply broadcast back to the original rows (i.e., a copy of the relevant aggregated value is attached to each original row, so that if there are multiple rows in a group, then there are multiple copies of the aggregate output). The return value is a new Series or DataFrame, indexed the same as the original, which makes it easy to use transform to create new derived columns based on group-wise aggregations or transforms.

For example, to add a new column that contains the FDOT District total population for each county, we can give the sum function to transform, and write the result as a new column in the original DataFrame:

[72]:

pop_sum = fl_2.groupby('District')['2019 Population'].transform(sum)
pop_sum.head()

[72]:

  2828809.0
  4034840.0
  4034840.0
  3237046.0
  4227713.0
Name: 2019 Population, dtype: float64

[73]:

fl_2.head()

[73]:

	Name	2019 Population	Growth Since 2010	Land Area	Population Density	County	District
0	Miami-Dade County	2751796.0	9.754518	1897.72	1450.053749	Miami-Dade	6
1	Broward County	1935878.0	10.438146	1209.79	1600.176890	Broward	4
2	Palm Beach County	1471150.0	11.134781	1969.76	746.867639	Palm Beach	4
3	Hillsborough County	1408566.0	14.187346	1020.21	1380.662805	Hillsborough	7
4	Orange County	1348975.0	17.435335	903.43	1493.170473	Orange	5

The second and third counties in the fl_2 DataFrame are both in District 4, so both rows of this output give the total population in District 4. Generating the output like this makes it super easy to attach the result as a new column in the DataFrame:

[74]:

fl_2['District Population'] = fl_2.groupby('District')['2019 Population'].transform(sum)

[75]:

fl_2.head()

[75]:

	Name	2019 Population	Growth Since 2010	Land Area	Population Density	County	District	District Population
0	Miami-Dade County	2751796.0	9.754518	1897.72	1450.053749	Miami-Dade	6	2828809.0
1	Broward County	1935878.0	10.438146	1209.79	1600.176890	Broward	4	4034840.0
2	Palm Beach County	1471150.0	11.134781	1969.76	746.867639	Palm Beach	4	4034840.0
3	Hillsborough County	1408566.0	14.187346	1020.21	1380.662805	Hillsborough	7	3237046.0
4	Orange County	1348975.0	17.435335	903.43	1493.170473	Orange	5	4227713.0

In addition to expanding aggregate measures to match with the rows of the original DataFrame, the transform function can also generate unique row-wise values, which can be computed relative to the group instead of relative to the entire DataFrame. For example, to compute the fraction of the population of each district living in each county:

[76]:

fl_2['Fraction of Pop'] = fl_2.groupby('District')['2019 Population'].transform(lambda x: (x / x.sum()))
fl_2.head()

[76]:

	Name	2019 Population	Growth Since 2010	Land Area	Population Density	County	District	District Population	Fraction of Pop
0	Miami-Dade County	2751796.0	9.754518	1897.72	1450.053749	Miami-Dade	6	2828809.0	0.972775
1	Broward County	1935878.0	10.438146	1209.79	1600.176890	Broward	4	4034840.0	0.479791
2	Palm Beach County	1471150.0	11.134781	1969.76	746.867639	Palm Beach	4	4034840.0	0.364612
3	Hillsborough County	1408566.0	14.187346	1020.21	1380.662805	Hillsborough	7	3237046.0	0.435139
4	Orange County	1348975.0	17.435335	903.43	1493.170473	Orange	5	4227713.0	0.319079

Table Of Contents