[1]:

import transportation_tutorials as tt

Manipulating Geographic Data¶

A shapefile is a data format that is actually a set of files containing geospatial information. All of the related files have the same filename (except for the file extension) and must reside in the same directory. Generally, there will be files with the shp extension containing geographic data, the dbf extension containing tabular data, and a few others. The files can be read into python as a group and manipulated using the GeoPandas package.

[2]:

import geopandas as gpd

Loading a Shapefile¶

Let’s load the SERPM8 TAZ shapefile. This shapefile (actually a set of files) is included in the data of the tuturials package and can be accessed as shown. Loading a shapefile into Python using geopandas is a simple task using the read_file function.

[3]:

shapefile_filename = tt.data('SERPM8-TAZSHAPE')
taz = gpd.read_file(shapefile_filename)

This will load the shapefile into a GeoDataFrame object, which is similar to a pandas DataFrame, but with a special geometry column that contains, as you might guess, geometric shape data contained in the shp part of the shapefile. It also includes all the other various tabular data contained in the dbf part of the shapefile. We can inspect the first few lines of the GeoDataFrame to see the contents, just like a regular DataFrame:

[4]:

taz.head()

[4]:

	OBJECTID	TAZ_REG	TAZ_OLD05	TAZ_MPO	COUNTY	CENSUSTAZ	AREA	F_NETAREA	Shape_Leng	Shape_Area	geometry
0	1	1122.0	1122	1122	1.0	None	4442490.0	0.8153	10592.846522	4.442490e+06	POLYGON ((936374.6744969971 959539.5675094873,...
1	2	17.0	17	17	1.0	None	15689400.0	0.8571	17396.297932	1.568938e+07	POLYGON ((942254.5000076629 952920.9373740703,...
2	3	1123.0	1123	1123	1.0	None	17396100.0	0.8663	23585.421941	1.739613e+07	POLYGON ((940953.5610084943 952985.0688074902,...
3	4	1120.0	1120	1120	1.0	None	1303420.0	0.8536	7202.864864	1.303422e+06	POLYGON ((953118.9999321625 951985.3749407381,...
4	5	1121.0	1121	1121	1.0	None	31477500.0	0.8787	24940.959492	3.147748e+07	POLYGON ((934328.2825924121 951600.5853559896,...

Because this is the TAZ shapefile, we expect that the geometry column includes a set of Polygon values, as we can observe in the first few rows. We can also easily check how many zones are included in the file by checking the length of the GeoDataFrame.

[5]:

len(taz)

[5]:

Displaying the shapefile in a jupyter notebook can be accomplished by calling the plot method of the GeoDataFrame.

[6]:

ax = taz.plot()

../_images/geographic-analysis_working-with-geodata_12_0.png

Coordinate Reference Systems¶

You might notice that the axis labels don’t show latitude and longitude. That’s because the TAZ file uses a different coordinate reference system (crs) for map projection. We can find what crs is being used by accessing the crs attribute.

[7]:

taz.crs

[7]:

{'init': 'epsg:2236'}

Drop that code into Google and you’ll find that it’s a special coordinate system for mapping and engineering in eastern Florida, which for this application may be perfectly reasonable. However, suppose we want to generate maps using plain old latitude and longitude. We can convert between CRS systems easily, like this:

[8]:

taz1 = taz.to_crs(epsg=4326)

[9]:

ax = taz1.plot()

../_images/geographic-analysis_working-with-geodata_17_0.png

The map projection generated here is subtly different from the previous one. If you look carefully, you’ll note the shape of the second map is a bit wider than first. This distortion is a natural part of the mapping projection, using plain latitude and longitude, as one degree of longitude is a different distance than one degree of latitude. In Florida, the difference is relatively small, but it grows as you move north.

Selecting by Geography¶

The SERPM8 region is quite large. Let’s suppose we only want to study a small portion of this region. Since we are working with Jupyter notebooks, we’ll select an area to study around Jupiter, Florida.

Selection by Rectangular Envelope¶

We can explicitly define some bounds for our study area:

[10]:

xmin = 905712.145924
ymin = 905343.94408855
xmax = 983346.68922847
ymax = 981695.93140023

Then, to select only the TAZ’s that are (at least partially) contained in the study area, we can use the cx indexer on GeoDataFrames. This indexer works similarly to the loc and iloc indexers on regular DataFrames, but selects not based on labels or tabular position, but instead selects based on geographic position.

[11]:

taz_jupiter = taz.cx[xmin:xmax, ymin:ymax]

The result of the selection is another GeoDataFrame, just as slicing using the locand iloc indexers on regular DataFrames creates a new DataFrame. Thus, we can review how many TAZ’s are in the selection in the same manner as we checked the number of TAZ’s in the original shapefile.

[12]:

len(taz_jupiter)

[12]:

And we can plot a map of the selected area, just as we did with the original data. Here, we’ll draw the borders of the TAZ’s in white for better contrast.

[13]:

ax = taz_jupiter.plot(edgecolor='w')

../_images/geographic-analysis_working-with-geodata_28_0.png

Because Jupiter is in the northwest corner of the SERPM region, only the south and west boundaries of the study area are relevant in the selection definition. Convenietly, just like other DataFrame indexers, the cx indexer allows for one-sided slices.

[14]:

taz_jupiter2 = taz.cx[xmin:, ymin:]
taz_jupiter2.equals(taz_jupiter)

[14]:

True

One important note regarding the cx indexer is that it will select all features that are contained fully or partially in the selection area. To illustrate, we can construct a GeoDataFrame consisting of the study area box, and draw it on a map of the selected zones.

[15]:

from shapely.geometry import box
study_area = gpd.GeoDataFrame(geometry=[box(xmin, ymin, xmax, ymax)], crs={'init': 'epsg:2236'})
ax = taz_jupiter.plot(edgecolor='w')
transparent = (0,0,0,0)
ax = study_area.plot(ax=ax, edgecolor='red', facecolor=transparent)

../_images/geographic-analysis_working-with-geodata_32_0.png

Selection by Polygon¶

If the study area is not neatly defined by a rectangle but instead by some defined irregular polygon, it is still possible to filter the GeoDataFrame, but slightly more complicated. We’ll demonstrate by first constructing an irregular polygon.

[16]:

from shapely.geometry import Polygon
irregular_polygon = Polygon([
    (973346, 935343),
    (973346, 971695),
    (912812, 973695),
    (915812, 943695),
    (928812, 943695),
    (935712, 905343),
])
ax = taz_jupiter.plot(edgecolor='w')
lines = ax.plot(*irregular_polygon.exterior.xy, color='r')

../_images/geographic-analysis_working-with-geodata_34_0.png

To indentify those TAZ polygons that touch the polygon, we can use the intersects method of GeoDataFrames.

[17]:

taz_jupiter.intersects(irregular_polygon)

[17]:

      True
      True
      True
      True
      True
      True
      True
      True
      True
      True
     True
     True
     True
     True
     True
     True
     True
     True
     True
     True
     True
     True
     True
     True
     True
     True
     True
     True
     True
     True
        ...
   False
   False
  False
   True
   True
   True
   True
   True
  False
  False
  False
  False
   True
   True
   True
  False
  False
   True
  False
  False
   True
   True
   True
   True
   True
   True
   True
   True
   True
   True
Length: 220, dtype: bool

This gives us a series that matches the GeoDataFrame, and identifies whether each row’s geometry intersects the the polygon. We can use that to filter the rows, to retain only those TAZ’s of interest.

[18]:

taz_jupiter_irregular = taz_jupiter[taz_jupiter.intersects(irregular_polygon)]

[19]:

len(taz_jupiter_irregular)

[19]:

This gives us a new GeoDataFrame that is potentially just a filtered view of the original, which can potentially lead to problems later (see here). To ensure we have a bonafide new GeoDataFrame, we can make sure it is a copy.

[20]:

taz_jupiter_irregular = taz_jupiter_irregular.copy()

[21]:

ax = taz_jupiter_irregular.plot(edgecolor='w')
lines = ax.plot(*irregular_polygon.exterior.xy, color='r')

../_images/geographic-analysis_working-with-geodata_42_0.png

You may note, as with the rectangular selection, we still select any zone that has any part of the zone inside the polygon.

Clipping by Geography¶

If instead we want to actually clip the geographic data at the study area frame, we can use the overlay function to get the intersection of the TAZ’s and the study area.

[22]:

taz_jupiter_clip = gpd.overlay(taz, study_area, how='intersection')

By taking the intersection, we discard portions of the zone shapes that are outside the study area, as is illustrated when we map the results with the study area box.

[23]:

ax = taz_jupiter_clip.plot(edgecolor='w')
transparent = (0,0,0,0)
ax = study_area.plot(ax=ax, edgecolor='red', facecolor=transparent)

../_images/geographic-analysis_working-with-geodata_48_0.png

Alternatively, if we have a polygon, we can use intersection to trim the GeoDataFrame geometry to only the overlapping areas.

[24]:

taz_jupiter_irregular.geometry = taz_jupiter_irregular.intersection(irregular_polygon)

[25]:

ax = taz_jupiter_irregular.plot(edgecolor='w')
transparent = (0,0,0,0)
lines = ax.plot(*irregular_polygon.exterior.xy, color='r')

../_images/geographic-analysis_working-with-geodata_51_0.png

An important feature to consider if using this method is that the operation does not actually change the set of rows or any other data included in the GeoDataFrame. If, for example, we apply it to the set of TAZ’s from the square clipped region, we can get a map that looks identical, but the GeoDataFrame will still retain all the data rows from the previous version, just with some geometry values set as empty.

[26]:

taz_jupiter_clip.geometry = taz_jupiter_clip.intersection(irregular_polygon)

ax = taz_jupiter_clip.plot(edgecolor='w')
transparent = (0,0,0,0)
lines = ax.plot(*irregular_polygon.exterior.xy, color='r')

../_images/geographic-analysis_working-with-geodata_53_0.png

[27]:

len(taz_jupiter_irregular), len(taz_jupiter_clip)

[27]:

(146, 220)

[28]:

sum(taz_jupiter_clip.geometry.area == 0)

[28]:

Joining by Geography¶

A common task in geographic analysis is joining two sets of geographic data together. We actually did a simple version of this just above, when we joined the TAZ data with the study area, although the study area had only a single polygon and no other associated data, so the results are not particularly inspiring.

As a more sophisticated example, let’s consider the micro-area zone (MAZ) system from SERPM8. The MAZ are smaller zones nested inside TAZ’s, used to provide a higher resolution of data, particularly for transit travel. Using the TAZ system is sometimes too coarse for this purpose, as it may be convenient to walk to a bus stop located on a TAZ boundary only for some parts of the zone, while the walking from the far side of the same TAZ to the bus stop might be unreasonably far.

Like the TAZ system, the MAZ system for SERPM8 is given by a shape file, and is included in the tutorial data, so we can load it in the same way.

[29]:

mazshape_filename = tt.data('SERPM8-MAZSHAPE')
maz = gpd.read_file(mazshape_filename)

[30]:

len(maz)

[30]:

[31]:

maz.head()

[31]:

	OBJECTID	MAZ	SHAPE_LENG	SHAPE_AREA	ACRES	POINT_X	POINT_Y	geometry
0	1	5347	8589.393674	3.111034e+06	71	953130	724165	POLYGON ((953970.4660769962 723936.0810402408,...
1	2	5348	11974.067469	7.628753e+06	175	907018	634551	POLYGON ((908505.2801046632 635081.7738410756,...
2	3	5349	9446.131753	4.007041e+06	92	923725	707062	POLYGON ((922736.6374686621 708387.6918614879,...
3	4	5350	21773.153739	2.487397e+07	571	908988	713484	POLYGON ((908334.2374677472 715692.2628822401,...
4	5	5351	17882.701416	1.963139e+07	451	909221	717493	POLYGON ((911883.0187559947 719309.3261861578,...

[ ]:

We can select only MAZ’s in the study area in the same manner as we did for TAZ’s.

[32]:

maz_jupiter = maz.cx[xmin:xmax, ymin:ymax]

[33]:

len(maz_jupiter)

[33]:

To create a joined GeoDataFrame, we can use the same `overlay <http://geopandas.org/reference/geopandas.overlay.html>`__ function we used to clip the TAZ’s, although we will intersect the TAZ and MAZ geometries.

[34]:

maz_taz = gpd.overlay(maz_jupiter, taz_jupiter, how='intersection')

The joined GeoDataFrame has all the columns from both the MAZ and TAZ source data. In cases where both source tables have an identically named column (for example, “OBJECTID”) then both columns are retained, and they are labeled with “_1” and “_2” suffixes.

[35]:

maz_taz.head()

[35]:

	OBJECTID_1	MAZ	SHAPE_LENG	SHAPE_AREA	ACRES	POINT_X	POINT_Y	OBJECTID_2	TAZ_REG	TAZ_OLD05	...	CENSUSTAZ	AREA	F_NETAREA	Shape_Leng	Shape_Area	geometry
0	2389	7736	10592.846522	4.442490e+06	102	936259	957308	1	1122.0	1122	...	None	4442490.0	0.8153	10592.846522	4.442490e+06	POLYGON ((936374.6744969971 959539.5675094873,...
1	2390	7737	17396.297932	1.568938e+07	360	940102	950911	2	17.0	17	...	None	15689400.0	0.8571	17396.297932	1.568938e+07	POLYGON ((942254.5000076629 952920.9373740703,...
2	2390	7737	17396.297932	1.568938e+07	360	940102	950911	3	1123.0	1123	...	None	17396100.0	0.8663	23585.421941	1.739613e+07	POLYGON ((938316.2017819136 951630.5626582429,...
3	2391	7738	18511.806088	1.405994e+07	323	937715	953796	3	1123.0	1123	...	None	17396100.0	0.8663	23585.421941	1.739613e+07	POLYGON ((938309.9170175791 952656.4848189875,...
4	6329	11676	10422.047410	3.336191e+06	77	937781	950989	3	1123.0	1123	...	None	17396100.0	0.8663	23585.421941	1.739613e+07	POLYGON ((938309.9170175791 952656.4848189875,...

5 rows × 22 columns

Because we know that the MAZ’s are supposed to be nested within the TAZ’s, we would normally expect that the joined data would have exactly one row for each of the 361 MAZ’s, and that some TAZ’s would be duplicated. However, that’s not what has happened.

[36]:

len(maz_taz)

[36]:

Managing Errors in Joining Spatial Data¶

The joined data has 564 new zones, 203 more than we expected. There are two reasons for this discrepancy, although both are a result of the two zone systems being misaligned.

The first reason is related to the limits computer precision: some of the new zones are merely slivers of space that arise due to rounding errors on the computer placement of vertices. If we evaluate the area of the first 5 zones in the joined data table, we see that one of the zones is only 1.878×10-3 square feet, or about a quarter of a square inch. Clearly, this isn’t a real zone!

[37]:

maz_taz.head().area

[37]:

  4.442490e+06
  1.568938e+07
  1.878210e-03
  1.405994e+07
  3.336191e+06
dtype: float64

The seconds reason that extra zones are found in this join is because the two zone systems are not always carefully aligned by the analysts who originally drew the zones, particularly in areas that are not functionally important. For example, here are the boundaries for the TAZ and MAZ systems, superimposed over a map of the The Loxahatchee Club golf course in Jupiter.

[38]:

ax = tt.mapping.make_basemap(xlim=(938500,945500), ylim=(937500,944500), zoom=15, tiles='OSM_A', epsg=2236, axis='off')
ax = taz_jupiter.plot(ax=ax, edgecolor='blue', linewidth=2, facecolor=transparent )
ax = maz_jupiter.plot(ax=ax, edgecolor='red', linewidth=1, facecolor=transparent )

../_images/geographic-analysis_working-with-geodata_74_0.png

When the intersection of the two zone systems is computed, half a dozen extraneous small zones are generated along the lake that forms the western edge of the golf course, but these are not actually meaningful activity locations and are not relevant for transortation analysis.

The possible solutions to both problems are the same, given that we know that the “correct” join solution will have exactly one zone in the result for each MAZ.

Dropping Extra Pieces¶

One possible solution is to take the joined data we already have, and remove all the extra pieces of zones that were created, retaining only the largest main section of each MAZ. To do this, we can sort zones in the joined GeoDataFrame by area, and retain only the largest zone from each group of duplicated MAZ codes.

First, we’ll use the argsort method, to get the order of the joined zones, if ordered by area. Then we feed that into the iloc indexer to actually sort the rows as indicated. The argsort returns values in ascending order by default, so all the small slivers and bits of zones we want to discard will appear first.

[39]:

maz_taz = maz_taz.iloc[maz_taz.area.argsort()]

Then, we can drop duplicate MAZ values from the table, keeping only the last instance of each MAZ, which is the one we want. In theory, if the two zones systems are very poorly aligned, this process could drop the wrong part of the zones, although in almost all cases it will work fine.

[40]:

maz_taz = maz_taz.drop_duplicates('MAZ', keep='last')

[41]:

len(maz_taz)

[41]:

One drawback of this method is that some small gaps may be left in the resulting joined GeoDataFrame zones, in the spaces that are removed, as illustrated below.

This may or may not be problematic, depending on the application.

[42]:

ax = tt.mapping.make_basemap(xlim=(938500,945500), ylim=(937500,944500), zoom=15, tiles='OSM_A', epsg=2236, axis='off')
ax = maz_taz.plot(ax=ax, edgecolor='k', linewidth=2, facecolor='r', alpha=0.5 )

../_images/geographic-analysis_working-with-geodata_83_0.png

Joining on Centroids¶

An alternate solution is to join the TAZ and MAZ data in a entirely different manner. Rather than joining the two GeoDataFrames based on the area polygons in each, we can join based on the area polygons for TAZ’s and centroids for MAZ’s. To do this, we’ll use the sjoin method.

First, we’ll need a version of the MAZ GeoDataFrame that has only centroid points in the geometry column, instead of polygons. To create this, start by making a copy of the original data (assuming we want to save it for later).

[43]:

maz_points = maz_jupiter.copy()

Then we want to change the geometry to be points instead of polygons. (Both Point and Polygon are specific object types in the shapely package that is used by GeoPandas.) We could simply use the ‘centroid’ attribute of the geometry column to find new centroid for each zone, but in this data we already have X and Y values for a centroid, so we’ll prefer to use those. We can uses the apply method of DataFrames to create Point values, and assign those values into the geometry column.

[44]:

from shapely.geometry import Point
maz_points.geometry = maz_points.apply(lambda x: Point(x.POINT_X, x.POINT_Y), axis=1)

Now we are ready to conduct a spatial join using the sjoin function. Like the overlay function, this function accepts two GeoDataFrames and returns one, with general data columns from both inputs.

Unlike the overlay function, the sjoin does not create any new geometries; instead, the resulting GeoDataFrame contains geometry from either the left (first argument) or right (second argument) input GeoDataFrames.

[45]:

sjoined = gpd.sjoin(maz_points, taz_jupiter, how='left', op='within')

[46]:

len(sjoined)

[46]:

[47]:

sjoined.head()

[47]:

	OBJECTID_left	MAZ	SHAPE_LENG	SHAPE_AREA	ACRES	POINT_X	POINT_Y	geometry	index_right	OBJECTID_right	...	COUNTY	CENSUSTAZ	AREA	F_NETAREA	Shape_Leng	Shape_Area
2388	2389	7736	10592.846522	4.442490e+06	102	936259	957308	POINT (936259 957308)	0	1	...	1.0	None	4442490.0	0.8153	10592.846522	4.442490e+06
2389	2390	7737	17396.297932	1.568938e+07	360	940102	950911	POINT (940102 950911)	1	2	...	1.0	None	15689400.0	0.8571	17396.297932	1.568938e+07
2390	2391	7738	18511.806088	1.405994e+07	323	937715	953796	POINT (937715 953796)	2	3	...	1.0	None	17396100.0	0.8663	23585.421941	1.739613e+07
2391	2392	7739	7202.864864	1.303422e+06	30	953218	953704	POINT (953218 953704)	3	4	...	1.0	None	1303420.0	0.8536	7202.864864	1.303422e+06
2392	2393	7740	24940.959492	3.147748e+07	723	931973	952018	POINT (931973 952018)	4	5	...	1.0	None	31477500.0	0.8787	24940.959492	3.147748e+07

5 rows × 23 columns

The resulting GeoDataFrame has the geometry from the left input, which in this case is just the centroid points.

[48]:

ax = sjoined.plot()

../_images/geographic-analysis_working-with-geodata_93_0.png

If we want to re-attach the polygon zones, we can do so easily if we didn’t overwrite the original MAZ data, because this joined data is in the same order as the original.

[49]:

sjoined.MAZ.equals(maz_jupiter.MAZ)

[49]:

True

[50]:

sjoined.geometry = maz_jupiter.geometry

[51]:

ax = sjoined.plot(edgecolor='w')

../_images/geographic-analysis_working-with-geodata_97_0.png

Creating a Crosswalk Dictionary from the Joined Data¶

We will sometimes be given a MAZ id but want to get the TAZ id instead. As this is a one-to-many mapping (every MAZ is in one and only one TAZ, but each TAZ can contain multiple MAZ’s), it is convenient to express this mapping as a Python dictionary. This can be done easily if the index of the GeoDataFrame is set to be the MAZ.

[52]:

sjoined.index=sjoined.MAZ

[53]:

maz_taz_crosswalk = dict(sjoined.TAZ_MPO)

Then, given a MAZ number we can access the TAZ simply like this:

[54]:

maz_number = 7755
taz_number = maz_taz_crosswalk[maz_number]

Table Of Contents