[1]:
import transportation_tutorials as tt
import pandas as pd
import numpy as np
import statsmodels.api as sm

Linear Regression

A popular package for developing linear regression models in Python is statsmodels. This packages includes an extensive set of tools for statisitical modeling, but in this tutorial we will focus on linear regression models.

Generally, linear regression models will be developed using a pandas.DataFrame containing both independent (explanatory) and dependent (target) variables. We’ll work with data in the households and trips table from the Jupiter study area.

[2]:
hh = pd.read_csv(tt.data('SERPM8-BASE2015-HOUSEHOLDS'), index_col=0)
hh.set_index('hh_id', inplace=True)
[3]:
trips = pd.read_csv(tt.data('SERPM8-BASE2015-TRIPS'))

If we want to develop a linear regression model to predict trip generation by households, we’ll need to merge these two data tables, tabulating the number of trips taken by each household. (See the tutorial on grouping for more details on how to do this).

[4]:
hh = hh.merge(
    trips.groupby(['hh_id']).size().rename('n_trips'),
    left_on=['hh_id'],
    right_index=True,
)

We can review what variables we now have in the hh DataFrame:

[5]:
hh.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 17260 entries, 1690841 to 1726370
Data columns (total 9 columns):
home_mgra       17260 non-null int64
income          17260 non-null int64
autos           17260 non-null int64
transponder     17260 non-null int64
cdap_pattern    17260 non-null object
jtf_choice      17260 non-null int64
autotech        17260 non-null int64
tncmemb         17260 non-null int64
n_trips         17260 non-null int64
dtypes: int64(8), object(1)
memory usage: 1.3+ MB

If we suppose that the number of trips made by a household is a function of income and the number of automobiles owned, we can create an ordinary least squares regression model, and find the best fitting parameters like this:

[6]:
mod = sm.OLS(
    hh.n_trips,
    sm.add_constant(hh[['autos','income']])
)
res = mod.fit()
print(res.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                n_trips   R-squared:                       0.229
Model:                            OLS   Adj. R-squared:                  0.229
Method:                 Least Squares   F-statistic:                     2563.
Date:                Thu, 08 Aug 2019   Prob (F-statistic):               0.00
Time:                        13:45:17   Log-Likelihood:                -48167.
No. Observations:               17260   AIC:                         9.634e+04
Df Residuals:                   17257   BIC:                         9.636e+04
Df Model:                           2
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.4460      0.073     33.694      0.000       2.304       2.588
autos          2.5804      0.039     65.547      0.000       2.503       2.658
income       1.97e-06   2.79e-07      7.049      0.000    1.42e-06    2.52e-06
==============================================================================
Omnibus:                     4116.620   Durbin-Watson:                   1.926
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            11185.853
Skew:                           1.273   Prob(JB):                         0.00
Kurtosis:                       6.011   Cond. No.                     4.14e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.14e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
/Users/jpn/anaconda/envs/tt/lib/python3.7/site-packages/numpy/core/fromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)

Note that the hh dataframes contains a variety of other columns of data, but since we’re not interested in using them for this model, they can be implicitly omitted by creating a dataframe view that includes only the variables we do want.

Also, we use sm.add_constant, which includes a constant in the regression function. By default, the statsmodels module does not include a constant in an ordinary least squares (OLS) model, so you must explicitly add one to the explanatory variables to include it.

The output of the model summary() is relatively extensive and includes a large number of statistical measures and tests that may or may not interest you. The most important of these measures include the coefficient estimates shown in the center panel of this report, as well as the R-squared measure at the upper right.

One other item that may be concerning in this report is the second warning at the bottom, which reports that there may be some numerical problem with the model. This problem is actually reflected also in the coefficients themselves, as the coefficient for income is many orders of magnitide different from the others.
This is reasonable and intuititve: the impact of a unit (single dollar) change in annual household income is insignificant compared to a unit (single car) change in automobile ownership. If we review the standard deviations of these explanatory variables, we can also see they vary greatly.
[7]:
sm.add_constant(hh[['autos','income']]).std()
[7]:
const          0.000000
autos          0.801841
income    112974.383573
dtype: float64

A magnitude variance this large is not problematic in raw statistical theory, but it can introduce numerical stability problems when using computers to represent these models. To solve this issue, we can simply scale one or more variables to more consistent variance:

[8]:
hh['income_100k'] = hh.income / 100_000
[9]:
mod = sm.OLS(
    hh.n_trips,
    sm.add_constant(hh[['autos','income_100k']])
)
res = mod.fit()
print(res.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                n_trips   R-squared:                       0.229
Model:                            OLS   Adj. R-squared:                  0.229
Method:                 Least Squares   F-statistic:                     2563.
Date:                Thu, 08 Aug 2019   Prob (F-statistic):               0.00
Time:                        13:45:17   Log-Likelihood:                -48167.
No. Observations:               17260   AIC:                         9.634e+04
Df Residuals:                   17257   BIC:                         9.636e+04
Df Model:                           2
Covariance Type:            nonrobust
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const           2.4460      0.073     33.694      0.000       2.304       2.588
autos           2.5804      0.039     65.547      0.000       2.503       2.658
income_100k     0.1970      0.028      7.049      0.000       0.142       0.252
==============================================================================
Omnibus:                     4116.620   Durbin-Watson:                   1.926
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            11185.853
Skew:                           1.273   Prob(JB):                         0.00
Kurtosis:                       6.011   Cond. No.                         6.60
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The R-squared and t-statistics of this model are the same as before, as this is in effect the same model as above. But in this revised model, the magnitude of the income coefficient is now much closer to that of the other coefficients, and the “condition number” warning is not present in the summary.

Piecewise Linear Functions

OLS linear regression models are by design written as linear-in-parameters models, but that does not mean that the explanitory data cannot be first transformed, for example by using a piece-wise linear expansion. The larch package includes a piecewise_expansion function that can expand a single column of data in a pandas.DataFrame into multiple columns based on defined breakpoints.

[10]:
from larch.util.data_expansion import piecewise_expansion

For example, to create a piecewise linear version of household income, with break points at $25 and $75 thouand, we could write:

[11]:
piecewise_expansion(hh.income, [25_000, 75_000]).head()
[11]:
piece(income,None,25000) piece(income,25000,75000) piece(income,75000,None)
hh_id
1690841 25000 50000 437000
1690961 25000 2500 0
1690866 25000 50000 75000
1690895 25000 50000 29000
1690933 25000 50000 20000

The result is three columns of data instead of one, with the first giving income up to the lower breakpoint, the next giving income between the two breakpoints, and the last giving the amount of income above the top breakpoint.

We can readily concatenate this expanded data with any other explanatory variables by using pandas.concat. Note that by default this function concatenates dataframes vertically (combining columns and stacking rows), but in this case we want to concatenate horizontally (combining rows and stacking columns). We can achieve this by also passing axis=1 to the function in addition to the list of dataframes to concatenate.

[12]:
hh_edited = pd.concat([
    hh.autos,
    piecewise_expansion(hh.income_100k, [.25, .75]),
], axis=1)

hh_edited.head()
[12]:
autos piece(income_100k,None,0.25) piece(income_100k,0.25,0.75) piece(income_100k,0.75,None)
hh_id
1690841 2 0.25 0.500 4.37
1690961 1 0.25 0.025 0.00
1690866 2 0.25 0.500 0.75
1690895 2 0.25 0.500 0.29
1690933 2 0.25 0.500 0.20

Then we can use this modified dataframe to construct a piecewise linear OLS regression model. Because the original and modified dataframes have the same index (i.e. number and order of rows) we can mix them in the OLS defintion, using the n_trips column from the original as the dependent variable and the explanatory data from the modified dataframe.

[13]:
mod = sm.OLS(
    hh.n_trips,
    sm.add_constant(hh_edited)
)
res = mod.fit()
print(res.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                n_trips   R-squared:                       0.231
Model:                            OLS   Adj. R-squared:                  0.231
Method:                 Least Squares   F-statistic:                     1297.
Date:                Thu, 08 Aug 2019   Prob (F-statistic):               0.00
Time:                        13:45:17   Log-Likelihood:                -48143.
No. Observations:               17260   AIC:                         9.630e+04
Df Residuals:                   17255   BIC:                         9.633e+04
Df Model:                           4
Covariance Type:            nonrobust
================================================================================================
                                   coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------
const                            1.9177      0.153     12.542      0.000       1.618       2.217
autos                            2.4769      0.042     58.560      0.000       2.394       2.560
piece(income_100k,None,0.25)     2.2983      0.722      3.181      0.001       0.882       3.714
piece(income_100k,0.25,0.75)     1.0124      0.204      4.972      0.000       0.613       1.412
piece(income_100k,0.75,None)     0.0977      0.033      3.002      0.003       0.034       0.162
==============================================================================
Omnibus:                     4132.769   Durbin-Watson:                   1.925
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            11235.615
Skew:                           1.278   Prob(JB):                         0.00
Kurtosis:                       6.015   Cond. No.                         56.0
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
/Users/jpn/anaconda/envs/tt/lib/python3.7/site-packages/numpy/core/fromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)

Polynomial Functions

In addition to piecewise linear terms in the regression equation, standard OLS allows for any arbitrary non-linear transformation. Students of statistics will be familiar with fitting a polynomial function with OLS coefficients, and this can be done using statsmodels for example by explicitly computing the desired polynomial terms before estimating model parameter.

[14]:
hh['autos^2'] = hh['autos'] ** 2
hh['income^2'] = hh['income_100k'] ** 2
[15]:
mod = sm.OLS(
    hh.n_trips,
    sm.add_constant(hh[['autos','income_100k', 'autos^2', 'income^2']])
)
res = mod.fit()
print(res.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                n_trips   R-squared:                       0.230
Model:                            OLS   Adj. R-squared:                  0.229
Method:                 Least Squares   F-statistic:                     1286.
Date:                Thu, 08 Aug 2019   Prob (F-statistic):               0.00
Time:                        13:45:17   Log-Likelihood:                -48160.
No. Observations:               17260   AIC:                         9.633e+04
Df Residuals:                   17255   BIC:                         9.637e+04
Df Model:                           4
Covariance Type:            nonrobust
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const           2.6596      0.127     20.926      0.000       2.411       2.909
autos           2.2161      0.139     15.970      0.000       1.944       2.488
income_100k     0.3748      0.067      5.578      0.000       0.243       0.507
autos^2         0.0839      0.033      2.545      0.011       0.019       0.148
income^2       -0.0314      0.011     -2.788      0.005      -0.054      -0.009
==============================================================================
Omnibus:                     4098.383   Durbin-Watson:                   1.925
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            11134.434
Skew:                           1.268   Prob(JB):                         0.00
Kurtosis:                       6.009   Cond. No.                         46.6
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Alternatively, polynomial terms can be created automatically for every column in the source data, as well as for interactions, using the PolynomialFeatures preprocessor from the sklearn package. This tool doesn’t automatically maintain the DataFrame formatting when applied (instead it outputs a simple array of values), but it is simple to write a small function that will do so.

[16]:
def polynomial(x, **kwargs):
    from sklearn.preprocessing import PolynomialFeatures
    poly = PolynomialFeatures(**kwargs)
    arr = poly.fit_transform(x)
    return pd.DataFrame(arr, columns=poly.get_feature_names(x.columns), index=x.index)

Then we can use the function to calculate polynomial terms automatically. In this example, by setting the degree to 3, we not only get squared and cubed versions of the two parameters, but also all the interactions of these parameters up to degree 3.

[17]:
hh_poly = polynomial(hh[['autos','income_100k']], degree=3)
hh_poly.head()
[17]:
1 autos income_100k autos^2 autos income_100k income_100k^2 autos^3 autos^2 income_100k autos income_100k^2 income_100k^3
hh_id
1690841 1.0 2.0 5.120 4.0 10.240 26.214400 8.0 20.480 52.428800 134.217728
1690961 1.0 1.0 0.275 1.0 0.275 0.075625 1.0 0.275 0.075625 0.020797
1690866 1.0 2.0 1.500 4.0 3.000 2.250000 8.0 6.000 4.500000 3.375000
1690895 1.0 2.0 1.040 4.0 2.080 1.081600 8.0 4.160 2.163200 1.124864
1690933 1.0 2.0 0.950 4.0 1.900 0.902500 8.0 3.800 1.805000 0.857375

Great care should be used with this automatic polynomial expansion of the data, as it is easy to end up with an overfitted model, especially when using a tool like OLS that does not attempt to self-correct to limit overfitting.

[18]:
mod = sm.OLS(
    hh.n_trips,
    hh_poly
)
res = mod.fit()
print(res.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                n_trips   R-squared:                       0.236
Model:                            OLS   Adj. R-squared:                  0.235
Method:                 Least Squares   F-statistic:                     590.5
Date:                Thu, 08 Aug 2019   Prob (F-statistic):               0.00
Time:                        13:45:17   Log-Likelihood:                -48093.
No. Observations:               17260   AIC:                         9.621e+04
Df Residuals:                   17250   BIC:                         9.628e+04
Df Model:                           9
Covariance Type:            nonrobust
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
1                       3.6606      0.192     19.086      0.000       3.285       4.037
autos                  -0.1241      0.321     -0.386      0.699      -0.754       0.506
income_100k             0.6579      0.223      2.955      0.003       0.222       1.094
autos^2                 1.2151      0.178      6.828      0.000       0.866       1.564
autos income_100k       0.5617      0.182      3.088      0.002       0.205       0.918
income_100k^2          -0.3633      0.052     -6.984      0.000      -0.465      -0.261
autos^3                -0.1641      0.030     -5.515      0.000      -0.222      -0.106
autos^2 income_100k    -0.1079      0.039     -2.790      0.005      -0.184      -0.032
autos income_100k^2    -0.0216      0.016     -1.337      0.181      -0.053       0.010
income_100k^3           0.0349      0.004      8.257      0.000       0.027       0.043
==============================================================================
Omnibus:                     4055.002   Durbin-Watson:                   1.926
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            10960.315
Skew:                           1.257   Prob(JB):                         0.00
Kurtosis:                       5.987   Cond. No.                         659.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.