Using Linear Regression

[1]:
import transportation_tutorials as tt
import pandas as pd
import numpy as np
import statsmodels.api as sm

Question

Construct an ordinary least squares linear regression model to predict the given value of time for each individual in the Jupiter study area data as a function of: - age, - gender, - full-time employment status, and - household income.

Evaluate this model to answer the questions:

  1. What are the coefficients on this model?
  2. Do any of these factors appear to actually be not relevant in determining an individual’s value of time? (Hint: Gender)
  3. If other variables from the household and person datasets could also be included in the OLS model specificiation, are there any that are also significant? (Hint: Yes, there is at least one other relevant factor in this data.)

Data

To answer the questions, use the following data files:

[2]:
per = pd.read_csv(tt.data('SERPM8-BASE2015-PERSONS'))
hh = pd.read_csv(tt.data('SERPM8-BASE2015-HOUSEHOLDS'))
[3]:
per.head()
[3]:
hh_id person_id person_num age gender type value_of_time activity_pattern imf_choice inmf_choice fp_choice reimb_pct wrkr_type
0 1690841 4502948 1 46 m Full-time worker 5.072472 M 1 1 -1 0.0 0
1 1690841 4502949 2 47 f Part-time worker 5.072472 M 2 37 -1 0.0 0
2 1690841 4502950 3 11 f Student of non-driving age 3.381665 M 3 1 -1 0.0 0
3 1690841 4502951 4 8 m Student of non-driving age 3.381665 M 3 1 -1 0.0 0
4 1690961 4503286 1 52 m Part-time worker 2.447870 M 1 2 -1 0.0 0
[4]:
hh.head()
[4]:
Unnamed: 0 hh_id home_mgra income autos transponder cdap_pattern jtf_choice autotech tncmemb
0 426629 1690841 7736 512000 2 1 MMMM0 0 0 0
1 426630 1690961 7736 27500 1 0 MNMM0 0 0 0
2 426631 1690866 7736 150000 2 0 HMM0 0 0 0
3 426632 1690895 7736 104000 2 1 MMMM0 0 0 0
4 426633 1690933 7736 95000 2 1 MNM0 0 0 0

Solution

First, we import household income from hh dataframe and merge it with per dataframe to get the household income information for each individual.

[5]:
per = pd.merge(per, hh[['hh_id', 'income', 'autos', 'transponder']], on = 'hh_id', how = 'inner')

Then, we create a couple of dummy variables with binary values to include them as explanatory variables in model estimation. We create female and full-time variable to observe the categorical effect of gender and full-time employment status on model outcome. We can also scale income varibale to ensure more reasonable variance in the estimation. For example, we can simply scale down the numbers by 100K.

[6]:
per['female'] = np.where((per.gender == 'f'), 1, 0)
per['full_time'] = np.where((per.type == 'Full-time worker'), 1, 0)
per['hh_income(100k)'] = per['income'] / 100000
[7]:
per.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 40683 entries, 0 to 40682
Data columns (total 19 columns):
hh_id               40683 non-null int64
person_id           40683 non-null int64
person_num          40683 non-null int64
age                 40683 non-null int64
gender              40683 non-null object
type                40683 non-null object
value_of_time       40683 non-null float64
activity_pattern    40683 non-null object
imf_choice          40683 non-null int64
inmf_choice         40683 non-null int64
fp_choice           40683 non-null int64
reimb_pct           40683 non-null float64
wrkr_type           40683 non-null int64
income              40683 non-null int64
autos               40683 non-null int64
transponder         40683 non-null int64
female              40683 non-null int64
full_time           40683 non-null int64
hh_income(100k)     40683 non-null float64
dtypes: float64(3), int64(13), object(3)
memory usage: 6.2+ MB

At this point, we have the dataframe ready with all explanatory variables (age, female, full-time and hh_income(100k)) and the response variable (value_of_time). We check data types of all variables and presence of NULL values. If everything looks appropriate, then we go for creating a model object. We use sm.OLS() method for building a model object. Inside this method, we can add a constant to the explanatory variables in regression model using sm.add_constant() method. Then, we fit the model using .fit() method and store the estimation results in a variable.

[8]:
model = sm.OLS(per['value_of_time'], sm.add_constant(per[['age', 'female', 'full_time', 'hh_income(100k)']]))
result = model.fit()
/Users/jpn/anaconda/envs/tt/lib/python3.7/site-packages/numpy/core/fromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)

We can print the summary of model estimation using .summary() method, to review a number of statistical outputs from the model, including the model coefficients.

[9]:
print(result.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:          value_of_time   R-squared:                       0.036
Model:                            OLS   Adj. R-squared:                  0.036
Method:                 Least Squares   F-statistic:                     384.5
Date:                Thu, 08 Aug 2019   Prob (F-statistic):               0.00
Time:                        15:02:14   Log-Likelihood:            -1.4546e+05
No. Observations:               40683   AIC:                         2.909e+05
Df Residuals:                   40678   BIC:                         2.910e+05
Df Model:                           4
Covariance Type:            nonrobust
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
const               6.9651      0.116     60.046      0.000       6.738       7.192
age                 0.0361      0.002     20.006      0.000       0.033       0.040
female             -0.0345      0.087     -0.399      0.690      -0.204       0.135
full_time           1.7770      0.089     19.855      0.000       1.602       1.952
hh_income(100k)     0.9476      0.037     25.937      0.000       0.876       1.019
==============================================================================
Omnibus:                    17005.353   Durbin-Watson:                   0.353
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            77628.651
Skew:                           2.042   Prob(JB):                         0.00
Kurtosis:                       8.395   Cond. No.                         155.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

From the model estimation summary, we see that full-time employment has a substantial positive effect on value of time. Increases in household income and age also contributes to the increase in value of time. The t-statistics for all these parameters are much larger than 1.96, indicating they are all almost certainly significant parameters in determining the person’s value of time. However, the t-statistic for the gender of the person has a small t-statistics, with magnitude only about 0.4. This suggests that gender is not a statistically significant factor in determining the value of time.

If we estimate the same model, but adding the number of automobiles owned by the person’s household as an additional explanatory factor, we can see that automobile ownership is also a relevant and statistically significant factor (with a t-statistic of 11.8).

[10]:
model = sm.OLS(
    per['value_of_time'],
    sm.add_constant(per[['age', 'female', 'full_time', 'hh_income(100k)', 'autos']])
)
result = model.fit()
print(result.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:          value_of_time   R-squared:                       0.040
Model:                            OLS   Adj. R-squared:                  0.040
Method:                 Least Squares   F-statistic:                     336.2
Date:                Thu, 08 Aug 2019   Prob (F-statistic):               0.00
Time:                        15:02:14   Log-Likelihood:            -1.4539e+05
No. Observations:               40683   AIC:                         2.908e+05
Df Residuals:                   40677   BIC:                         2.909e+05
Df Model:                           5
Covariance Type:            nonrobust
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
const               5.6471      0.161     35.013      0.000       5.331       5.963
age                 0.0407      0.002     22.068      0.000       0.037       0.044
female              0.0060      0.086      0.069      0.945      -0.163       0.175
full_time           1.7309      0.089     19.354      0.000       1.556       1.906
hh_income(100k)     0.8326      0.038     22.047      0.000       0.759       0.907
autos               0.6224      0.053     11.740      0.000       0.518       0.726
==============================================================================
Omnibus:                    17013.391   Durbin-Watson:                   0.348
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            77754.288
Skew:                           2.043   Prob(JB):                         0.00
Kurtosis:                       8.401   Cond. No.                         203.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.