Using Linear Regression

[1]:
import transportation_tutorials as tt
import pandas as pd
import numpy as np
import statsmodels.api as sm

Question

Construct an ordinary least squares linear regression model to predict the given value of time for each individual in the Jupiter study area data as a function of: - age, - gender, - full-time employment status, and - household income.

Evaluate this model to answer the questions:

  1. What are the coefficients on this model?
  2. Do any of these factors appear to actually be not relevant in determining an individual’s value of time? (Hint: Gender)
  3. If other variables from the household and person datasets could also be included in the OLS model specificiation, are there any that are also significant? (Hint: Yes, there is at least one other relevant factor in this data.)

Data

To answer the questions, use the following data files:

[2]:
per = pd.read_csv(tt.data('SERPM8-BASE2015-PERSONS'))
hh = pd.read_csv(tt.data('SERPM8-BASE2015-HOUSEHOLDS'))
[3]:
per.head()
[3]:
hh_id person_id person_num age gender type value_of_time activity_pattern imf_choice inmf_choice fp_choice reimb_pct wrkr_type
0 1690841 4502948 1 46 m Full-time worker 5.072472 M 1 1 -1 0.0 0
1 1690841 4502949 2 47 f Part-time worker 5.072472 M 2 37 -1 0.0 0
2 1690841 4502950 3 11 f Student of non-driving age 3.381665 M 3 1 -1 0.0 0
3 1690841 4502951 4 8 m Student of non-driving age 3.381665 M 3 1 -1 0.0 0
4 1690961 4503286 1 52 m Part-time worker 2.447870 M 1 2 -1 0.0 0
[4]:
hh.head()
[4]:
Unnamed: 0 hh_id home_mgra income autos transponder cdap_pattern jtf_choice autotech tncmemb
0 426629 1690841 7736 512000 2 1 MMMM0 0 0 0
1 426630 1690961 7736 27500 1 0 MNMM0 0 0 0
2 426631 1690866 7736 150000 2 0 HMM0 0 0 0
3 426632 1690895 7736 104000 2 1 MMMM0 0 0 0
4 426633 1690933 7736 95000 2 1 MNM0 0 0 0