Processing math: 100%

Discrete Choice Models

Discrete Choice Models

Link to Notebook GitHub

Fair's Affair data

A survey of women only was conducted in 1974 by Redbook asking about extramarital affairs.

In [1]:
1
2
3
4
5
6
<span class="kn">from</span> <span class="nn">__future__</span> <span class="kn">import</span> <span class="n">print_function</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">scipy</span> <span class="kn">import</span> <span class="n">stats</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="kn">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">statsmodels.api</span> <span class="kn">as</span> <span class="nn">sm</span>
<span class="kn">from</span> <span class="nn">statsmodels.formula.api</span> <span class="kn">import</span> <span class="n">logit</span><span class="p">,</span> <span class="n">probit</span><span class="p">,</span> <span class="n">poisson</span><span class="p">,</span> <span class="n">ols</span>
In [2]:
1
<span class="k">print</span><span class="p">(</span><span class="n">sm</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">fair</span><span class="o">.</span><span class="n">SOURCE</span><span class="p">)</span>
Fair, Ray. 1978. "A Theory of Extramarital Affairs," `Journal of Political
    Economy`, February, 45-61.

The data is available at http://fairmodel.econ.yale.edu/rayfair/pdf/2011b.htm


In [3]:
1
<span class="k">print</span><span class="p">(</span> <span class="n">sm</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">fair</span><span class="o">.</span><span class="n">NOTE</span><span class="p">)</span>
::

    Number of observations: 6366
    Number of variables: 9
    Variable name definitions:

        rate_marriage   : How rate marriage, 1 = very poor, 2 = poor, 3 = fair,
                        4 = good, 5 = very good
        age             : Age
        yrs_married     : No. years married. Interval approximations. See
                        original paper for detailed explanation.
        children        : No. children
        religious       : How relgious, 1 = not, 2 = mildly, 3 = fairly,
                        4 = strongly
        educ            : Level of education, 9 = grade school, 12 = high
                        school, 14 = some college, 16 = college graduate,
                        17 = some graduate school, 20 = advanced degree
        occupation      : 1 = student, 2 = farming, agriculture; semi-skilled,
                        or unskilled worker; 3 = white-colloar; 4 = teacher
                        counselor social worker, nurse; artist, writers;
                        technician, skilled worker, 5 = managerial,
                        administrative, business, 6 = professional with
                        advanced degree
        occupation_husb : Husband's occupation. Same as occupation.
        affairs         : measure of time spent in extramarital affairs

    See the original paper for more details.


In [4]:
1
<span class="n">dta</span> <span class="o">=</span> <span class="n">sm</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">fair</span><span class="o">.</span><span class="n">load_pandas</span><span class="p">()</span><span class="o">.</span><span class="n">data</span>
In [5]:
1
2
<span class="n">dta</span><span class="p">[</span><span class="s">'affair'</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">dta</span><span class="p">[</span><span class="s">'affairs'</span><span class="p">]</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">float</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">dta</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">10</span><span class="p">))</span>
   rate_marriage  age  yrs_married  children  religious  educ  occupation  occupation_husb  \
0              3   32          9.0       3.0          3    17           2                5
1              3   27         13.0       3.0          1    14           3                4
2              4   22          2.5       0.0          1    16           3                5
3              4   37         16.5       4.0          3    16           5                5
4              5   27          9.0       1.0          1    14           3                4
5              4   27          9.0       0.0          2    14           3                4
6              5   37         23.0       5.5          2    12           5                4
7              5   37         23.0       5.5          2    12           2                3
8              3   22          2.5       0.0          2    12           3                3
9              3   27          6.0       0.0          1    16           3                5

    affairs  affair
0  0.111111       1
1  3.230769       1
2  1.400000       1
3  0.727273       1
4  4.666666       1
5  4.666666       1
6  0.852174       1
7  1.826086       1
8  4.799999       1
9  1.333333       1

In [6]:
1
<span class="k">print</span><span class="p">(</span><span class="n">dta</span><span class="o">.</span><span class="n">describe</span><span class="p">())</span>
       rate_marriage          age  yrs_married     children    religious         educ  \
count    6366.000000  6366.000000  6366.000000  6366.000000  6366.000000  6366.000000
mean        4.109645    29.082862     9.009425     1.396874     2.426170    14.209865
std         0.961430     6.847882     7.280120     1.433471     0.878369     2.178003
min         1.000000    17.500000     0.500000     0.000000     1.000000     9.000000
25%         4.000000    22.000000     2.500000     0.000000     2.000000    12.000000
50%         4.000000    27.000000     6.000000     1.000000     2.000000    14.000000
75%         5.000000    32.000000    16.500000     2.000000     3.000000    16.000000
max         5.000000    42.000000    23.000000     5.500000     4.000000    20.000000

        occupation  occupation_husb      affairs       affair
count  6366.000000      6366.000000  6366.000000  6366.000000
mean      3.424128         3.850141     0.705374     0.322495
std       0.942399         1.346435     2.203374     0.467468
min       1.000000         1.000000     0.000000     0.000000
25%       3.000000         3.000000     0.000000     0.000000
50%       3.000000         4.000000     0.000000     0.000000
75%       4.000000         5.000000     0.484848     1.000000
max       6.000000         6.000000    57.599991     1.000000

In [7]:
1
2
3
<span class="n">affair_mod</span> <span class="o">=</span> <span class="n">logit</span><span class="p">(</span><span class="s">"affair ~ occupation + educ + occupation_husb"</span>
                   <span class="s">"+ rate_marriage + age + yrs_married + children"</span>
                   <span class="s">" + religious"</span><span class="p">,</span> <span class="n">dta</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">()</span>
Optimization terminated successfully.
         Current function value: 0.545314
         Iterations 6

In [8]:
1
<span class="k">print</span><span class="p">(</span><span class="n">affair_mod</span><span class="o">.</span><span class="n">summary</span><span class="p">())</span>
                           Logit Regression Results
==============================================================================
Dep. Variable:                 affair   No. Observations:                 6366
Model:                          Logit   Df Residuals:                     6357
Method:                           MLE   Df Model:                            8
Date:                Tue, 02 Dec 2014   Pseudo R-squ.:                  0.1327
Time:                        12:54:00   Log-Likelihood:                -3471.5
converged:                       True   LL-Null:                       -4002.5
                                        LLR p-value:                5.807e-224
===================================================================================
                      coef    std err          z      P>|z|      [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
Intercept           3.7257      0.299     12.470      0.000         3.140     4.311
occupation          0.1602      0.034      4.717      0.000         0.094     0.227
educ               -0.0392      0.015     -2.533      0.011        -0.070    -0.009
occupation_husb     0.0124      0.023      0.541      0.589        -0.033     0.057
rate_marriage      -0.7161      0.031    -22.784      0.000        -0.778    -0.655
age                -0.0605      0.010     -5.885      0.000        -0.081    -0.040
yrs_married         0.1100      0.011     10.054      0.000         0.089     0.131
children           -0.0042      0.032     -0.134      0.893        -0.066     0.058
religious          -0.3752      0.035    -10.792      0.000        -0.443    -0.307
===================================================================================

How well are we predicting?

In [9]:
1
<span class="n">affair_mod</span><span class="o">.</span><span class="n">pred_table</span><span class="p">()</span>
Out[9]:
array([[ 3882.,   431.],
       [ 1326.,   727.]])

The coefficients of the discrete choice model do not tell us much. What we're after is marginal effects.

In [10]:
1
2
<span class="n">mfx</span> <span class="o">=</span> <span class="n">affair_mod</span><span class="o">.</span><span class="n">get_margeff</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="n">mfx</span><span class="o">.</span><span class="n">summary</span><span class="p">())</span>
        Logit Marginal Effects
=====================================
Dep. Variable:                 affair
Method:                          dydx
At:                           overall
===================================================================================
                     dy/dx    std err          z      P>|z|      [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
occupation          0.0293      0.006      4.744      0.000         0.017     0.041
educ               -0.0072      0.003     -2.538      0.011        -0.013    -0.002
occupation_husb     0.0023      0.004      0.541      0.589        -0.006     0.010
rate_marriage      -0.1308      0.005    -26.891      0.000        -0.140    -0.121
age                -0.0110      0.002     -5.937      0.000        -0.015    -0.007
yrs_married         0.0201      0.002     10.327      0.000         0.016     0.024
children           -0.0008      0.006     -0.134      0.893        -0.012     0.011
religious          -0.0685      0.006    -11.119      0.000        -0.081    -0.056
===================================================================================

In [11]:
1
2
<span class="n">respondent1000</span> <span class="o">=</span> <span class="n">dta</span><span class="o">.</span><span class="n">ix</span><span class="p">[</span><span class="mi">1000</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="n">respondent1000</span><span class="p">)</span>
rate_marriage       4.000000
age                37.000000
yrs_married        23.000000
children            3.000000
religious           3.000000
educ               12.000000
occupation          3.000000
occupation_husb     4.000000
affairs             0.521739
affair              1.000000
Name: 1000, dtype: float64

In [12]:
1
2
3
4
5
6
<span class="n">resp</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">9</span><span class="p">),</span> <span class="n">respondent1000</span><span class="p">[[</span><span class="s">"occupation"</span><span class="p">,</span> <span class="s">"educ"</span><span class="p">,</span>
                                            <span class="s">"occupation_husb"</span><span class="p">,</span> <span class="s">"rate_marriage"</span><span class="p">,</span>
                                            <span class="s">"age"</span><span class="p">,</span> <span class="s">"yrs_married"</span><span class="p">,</span> <span class="s">"children"</span><span class="p">,</span>
                                            <span class="s">"religious"</span><span class="p">]]</span><span class="o">.</span><span class="n">tolist</span><span class="p">()))</span>
<span class="n">resp</span><span class="o">.</span><span class="n">update</span><span class="p">({</span><span class="mi">0</span> <span class="p">:</span> <span class="mi">1</span><span class="p">})</span>
<span class="k">print</span><span class="p">(</span><span class="n">resp</span><span class="p">)</span>
{0: 1, 1: 3.0, 2: 12.0, 3: 4.0, 4: 4.0, 5: 37.0, 6: 23.0, 7: 3.0, 8: 3.0}

In [13]:
1
2
<span class="n">mfx</span> <span class="o">=</span> <span class="n">affair_mod</span><span class="o">.</span><span class="n">get_margeff</span><span class="p">(</span><span class="n">atexog</span><span class="o">=</span><span class="n">resp</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">mfx</span><span class="o">.</span><span class="n">summary</span><span class="p">())</span>
        Logit Marginal Effects
=====================================
Dep. Variable:                 affair
Method:                          dydx
At:                           overall
===================================================================================
                     dy/dx    std err          z      P>|z|      [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
occupation          0.0400      0.008      4.711      0.000         0.023     0.057
educ               -0.0098      0.004     -2.537      0.011        -0.017    -0.002
occupation_husb     0.0031      0.006      0.541      0.589        -0.008     0.014
rate_marriage      -0.1788      0.008    -22.743      0.000        -0.194    -0.163
age                -0.0151      0.003     -5.928      0.000        -0.020    -0.010
yrs_married         0.0275      0.003     10.256      0.000         0.022     0.033
children           -0.0011      0.008     -0.134      0.893        -0.017     0.014
religious          -0.0937      0.009    -10.722      0.000        -0.111    -0.077
===================================================================================

In [14]:
1
<span class="n">affair_mod</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">respondent1000</span><span class="p">)</span>
Out[14]:
array([ 0.5188])
In [15]:
1
<span class="n">affair_mod</span><span class="o">.</span><span class="n">fittedvalues</span><span class="p">[</span><span class="mi">1000</span><span class="p">]</span>
Out[15]:
0.075161592850574888
In [16]:
1
<span class="n">affair_mod</span><span class="o">.</span><span class="n">model</span><span class="o">.</span><span class="n">cdf</span><span class="p">(</span><span class="n">affair_mod</span><span class="o">.</span><span class="n">fittedvalues</span><span class="p">[</span><span class="mi">1000</span><span class="p">])</span>
Out[16]:
0.51878155721215002

The "correct" model here is likely the Tobit model. We have an work in progress branch "tobit-model" on github, if anyone is interested in censored regression models.

Exercise: Logit vs Probit

In [17]:
1
2
3
4
5
6
<span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span><span class="mi">8</span><span class="p">))</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">fig</span><span class="o">.</span><span class="n">add_subplot</span><span class="p">(</span><span class="mi">111</span><span class="p">)</span>
<span class="n">support</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">6</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">1000</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">support</span><span class="p">,</span> <span class="n">stats</span><span class="o">.</span><span class="n">logistic</span><span class="o">.</span><span class="n">cdf</span><span class="p">(</span><span class="n">support</span><span class="p">),</span> <span class="s">'r-'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'Logistic'</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">support</span><span class="p">,</span> <span class="n">stats</span><span class="o">.</span><span class="n">norm</span><span class="o">.</span><span class="n">cdf</span><span class="p">(</span><span class="n">support</span><span class="p">),</span> <span class="n">label</span><span class="o">=</span><span class="s">'Probit'</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">legend</span><span class="p">();</span>
In [18]:
1
2
3
4
5
6
<span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span><span class="mi">8</span><span class="p">))</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">fig</span><span class="o">.</span><span class="n">add_subplot</span><span class="p">(</span><span class="mi">111</span><span class="p">)</span>
<span class="n">support</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">6</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">1000</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">support</span><span class="p">,</span> <span class="n">stats</span><span class="o">.</span><span class="n">logistic</span><span class="o">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">support</span><span class="p">),</span> <span class="s">'r-'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'Logistic'</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">support</span><span class="p">,</span> <span class="n">stats</span><span class="o">.</span><span class="n">norm</span><span class="o">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">support</span><span class="p">),</span> <span class="n">label</span><span class="o">=</span><span class="s">'Probit'</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">legend</span><span class="p">();</span>

Compare the estimates of the Logit Fair model above to a Probit model. Does the prediction table look better? Much difference in marginal effects?

Genarlized Linear Model Example

In [19]:
1
<span class="k">print</span><span class="p">(</span><span class="n">sm</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">star98</span><span class="o">.</span><span class="n">SOURCE</span><span class="p">)</span>
Jeff Gill's `Generalized Linear Models: A Unified Approach`

http://jgill.wustl.edu/research/books.html


In [20]:
1
<span class="k">print</span><span class="p">(</span><span class="n">sm</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">star98</span><span class="o">.</span><span class="n">DESCRLONG</span><span class="p">)</span>
This data is on the California education policy and outcomes (STAR program
results for 1998.  The data measured standardized testing by the California
Department of Education that required evaluation of 2nd - 11th grade students
by the the Stanford 9 test on a variety of subjects.  This dataset is at
the level of the unified school district and consists of 303 cases.  The
binary response variable represents the number of 9th graders scoring
over the national median value on the mathematics exam.

The data used in this example is only a subset of the original source.


In [21]:
1
<span class="k">print</span><span class="p">(</span><span class="n">sm</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">star98</span><span class="o">.</span><span class="n">NOTE</span><span class="p">)</span>
::

    Number of Observations - 303 (counties in California).

    Number of Variables - 13 and 8 interaction terms.

    Definition of variables names::

        NABOVE   - Total number of students above the national median for the
                   math section.
        NBELOW   - Total number of students below the national median for the
                   math section.
        LOWINC   - Percentage of low income students
        PERASIAN - Percentage of Asian student
        PERBLACK - Percentage of black students
        PERHISP  - Percentage of Hispanic students
        PERMINTE - Percentage of minority teachers
        AVYRSEXP - Sum of teachers' years in educational service divided by the
                number of teachers.
        AVSALK   - Total salary budget including benefits divided by the number
                   of full-time teachers (in thousands)
        PERSPENK - Per-pupil spending (in thousands)
        PTRATIO  - Pupil-teacher ratio.
        PCTAF    - Percentage of students taking UC/CSU prep courses
        PCTCHRT  - Percentage of charter schools
        PCTYRRND - Percentage of year-round schools

        The below variables are interaction terms of the variables defined
        above.

        PERMINTE_AVYRSEXP
        PEMINTE_AVSAL
        AVYRSEXP_AVSAL
        PERSPEN_PTRATIO
        PERSPEN_PCTAF
        PTRATIO_PCTAF
        PERMINTE_AVTRSEXP_AVSAL
        PERSPEN_PTRATIO_PCTAF


In [22]:
1
2
<span class="n">dta</span> <span class="o">=</span> <span class="n">sm</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">star98</span><span class="o">.</span><span class="n">load_pandas</span><span class="p">()</span><span class="o">.</span><span class="n">data</span>
<span class="k">print</span><span class="p">(</span><span class="n">dta</span><span class="o">.</span><span class="n">columns</span><span class="p">)</span>
Index([u'NABOVE', u'NBELOW', u'LOWINC', u'PERASIAN', u'PERBLACK', u'PERHISP', u'PERMINTE', u'AVYRSEXP', u'AVSALK', u'PERSPENK', u'PTRATIO', u'PCTAF', u'PCTCHRT', u'PCTYRRND', u'PERMINTE_AVYRSEXP', u'PERMINTE_AVSAL', u'AVYRSEXP_AVSAL', u'PERSPEN_PTRATIO', u'PERSPEN_PCTAF', u'PTRATIO_PCTAF', u'PERMINTE_AVYRSEXP_AVSAL', u'PERSPEN_PTRATIO_PCTAF'], dtype='object')

In [23]:
1
<span class="k">print</span><span class="p">(</span><span class="n">dta</span><span class="p">[[</span><span class="s">'NABOVE'</span><span class="p">,</span> <span class="s">'NBELOW'</span><span class="p">,</span> <span class="s">'LOWINC'</span><span class="p">,</span> <span class="s">'PERASIAN'</span><span class="p">,</span> <span class="s">'PERBLACK'</span><span class="p">,</span> <span class="s">'PERHISP'</span><span class="p">,</span> <span class="s">'PERMINTE'</span><span class="p">]]</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">10</span><span class="p">))</span>
   NABOVE  NBELOW    LOWINC   PERASIAN   PERBLACK    PERHISP   PERMINTE
0     452     355  34.39730  23.299300  14.235280  11.411120  15.918370
1     144      40  17.36507  29.328380   8.234897   9.314884  13.636360
2     337     234  32.64324   9.226386  42.406310  13.543720  28.834360
3     395     178  11.90953  13.883090   3.796973  11.443110  11.111110
4       8      57  36.88889  12.187500  76.875000   7.604167  43.589740
5    1348     899  20.93149  28.023510   4.643221  13.808160  15.378490
6     477     887  53.26898   8.447858  19.374830  37.905330  25.525530
7     565     347  15.19009   3.665781   2.649680  13.092070   6.203008
8     205     320  28.21582  10.430420   6.786374  32.334300  13.461540
9     469     598  32.77897  17.178310  12.484930  28.323290  27.259890

In [24]:
1
<span class="k">print</span><span class="p">(</span><span class="n">dta</span><span class="p">[[</span><span class="s">'AVYRSEXP'</span><span class="p">,</span> <span class="s">'AVSALK'</span><span class="p">,</span> <span class="s">'PERSPENK'</span><span class="p">,</span> <span class="s">'PTRATIO'</span><span class="p">,</span> <span class="s">'PCTAF'</span><span class="p">,</span> <span class="s">'PCTCHRT'</span><span class="p">,</span> <span class="s">'PCTYRRND'</span><span class="p">]]</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">10</span><span class="p">))</span>
   AVYRSEXP    AVSALK  PERSPENK   PTRATIO     PCTAF  PCTCHRT   PCTYRRND
0  14.70646  59.15732  4.445207  21.71025  57.03276        0  22.222220
1  16.08324  59.50397  5.267598  20.44278  64.62264        0   0.000000
2  14.59559  60.56992  5.482922  18.95419  53.94191        0   0.000000
3  14.38939  58.33411  4.165093  21.63539  49.06103        0   7.142857
4  13.90568  63.15364  4.324902  18.77984  52.38095        0   0.000000
5  14.97755  66.97055  3.916104  24.51914  44.91578        0   2.380952
6  14.67829  57.62195  4.270903  22.21278  32.28916        0  12.121210
7  13.66197  63.44740  4.309734  24.59026  30.45267        0   0.000000
8  16.41760  57.84564  4.527603  21.74138  22.64574        0   0.000000
9  12.51864  57.80141  4.648917  20.26010  26.07099        0   0.000000

In [25]:
1
2
<span class="n">formula</span> <span class="o">=</span> <span class="s">'NABOVE + NBELOW ~ LOWINC + PERASIAN + PERBLACK + PERHISP + PCTCHRT '</span>
<span class="n">formula</span> <span class="o">+=</span> <span class="s">'+ PCTYRRND + PERMINTE*AVYRSEXP*AVSALK + PERSPENK*PTRATIO*PCTAF'</span>

Aside: Binomial distribution

Toss a six-sided die 5 times, what's the probability of exactly 2 fours?

In [26]:
1
<span class="n">stats</span><span class="o">.</span><span class="n">binom</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="mf">1.</span><span class="o">/</span><span class="mi">6</span><span class="p">)</span><span class="o">.</span><span class="n">pmf</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
Out[26]:
0.16075102880658435
In [27]:
1
2
<span class="kn">from</span> <span class="nn">scipy.misc</span> <span class="kn">import</span> <span class="n">comb</span>
<span class="n">comb</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span><span class="mi">2</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span><span class="o">/</span><span class="mf">6.</span><span class="p">)</span><span class="o">**</span><span class="mi">2</span> <span class="o">*</span> <span class="p">(</span><span class="mi">5</span><span class="o">/</span><span class="mf">6.</span><span class="p">)</span><span class="o">**</span><span class="mi">3</span>
Out[27]:
0.1607510288065844
In [28]:
1
2
<span class="kn">from</span> <span class="nn">statsmodels.formula.api</span> <span class="kn">import</span> <span class="n">glm</span>
<span class="n">glm_mod</span> <span class="o">=</span> <span class="n">glm</span><span class="p">(</span><span class="n">formula</span><span class="p">,</span> <span class="n">dta</span><span class="p">,</span> <span class="n">family</span><span class="o">=</span><span class="n">sm</span><span class="o">.</span><span class="n">families</span><span class="o">.</span><span class="n">Binomial</span><span class="p">())</span><span class="o">.</span><span class="n">fit</span><span class="p">()</span>
In [29]:
1
<span class="k">print</span><span class="p">(</span><span class="n">glm_mod</span><span class="o">.</span><span class="n">summary</span><span class="p">())</span>
                  Generalized Linear Model Regression Results
================================================================================
Dep. Variable:     ['NABOVE', 'NBELOW']   No. Observations:                  303
Model:                              GLM   Df Residuals:                      282
Model Family:                  Binomial   Df Model:                           20
Link Function:                    logit   Scale:                             1.0
Method:                            IRLS   Log-Likelihood:                -2998.6
Date:                  Tue, 02 Dec 2014   Deviance:                       4078.8
Time:                          12:54:05   Pearson chi2:                 4.05e+03
No. Iterations:                       7
============================================================================================
                               coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------------------
Intercept                    2.9589      1.547      1.913      0.056        -0.073     5.990
LOWINC                      -0.0168      0.000    -38.749      0.000        -0.018    -0.016
PERASIAN                     0.0099      0.001     16.505      0.000         0.009     0.011
PERBLACK                    -0.0187      0.001    -25.182      0.000        -0.020    -0.017
PERHISP                     -0.0142      0.000    -32.818      0.000        -0.015    -0.013
PCTCHRT                      0.0049      0.001      3.921      0.000         0.002     0.007
PCTYRRND                    -0.0036      0.000    -15.878      0.000        -0.004    -0.003
PERMINTE                     0.2545      0.030      8.498      0.000         0.196     0.313
AVYRSEXP                     0.2407      0.057      4.212      0.000         0.129     0.353
PERMINTE:AVYRSEXP           -0.0141      0.002     -7.391      0.000        -0.018    -0.010
AVSALK                       0.0804      0.014      5.775      0.000         0.053     0.108
PERMINTE:AVSALK             -0.0040      0.000     -8.450      0.000        -0.005    -0.003
AVYRSEXP:AVSALK             -0.0039      0.001     -4.059      0.000        -0.006    -0.002
PERMINTE:AVYRSEXP:AVSALK     0.0002   2.99e-05      7.428      0.000         0.000     0.000
PERSPENK                    -1.9522      0.317     -6.162      0.000        -2.573    -1.331
PTRATIO                     -0.3341      0.061     -5.453      0.000        -0.454    -0.214
PERSPENK:PTRATIO             0.0917      0.015      6.321      0.000         0.063     0.120
PCTAF                       -0.1690      0.033     -5.169      0.000        -0.233    -0.105
PERSPENK:PCTAF               0.0490      0.007      6.574      0.000         0.034     0.064
PTRATIO:PCTAF                0.0080      0.001      5.362      0.000         0.005     0.011
PERSPENK:PTRATIO:PCTAF      -0.0022      0.000     -6.445      0.000        -0.003    -0.002
============================================================================================

The number of trials

In [30]:
1
<span class="n">glm_mod</span><span class="o">.</span><span class="n">model</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">orig_endog</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
Out[30]:
0      807
1      184
2      571
3      573
4       65
5     2247
6     1364
7      912
8      525
9     1067
10    3016
11     235
12     556
13     688
14     252
...
288      53
289     266
290     304
291    1338
292    1170
293    1431
294     248
295     516
296     591
297      59
298     342
299     154
300     595
301     709
302     156
Length: 303, dtype: float64
In [31]:
1
<span class="n">glm_mod</span><span class="o">.</span><span class="n">fittedvalues</span> <span class="o">*</span> <span class="n">glm_mod</span><span class="o">.</span><span class="n">model</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">orig_endog</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
Out[31]:
0      470.732584
1      138.266178
2      285.832629
3      392.702917
4       20.963146
5     1543.545102
6      454.209651
7      598.497867
8      261.720305
9      540.687237
10     722.479333
11     203.583934
12     258.167040
13     303.902616
14     168.330747
...
288     33.470295
289     68.855461
290    174.264199
291    827.377548
292    506.242734
293    958.896993
294    187.988967
295    259.823500
296    379.553974
297     17.656181
298    111.464708
299     61.037884
300    235.517446
301    290.952508
302     53.312851
Length: 303, dtype: float64

First differences: We hold all explanatory variables constant at their means and manipulate the percentage of low income households to assess its impact on the response variables:

In [32]:
1
<span class="n">exog</span> <span class="o">=</span> <span class="n">glm_mod</span><span class="o">.</span><span class="n">model</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">orig_exog</span> <span class="c"># get the dataframe</span>
In [33]:
1
2
<span class="n">means25</span> <span class="o">=</span> <span class="n">exog</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="n">means25</span><span class="p">)</span>
Intercept                       1.000000
LOWINC                         41.409877
PERASIAN                        5.896335
PERBLACK                        5.636808
PERHISP                        34.398080
PCTCHRT                         1.175909
PCTYRRND                       11.611905
PERMINTE                       14.694747
AVYRSEXP                       14.253875
PERMINTE:AVYRSEXP             209.018700
AVSALK                         58.640258
PERMINTE:AVSALK               879.979883
AVYRSEXP:AVSALK               839.718173
PERMINTE:AVYRSEXP:AVSALK    12585.266464
PERSPENK                        4.320310
PTRATIO                        22.464250
PERSPENK:PTRATIO               96.295756
PCTAF                          33.630593
PERSPENK:PCTAF                147.235740
PTRATIO:PCTAF                 747.445536
PERSPENK:PTRATIO:PCTAF       3243.607568
dtype: float64

In [34]:
1
2
<span class="n">means25</span><span class="p">[</span><span class="s">'LOWINC'</span><span class="p">]</span> <span class="o">=</span> <span class="n">exog</span><span class="p">[</span><span class="s">'LOWINC'</span><span class="p">]</span><span class="o">.</span><span class="n">quantile</span><span class="p">(</span><span class="o">.</span><span class="mi">25</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">means25</span><span class="p">)</span>
Intercept                       1.000000
LOWINC                         26.683040
PERASIAN                        5.896335
PERBLACK                        5.636808
PERHISP                        34.398080
PCTCHRT                         1.175909
PCTYRRND                       11.611905
PERMINTE                       14.694747
AVYRSEXP                       14.253875
PERMINTE:AVYRSEXP             209.018700
AVSALK                         58.640258
PERMINTE:AVSALK               879.979883
AVYRSEXP:AVSALK               839.718173
PERMINTE:AVYRSEXP:AVSALK    12585.266464
PERSPENK                        4.320310
PTRATIO                        22.464250
PERSPENK:PTRATIO               96.295756
PCTAF                          33.630593
PERSPENK:PCTAF                147.235740
PTRATIO:PCTAF                 747.445536
PERSPENK:PTRATIO:PCTAF       3243.607568
dtype: float64

In [35]:
1
2
3
<span class="n">means75</span> <span class="o">=</span> <span class="n">exog</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="n">means75</span><span class="p">[</span><span class="s">'LOWINC'</span><span class="p">]</span> <span class="o">=</span> <span class="n">exog</span><span class="p">[</span><span class="s">'LOWINC'</span><span class="p">]</span><span class="o">.</span><span class="n">quantile</span><span class="p">(</span><span class="o">.</span><span class="mi">75</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">means75</span><span class="p">)</span>
Intercept                       1.000000
LOWINC                         55.460075
PERASIAN                        5.896335
PERBLACK                        5.636808
PERHISP                        34.398080
PCTCHRT                         1.175909
PCTYRRND                       11.611905
PERMINTE                       14.694747
AVYRSEXP                       14.253875
PERMINTE:AVYRSEXP             209.018700
AVSALK                         58.640258
PERMINTE:AVSALK               879.979883
AVYRSEXP:AVSALK               839.718173
PERMINTE:AVYRSEXP:AVSALK    12585.266464
PERSPENK                        4.320310
PTRATIO                        22.464250
PERSPENK:PTRATIO               96.295756
PCTAF                          33.630593
PERSPENK:PCTAF                147.235740
PTRATIO:PCTAF                 747.445536
PERSPENK:PTRATIO:PCTAF       3243.607568
dtype: float64

In [36]:
1
2
3
<span class="n">resp25</span> <span class="o">=</span> <span class="n">glm_mod</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">means25</span><span class="p">)</span>
<span class="n">resp75</span> <span class="o">=</span> <span class="n">glm_mod</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">means75</span><span class="p">)</span>
<span class="n">diff</span> <span class="o">=</span> <span class="n">resp75</span> <span class="o">-</span> <span class="n">resp25</span>

The interquartile first difference for the percentage of low income households in a school district is:

In [37]:
1
<span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="si">%2.4f%%</span><span class="s">"</span> <span class="o">%</span> <span class="p">(</span><span class="n">diff</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">*</span><span class="mi">100</span><span class="p">))</span>
-11.8863%

In [38]:
1
2
3
<span class="n">nobs</span> <span class="o">=</span> <span class="n">glm_mod</span><span class="o">.</span><span class="n">nobs</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">glm_mod</span><span class="o">.</span><span class="n">model</span><span class="o">.</span><span class="n">endog</span>
<span class="n">yhat</span> <span class="o">=</span> <span class="n">glm_mod</span><span class="o">.</span><span class="n">mu</span>
In [39]:
1
2
3
4
5
6
<span class="kn">from</span> <span class="nn">statsmodels.graphics.api</span> <span class="kn">import</span> <span class="n">abline_plot</span>
<span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span><span class="mi">8</span><span class="p">))</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">fig</span><span class="o">.</span><span class="n">add_subplot</span><span class="p">(</span><span class="mi">111</span><span class="p">,</span> <span class="n">ylabel</span><span class="o">=</span><span class="s">'Observed Values'</span><span class="p">,</span> <span class="n">xlabel</span><span class="o">=</span><span class="s">'Fitted Values'</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">yhat</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="n">y_vs_yhat</span> <span class="o">=</span> <span class="n">sm</span><span class="o">.</span><span class="n">OLS</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">sm</span><span class="o">.</span><span class="n">add_constant</span><span class="p">(</span><span class="n">yhat</span><span class="p">,</span> <span class="n">prepend</span><span class="o">=</span><span class="bp">True</span><span class="p">))</span><span class="o">.</span><span class="n">fit</span><span class="p">()</span>
<span class="n">fig</span> <span class="o">=</span> <span class="n">abline_plot</span><span class="p">(</span><span class="n">model_results</span><span class="o">=</span><span class="n">y_vs_yhat</span><span class="p">,</span> <span class="n">ax</span><span class="o">=</span><span class="n">ax</span><span class="p">)</span>

Plot fitted values vs Pearson residuals

Pearson residuals are defined to be

(yμ)(var(μ))

where var is typically determined by the family. E.g., binomial variance is $np(1 - p)$

In [40]:
1
2
3
4
5
6
<span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span><span class="mi">8</span><span class="p">))</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">fig</span><span class="o">.</span><span class="n">add_subplot</span><span class="p">(</span><span class="mi">111</span><span class="p">,</span> <span class="n">title</span><span class="o">=</span><span class="s">'Residual Dependence Plot'</span><span class="p">,</span> <span class="n">xlabel</span><span class="o">=</span><span class="s">'Fitted Values'</span><span class="p">,</span>
                          <span class="n">ylabel</span><span class="o">=</span><span class="s">'Pearson Residuals'</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">yhat</span><span class="p">,</span> <span class="n">stats</span><span class="o">.</span><span class="n">zscore</span><span class="p">(</span><span class="n">glm_mod</span><span class="o">.</span><span class="n">resid_pearson</span><span class="p">))</span>
<span class="n">ax</span><span class="o">.</span><span class="n">axis</span><span class="p">(</span><span class="s">'tight'</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">plot</span><span class="p">([</span><span class="mf">0.0</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">],[</span><span class="mf">0.0</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">],</span> <span class="s">'k-'</span><span class="p">);</span>

Histogram of standardized deviance residuals with Kernel Density Estimate overlayed

The definition of the deviance residuals depends on the family. For the Binomial distribution this is

rdev=sign(Yμ)2n(YlogYμ+(1Y)log(1Y)(1μ)

They can be used to detect ill-fitting covariates

In [41]:
1
2
3
4
<span class="n">resid</span> <span class="o">=</span> <span class="n">glm_mod</span><span class="o">.</span><span class="n">resid_deviance</span>
<span class="n">resid_std</span> <span class="o">=</span> <span class="n">stats</span><span class="o">.</span><span class="n">zscore</span><span class="p">(</span><span class="n">resid</span><span class="p">)</span>
<span class="n">kde_resid</span> <span class="o">=</span> <span class="n">sm</span><span class="o">.</span><span class="n">nonparametric</span><span class="o">.</span><span class="n">KDEUnivariate</span><span class="p">(</span><span class="n">resid_std</span><span class="p">)</span>
<span class="n">kde_resid</span><span class="o">.</span><span class="n">fit</span><span class="p">()</span>
In [42]:
1
2
3
4
<span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span><span class="mi">8</span><span class="p">))</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">fig</span><span class="o">.</span><span class="n">add_subplot</span><span class="p">(</span><span class="mi">111</span><span class="p">,</span> <span class="n">title</span><span class="o">=</span><span class="s">"Standardized Deviance Residuals"</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">resid_std</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">25</span><span class="p">,</span> <span class="n">normed</span><span class="o">=</span><span class="bp">True</span><span class="p">);</span>
<span class="n">ax</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">kde_resid</span><span class="o">.</span><span class="n">support</span><span class="p">,</span> <span class="n">kde_resid</span><span class="o">.</span><span class="n">density</span><span class="p">,</span> <span class="s">'r'</span><span class="p">);</span>

QQ-plot of deviance residuals

In [43]:
1
2
3
<span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span><span class="mi">8</span><span class="p">))</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">fig</span><span class="o">.</span><span class="n">add_subplot</span><span class="p">(</span><span class="mi">111</span><span class="p">)</span>
<span class="n">fig</span> <span class="o">=</span> <span class="n">sm</span><span class="o">.</span><span class="n">graphics</span><span class="o">.</span><span class="n">qqplot</span><span class="p">(</span><span class="n">resid</span><span class="p">,</span> <span class="n">line</span><span class="o">=</span><span class="s">'r'</span><span class="p">,</span> <span class="n">ax</span><span class="o">=</span><span class="n">ax</span><span class="p">)</span>
doc_statsmodels
2025-01-10 15:47:30
Comments
Leave a Comment

Please login to continue.