Generalized Linear Models (Formula)

Generalized Linear Models (Formula)

Link to Notebook GitHub

This notebook illustrates how you can use R-style formulas to fit Generalized Linear Models.

To begin, we load the Star98 dataset and we construct a formula and pre-process the data:

In [1]:
1
2
3
4
5
6
7
8
9
10
11
12
<span class="kn">from</span> <span class="nn">__future__</span> <span class="kn">import</span> <span class="n">print_function</span>
<span class="kn">import</span> <span class="nn">statsmodels.api</span> <span class="kn">as</span> <span class="nn">sm</span>
<span class="kn">import</span> <span class="nn">statsmodels.formula.api</span> <span class="kn">as</span> <span class="nn">smf</span>
<span class="n">star98</span> <span class="o">=</span> <span class="n">sm</span><span class="o">.</span><span class="n">datasets</span><span class="o">.</span><span class="n">star98</span><span class="o">.</span><span class="n">load_pandas</span><span class="p">()</span><span class="o">.</span><span class="n">data</span>
<span class="n">formula</span> <span class="o">=</span> <span class="s">'SUCCESS ~ LOWINC + PERASIAN + PERBLACK + PERHISP + PCTCHRT + </span><span class="se">\</span>
<span class="s">           PCTYRRND + PERMINTE*AVYRSEXP*AVSALK + PERSPENK*PTRATIO*PCTAF'</span>
<span class="n">dta</span> <span class="o">=</span> <span class="n">star98</span><span class="p">[[</span><span class="s">'NABOVE'</span><span class="p">,</span> <span class="s">'NBELOW'</span><span class="p">,</span> <span class="s">'LOWINC'</span><span class="p">,</span> <span class="s">'PERASIAN'</span><span class="p">,</span> <span class="s">'PERBLACK'</span><span class="p">,</span> <span class="s">'PERHISP'</span><span class="p">,</span>
              <span class="s">'PCTCHRT'</span><span class="p">,</span> <span class="s">'PCTYRRND'</span><span class="p">,</span> <span class="s">'PERMINTE'</span><span class="p">,</span> <span class="s">'AVYRSEXP'</span><span class="p">,</span> <span class="s">'AVSALK'</span><span class="p">,</span>
              <span class="s">'PERSPENK'</span><span class="p">,</span> <span class="s">'PTRATIO'</span><span class="p">,</span> <span class="s">'PCTAF'</span><span class="p">]]</span>
<span class="n">endog</span> <span class="o">=</span> <span class="n">dta</span><span class="p">[</span><span class="s">'NABOVE'</span><span class="p">]</span> <span class="o">/</span> <span class="p">(</span><span class="n">dta</span><span class="p">[</span><span class="s">'NABOVE'</span><span class="p">]</span> <span class="o">+</span> <span class="n">dta</span><span class="o">.</span><span class="n">pop</span><span class="p">(</span><span class="s">'NBELOW'</span><span class="p">))</span>
<span class="k">del</span> <span class="n">dta</span><span class="p">[</span><span class="s">'NABOVE'</span><span class="p">]</span>
<span class="n">dta</span><span class="p">[</span><span class="s">'SUCCESS'</span><span class="p">]</span> <span class="o">=</span> <span class="n">endog</span>

Then, we fit the GLM model:

In [2]:
1
2
<span class="n">mod1</span> <span class="o">=</span> <span class="n">smf</span><span class="o">.</span><span class="n">glm</span><span class="p">(</span><span class="n">formula</span><span class="o">=</span><span class="n">formula</span><span class="p">,</span> <span class="n">data</span><span class="o">=</span><span class="n">dta</span><span class="p">,</span> <span class="n">family</span><span class="o">=</span><span class="n">sm</span><span class="o">.</span><span class="n">families</span><span class="o">.</span><span class="n">Binomial</span><span class="p">())</span><span class="o">.</span><span class="n">fit</span><span class="p">()</span>
<span class="n">mod1</span><span class="o">.</span><span class="n">summary</span><span class="p">()</span>
Out[2]:
Generalized Linear Model Regression Results
Dep. Variable: SUCCESS No. Observations: 303
Model: GLM Df Residuals: 282
Model Family: Binomial Df Model: 20
Link Function: logit Scale: 1.0
Method: IRLS Log-Likelihood: -189.70
Date: Tue, 02 Dec 2014 Deviance: 380.66
Time: 12:53:02 Pearson chi2: 8.48
No. Iterations: 7
coef std err z P>|z| [95.0% Conf. Int.]
Intercept 0.4037 25.036 0.016 0.987 -48.665 49.472
LOWINC -0.0204 0.010 -1.982 0.048 -0.041 -0.000
PERASIAN 0.0159 0.017 0.910 0.363 -0.018 0.050
PERBLACK -0.0198 0.020 -1.004 0.316 -0.058 0.019
PERHISP -0.0096 0.010 -0.951 0.341 -0.029 0.010
PCTCHRT -0.0022 0.022 -0.103 0.918 -0.045 0.040
PCTYRRND -0.0022 0.006 -0.348 0.728 -0.014 0.010
PERMINTE 0.1068 0.787 0.136 0.892 -1.436 1.650
AVYRSEXP -0.0411 1.176 -0.035 0.972 -2.346 2.264
PERMINTE:AVYRSEXP -0.0031 0.054 -0.057 0.954 -0.108 0.102
AVSALK 0.0131 0.295 0.044 0.965 -0.566 0.592
PERMINTE:AVSALK -0.0019 0.013 -0.145 0.885 -0.028 0.024
AVYRSEXP:AVSALK 0.0008 0.020 0.038 0.970 -0.039 0.041
PERMINTE:AVYRSEXP:AVSALK 5.978e-05 0.001 0.068 0.946 -0.002 0.002
PERSPENK -0.3097 4.233 -0.073 0.942 -8.606 7.987
PTRATIO 0.0096 0.919 0.010 0.992 -1.792 1.811
PERSPENK:PTRATIO 0.0066 0.206 0.032 0.974 -0.397 0.410
PCTAF -0.0143 0.474 -0.030 0.976 -0.944 0.916
PERSPENK:PCTAF 0.0105 0.098 0.107 0.915 -0.182 0.203
PTRATIO:PCTAF -0.0001 0.022 -0.005 0.996 -0.044 0.044
PERSPENK:PTRATIO:PCTAF -0.0002 0.005 -0.051 0.959 -0.010 0.009

Finally, we define a function to operate customized data transformation using the formula framework:

In [3]:
1
2
3
4
5
6
<span class="k">def</span> <span class="nf">double_it</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
    <span class="k">return</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">x</span>
<span class="n">formula</span> <span class="o">=</span> <span class="s">'SUCCESS ~ double_it(LOWINC) + PERASIAN + PERBLACK + PERHISP + PCTCHRT + </span><span class="se">\</span>
<span class="s">           PCTYRRND + PERMINTE*AVYRSEXP*AVSALK + PERSPENK*PTRATIO*PCTAF'</span>
<span class="n">mod2</span> <span class="o">=</span> <span class="n">smf</span><span class="o">.</span><span class="n">glm</span><span class="p">(</span><span class="n">formula</span><span class="o">=</span><span class="n">formula</span><span class="p">,</span> <span class="n">data</span><span class="o">=</span><span class="n">dta</span><span class="p">,</span> <span class="n">family</span><span class="o">=</span><span class="n">sm</span><span class="o">.</span><span class="n">families</span><span class="o">.</span><span class="n">Binomial</span><span class="p">())</span><span class="o">.</span><span class="n">fit</span><span class="p">()</span>
<span class="n">mod2</span><span class="o">.</span><span class="n">summary</span><span class="p">()</span>
Out[3]:
Generalized Linear Model Regression Results
Dep. Variable: SUCCESS No. Observations: 303
Model: GLM Df Residuals: 282
Model Family: Binomial Df Model: 20
Link Function: logit Scale: 1.0
Method: IRLS Log-Likelihood: -189.70
Date: Tue, 02 Dec 2014 Deviance: 380.66
Time: 12:53:02 Pearson chi2: 8.48
No. Iterations: 7
coef std err z P>|z| [95.0% Conf. Int.]
Intercept 0.4037 25.036 0.016 0.987 -48.665 49.472
double_it(LOWINC) -0.0102 0.005 -1.982 0.048 -0.020 -0.000
PERASIAN 0.0159 0.017 0.910 0.363 -0.018 0.050
PERBLACK -0.0198 0.020 -1.004 0.316 -0.058 0.019
PERHISP -0.0096 0.010 -0.951 0.341 -0.029 0.010
PCTCHRT -0.0022 0.022 -0.103 0.918 -0.045 0.040
PCTYRRND -0.0022 0.006 -0.348 0.728 -0.014 0.010
PERMINTE 0.1068 0.787 0.136 0.892 -1.436 1.650
AVYRSEXP -0.0411 1.176 -0.035 0.972 -2.346 2.264
PERMINTE:AVYRSEXP -0.0031 0.054 -0.057 0.954 -0.108 0.102
AVSALK 0.0131 0.295 0.044 0.965 -0.566 0.592
PERMINTE:AVSALK -0.0019 0.013 -0.145 0.885 -0.028 0.024
AVYRSEXP:AVSALK 0.0008 0.020 0.038 0.970 -0.039 0.041
PERMINTE:AVYRSEXP:AVSALK 5.978e-05 0.001 0.068 0.946 -0.002 0.002
PERSPENK -0.3097 4.233 -0.073 0.942 -8.606 7.987
PTRATIO 0.0096 0.919 0.010 0.992 -1.792 1.811
PERSPENK:PTRATIO 0.0066 0.206 0.032 0.974 -0.397 0.410
PCTAF -0.0143 0.474 -0.030 0.976 -0.944 0.916
PERSPENK:PCTAF 0.0105 0.098 0.107 0.915 -0.182 0.203
PTRATIO:PCTAF -0.0001 0.022 -0.005 0.996 -0.044 0.044
PERSPENK:PTRATIO:PCTAF -0.0002 0.005 -0.051 0.959 -0.010 0.009

As expected, the coefficient for double_it(LOWINC) in the second model is half the size of the LOWINC coefficient from the first model:

In [4]:
1
2
<span class="k">print</span><span class="p">(</span><span class="n">mod1</span><span class="o">.</span><span class="n">params</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="k">print</span><span class="p">(</span><span class="n">mod2</span><span class="o">.</span><span class="n">params</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">*</span> <span class="mi">2</span><span class="p">)</span>
-0.0203959871548
-0.0203959871548

doc_statsmodels
2025-01-10 15:47:30
Comments
Leave a Comment

Please login to continue.