Generalized Linear Models (Formula)
This notebook illustrates how you can use R-style formulas to fit Generalized Linear Models.
To begin, we load the Star98
dataset and we construct a formula and pre-process the data:
In [1]:
1 2 3 4 5 6 7 8 9 10 11 12 | <span class = "kn" > from < / span> <span class = "nn" >__future__< / span> <span class = "kn" > import < / span> <span class = "n" >print_function< / span> <span class = "kn" > import < / span> <span class = "nn" >statsmodels.api< / span> <span class = "kn" >as< / span> <span class = "nn" >sm< / span> <span class = "kn" > import < / span> <span class = "nn" >statsmodels.formula.api< / span> <span class = "kn" >as< / span> <span class = "nn" >smf< / span> <span class = "n" >star98< / span> <span class = "o" > = < / span> <span class = "n" >sm< / span><span class = "o" >.< / span><span class = "n" >datasets< / span><span class = "o" >.< / span><span class = "n" >star98< / span><span class = "o" >.< / span><span class = "n" >load_pandas< / span><span class = "p" >()< / span><span class = "o" >.< / span><span class = "n" >data< / span> <span class = "n" >formula< / span> <span class = "o" > = < / span> <span class = "s" >'SUCCESS ~ LOWINC + PERASIAN + PERBLACK + PERHISP + PCTCHRT + < / span><span class = "se" >\< / span> <span class = "s" > PCTYRRND + PERMINTE * AVYRSEXP * AVSALK + PERSPENK * PTRATIO * PCTAF'< / span> <span class = "n" >dta< / span> <span class = "o" > = < / span> <span class = "n" >star98< / span><span class = "p" >[[< / span><span class = "s" > 'NABOVE' < / span><span class = "p" >,< / span> <span class = "s" > 'NBELOW' < / span><span class = "p" >,< / span> <span class = "s" > 'LOWINC' < / span><span class = "p" >,< / span> <span class = "s" > 'PERASIAN' < / span><span class = "p" >,< / span> <span class = "s" > 'PERBLACK' < / span><span class = "p" >,< / span> <span class = "s" > 'PERHISP' < / span><span class = "p" >,< / span> <span class = "s" > 'PCTCHRT' < / span><span class = "p" >,< / span> <span class = "s" > 'PCTYRRND' < / span><span class = "p" >,< / span> <span class = "s" > 'PERMINTE' < / span><span class = "p" >,< / span> <span class = "s" > 'AVYRSEXP' < / span><span class = "p" >,< / span> <span class = "s" > 'AVSALK' < / span><span class = "p" >,< / span> <span class = "s" > 'PERSPENK' < / span><span class = "p" >,< / span> <span class = "s" > 'PTRATIO' < / span><span class = "p" >,< / span> <span class = "s" > 'PCTAF' < / span><span class = "p" >]]< / span> <span class = "n" >endog< / span> <span class = "o" > = < / span> <span class = "n" >dta< / span><span class = "p" >[< / span><span class = "s" > 'NABOVE' < / span><span class = "p" >]< / span> <span class = "o" > / < / span> <span class = "p" >(< / span><span class = "n" >dta< / span><span class = "p" >[< / span><span class = "s" > 'NABOVE' < / span><span class = "p" >]< / span> <span class = "o" > + < / span> <span class = "n" >dta< / span><span class = "o" >.< / span><span class = "n" >pop< / span><span class = "p" >(< / span><span class = "s" > 'NBELOW' < / span><span class = "p" >))< / span> <span class = "k" > del < / span> <span class = "n" >dta< / span><span class = "p" >[< / span><span class = "s" > 'NABOVE' < / span><span class = "p" >]< / span> <span class = "n" >dta< / span><span class = "p" >[< / span><span class = "s" > 'SUCCESS' < / span><span class = "p" >]< / span> <span class = "o" > = < / span> <span class = "n" >endog< / span> |
Then, we fit the GLM model:
In [2]:
1 2 | <span class = "n" >mod1< / span> <span class = "o" > = < / span> <span class = "n" >smf< / span><span class = "o" >.< / span><span class = "n" >glm< / span><span class = "p" >(< / span><span class = "n" >formula< / span><span class = "o" > = < / span><span class = "n" >formula< / span><span class = "p" >,< / span> <span class = "n" >data< / span><span class = "o" > = < / span><span class = "n" >dta< / span><span class = "p" >,< / span> <span class = "n" >family< / span><span class = "o" > = < / span><span class = "n" >sm< / span><span class = "o" >.< / span><span class = "n" >families< / span><span class = "o" >.< / span><span class = "n" >Binomial< / span><span class = "p" >())< / span><span class = "o" >.< / span><span class = "n" >fit< / span><span class = "p" >()< / span> <span class = "n" >mod1< / span><span class = "o" >.< / span><span class = "n" >summary< / span><span class = "p" >()< / span> |
Out[2]:
Finally, we define a function to operate customized data transformation using the formula framework:
In [3]:
1 2 3 4 5 6 | <span class = "k" > def < / span> <span class = "nf" >double_it< / span><span class = "p" >(< / span><span class = "n" >x< / span><span class = "p" >):< / span> <span class = "k" > return < / span> <span class = "mi" > 2 < / span> <span class = "o" > * < / span> <span class = "n" >x< / span> <span class = "n" >formula< / span> <span class = "o" > = < / span> <span class = "s" >'SUCCESS ~ double_it(LOWINC) + PERASIAN + PERBLACK + PERHISP + PCTCHRT + < / span><span class = "se" >\< / span> <span class = "s" > PCTYRRND + PERMINTE * AVYRSEXP * AVSALK + PERSPENK * PTRATIO * PCTAF'< / span> <span class = "n" >mod2< / span> <span class = "o" > = < / span> <span class = "n" >smf< / span><span class = "o" >.< / span><span class = "n" >glm< / span><span class = "p" >(< / span><span class = "n" >formula< / span><span class = "o" > = < / span><span class = "n" >formula< / span><span class = "p" >,< / span> <span class = "n" >data< / span><span class = "o" > = < / span><span class = "n" >dta< / span><span class = "p" >,< / span> <span class = "n" >family< / span><span class = "o" > = < / span><span class = "n" >sm< / span><span class = "o" >.< / span><span class = "n" >families< / span><span class = "o" >.< / span><span class = "n" >Binomial< / span><span class = "p" >())< / span><span class = "o" >.< / span><span class = "n" >fit< / span><span class = "p" >()< / span> <span class = "n" >mod2< / span><span class = "o" >.< / span><span class = "n" >summary< / span><span class = "p" >()< / span> |
Out[3]:
As expected, the coefficient for double_it(LOWINC)
in the second model is half the size of the LOWINC
coefficient from the first model:
In [4]:
1 2 | <span class = "k" > print < / span><span class = "p" >(< / span><span class = "n" >mod1< / span><span class = "o" >.< / span><span class = "n" >params< / span><span class = "p" >[< / span><span class = "mi" > 1 < / span><span class = "p" >])< / span> <span class = "k" > print < / span><span class = "p" >(< / span><span class = "n" >mod2< / span><span class = "o" >.< / span><span class = "n" >params< / span><span class = "p" >[< / span><span class = "mi" > 1 < / span><span class = "p" >]< / span> <span class = "o" > * < / span> <span class = "mi" > 2 < / span><span class = "p" >)< / span> |
Please login to continue.