Generalized Linear Models
1 2 3 4 5 | <span class = "kn" > from < / span> <span class = "nn" >__future__< / span> <span class = "kn" > import < / span> <span class = "n" >print_function< / span> <span class = "kn" > import < / span> <span class = "nn" >numpy< / span> <span class = "kn" >as< / span> <span class = "nn" >np< / span> <span class = "kn" > import < / span> <span class = "nn" >statsmodels.api< / span> <span class = "kn" >as< / span> <span class = "nn" >sm< / span> <span class = "kn" > from < / span> <span class = "nn" >scipy< / span> <span class = "kn" > import < / span> <span class = "n" >stats< / span> <span class = "kn" > from < / span> <span class = "nn" >matplotlib< / span> <span class = "kn" > import < / span> <span class = "n" >pyplot< / span> <span class = "k" >as< / span> <span class = "n" >plt< / span> |
GLM: Binomial response data
Load data
In this example, we use the Star98 dataset which was taken with permission from Jeff Gill (2000) Generalized linear models: A unified approach. Codebook information can be obtained by typing:
1 | <span class = "k" > print < / span><span class = "p" >(< / span><span class = "n" >sm< / span><span class = "o" >.< / span><span class = "n" >datasets< / span><span class = "o" >.< / span><span class = "n" >star98< / span><span class = "o" >.< / span><span class = "n" >NOTE< / span><span class = "p" >)< / span> |
Load the data and add a constant to the exogenous (independent) variables:
1 2 | <span class = "n" >data< / span> <span class = "o" > = < / span> <span class = "n" >sm< / span><span class = "o" >.< / span><span class = "n" >datasets< / span><span class = "o" >.< / span><span class = "n" >star98< / span><span class = "o" >.< / span><span class = "n" >load< / span><span class = "p" >()< / span> <span class = "n" >data< / span><span class = "o" >.< / span><span class = "n" >exog< / span> <span class = "o" > = < / span> <span class = "n" >sm< / span><span class = "o" >.< / span><span class = "n" >add_constant< / span><span class = "p" >(< / span><span class = "n" >data< / span><span class = "o" >.< / span><span class = "n" >exog< / span><span class = "p" >,< / span> <span class = "n" >prepend< / span><span class = "o" > = < / span><span class = "bp" > False < / span><span class = "p" >)< / span> |
The dependent variable is N by 2 (Success: NABOVE, Failure: NBELOW):
1 | <span class = "k" > print < / span><span class = "p" >(< / span><span class = "n" >data< / span><span class = "o" >.< / span><span class = "n" >endog< / span><span class = "p" >[:< / span><span class = "mi" > 5 < / span><span class = "p" >,:])< / span> |
The independent variables include all the other variables described above, as well as the interaction terms:
1 | <span class = "k" > print < / span><span class = "p" >(< / span><span class = "n" >data< / span><span class = "o" >.< / span><span class = "n" >exog< / span><span class = "p" >[:< / span><span class = "mi" > 2 < / span><span class = "p" >,:])< / span> |
Fit and summary
1 2 3 | <span class = "n" >glm_binom< / span> <span class = "o" > = < / span> <span class = "n" >sm< / span><span class = "o" >.< / span><span class = "n" >GLM< / span><span class = "p" >(< / span><span class = "n" >data< / span><span class = "o" >.< / span><span class = "n" >endog< / span><span class = "p" >,< / span> <span class = "n" >data< / span><span class = "o" >.< / span><span class = "n" >exog< / span><span class = "p" >,< / span> <span class = "n" >family< / span><span class = "o" > = < / span><span class = "n" >sm< / span><span class = "o" >.< / span><span class = "n" >families< / span><span class = "o" >.< / span><span class = "n" >Binomial< / span><span class = "p" >())< / span> <span class = "n" >res< / span> <span class = "o" > = < / span> <span class = "n" >glm_binom< / span><span class = "o" >.< / span><span class = "n" >fit< / span><span class = "p" >()< / span> <span class = "k" > print < / span><span class = "p" >(< / span><span class = "n" >res< / span><span class = "o" >.< / span><span class = "n" >summary< / span><span class = "p" >())< / span> |
Quantities of interest
1 2 3 | <span class = "k" > print < / span><span class = "p" >(< / span><span class = "s" > 'Total number of trials:' < / span><span class = "p" >,< / span> <span class = "n" >data< / span><span class = "o" >.< / span><span class = "n" >endog< / span><span class = "p" >[< / span><span class = "mi" > 0 < / span><span class = "p" >]< / span><span class = "o" >.< / span><span class = "n" > sum < / span><span class = "p" >())< / span> <span class = "k" > print < / span><span class = "p" >(< / span><span class = "s" > 'Parameters: ' < / span><span class = "p" >,< / span> <span class = "n" >res< / span><span class = "o" >.< / span><span class = "n" >params< / span><span class = "p" >)< / span> <span class = "k" > print < / span><span class = "p" >(< / span><span class = "s" > 'T-values: ' < / span><span class = "p" >,< / span> <span class = "n" >res< / span><span class = "o" >.< / span><span class = "n" >tvalues< / span><span class = "p" >)< / span> |
First differences: We hold all explanatory variables constant at their means and manipulate the percentage of low income households to assess its impact on the response variables:
1 2 3 4 5 6 7 8 | <span class = "n" >means< / span> <span class = "o" > = < / span> <span class = "n" >data< / span><span class = "o" >.< / span><span class = "n" >exog< / span><span class = "o" >.< / span><span class = "n" >mean< / span><span class = "p" >(< / span><span class = "n" >axis< / span><span class = "o" > = < / span><span class = "mi" > 0 < / span><span class = "p" >)< / span> <span class = "n" >means25< / span> <span class = "o" > = < / span> <span class = "n" >means< / span><span class = "o" >.< / span><span class = "n" >copy< / span><span class = "p" >()< / span> <span class = "n" >means25< / span><span class = "p" >[< / span><span class = "mi" > 0 < / span><span class = "p" >]< / span> <span class = "o" > = < / span> <span class = "n" >stats< / span><span class = "o" >.< / span><span class = "n" >scoreatpercentile< / span><span class = "p" >(< / span><span class = "n" >data< / span><span class = "o" >.< / span><span class = "n" >exog< / span><span class = "p" >[:,< / span><span class = "mi" > 0 < / span><span class = "p" >],< / span> <span class = "mi" > 25 < / span><span class = "p" >)< / span> <span class = "n" >means75< / span> <span class = "o" > = < / span> <span class = "n" >means< / span><span class = "o" >.< / span><span class = "n" >copy< / span><span class = "p" >()< / span> <span class = "n" >means75< / span><span class = "p" >[< / span><span class = "mi" > 0 < / span><span class = "p" >]< / span> <span class = "o" > = < / span> <span class = "n" >lowinc_75per< / span> <span class = "o" > = < / span> <span class = "n" >stats< / span><span class = "o" >.< / span><span class = "n" >scoreatpercentile< / span><span class = "p" >(< / span><span class = "n" >data< / span><span class = "o" >.< / span><span class = "n" >exog< / span><span class = "p" >[:,< / span><span class = "mi" > 0 < / span><span class = "p" >],< / span> <span class = "mi" > 75 < / span><span class = "p" >)< / span> <span class = "n" >resp_25< / span> <span class = "o" > = < / span> <span class = "n" >res< / span><span class = "o" >.< / span><span class = "n" >predict< / span><span class = "p" >(< / span><span class = "n" >means25< / span><span class = "p" >)< / span> <span class = "n" >resp_75< / span> <span class = "o" > = < / span> <span class = "n" >res< / span><span class = "o" >.< / span><span class = "n" >predict< / span><span class = "p" >(< / span><span class = "n" >means75< / span><span class = "p" >)< / span> <span class = "n" >diff< / span> <span class = "o" > = < / span> <span class = "n" >resp_75< / span> <span class = "o" > - < / span> <span class = "n" >resp_25< / span> |
The interquartile first difference for the percentage of low income households in a school district is:
1 | <span class = "k" > print < / span><span class = "p" >(< / span><span class = "s" > "</span><span class=" si ">%2.4f%%</span><span class=" s ">" < / span> <span class = "o" > % < / span> <span class = "p" >(< / span><span class = "n" >diff< / span><span class = "o" > * < / span><span class = "mi" > 100 < / span><span class = "p" >))< / span> |
Plots
We extract information that will be used to draw some interesting plots:
1 2 3 | <span class = "n" >nobs< / span> <span class = "o" > = < / span> <span class = "n" >res< / span><span class = "o" >.< / span><span class = "n" >nobs< / span> <span class = "n" >y< / span> <span class = "o" > = < / span> <span class = "n" >data< / span><span class = "o" >.< / span><span class = "n" >endog< / span><span class = "p" >[:,< / span><span class = "mi" > 0 < / span><span class = "p" >]< / span><span class = "o" > / < / span><span class = "n" >data< / span><span class = "o" >.< / span><span class = "n" >endog< / span><span class = "o" >.< / span><span class = "n" > sum < / span><span class = "p" >(< / span><span class = "mi" > 1 < / span><span class = "p" >)< / span> <span class = "n" >yhat< / span> <span class = "o" > = < / span> <span class = "n" >res< / span><span class = "o" >.< / span><span class = "n" >mu< / span> |
Plot yhat vs y:
1 | <span class = "kn" > from < / span> <span class = "nn" >statsmodels.graphics.api< / span> <span class = "kn" > import < / span> <span class = "n" >abline_plot< / span> |
1 2 3 4 5 6 7 8 9 | <span class = "n" >fig< / span><span class = "p" >,< / span> <span class = "n" >ax< / span> <span class = "o" > = < / span> <span class = "n" >plt< / span><span class = "o" >.< / span><span class = "n" >subplots< / span><span class = "p" >()< / span> <span class = "n" >ax< / span><span class = "o" >.< / span><span class = "n" >scatter< / span><span class = "p" >(< / span><span class = "n" >yhat< / span><span class = "p" >,< / span> <span class = "n" >y< / span><span class = "p" >)< / span> <span class = "n" >line_fit< / span> <span class = "o" > = < / span> <span class = "n" >sm< / span><span class = "o" >.< / span><span class = "n" >OLS< / span><span class = "p" >(< / span><span class = "n" >y< / span><span class = "p" >,< / span> <span class = "n" >sm< / span><span class = "o" >.< / span><span class = "n" >add_constant< / span><span class = "p" >(< / span><span class = "n" >yhat< / span><span class = "p" >,< / span> <span class = "n" >prepend< / span><span class = "o" > = < / span><span class = "bp" > True < / span><span class = "p" >))< / span><span class = "o" >.< / span><span class = "n" >fit< / span><span class = "p" >()< / span> <span class = "n" >abline_plot< / span><span class = "p" >(< / span><span class = "n" >model_results< / span><span class = "o" > = < / span><span class = "n" >line_fit< / span><span class = "p" >,< / span> <span class = "n" >ax< / span><span class = "o" > = < / span><span class = "n" >ax< / span><span class = "p" >)< / span> <span class = "n" >ax< / span><span class = "o" >.< / span><span class = "n" >set_title< / span><span class = "p" >(< / span><span class = "s" > 'Model Fit Plot' < / span><span class = "p" >)< / span> <span class = "n" >ax< / span><span class = "o" >.< / span><span class = "n" >set_ylabel< / span><span class = "p" >(< / span><span class = "s" > 'Observed values' < / span><span class = "p" >)< / span> <span class = "n" >ax< / span><span class = "o" >.< / span><span class = "n" >set_xlabel< / span><span class = "p" >(< / span><span class = "s" > 'Fitted values' < / span><span class = "p" >);< / span> |
Plot yhat vs. Pearson residuals:
1 2 3 4 5 6 7 8 | <span class = "n" >fig< / span><span class = "p" >,< / span> <span class = "n" >ax< / span> <span class = "o" > = < / span> <span class = "n" >plt< / span><span class = "o" >.< / span><span class = "n" >subplots< / span><span class = "p" >()< / span> <span class = "n" >ax< / span><span class = "o" >.< / span><span class = "n" >scatter< / span><span class = "p" >(< / span><span class = "n" >yhat< / span><span class = "p" >,< / span> <span class = "n" >res< / span><span class = "o" >.< / span><span class = "n" >resid_pearson< / span><span class = "p" >)< / span> <span class = "n" >ax< / span><span class = "o" >.< / span><span class = "n" >hlines< / span><span class = "p" >(< / span><span class = "mi" > 0 < / span><span class = "p" >,< / span> <span class = "mi" > 0 < / span><span class = "p" >,< / span> <span class = "mi" > 1 < / span><span class = "p" >)< / span> <span class = "n" >ax< / span><span class = "o" >.< / span><span class = "n" >set_xlim< / span><span class = "p" >(< / span><span class = "mi" > 0 < / span><span class = "p" >,< / span> <span class = "mi" > 1 < / span><span class = "p" >)< / span> <span class = "n" >ax< / span><span class = "o" >.< / span><span class = "n" >set_title< / span><span class = "p" >(< / span><span class = "s" > 'Residual Dependence Plot' < / span><span class = "p" >)< / span> <span class = "n" >ax< / span><span class = "o" >.< / span><span class = "n" >set_ylabel< / span><span class = "p" >(< / span><span class = "s" > 'Pearson Residuals' < / span><span class = "p" >)< / span> <span class = "n" >ax< / span><span class = "o" >.< / span><span class = "n" >set_xlabel< / span><span class = "p" >(< / span><span class = "s" > 'Fitted values' < / span><span class = "p" >)< / span> |
Histogram of standardized deviance residuals:
1 2 3 4 5 6 7 8 | <span class = "kn" > from < / span> <span class = "nn" >scipy< / span> <span class = "kn" > import < / span> <span class = "n" >stats< / span> <span class = "n" >fig< / span><span class = "p" >,< / span> <span class = "n" >ax< / span> <span class = "o" > = < / span> <span class = "n" >plt< / span><span class = "o" >.< / span><span class = "n" >subplots< / span><span class = "p" >()< / span> <span class = "n" >resid< / span> <span class = "o" > = < / span> <span class = "n" >res< / span><span class = "o" >.< / span><span class = "n" >resid_deviance< / span><span class = "o" >.< / span><span class = "n" >copy< / span><span class = "p" >()< / span> <span class = "n" >resid_std< / span> <span class = "o" > = < / span> <span class = "n" >stats< / span><span class = "o" >.< / span><span class = "n" >zscore< / span><span class = "p" >(< / span><span class = "n" >resid< / span><span class = "p" >)< / span> <span class = "n" >ax< / span><span class = "o" >.< / span><span class = "n" >hist< / span><span class = "p" >(< / span><span class = "n" >resid_std< / span><span class = "p" >,< / span> <span class = "n" >bins< / span><span class = "o" > = < / span><span class = "mi" > 25 < / span><span class = "p" >)< / span> <span class = "n" >ax< / span><span class = "o" >.< / span><span class = "n" >set_title< / span><span class = "p" >(< / span><span class = "s" > 'Histogram of standardized deviance residuals' < / span><span class = "p" >);< / span> |
QQ Plot of Deviance Residuals:
1 2 | <span class = "kn" > from < / span> <span class = "nn" >statsmodels< / span> <span class = "kn" > import < / span> <span class = "n" >graphics< / span> <span class = "n" >graphics< / span><span class = "o" >.< / span><span class = "n" >gofplots< / span><span class = "o" >.< / span><span class = "n" >qqplot< / span><span class = "p" >(< / span><span class = "n" >resid< / span><span class = "p" >,< / span> <span class = "n" >line< / span><span class = "o" > = < / span><span class = "s" > 'r' < / span><span class = "p" >)< / span> |
GLM: Gamma for proportional count response
Load data
In the example above, we printed the NOTE
attribute to learn about the Star98 dataset. Statsmodels datasets ships with other useful information. For example:
1 | <span class = "k" > print < / span><span class = "p" >(< / span><span class = "n" >sm< / span><span class = "o" >.< / span><span class = "n" >datasets< / span><span class = "o" >.< / span><span class = "n" >scotland< / span><span class = "o" >.< / span><span class = "n" >DESCRLONG< / span><span class = "p" >)< / span> |
Load the data and add a constant to the exogenous variables:
1 2 3 4 | <span class = "n" >data2< / span> <span class = "o" > = < / span> <span class = "n" >sm< / span><span class = "o" >.< / span><span class = "n" >datasets< / span><span class = "o" >.< / span><span class = "n" >scotland< / span><span class = "o" >.< / span><span class = "n" >load< / span><span class = "p" >()< / span> <span class = "n" >data2< / span><span class = "o" >.< / span><span class = "n" >exog< / span> <span class = "o" > = < / span> <span class = "n" >sm< / span><span class = "o" >.< / span><span class = "n" >add_constant< / span><span class = "p" >(< / span><span class = "n" >data2< / span><span class = "o" >.< / span><span class = "n" >exog< / span><span class = "p" >,< / span> <span class = "n" >prepend< / span><span class = "o" > = < / span><span class = "bp" > False < / span><span class = "p" >)< / span> <span class = "k" > print < / span><span class = "p" >(< / span><span class = "n" >data2< / span><span class = "o" >.< / span><span class = "n" >exog< / span><span class = "p" >[:< / span><span class = "mi" > 5 < / span><span class = "p" >,:])< / span> <span class = "k" > print < / span><span class = "p" >(< / span><span class = "n" >data2< / span><span class = "o" >.< / span><span class = "n" >endog< / span><span class = "p" >[:< / span><span class = "mi" > 5 < / span><span class = "p" >])< / span> |
Fit and summary
1 2 3 | <span class = "n" >glm_gamma< / span> <span class = "o" > = < / span> <span class = "n" >sm< / span><span class = "o" >.< / span><span class = "n" >GLM< / span><span class = "p" >(< / span><span class = "n" >data2< / span><span class = "o" >.< / span><span class = "n" >endog< / span><span class = "p" >,< / span> <span class = "n" >data2< / span><span class = "o" >.< / span><span class = "n" >exog< / span><span class = "p" >,< / span> <span class = "n" >family< / span><span class = "o" > = < / span><span class = "n" >sm< / span><span class = "o" >.< / span><span class = "n" >families< / span><span class = "o" >.< / span><span class = "n" >Gamma< / span><span class = "p" >())< / span> <span class = "n" >glm_results< / span> <span class = "o" > = < / span> <span class = "n" >glm_gamma< / span><span class = "o" >.< / span><span class = "n" >fit< / span><span class = "p" >()< / span> <span class = "k" > print < / span><span class = "p" >(< / span><span class = "n" >glm_results< / span><span class = "o" >.< / span><span class = "n" >summary< / span><span class = "p" >())< / span> |
GLM: Gaussian distribution with a noncanonical link
Artificial data
1 2 3 4 5 6 | <span class = "n" >nobs2< / span> <span class = "o" > = < / span> <span class = "mi" > 100 < / span> <span class = "n" >x< / span> <span class = "o" > = < / span> <span class = "n" >np< / span><span class = "o" >.< / span><span class = "n" >arange< / span><span class = "p" >(< / span><span class = "n" >nobs2< / span><span class = "p" >)< / span> <span class = "n" >np< / span><span class = "o" >.< / span><span class = "n" >random< / span><span class = "o" >.< / span><span class = "n" >seed< / span><span class = "p" >(< / span><span class = "mi" > 54321 < / span><span class = "p" >)< / span> <span class = "n" >X< / span> <span class = "o" > = < / span> <span class = "n" >np< / span><span class = "o" >.< / span><span class = "n" >column_stack< / span><span class = "p" >((< / span><span class = "n" >x< / span><span class = "p" >,< / span><span class = "n" >x< / span><span class = "o" > * * < / span><span class = "mi" > 2 < / span><span class = "p" >))< / span> <span class = "n" >X< / span> <span class = "o" > = < / span> <span class = "n" >sm< / span><span class = "o" >.< / span><span class = "n" >add_constant< / span><span class = "p" >(< / span><span class = "n" >X< / span><span class = "p" >,< / span> <span class = "n" >prepend< / span><span class = "o" > = < / span><span class = "bp" > False < / span><span class = "p" >)< / span> <span class = "n" >lny< / span> <span class = "o" > = < / span> <span class = "n" >np< / span><span class = "o" >.< / span><span class = "n" >exp< / span><span class = "p" >(< / span><span class = "o" > - < / span><span class = "p" >(< / span><span class = "o" >.< / span><span class = "mo" > 03 < / span><span class = "o" > * < / span><span class = "n" >x< / span> <span class = "o" > + < / span> <span class = "o" >.< / span><span class = "mo" > 0001 < / span><span class = "o" > * < / span><span class = "n" >x< / span><span class = "o" > * * < / span><span class = "mi" > 2 < / span> <span class = "o" > - < / span> <span class = "mf" > 1.0 < / span><span class = "p" >))< / span> <span class = "o" > + < / span> <span class = "o" >.< / span><span class = "mo" > 001 < / span> <span class = "o" > * < / span> <span class = "n" >np< / span><span class = "o" >.< / span><span class = "n" >random< / span><span class = "o" >.< / span><span class = "n" >rand< / span><span class = "p" >(< / span><span class = "n" >nobs2< / span><span class = "p" >)< / span> |
Fit and summary
1 2 3 | <span class = "n" >gauss_log< / span> <span class = "o" > = < / span> <span class = "n" >sm< / span><span class = "o" >.< / span><span class = "n" >GLM< / span><span class = "p" >(< / span><span class = "n" >lny< / span><span class = "p" >,< / span> <span class = "n" >X< / span><span class = "p" >,< / span> <span class = "n" >family< / span><span class = "o" > = < / span><span class = "n" >sm< / span><span class = "o" >.< / span><span class = "n" >families< / span><span class = "o" >.< / span><span class = "n" >Gaussian< / span><span class = "p" >(< / span><span class = "n" >sm< / span><span class = "o" >.< / span><span class = "n" >families< / span><span class = "o" >.< / span><span class = "n" >links< / span><span class = "o" >.< / span><span class = "n" >log< / span><span class = "p" >))< / span> <span class = "n" >gauss_log_results< / span> <span class = "o" > = < / span> <span class = "n" >gauss_log< / span><span class = "o" >.< / span><span class = "n" >fit< / span><span class = "p" >()< / span> <span class = "k" > print < / span><span class = "p" >(< / span><span class = "n" >gauss_log_results< / span><span class = "o" >.< / span><span class = "n" >summary< / span><span class = "p" >())< / span> |
Please login to continue.