Formulas: Fitting models using R-style formulas
Since version 0.5.0, statsmodels
allows users to fit statistical models using R-style formulas. Internally, statsmodels
uses the patsy package to convert formulas and data to the matrices that are used in model fitting. The formula framework is quite powerful; this tutorial only scratches the surface. A full description of the formula language can be found in the patsy
docs:
Loading modules and functions
1 2 3 | <span class = "kn" > from < / span> <span class = "nn" >__future__< / span> <span class = "kn" > import < / span> <span class = "n" >print_function< / span> <span class = "kn" > import < / span> <span class = "nn" >numpy< / span> <span class = "kn" >as< / span> <span class = "nn" >np< / span> <span class = "kn" > import < / span> <span class = "nn" >statsmodels.api< / span> <span class = "kn" >as< / span> <span class = "nn" >sm< / span> |
Import convention
You can import explicitly from statsmodels.formula.api
1 | <span class = "kn" > from < / span> <span class = "nn" >statsmodels.formula.api< / span> <span class = "kn" > import < / span> <span class = "n" >ols< / span> |
Alternatively, you can just use the formula
namespace of the main statsmodels.api
.
1 | <span class = "n" >sm< / span><span class = "o" >.< / span><span class = "n" >formula< / span><span class = "o" >.< / span><span class = "n" >ols< / span> |
Or you can use the following conventioin
1 | <span class = "kn" > import < / span> <span class = "nn" >statsmodels.formula.api< / span> <span class = "kn" >as< / span> <span class = "nn" >smf< / span> |
These names are just a convenient way to get access to each model's from_formula
classmethod. See, for instance
1 | <span class = "n" >sm< / span><span class = "o" >.< / span><span class = "n" >OLS< / span><span class = "o" >.< / span><span class = "n" >from_formula< / span> |
All of the lower case models accept formula
and data
arguments, whereas upper case ones take endog
and exog
design matrices. formula
accepts a string which describes the model in terms of a patsy
formula. data
takes a pandas data frame or any other data structure that defines a __getitem__
for variable names like a structured array or a dictionary of variables.
dir(sm.formula)
will print a list of available models.
Formula-compatible models have the following generic call signature: (formula, data, subset=None, *args, **kwargs)
OLS regression using formulas
To begin, we fit the linear model described on the Getting Started page. Download the data, subset columns, and list-wise delete to remove missing observations:
1 | <span class = "n" >dta< / span> <span class = "o" > = < / span> <span class = "n" >sm< / span><span class = "o" >.< / span><span class = "n" >datasets< / span><span class = "o" >.< / span><span class = "n" >get_rdataset< / span><span class = "p" >(< / span><span class = "s" > "Guerry" < / span><span class = "p" >,< / span> <span class = "s" > "HistData" < / span><span class = "p" >,< / span> <span class = "n" >cache< / span><span class = "o" > = < / span><span class = "bp" > True < / span><span class = "p" >)< / span> |
1 2 | <span class = "n" >df< / span> <span class = "o" > = < / span> <span class = "n" >dta< / span><span class = "o" >.< / span><span class = "n" >data< / span><span class = "p" >[[< / span><span class = "s" > 'Lottery' < / span><span class = "p" >,< / span> <span class = "s" > 'Literacy' < / span><span class = "p" >,< / span> <span class = "s" > 'Wealth' < / span><span class = "p" >,< / span> <span class = "s" > 'Region' < / span><span class = "p" >]]< / span><span class = "o" >.< / span><span class = "n" >dropna< / span><span class = "p" >()< / span> <span class = "n" >df< / span><span class = "o" >.< / span><span class = "n" >head< / span><span class = "p" >()< / span> |
Fit the model:
1 2 3 | <span class = "n" >mod< / span> <span class = "o" > = < / span> <span class = "n" >ols< / span><span class = "p" >(< / span><span class = "n" >formula< / span><span class = "o" > = < / span><span class = "s" > 'Lottery ~ Literacy + Wealth + Region' < / span><span class = "p" >,< / span> <span class = "n" >data< / span><span class = "o" > = < / span><span class = "n" >df< / span><span class = "p" >)< / span> <span class = "n" >res< / span> <span class = "o" > = < / span> <span class = "n" >mod< / span><span class = "o" >.< / span><span class = "n" >fit< / span><span class = "p" >()< / span> <span class = "k" > print < / span><span class = "p" >(< / span><span class = "n" >res< / span><span class = "o" >.< / span><span class = "n" >summary< / span><span class = "p" >())< / span> |
Categorical variables
Looking at the summary printed above, notice that patsy
determined that elements of Region were text strings, so it treated Region as a categorical variable. patsy
's default is also to include an intercept, so we automatically dropped one of the Region categories.
If Region had been an integer variable that we wanted to treat explicitly as categorical, we could have done so by using the C()
operator:
1 2 | <span class = "n" >res< / span> <span class = "o" > = < / span> <span class = "n" >ols< / span><span class = "p" >(< / span><span class = "n" >formula< / span><span class = "o" > = < / span><span class = "s" > 'Lottery ~ Literacy + Wealth + C(Region)' < / span><span class = "p" >,< / span> <span class = "n" >data< / span><span class = "o" > = < / span><span class = "n" >df< / span><span class = "p" >)< / span><span class = "o" >.< / span><span class = "n" >fit< / span><span class = "p" >()< / span> <span class = "k" > print < / span><span class = "p" >(< / span><span class = "n" >res< / span><span class = "o" >.< / span><span class = "n" >params< / span><span class = "p" >)< / span> |
Patsy's mode advanced features for categorical variables are discussed in: Patsy: Contrast Coding Systems for categorical variables
Operators
We have already seen that "~" separates the left-hand side of the model from the right-hand side, and that "+" adds new columns to the design matrix.
Removing variables
The "-" sign can be used to remove columns/variables. For instance, we can remove the intercept from a model by:
1 2 | <span class = "n" >res< / span> <span class = "o" > = < / span> <span class = "n" >ols< / span><span class = "p" >(< / span><span class = "n" >formula< / span><span class = "o" > = < / span><span class = "s" > 'Lottery ~ Literacy + Wealth + C(Region) -1 ' < / span><span class = "p" >,< / span> <span class = "n" >data< / span><span class = "o" > = < / span><span class = "n" >df< / span><span class = "p" >)< / span><span class = "o" >.< / span><span class = "n" >fit< / span><span class = "p" >()< / span> <span class = "k" > print < / span><span class = "p" >(< / span><span class = "n" >res< / span><span class = "o" >.< / span><span class = "n" >params< / span><span class = "p" >)< / span> |
Multiplicative interactions
":" adds a new column to the design matrix with the interaction of the other two columns. "*" will also include the individual columns that were multiplied together:
1 2 3 4 | <span class = "n" >res1< / span> <span class = "o" > = < / span> <span class = "n" >ols< / span><span class = "p" >(< / span><span class = "n" >formula< / span><span class = "o" > = < / span><span class = "s" > 'Lottery ~ Literacy : Wealth - 1' < / span><span class = "p" >,< / span> <span class = "n" >data< / span><span class = "o" > = < / span><span class = "n" >df< / span><span class = "p" >)< / span><span class = "o" >.< / span><span class = "n" >fit< / span><span class = "p" >()< / span> <span class = "n" >res2< / span> <span class = "o" > = < / span> <span class = "n" >ols< / span><span class = "p" >(< / span><span class = "n" >formula< / span><span class = "o" > = < / span><span class = "s" > 'Lottery ~ Literacy * Wealth - 1' < / span><span class = "p" >,< / span> <span class = "n" >data< / span><span class = "o" > = < / span><span class = "n" >df< / span><span class = "p" >)< / span><span class = "o" >.< / span><span class = "n" >fit< / span><span class = "p" >()< / span> <span class = "k" > print < / span><span class = "p" >(< / span><span class = "n" >res1< / span><span class = "o" >.< / span><span class = "n" >params< / span><span class = "p" >,< / span> <span class = "s" > '</span><span class="se">\n</span><span class="s">' < / span><span class = "p" >)< / span> <span class = "k" > print < / span><span class = "p" >(< / span><span class = "n" >res2< / span><span class = "o" >.< / span><span class = "n" >params< / span><span class = "p" >)< / span> |
Many other things are possible with operators. Please consult the patsy docs to learn more.
Functions
You can apply vectorized functions to the variables in your model:
1 2 | <span class = "n" >res< / span> <span class = "o" > = < / span> <span class = "n" >smf< / span><span class = "o" >.< / span><span class = "n" >ols< / span><span class = "p" >(< / span><span class = "n" >formula< / span><span class = "o" > = < / span><span class = "s" > 'Lottery ~ np.log(Literacy)' < / span><span class = "p" >,< / span> <span class = "n" >data< / span><span class = "o" > = < / span><span class = "n" >df< / span><span class = "p" >)< / span><span class = "o" >.< / span><span class = "n" >fit< / span><span class = "p" >()< / span> <span class = "k" > print < / span><span class = "p" >(< / span><span class = "n" >res< / span><span class = "o" >.< / span><span class = "n" >params< / span><span class = "p" >)< / span> |
Define a custom function:
1 2 3 4 | <span class = "k" > def < / span> <span class = "nf" >log_plus_1< / span><span class = "p" >(< / span><span class = "n" >x< / span><span class = "p" >):< / span> <span class = "k" > return < / span> <span class = "n" >np< / span><span class = "o" >.< / span><span class = "n" >log< / span><span class = "p" >(< / span><span class = "n" >x< / span><span class = "p" >)< / span> <span class = "o" > + < / span> <span class = "mf" > 1. < / span> <span class = "n" >res< / span> <span class = "o" > = < / span> <span class = "n" >smf< / span><span class = "o" >.< / span><span class = "n" >ols< / span><span class = "p" >(< / span><span class = "n" >formula< / span><span class = "o" > = < / span><span class = "s" > 'Lottery ~ log_plus_1(Literacy)' < / span><span class = "p" >,< / span> <span class = "n" >data< / span><span class = "o" > = < / span><span class = "n" >df< / span><span class = "p" >)< / span><span class = "o" >.< / span><span class = "n" >fit< / span><span class = "p" >()< / span> <span class = "k" > print < / span><span class = "p" >(< / span><span class = "n" >res< / span><span class = "o" >.< / span><span class = "n" >params< / span><span class = "p" >)< / span> |
Any function that is in the calling namespace is available to the formula.
Using formulas with models that do not (yet) support them
Even if a given statsmodels
function does not support formulas, you can still use patsy
's formula language to produce design matrices. Those matrices can then be fed to the fitting function as endog
and exog
arguments.
To generate numpy
arrays:
1 2 3 4 5 | <span class = "kn" > import < / span> <span class = "nn" >patsy< / span> <span class = "n" >f< / span> <span class = "o" > = < / span> <span class = "s" > 'Lottery ~ Literacy * Wealth' < / span> <span class = "n" >y< / span><span class = "p" >,< / span><span class = "n" >X< / span> <span class = "o" > = < / span> <span class = "n" >patsy< / span><span class = "o" >.< / span><span class = "n" >dmatrices< / span><span class = "p" >(< / span><span class = "n" >f< / span><span class = "p" >,< / span> <span class = "n" >df< / span><span class = "p" >,< / span> <span class = "n" >return_type< / span><span class = "o" > = < / span><span class = "s" > 'dataframe' < / span><span class = "p" >)< / span> <span class = "k" > print < / span><span class = "p" >(< / span><span class = "n" >y< / span><span class = "p" >[:< / span><span class = "mi" > 5 < / span><span class = "p" >])< / span> <span class = "k" > print < / span><span class = "p" >(< / span><span class = "n" >X< / span><span class = "p" >[:< / span><span class = "mi" > 5 < / span><span class = "p" >])< / span> |
To generate pandas data frames:
1 2 3 4 | <span class = "n" >f< / span> <span class = "o" > = < / span> <span class = "s" > 'Lottery ~ Literacy * Wealth' < / span> <span class = "n" >y< / span><span class = "p" >,< / span><span class = "n" >X< / span> <span class = "o" > = < / span> <span class = "n" >patsy< / span><span class = "o" >.< / span><span class = "n" >dmatrices< / span><span class = "p" >(< / span><span class = "n" >f< / span><span class = "p" >,< / span> <span class = "n" >df< / span><span class = "p" >,< / span> <span class = "n" >return_type< / span><span class = "o" > = < / span><span class = "s" > 'dataframe' < / span><span class = "p" >)< / span> <span class = "k" > print < / span><span class = "p" >(< / span><span class = "n" >y< / span><span class = "p" >[:< / span><span class = "mi" > 5 < / span><span class = "p" >])< / span> <span class = "k" > print < / span><span class = "p" >(< / span><span class = "n" >X< / span><span class = "p" >[:< / span><span class = "mi" > 5 < / span><span class = "p" >])< / span> |
1 | <span class = "k" > print < / span><span class = "p" >(< / span><span class = "n" >sm< / span><span class = "o" >.< / span><span class = "n" >OLS< / span><span class = "p" >(< / span><span class = "n" >y< / span><span class = "p" >,< / span> <span class = "n" >X< / span><span class = "p" >)< / span><span class = "o" >.< / span><span class = "n" >fit< / span><span class = "p" >()< / span><span class = "o" >.< / span><span class = "n" >summary< / span><span class = "p" >())< / span> |
Please login to continue.