Prediction (out of sample)

In [1]:

from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [2]:

nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:

olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())

                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.981
Model:                            OLS   Adj. R-squared:                  0.980
Method:                 Least Squares   F-statistic:                     808.7
Date:                Tue, 02 Dec 2014   Prob (F-statistic):           8.70e-40
Time:                        12:53:22   Log-Likelihood:                -3.7832
No. Observations:                  50   AIC:                             15.57
Df Residuals:                      46   BIC:                             23.21
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          4.8950      0.093     52.780      0.000         4.708     5.082
x1             0.5222      0.014     36.510      0.000         0.493     0.551
x2             0.3781      0.056      6.725      0.000         0.265     0.491
x3            -0.0216      0.001    -17.196      0.000        -0.024    -0.019
==============================================================================
Omnibus:                        8.517   Durbin-Watson:                   1.657
Prob(Omnibus):                  0.014   Jarque-Bera (JB):                7.697
Skew:                          -0.900   Prob(JB):                       0.0213
Kurtosis:                       3.677   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:

ypred = olsres.predict(X)
print(ypred)

[  4.3551   4.8029   5.2188   5.5823   5.8801   6.1088   6.2748   6.394
   6.4886   6.5836   6.703    6.8653   7.081    7.3506   7.6644   8.0046
   8.3474   8.6679   8.9433   9.1572   9.3021   9.3806   9.4052   9.3957
   9.3768   9.3732   9.4061   9.4894   9.6276   9.8149  10.0365  10.2706
  10.492   10.6768  10.8056  10.8672  10.8601  10.7931  10.6835  10.5545
  10.4313  10.3369  10.2884  10.2944  10.3529  10.4526  10.5739  10.6925
  10.7836  10.8255]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:

x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)

[ 10.789   10.6494  10.4216  10.1393   9.847    9.589    9.3986   9.2898
   9.254    9.2621]

Plot comparison

In [6]:

import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:

from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:

res.params

Out[8]:

Intercept           4.895017
x1                  0.522221
np.sin(x1)          0.378110
I((x1 - 5) ** 2)   -0.021596
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:

res.predict(exog=dict(x1=x1n))

Out[9]:

array([ 10.789 ,  10.6494,  10.4216,  10.1393,   9.847 ,   9.589 ,
         9.3986,   9.2898,   9.254 ,   9.2621])

Links:

http://statsmodels.sourceforge.net/stable/examples/notebooks/generated/predict.html

doc_statsmodels

2017-01-18 16:14:16

Comments