Поиск по блогу

воскресенье, 1 марта 2015 г.

Regression models from Pandas и "Statistical Data Modeling" (часть 4-2)

Здесь примеры к видео из предыдущего поста, начиная с темы регрессии..., включая нелинейную... bootstraping...
A general, primary goal of many statistical data analysis tasks is to relate the influence of one variable on another. For example, we may wish to know how different medical interventions influence the incidence or duration of disease, or perhaps a how baseball player's performance varies as a function of age.

In [36]:
x = np.array([2.2, 4.3, 5.1, 5.8, 6.4, 8.0])
y = np.array([0.4, 10.1, 14.0, 10.9, 15.4, 18.5])
plt.plot(x,y,'ro')
Out[36]:
[<matplotlib.lines.Line2D at 0x1076a07d0>]

We can build a model to characterize the relationship between $X$ and $Y$, recognizing that additional factors other than $X$ (the ones we have measured or are interested in) may influence the response variable $Y$.


$y_i = f(x_i) + \epsilon_i$

where $f$ is some function, for example a linear function:


$y_i = \beta_0 + \beta_1 x_i + \epsilon_i$

and $\epsilon_i$ accounts for the difference between the observed response $y_i$ and its prediction from the model $\hat{y_i} = \beta_0 + \beta_1 x_i$. This is sometimes referred to as process uncertainty.

We would like to select $\beta_0, \beta_1$ so that the difference between the predictions and the observations is zero, but this is not usually possible. Instead, we choose a reasonable criterion: the smallest sum of the squared differences between $\hat{y}$ and $y$.


$$R^2 = \sum_i (y_i - [\beta_0 + \beta_1 x_i])^2 = \sum_i \epsilon_i^2 $$

Squaring serves two purposes: (1) to prevent positive and negative values from cancelling each other out and (2) to strongly penalize large deviations. Whether the latter is a good thing or not depends on the goals of the analysis.

In other words, we will select the parameters that minimize the squared error of the model.

In [37]:
ss = lambda theta, x, y: np.sum((y - theta[0] - theta[1]*x) ** 2)
In [38]:
ss([0,1],x,y)
Out[38]:
333.35000000000002
In [39]:
b0,b1 = fmin(ss, [0,1], args=(x,y))
b0,b1
Optimization terminated successfully.
         Current function value: 21.375000
         Iterations: 79
         Function evaluations: 153

Out[39]:
(-4.3500136038870876, 3.0000002915386412)
In [40]:
plt.plot(x, y, 'ro')
plt.plot([0,10], [b0, b0+b1*10])
Out[40]:
[<matplotlib.lines.Line2D at 0x107671cd0>]
In [41]:
plt.plot(x, y, 'ro')
plt.plot([0,10], [b0, b0+b1*10])
for xi, yi in zip(x,y):
    plt.plot([xi]*2, [yi, b0+b1*xi], 'k:')
plt.xlim(2, 9); plt.ylim(0, 20)
Out[41]:
(0, 20)

Minimizing the sum of squares is not the only criterion we can use; it is just a very popular (and successful) one. For example, we can try to minimize the sum of absolute differences:

In [42]:
sabs = lambda theta, x, y: np.sum(np.abs(y - theta[0] - theta[1]*x))
b0,b1 = fmin(sabs, [0,1], args=(x,y))
print b0,b1
plt.plot(x, y, 'ro')
plt.plot([0,10], [b0, b0+b1*10])
Optimization terminated successfully.
         Current function value: 10.162463
         Iterations: 39
         Function evaluations: 77
0.00157170444494 2.31231743181

Out[42]:
[<matplotlib.lines.Line2D at 0x1077cd890>]

We are not restricted to a straight-line regression model; we can represent a curved relationship between our variables by introducing polynomial terms. For example, a cubic model:


$y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \epsilon_i$
In [43]:
ss2 = lambda theta, x, y: np.sum((y - theta[0] - theta[1]*x - theta[2]*(x**2)) ** 2)
b0,b1,b2 = fmin(ss2, [1,1,-1], args=(x,y))
print b0,b1,b2
plt.plot(x, y, 'ro')
xvals = np.linspace(0, 10, 100)
plt.plot(xvals, b0 + b1*xvals + b2*(xvals**2))
Optimization terminated successfully.
         Current function value: 14.001110
         Iterations: 198
         Function evaluations: 372
-11.0748186039 6.0576975948 -0.302681057088

Out[43]:
[<matplotlib.lines.Line2D at 0x10772f4d0>]

Although polynomial model characterizes a nonlinear relationship, it is a linear problem in terms of estimation. That is, the regression model $f(y | x)$ is linear in the parameters.

For some data, it may be reasonable to consider polynomials of order>2. For example, consider the relationship between the number of home runs a baseball player hits and the number of runs batted in (RBI) they accumulate; clearly, the relationship is positive, but we may not expect a linear relationship.

In [44]:
ss3 = lambda theta, x, y: np.sum((y - theta[0] - theta[1]*x - theta[2]*(x**2) 
                                  - theta[3]*(x**3)) ** 2)

bb = pd.read_csv("data/baseball.csv", index_col=0)
plt.plot(bb.hr, bb.rbi, 'r.')
b0,b1,b2,b3 = fmin(ss3, [0,1,-1,0], args=(bb.hr, bb.rbi))
xvals = np.arange(40)
plt.plot(xvals, b0 + b1*xvals + b2*(xvals**2) + b3*(xvals**3))
Optimization terminated successfully.
         Current function value: 4274.128398
         Iterations: 230
         Function evaluations: 407

Out[44]:
[<matplotlib.lines.Line2D at 0x107865190>]

Of course, we need not fit least squares models by hand. The statsmodels package implements least squares models that allow for model fitting in a single line:

In [45]:
import statsmodels.api as sm

straight_line = sm.OLS(y, sm.add_constant(x)).fit()
straight_line.summary()
/usr/local/lib/python2.7/site-packages/statsmodels/stats/stattools.py:72: UserWarning: omni_normtest is not valid with less than 8 observations; 6 samples were given.
  "samples were given." % int(n))

Out[45]:
OLS Regression Results
Dep. Variable: y R-squared: 0.891
Model: OLS Adj. R-squared: 0.864
Method: Least Squares F-statistic: 32.67
Date: Wed, 18 Jun 2014 Prob (F-statistic): 0.00463
Time: 20:56:03 Log-Likelihood: -12.325
No. Observations: 6 AIC: 28.65
Df Residuals: 4 BIC: 28.23
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
const -4.3500 2.937 -1.481 0.213 -12.505 3.805
x1 3.0000 0.525 5.716 0.005 1.543 4.457
Omnibus: nan Durbin-Watson: 2.387
Prob(Omnibus): nan Jarque-Bera (JB): 0.570
Skew: 0.359 Prob(JB): 0.752
Kurtosis: 1.671 Cond. No. 17.9
In [46]:
from statsmodels.formula.api import ols as OLS

data = pd.DataFrame(dict(x=x, y=y))
cubic_fit = OLS('y ~ x + I(x**2)', data).fit()

cubic_fit.summary()
Out[46]:
OLS Regression Results
Dep. Variable: y R-squared: 0.929
Model: OLS Adj. R-squared: 0.881
Method: Least Squares F-statistic: 19.50
Date: Wed, 18 Jun 2014 Prob (F-statistic): 0.0191
Time: 20:56:03 Log-Likelihood: -11.056
No. Observations: 6 AIC: 28.11
Df Residuals: 3 BIC: 27.49
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -11.0748 6.013 -1.842 0.163 -30.211 8.062
x 6.0577 2.482 2.441 0.092 -1.840 13.955
I(x ** 2) -0.3027 0.241 -1.257 0.298 -1.069 0.464
Omnibus: nan Durbin-Watson: 2.711
Prob(Omnibus): nan Jarque-Bera (JB): 0.655
Skew: -0.809 Prob(JB): 0.721
Kurtosis: 2.961 Cond. No. 270.

Exercise: Polynomial function

Write a function that specified a polynomial of arbitrary degree.

In [47]:
# Write your answer here

Model Selection

How do we choose among competing models for a given dataset? More parameters are not necessarily better, from the standpoint of model fit. For example, fitting a 9-th order polynomial to the sample data from the above example certainly results in an overfit.

In [48]:
def calc_poly(params, data):
        x = np.c_[[data**i for i in range(len(params))]]
        return np.dot(params, x)
    
ssp = lambda theta, x, y: np.sum((y - calc_poly(theta, x)) ** 2)
betas = fmin(ssp, np.zeros(10), args=(x,y), maxiter=1e6)
plt.plot(x, y, 'ro')
xvals = np.linspace(0, max(x), 100)
plt.plot(xvals, calc_poly(betas, xvals))
Optimization terminated successfully.
         Current function value: 7.015262
         Iterations: 663
         Function evaluations: 983

Out[48]:
[<matplotlib.lines.Line2D at 0x1074ad290>]

One approach is to use an information-theoretic criterion to select the most appropriate model. For example Akaike's Information Criterion (AIC) balances the fit of the model (in terms of the likelihood) with the number of parameters required to achieve that fit. We can easily calculate AIC as:

$$AIC = n \log(\hat{\sigma}^2) + 2p$$

where $p$ is the number of parameters in the model and $\hat{\sigma}^2 = RSS/(n-p-1)$.

Notice that as the number of parameters increase, the residual sum of squares goes down, but the second term (a penalty) increases.

To apply AIC to model selection, we choose the model that has the lowest AIC value.

In [49]:
n = len(x)

aic = lambda rss, p, n: n * np.log(rss/(n-p-1)) + 2*p

RSS1 = ss(fmin(ss, [0,1], args=(x,y)), x, y)
RSS2 = ss2(fmin(ss2, [1,1,-1], args=(x,y)), x, y)

print aic(RSS1, 2, n), aic(RSS2, 3, n)
Optimization terminated successfully.
         Current function value: 21.375000
         Iterations: 79
         Function evaluations: 153
Optimization terminated successfully.
         Current function value: 14.001110
         Iterations: 198
         Function evaluations: 372
15.7816583572 17.6759368019

Hence, we would select the 2-parameter (linear) model.

Logistic Regression

Fitting a line to the relationship between two variables using the least squares approach is sensible when the variable we are trying to predict is continuous, but what about when the data are dichotomous?

  • male/female
  • pass/fail
  • died/survived

Let's consider the problem of predicting survival in the Titanic disaster, based on our available information. For example, lets say that we want to predict survival as a function of the fare paid for the journey.

In [50]:
titanic = pd.read_excel("data/titanic.xls", "titanic")
titanic.name
Out[50]:
0                      Allen, Miss. Elisabeth Walton
1                     Allison, Master. Hudson Trevor
2                       Allison, Miss. Helen Loraine
3               Allison, Mr. Hudson Joshua Creighton
4    Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
5                                Anderson, Mr. Harry
6                  Andrews, Miss. Kornelia Theodosia
7                             Andrews, Mr. Thomas Jr
8      Appleton, Mrs. Edward Dale (Charlotte Lamson)
9                            Artagaveytia, Mr. Ramon
...
1298                  Wittevrongel, Mr. Camille
1299                        Yasbeck, Mr. Antoni
1300    Yasbeck, Mrs. Antoni (Selini Alexander)
1301                       Youseff, Mr. Gerious
1302                          Yousif, Mr. Wazli
1303                      Yousseff, Mr. Gerious
1304                       Zabour, Miss. Hileni
1305                      Zabour, Miss. Thamine
1306                  Zakarian, Mr. Mapriededer
1307                        Zakarian, Mr. Ortin
1308                         Zimmerman, Mr. Leo
Name: name, Length: 1309, dtype: object
In [51]:
jitter = np.random.normal(scale=0.02, size=len(titanic))
plt.scatter(np.log(titanic.fare), titanic.survived + jitter, alpha=0.3)
plt.yticks([0,1])
plt.ylabel("survived")
plt.xlabel("log(fare)")
Out[51]:
<matplotlib.text.Text at 0x1082b5510>

I have added random jitter on the y-axis to help visualize the density of the points, and have plotted fare on the log scale.

Clearly, fitting a line through this data makes little sense, for several reasons. First, for most values of the predictor variable, the line would predict values that are not zero or one. Second, it would seem odd to choose least squares (or similar) as a criterion for selecting the best line.

In [52]:
x = np.log(titanic.fare[titanic.fare>0])
y = titanic.survived[titanic.fare>0]
betas_titanic = fmin(ss, [1,1], args=(x,y))
Optimization terminated successfully.
         Current function value: 277.621917
         Iterations: 55
         Function evaluations: 103

In [53]:
jitter = np.random.normal(scale=0.02, size=len(titanic))
plt.scatter(np.log(titanic.fare), titanic.survived + jitter, alpha=0.3)
plt.yticks([0,1])
plt.ylabel("survived")
plt.xlabel("log(fare)")
plt.plot([0,7], [betas_titanic[0], betas_titanic[0] + betas_titanic[1]*7.])
Out[53]:
[<matplotlib.lines.Line2D at 0x1083988d0>]

If we look at this data, we can see that for most values of fare, there are some individuals that survived and some that did not. However, notice that the cloud of points is denser on the "survived" (y=1) side for larger values of fare than on the "died" (y=0) side.



Посты чуть ниже также могут вас заинтересовать

Комментариев нет:

Отправить комментарий