Поиск по блогу

воскресенье, 1 марта 2015 г.

Stochastic model from Pandas и "Statistical Data Modeling" (часть 4-3)

Rather than model the binary outcome explicitly, it makes sense instead to model the probability of death or survival in a stochastic model. Probabilities are measured on a continuous [0,1] scale, which may be more amenable for prediction using a regression line. We need to consider a different probability model for this exerciese however; let's consider the Bernoulli distribution as a generative model for our data:


$$f(y|p) = p^{y} (1-p)^{1-y}$$

where $y = \{0,1\}$ and $p \in [0,1]$. So, this model predicts whether $y$ is zero or one as a function of the probability $p$. Notice that when $y=1$, the $1-p$ term disappears, and when $y=0$, the $p$ term disappears.

So, the model we want to fit should look something like this:


$$p_i = \beta_0 + \beta_1 x_i + \epsilon_i$$

However, since $p$ is constrained to be between zero and one, it is easy to see where a linear (or polynomial) model might predict values outside of this range. We can modify this model sligtly by using a link function to transform the probability to have an unbounded range on a new scale. Specifically, we can use a logit transformation as our link function:


$$\text{logit}(p) = \log\left[\frac{p}{1-p}\right] = x$$

Here's a plot of $p/(1-p)$

In [54]:
logit = lambda p: np.log(p/(1.-p))
unit_interval = np.linspace(0,1)
plt.plot(unit_interval/(1-unit_interval), unit_interval)
Out[54]:
[<matplotlib.lines.Line2D at 0x108432650>]

And here's the logit function:

In [55]:
plt.plot(logit(unit_interval), unit_interval)
Out[55]:
[<matplotlib.lines.Line2D at 0x108464390>]

The inverse of the logit transformation is:


$$p = \frac{1}{1 + \exp(-x)}$$

So, now our model is:


$$\text{logit}(p_i) = \beta_0 + \beta_1 x_i + \epsilon_i$$

We can fit this model using maximum likelihood. Our likelihood, again based on the Bernoulli model is:


$$L(y|p) = \prod_{i=1}^n p_i^{y_i} (1-p_i)^{1-y_i}$$

which, on the log scale is:


$$l(y|p) = \sum_{i=1}^n y_i \log(p_i) + (1-y_i)\log(1-p_i)$$

We can easily implement this in Python, keeping in mind that fmin minimizes, rather than maximizes functions:

In [56]:
invlogit = lambda x: 1. / (1 + np.exp(-x))

def logistic_like(theta, x, y):
    p = invlogit(theta[0] + theta[1] * x)
    # Return negative of log-likelihood
    return -np.sum(y * np.log(p) + (1-y) * np.log(1 - p))

Remove null values from variables

In [57]:
x, y = titanic[titanic.fare.notnull()][['fare', 'survived']].values.T

... and fit the model.

In [58]:
b0,b1 = fmin(logistic_like, [0.5,0], args=(x,y))
b0, b1
Optimization terminated successfully.
         Current function value: 827.015955
         Iterations: 47
         Function evaluations: 93

Out[58]:
(-0.88238984528338194, 0.012452067664164127)
In [59]:
jitter = np.random.normal(scale=0.01, size=len(x))
plt.plot(x, y+jitter, 'r.', alpha=0.3)
plt.yticks([0,.25,.5,.75,1])
xvals = np.linspace(0, 600)
plt.plot(xvals, invlogit(b0+b1*xvals))
Out[59]:
[<matplotlib.lines.Line2D at 0x10865bc50>]

As with our least squares model, we can easily fit logistic regression models in statsmodels, in this case using the GLM (generalized linear model) class with a binomial error distribution specified.

In [60]:
logistic = sm.GLM(y, sm.add_constant(x), family=sm.families.Binomial()).fit()
logistic.summary()
Out[60]:
Generalized Linear Model Regression Results
Dep. Variable: y No. Observations: 1308
Model: GLM Df Residuals: 1306
Model Family: Binomial Df Model: 1
Link Function: logit Scale: 1.0
Method: IRLS Log-Likelihood: -827.02
Date: Wed, 18 Jun 2014 Deviance: 1654.0
Time: 20:56:06 Pearson chi2: 1.33e+03
No. Iterations: 6
coef std err t P>|t| [95.0% Conf. Int.]
const -0.8824 0.076 -11.684 0.000 -1.030 -0.734
x1 0.0125 0.002 7.762 0.000 0.009 0.016

Exercise: multivariate logistic regression

Which other variables might be relevant for predicting the probability of surviving the Titanic? Generalize the model likelihood to include 2 or 3 other covariates from the dataset.

In [61]:
# Write your answer here

Bootstrapping

Parametric inference can be non-robust:

  • inaccurate if parametric assumptions are violated
  • if we rely on asymptotic results, we may not achieve an acceptable level of accuracy

Parmetric inference can be difficult:

  • derivation of sampling distribution may not be possible

An alternative is to estimate the sampling distribution of a statistic empirically without making assumptions about the form of the population.

We have seen this already with the kernel density estimate.

Non-parametric Bootstrap

The bootstrap is a resampling method discovered by Brad Efron that allows one to approximate the true sampling distribution of a dataset, and thereby obtain estimates of the mean and variance of the distribution.

Bootstrap sample:


$$S_1^* = \{x_{11}^*, x_{12}^*, \ldots, x_{1n}^*\}$$

$S_i^*$ is a sample of size $n$, with replacement.

In Python, we have already seen the NumPy function permutation that can be used in conjunction with Pandas' take method to generate a random sample of some data without replacement:

In [62]:
np.random.permutation(titanic.name)[:5]
Out[62]:
array([u'Meek, Mrs. Thomas (Annie Louise Rowley)',
       u'Thorneycroft, Mr. Percival', u'Williams, Mr. Leslie',
       u'Graham, Mr. George Edward', u'Petroff, Mr. Nedelio'], dtype=object)

Similarly, we can use the random.randint method to generate a sample with replacement, which we can use when bootstrapping.

In [63]:
random_ind = np.random.randint(0, len(titanic), 5)
titanic.name[random_ind]
Out[63]:
41      Brown, Mrs. James Joseph (Margaret Tobin)
1061              Nilsson, Miss. Helmina Josefina
937                  Klasen, Miss. Gertrud Emilia
426                            Hale, Mr. Reginald
831                Goodwin, Mr. Charles Frederick
Name: name, dtype: object

We regard S as an "estimate" of population P

population : sample :: sample : bootstrap sample

The idea is to generate replicate bootstrap samples:


$$S^* = \{S_1^*, S_2^*, \ldots, S_R^*\}$$

Compute statistic $t$ (estimate) for each bootstrap sample:


$$T_i^* = t(S^*)$$
In [64]:
n = 10
R = 1000
# Original sample (n=10)
x = np.random.normal(size=n)
# 1000 bootstrap samples of size 10
s = [x[np.random.randint(0,n,n)].mean() for i in range(R)]
_ = plt.hist(s, bins=30)

Bootstrap Estimates

From our bootstrapped samples, we can extract estimates of the expectation and its variance:

$$\bar{T}^* = \hat{E}(T^*) = \frac{\sum_i T_i^*}{R}$$

$$\hat{\text{Var}}(T^*) = \frac{\sum_i (T_i^* - \bar{T}^*)^2}{R-1}$$

In [65]:
boot_mean = np.sum(s)/R
boot_mean
Out[65]:
0.087385394806476724
In [66]:
boot_var = ((np.array(s) - boot_mean) ** 2).sum() / (R-1)
boot_var
Out[66]:
0.10590407752057245

Since we have estimated the expectation of the bootstrapped statistics, we can estimate the bias of T:

$$\hat{B}^* = \bar{T}^* - T$$

In [67]:
boot_mean - np.mean(x)
Out[67]:
-0.00528084680355842

Bootstrap error

There are two sources of error in bootstrap estimates:

  1. Sampling error from the selection of $S$.
  2. Bootstrap error from failing to enumerate all possible bootstrap samples.

For the sake of accuracy, it is prudent to choose at least R=1000

Bootstrap Percentile Intervals

An attractive feature of bootstrap statistics is the ease with which you can obtain an estimate of uncertainty for a given statistic. We simply use the empirical quantiles of the bootstrapped statistics to obtain percentiles corresponding to a confidence interval of interest.

This employs the ordered bootstrap replicates:

$$T_{(1)}^*, T_{(2)}^*, \ldots, T_{(R)}^*$$

Simply extract the $100(\alpha/2)$ and $100(1-\alpha/2)$ percentiles:

$$T_{[(R+1)\alpha/2]}^* \lt \theta \lt T_{[(R+1)(1-\alpha/2)]}^*$$

In [68]:
s_sorted = np.sort(s)
s_sorted[:10]
Out[68]:
array([-0.82890714, -0.77634577, -0.76588512, -0.76230089, -0.75578488,
       -0.73850118, -0.72869116, -0.72862786, -0.72840095, -0.71831374])
In [69]:
s_sorted[-10:]
Out[69]:
array([ 0.81823418,  0.86179331,  0.92314175,  0.93496722,  0.9358216 ,
        1.02058937,  1.03085586,  1.03121927,  1.22699691,  1.3599996 ])
In [70]:
alpha = 0.05
s_sorted[[(R+1)*alpha/2, (R+1)*(1-alpha/2)]]
Out[70]:
array([-0.5684053 ,  0.68682205])

Exercise: Cervical dystonia bootstrap estimates

Use bootstrapping to estimate the mean of one of the treatment groups, and calculate percentile intervals for the mean.

In [71]:
# Write your answer here


Посты чуть ниже также могут вас заинтересовать

Комментариев нет:

Отправить комментарий