Rather than model the binary outcome explicitly, it makes sense instead to model the probability of death or survival in a stochastic model. Probabilities are measured on a continuous [0,1] scale, which may be more amenable for prediction using a regression line. We need to consider a different probability model for this exerciese however; let's consider the Bernoulli distribution as a generative model for our data:

$f(y|p) = p^{y} (1-p)^{1-y}$

where $y = \{0,1\}$ and $p \in [0,1]$. So, this model predicts whether $y$ is zero or one as a function of the probability $p$. Notice that when $y=1$, the $1-p$ term disappears, and when $y=0$, the $p$ term disappears.

So, the model we want to fit should look something like this:

$p_i = \beta_0 + \beta_1 x_i + \epsilon_i$

However, since $p$ is constrained to be between zero and one, it is easy to see where a linear (or polynomial) model might predict values outside of this range. We can modify this model sligtly by using a link function to transform the probability to have an unbounded range on a new scale. Specifically, we can use a logit transformation as our link function:

$\text{logit}(p) = \log\left[\frac{p}{1-p}\right] = x$

Here's a plot of $p/(1-p)$

In [54]:

logit = lambda p: np.log(p/(1.-p))
unit_interval = np.linspace(0,1)
plt.plot(unit_interval/(1-unit_interval), unit_interval)

Out[54]:

[<matplotlib.lines.Line2D at 0x108432650>]

And here's the logit function:

In [55]:

plt.plot(logit(unit_interval), unit_interval)

Out[55]:

[<matplotlib.lines.Line2D at 0x108464390>]

The inverse of the logit transformation is:

$p = \frac{1}{1 + \exp(-x)}$

So, now our model is:

$\text{logit}(p_i) = \beta_0 + \beta_1 x_i + \epsilon_i$

We can fit this model using maximum likelihood. Our likelihood, again based on the Bernoulli model is:

$L(y|p) = \prod_{i=1}^n p_i^{y_i} (1-p_i)^{1-y_i}$

which, on the log scale is:

$l(y|p) = \sum_{i=1}^n y_i \log(p_i) + (1-y_i)\log(1-p_i)$

We can easily implement this in Python, keeping in mind that fmin minimizes, rather than maximizes functions:

In [56]:

invlogit = lambda x: 1. / (1 + np.exp(-x))

def logistic_like(theta, x, y):
    p = invlogit(theta[0] + theta[1] * x)
    # Return negative of log-likelihood
    return -np.sum(y * np.log(p) + (1-y) * np.log(1 - p))

Remove null values from variables

In [57]:

x, y = titanic[titanic.fare.notnull()][['fare', 'survived']].values.T

... and fit the model.

In [58]:

b0,b1 = fmin(logistic_like, [0.5,0], args=(x,y))
b0, b1

Optimization terminated successfully.
         Current function value: 827.015955
         Iterations: 47
         Function evaluations: 93

Out[58]:

(-0.88238984528338194, 0.012452067664164127)

In [59]:

jitter = np.random.normal(scale=0.01, size=len(x))
plt.plot(x, y+jitter, 'r.', alpha=0.3)
plt.yticks([0,.25,.5,.75,1])
xvals = np.linspace(0, 600)
plt.plot(xvals, invlogit(b0+b1*xvals))

Out[59]:

[<matplotlib.lines.Line2D at 0x10865bc50>]

As with our least squares model, we can easily fit logistic regression models in statsmodels, in this case using the GLM (generalized linear model) class with a binomial error distribution specified.

In [60]:

logistic = sm.GLM(y, sm.add_constant(x), family=sm.families.Binomial()).fit()
logistic.summary()

Out[60]:

Generalized Linear Model Regression Results
Dep. Variable:	y	No. Observations:	1308
Model:	GLM	Df Residuals:	1306
Model Family:	Binomial	Df Model:	1
Link Function:	logit	Scale:	1.0
Method:	IRLS	Log-Likelihood:	-827.02
Date:	Wed, 18 Jun 2014	Deviance:	1654.0
Time:	20:56:06	Pearson chi2:	1.33e+03
No. Iterations:	6

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
const	-0.8824	0.076	-11.684	0.000	-1.030 -0.734
x1	0.0125	0.002	7.762	0.000	0.009 0.016

Exercise: multivariate logistic regression

Which other variables might be relevant for predicting the probability of surviving the Titanic? Generalize the model likelihood to include 2 or 3 other covariates from the dataset.

In [61]:

# Write your answer here

Bootstrapping

Parametric inference can be non-robust:

inaccurate if parametric assumptions are violated
if we rely on asymptotic results, we may not achieve an acceptable level of accuracy

Parmetric inference can be difficult:

derivation of sampling distribution may not be possible

An alternative is to estimate the sampling distribution of a statistic empirically without making assumptions about the form of the population.

We have seen this already with the kernel density estimate.

Non-parametric Bootstrap

The bootstrap is a resampling method discovered by Brad Efron that allows one to approximate the true sampling distribution of a dataset, and thereby obtain estimates of the mean and variance of the distribution.

Bootstrap sample:

$S_1^* = \{x_{11}^*, x_{12}^*, \ldots, x_{1n}^*\}$

$S_i^*$ is a sample of size $n$, with replacement.

In Python, we have already seen the NumPy function permutation that can be used in conjunction with Pandas' take method to generate a random sample of some data without replacement:

In [62]:

np.random.permutation(titanic.name)[:5]

Out[62]:

array([u'Meek, Mrs. Thomas (Annie Louise Rowley)',
       u'Thorneycroft, Mr. Percival', u'Williams, Mr. Leslie',
       u'Graham, Mr. George Edward', u'Petroff, Mr. Nedelio'], dtype=object)

Similarly, we can use the random.randint method to generate a sample with replacement, which we can use when bootstrapping.

In [63]:

random_ind = np.random.randint(0, len(titanic), 5)
titanic.name[random_ind]

Out[63]:

41      Brown, Mrs. James Joseph (Margaret Tobin)
1061              Nilsson, Miss. Helmina Josefina
937                  Klasen, Miss. Gertrud Emilia
426                            Hale, Mr. Reginald
831                Goodwin, Mr. Charles Frederick
Name: name, dtype: object

We regard S as an "estimate" of population P

population : sample :: sample : bootstrap sample

The idea is to generate replicate bootstrap samples:

$S^* = \{S_1^*, S_2^*, \ldots, S_R^*\}$

Compute statistic $t$ (estimate) for each bootstrap sample:

$T_i^* = t(S^*)$

In [64]:

n = 10
R = 1000
# Original sample (n=10)
x = np.random.normal(size=n)
# 1000 bootstrap samples of size 10
s = [x[np.random.randint(0,n,n)].mean() for i in range(R)]
_ = plt.hist(s, bins=30)

Bootstrap Estimates

From our bootstrapped samples, we can extract estimates of the expectation and its variance:

$\bar{T}^* = \hat{E}(T^*) = \frac{\sum_i T_i^*}{R}$

$\hat{\text{Var}}(T^*) = \frac{\sum_i (T_i^* - \bar{T}^*)^2}{R-1}$

In [65]:

boot_mean = np.sum(s)/R
boot_mean

Out[65]:

0.087385394806476724

In [66]:

boot_var = ((np.array(s) - boot_mean) ** 2).sum() / (R-1)
boot_var

Out[66]:

0.10590407752057245

Since we have estimated the expectation of the bootstrapped statistics, we can estimate the bias of T:

$\hat{B}^* = \bar{T}^* - T$

In [67]:

boot_mean - np.mean(x)

Out[67]:

-0.00528084680355842

Bootstrap error

There are two sources of error in bootstrap estimates:

Sampling error from the selection of $S$.
Bootstrap error from failing to enumerate all possible bootstrap samples.

For the sake of accuracy, it is prudent to choose at least R=1000

Bootstrap Percentile Intervals

An attractive feature of bootstrap statistics is the ease with which you can obtain an estimate of uncertainty for a given statistic. We simply use the empirical quantiles of the bootstrapped statistics to obtain percentiles corresponding to a confidence interval of interest.

This employs the ordered bootstrap replicates:

$T_{(1)}^*, T_{(2)}^*, \ldots, T_{(R)}^*$

Simply extract the $100(\alpha/2)$ and $100(1-\alpha/2)$ percentiles:

$T_{[(R+1)\alpha/2]}^* \lt \theta \lt T_{[(R+1)(1-\alpha/2)]}^*$

In [68]:

s_sorted = np.sort(s)
s_sorted[:10]

Out[68]:

array([-0.82890714, -0.77634577, -0.76588512, -0.76230089, -0.75578488,
       -0.73850118, -0.72869116, -0.72862786, -0.72840095, -0.71831374])

In [69]:

s_sorted[-10:]

Out[69]:

array([ 0.81823418,  0.86179331,  0.92314175,  0.93496722,  0.9358216 ,
        1.02058937,  1.03085586,  1.03121927,  1.22699691,  1.3599996 ])

In [70]:

alpha = 0.05
s_sorted[[(R+1)*alpha/2, (R+1)*(1-alpha/2)]]

Out[70]:

array([-0.5684053 ,  0.68682205])

Exercise: Cervical dystonia bootstrap estimates

Use bootstrapping to estimate the mean of one of the treatment groups, and calculate percentile intervals for the mean.

In [71]:

# Write your answer here

Посты чуть ниже также могут вас заинтересовать

iPython R Rapid Miner

Поиск по блогу

Страницы

воскресенье, 1 марта 2015 г.

Stochastic model from Pandas и "Statistical Data Modeling" (часть 4-3)

Exercise: multivariate logistic regression

Bootstrapping

Non-parametric Bootstrap

Bootstrap Estimates

Bootstrap error

Bootstrap Percentile Intervals

Exercise: Cervical dystonia bootstrap estimates

Комментариев нет:

Отправить комментарий

Поиск по блогу

Страницы

воскресенье, 1 марта 2015 г.

Stochastic model from Pandas и "Statistical Data Modeling" (часть 4-3)

Exercise: multivariate logistic regression

Bootstrapping

Non-parametric Bootstrap

Bootstrap Estimates

Bootstrap error

Bootstrap Percentile Intervals

Exercise: Cervical dystonia bootstrap estimates

Комментариев нет:

Отправить комментарий

воскресенье, 1 марта 2015 г.