Rather than model the binary outcome explicitly, it makes sense instead to model the probability of death or survival in a stochastic model. Probabilities are measured on a continuous [0,1] scale, which may be more amenable for prediction using a regression line. We need to consider a different probability model for this exerciese however; let's consider the Bernoulli distribution as a generative model for our data:
$$f(y|p) = p^{y} (1-p)^{1-y}$$
where $y = \{0,1\}$ and $p \in [0,1]$. So, this model predicts whether $y$ is zero or one as a function of the probability $p$. Notice that when $y=1$, the $1-p$ term disappears, and when $y=0$, the $p$ term disappears.
So, the model we want to fit should look something like this:
$$p_i = \beta_0 + \beta_1 x_i + \epsilon_i$$
However, since $p$ is constrained to be between zero and one, it is easy to see where a linear (or polynomial) model might predict values outside of this range. We can modify this model sligtly by using a link function to transform the probability to have an unbounded range on a new scale. Specifically, we can use a logit transformation as our link function:
$$\text{logit}(p) = \log\left[\frac{p}{1-p}\right] = x$$
Here's a plot of $p/(1-p)$
logit = lambda p: np.log(p/(1.-p))
unit_interval = np.linspace(0,1)
plt.plot(unit_interval/(1-unit_interval), unit_interval)
And here's the logit function:
plt.plot(logit(unit_interval), unit_interval)
The inverse of the logit transformation is:
$$p = \frac{1}{1 + \exp(-x)}$$
So, now our model is:
$$\text{logit}(p_i) = \beta_0 + \beta_1 x_i + \epsilon_i$$
We can fit this model using maximum likelihood. Our likelihood, again based on the Bernoulli model is:
$$L(y|p) = \prod_{i=1}^n p_i^{y_i} (1-p_i)^{1-y_i}$$
which, on the log scale is:
$$l(y|p) = \sum_{i=1}^n y_i \log(p_i) + (1-y_i)\log(1-p_i)$$
We can easily implement this in Python, keeping in mind that fmin
minimizes, rather than maximizes functions:
invlogit = lambda x: 1. / (1 + np.exp(-x))
def logistic_like(theta, x, y):
p = invlogit(theta[0] + theta[1] * x)
# Return negative of log-likelihood
return -np.sum(y * np.log(p) + (1-y) * np.log(1 - p))
Remove null values from variables
x, y = titanic[titanic.fare.notnull()][['fare', 'survived']].values.T
... and fit the model.
b0,b1 = fmin(logistic_like, [0.5,0], args=(x,y))
b0, b1
jitter = np.random.normal(scale=0.01, size=len(x))
plt.plot(x, y+jitter, 'r.', alpha=0.3)
plt.yticks([0,.25,.5,.75,1])
xvals = np.linspace(0, 600)
plt.plot(xvals, invlogit(b0+b1*xvals))
As with our least squares model, we can easily fit logistic regression models in statsmodels
, in this case using the GLM
(generalized linear model) class with a binomial error distribution specified.
logistic = sm.GLM(y, sm.add_constant(x), family=sm.families.Binomial()).fit()
logistic.summary()
Exercise: multivariate logistic regression
Which other variables might be relevant for predicting the probability of surviving the Titanic? Generalize the model likelihood to include 2 or 3 other covariates from the dataset.
# Write your answer here
Bootstrapping
Parametric inference can be non-robust:
- inaccurate if parametric assumptions are violated
- if we rely on asymptotic results, we may not achieve an acceptable level of accuracy
Parmetric inference can be difficult:
- derivation of sampling distribution may not be possible
An alternative is to estimate the sampling distribution of a statistic empirically without making assumptions about the form of the population.
We have seen this already with the kernel density estimate.
Non-parametric Bootstrap
The bootstrap is a resampling method discovered by Brad Efron that allows one to approximate the true sampling distribution of a dataset, and thereby obtain estimates of the mean and variance of the distribution.
Bootstrap sample:
$$S_1^* = \{x_{11}^*, x_{12}^*, \ldots, x_{1n}^*\}$$
$S_i^*$ is a sample of size $n$, with replacement.
In Python, we have already seen the NumPy function permutation
that can be used in conjunction with Pandas' take
method to generate a random sample of some data without replacement:
np.random.permutation(titanic.name)[:5]
Similarly, we can use the random.randint
method to generate a sample with replacement, which we can use when bootstrapping.
random_ind = np.random.randint(0, len(titanic), 5)
titanic.name[random_ind]
We regard S as an "estimate" of population P
population : sample :: sample : bootstrap sample
The idea is to generate replicate bootstrap samples:
$$S^* = \{S_1^*, S_2^*, \ldots, S_R^*\}$$
Compute statistic $t$ (estimate) for each bootstrap sample:
$$T_i^* = t(S^*)$$
n = 10
R = 1000
# Original sample (n=10)
x = np.random.normal(size=n)
# 1000 bootstrap samples of size 10
s = [x[np.random.randint(0,n,n)].mean() for i in range(R)]
_ = plt.hist(s, bins=30)
Bootstrap Estimates
From our bootstrapped samples, we can extract estimates of the expectation and its variance:
$$\bar{T}^* = \hat{E}(T^*) = \frac{\sum_i T_i^*}{R}$$
$$\hat{\text{Var}}(T^*) = \frac{\sum_i (T_i^* - \bar{T}^*)^2}{R-1}$$
boot_mean = np.sum(s)/R
boot_mean
boot_var = ((np.array(s) - boot_mean) ** 2).sum() / (R-1)
boot_var
Since we have estimated the expectation of the bootstrapped statistics, we can estimate the bias of T:
$$\hat{B}^* = \bar{T}^* - T$$
boot_mean - np.mean(x)
Bootstrap error
There are two sources of error in bootstrap estimates:
- Sampling error from the selection of $S$.
- Bootstrap error from failing to enumerate all possible bootstrap samples.
For the sake of accuracy, it is prudent to choose at least R=1000
Bootstrap Percentile Intervals
An attractive feature of bootstrap statistics is the ease with which you can obtain an estimate of uncertainty for a given statistic. We simply use the empirical quantiles of the bootstrapped statistics to obtain percentiles corresponding to a confidence interval of interest.
This employs the ordered bootstrap replicates:
$$T_{(1)}^*, T_{(2)}^*, \ldots, T_{(R)}^*$$
Simply extract the $100(\alpha/2)$ and $100(1-\alpha/2)$ percentiles:
$$T_{[(R+1)\alpha/2]}^* \lt \theta \lt T_{[(R+1)(1-\alpha/2)]}^*$$
s_sorted = np.sort(s)
s_sorted[:10]
s_sorted[-10:]
alpha = 0.05
s_sorted[[(R+1)*alpha/2, (R+1)*(1-alpha/2)]]
Exercise: Cervical dystonia bootstrap estimates
Use bootstrapping to estimate the mean of one of the treatment groups, and calculate percentile intervals for the mean.
# Write your answer here
Посты чуть ниже также могут вас заинтересовать
Комментариев нет:
Отправить комментарий