Rather than model the binary outcome explicitly, it makes sense instead to model the probability of death or survival in a stochastic model. Probabilities are measured on a continuous [0,1] scale, which may be more amenable for prediction using a regression line. We need to consider a different probability model for this exerciese however; let's consider the Bernoulli distribution as a generative model for our data:
f(y|p)=py(1−p)1−y
where $y = \{0,1\}$ and $p \in [0,1]$. So, this model predicts whether $y$ is zero or one as a function of the probability $p$. Notice that when $y=1$, the $1-p$ term disappears, and when $y=0$, the $p$ term disappears.
So, the model we want to fit should look something like this:
pi=β0+β1xi+ϵi
However, since $p$ is constrained to be between zero and one, it is easy to see where a linear (or polynomial) model might predict values outside of this range. We can modify this model sligtly by using a link function to transform the probability to have an unbounded range on a new scale. Specifically, we can use a logit transformation as our link function:
logit(p)=log[p1−p]=x
Here's a plot of $p/(1-p)$
logit = lambda p: np.log(p/(1.-p))
unit_interval = np.linspace(0,1)
plt.plot(unit_interval/(1-unit_interval), unit_interval)
And here's the logit function:
plt.plot(logit(unit_interval), unit_interval)
The inverse of the logit transformation is:
p=11+exp(−x)
So, now our model is:
logit(pi)=β0+β1xi+ϵi
We can fit this model using maximum likelihood. Our likelihood, again based on the Bernoulli model is:
L(y|p)=n∏i=1pyii(1−pi)1−yi
which, on the log scale is:
l(y|p)=n∑i=1yilog(pi)+(1−yi)log(1−pi)
We can easily implement this in Python, keeping in mind that fmin
minimizes, rather than maximizes functions:
invlogit = lambda x: 1. / (1 + np.exp(-x))
def logistic_like(theta, x, y):
p = invlogit(theta[0] + theta[1] * x)
# Return negative of log-likelihood
return -np.sum(y * np.log(p) + (1-y) * np.log(1 - p))
Remove null values from variables
x, y = titanic[titanic.fare.notnull()][['fare', 'survived']].values.T
... and fit the model.
b0,b1 = fmin(logistic_like, [0.5,0], args=(x,y))
b0, b1
jitter = np.random.normal(scale=0.01, size=len(x))
plt.plot(x, y+jitter, 'r.', alpha=0.3)
plt.yticks([0,.25,.5,.75,1])
xvals = np.linspace(0, 600)
plt.plot(xvals, invlogit(b0+b1*xvals))
As with our least squares model, we can easily fit logistic regression models in statsmodels
, in this case using the GLM
(generalized linear model) class with a binomial error distribution specified.
logistic = sm.GLM(y, sm.add_constant(x), family=sm.families.Binomial()).fit()
logistic.summary()
Exercise: multivariate logistic regression
Which other variables might be relevant for predicting the probability of surviving the Titanic? Generalize the model likelihood to include 2 or 3 other covariates from the dataset.
# Write your answer here
Bootstrapping
Parametric inference can be non-robust:
- inaccurate if parametric assumptions are violated
- if we rely on asymptotic results, we may not achieve an acceptable level of accuracy
Parmetric inference can be difficult:
- derivation of sampling distribution may not be possible
An alternative is to estimate the sampling distribution of a statistic empirically without making assumptions about the form of the population.
We have seen this already with the kernel density estimate.
Non-parametric Bootstrap
The bootstrap is a resampling method discovered by Brad Efron that allows one to approximate the true sampling distribution of a dataset, and thereby obtain estimates of the mean and variance of the distribution.
Bootstrap sample:
S∗1={x∗11,x∗12,…,x∗1n}
$S_i^*$ is a sample of size $n$, with replacement.
In Python, we have already seen the NumPy function permutation
that can be used in conjunction with Pandas' take
method to generate a random sample of some data without replacement:
np.random.permutation(titanic.name)[:5]
Similarly, we can use the random.randint
method to generate a sample with replacement, which we can use when bootstrapping.
random_ind = np.random.randint(0, len(titanic), 5)
titanic.name[random_ind]
We regard S as an "estimate" of population P
population : sample :: sample : bootstrap sample
The idea is to generate replicate bootstrap samples:
S∗={S∗1,S∗2,…,S∗R}
Compute statistic $t$ (estimate) for each bootstrap sample:
T∗i=t(S∗)
n = 10
R = 1000
# Original sample (n=10)
x = np.random.normal(size=n)
# 1000 bootstrap samples of size 10
s = [x[np.random.randint(0,n,n)].mean() for i in range(R)]
_ = plt.hist(s, bins=30)
Bootstrap Estimates
From our bootstrapped samples, we can extract estimates of the expectation and its variance:
ˉT∗=ˆE(T∗)=∑iT∗iR
^Var(T∗)=∑i(T∗i−ˉT∗)2R−1
boot_mean = np.sum(s)/R
boot_mean
boot_var = ((np.array(s) - boot_mean) ** 2).sum() / (R-1)
boot_var
Since we have estimated the expectation of the bootstrapped statistics, we can estimate the bias of T:
ˆB∗=ˉT∗−T
boot_mean - np.mean(x)
Bootstrap error
There are two sources of error in bootstrap estimates:
- Sampling error from the selection of $S$.
- Bootstrap error from failing to enumerate all possible bootstrap samples.
For the sake of accuracy, it is prudent to choose at least R=1000
Bootstrap Percentile Intervals
An attractive feature of bootstrap statistics is the ease with which you can obtain an estimate of uncertainty for a given statistic. We simply use the empirical quantiles of the bootstrapped statistics to obtain percentiles corresponding to a confidence interval of interest.
This employs the ordered bootstrap replicates:
T∗(1),T∗(2),…,T∗(R)
Simply extract the $100(\alpha/2)$ and $100(1-\alpha/2)$ percentiles:
T∗[(R+1)α/2]<θ<T∗[(R+1)(1−α/2)]
s_sorted = np.sort(s)
s_sorted[:10]
s_sorted[-10:]
alpha = 0.05
s_sorted[[(R+1)*alpha/2, (R+1)*(1-alpha/2)]]
Exercise: Cervical dystonia bootstrap estimates
Use bootstrapping to estimate the mean of one of the treatment groups, and calculate percentile intervals for the mean.
# Write your answer here
Посты чуть ниже также могут вас заинтересовать
Комментариев нет:
Отправить комментарий