Здесь графики и формулы для "graduate-level statistics courses". Подгонка распределений, принцип максимума правдоподобия, ... Некоторые подходы в примерах распределений сомнительны, вспоминать формулы вряд ли понадобится в ближайшие годы... Но здесь большое количесвто примеров кода для построения диаграмм и примеры (latex) дифференциальных уравнений
Some or most of you have probably taken some undergraduate- or graduate-level statistics courses. Unfortunately, the curricula for most introductory statisics courses are mostly focused on conducting statistical hypothesis tests as the primary means for interest: t-tests, chi-squared tests, analysis of variance, etc. Such tests seek to esimate whether groups or effects are "statistically significant", a concept that is poorly understood, and hence often misused, by most practioners. Even when interpreted correctly, statistical significance is a questionable goal for statistical inference, as it is of limited utility.
A far more powerful approach to statistical analysis involves building flexible models with the overarching aim of estimating quantities of interest. This section of the tutorial illustrates how to use Python to build statistical models of low to moderate difficulty from scratch, and use them to extract estimates and associated measures of uncertainty.
A far more powerful approach to statistical analysis involves building flexible models with the overarching aim of estimating quantities of interest. This section of the tutorial illustrates how to use Python to build statistical models of low to moderate difficulty from scratch, and use them to extract estimates and associated measures of uncertainty.
In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Set some Pandas options
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 25)
Estimation
An recurring statistical problem is finding estimates of the relevant parameters that correspond to the distribution that best represents our data.In parametric inference, we specify a priori a suitable distribution, then choose the parameters that best fit the data.
- e.g. $\mu$ and $\sigma^2$ in the case of the normal distribution
In [2]:
x = np.array([ 1.00201077, 1.58251956, 0.94515919, 6.48778002, 1.47764604,
5.18847071, 4.21988095, 2.85971522, 3.40044437, 3.74907745,
1.18065796, 3.74748775, 3.27328568, 3.19374927, 8.0726155 ,
0.90326139, 2.34460034, 2.14199217, 3.27446744, 3.58872357,
1.20611533, 2.16594393, 5.56610242, 4.66479977, 2.3573932 ])
_ = plt.hist(x, bins=8)
Fitting data to probability distributions
We start with the problem of finding values for the parameters that provide the best fit between the model and the data, called point estimates. First, we need to define what we mean by ‘best fit’. There are two commonly used criteria:- Method of moments chooses the parameters so that the sample moments (typically the sample mean and variance) match the theoretical moments of our chosen distribution.
- Maximum likelihood chooses the parameters to maximize the likelihood, which measures how likely it is to observe our given sample.
Discrete Random Variables
$$X = \{0,1\}$$$$Y = \{\ldots,-2,-1,0,1,2,\ldots\}$$
Probability Mass Function:
For discrete $X$,
$$Pr(X=x) = f(x|\theta)$$
e.g. Poisson distribution
The Poisson distribution models unbounded counts:
$$Pr(X=x)=\frac{e^{-\lambda}\lambda^x}{x!}$$
The Poisson distribution models unbounded counts:
$$Pr(X=x)=\frac{e^{-\lambda}\lambda^x}{x!}$$
- $X=\{0,1,2,\ldots\}$
- $\lambda > 0$
Continuous Random Variables
$$X \in [0,1]$$$$Y \in (-\infty, \infty)$$
Probability Density Function:
For continuous $X$,
$$Pr(x \le X \le x + dx) = f(x|\theta)dx \, \text{ as } \, dx \rightarrow 0$$
e.g. normal distribution
$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left[-\frac{(x-\mu)^2}{2\sigma^2}\right]$$
$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left[-\frac{(x-\mu)^2}{2\sigma^2}\right]$$
- $X \in \mathbf{R}$
- $\mu \in \mathbf{R}$
- $\sigma>0$
Example: Nashville Precipitation
The datasetnashville_precip.txt
contains NOAA precipitation data for Nashville measured since 1871. The gamma distribution is often a good fit to aggregated rainfall data, and will be our candidate distribution in this case.
In [3]:
precip = pd.read_table("data/nashville_precip.txt", index_col=0, na_values='NA', delim_whitespace=True)
precip.head()
Out[3]:
In [4]:
_ = precip.hist(sharex=True, sharey=True, grid=False)
plt.tight_layout()
The first step is recognixing what sort of distribution to fit our data to. A couple of observations:
$$x \sim \text{Gamma}(\alpha, \beta) = \frac{\beta^{\alpha}x^{\alpha-1}e^{-\beta x}}{\Gamma(\alpha)}$$
- The data are skewed, with a longer tail to the right than to the left
- The data are positive-valued, since they are measuring rainfall
- The data are continuous
$$x \sim \text{Gamma}(\alpha, \beta) = \frac{\beta^{\alpha}x^{\alpha-1}e^{-\beta x}}{\Gamma(\alpha)}$$
The method of moments simply assigns the empirical mean and variance to their theoretical counterparts, so that we can solve for the parameters.
So, for the gamma distribution, the mean and variance are:
$$ \hat{\mu} = \bar{X} = \alpha \beta $$ $$ \hat{\sigma}^2 = S^2 = \alpha \beta^2 $$
So, for the gamma distribution, the mean and variance are:
$$ \hat{\mu} = \bar{X} = \alpha \beta $$ $$ \hat{\sigma}^2 = S^2 = \alpha \beta^2 $$
So, if we solve for these parameters, we can use a gamma distribution to describe our data:
$$ \alpha = \frac{\bar{X}^2}{S^2}, \, \beta = \frac{S^2}{\bar{X}} $$
$$ \alpha = \frac{\bar{X}^2}{S^2}, \, \beta = \frac{S^2}{\bar{X}} $$
Let's deal with the missing value in the October data. Given what we are trying to do, it is most sensible to fill in the missing value with the average of the available values.
In [5]:
precip.fillna(value={'Oct': precip.Oct.mean()}, inplace=True)
Out[5]:
Now, let's calculate the sample moments of interest, the means and variances by month:
In [6]:
precip_mean = precip.mean()
precip_mean
Out[6]:
In [7]:
precip_var = precip.var()
precip_var
Out[7]:
We then use these moments to estimate $\alpha$ and $\beta$ for each month:
In [8]:
alpha_mom = precip_mean ** 2 / precip_var
beta_mom = precip_var / precip_mean
In [9]:
alpha_mom, beta_mom
Out[9]:
We can use the
gamma.pdf
function in scipy.stats.distributions
to plot the ditribtuions implied by the calculated alphas and betas. For example, here is January:
In [10]:
from scipy.stats.distributions import gamma
precip.Jan.hist(normed=True, bins=20)
plt.plot(np.linspace(0, 10), gamma.pdf(np.linspace(0, 10), alpha_mom[0], beta_mom[0]))
Out[10]:
Looping over all months, we can create a grid of plots for the distribution of rainfall, using the gamma distribution:
In [11]:
axs = precip.hist(normed=True, figsize=(12, 8), sharex=True, sharey=True, bins=15, grid=False)
for ax in axs.ravel():
# Get month
m = ax.get_title()
# Plot fitted distribution
x = np.linspace(*ax.get_xlim())
ax.plot(x, gamma.pdf(x, alpha_mom[m], beta_mom[m]))
# Annotate with parameter estimates
label = 'alpha = {0:.2f}\nbeta = {1:.2f}'.format(alpha_mom[m], beta_mom[m])
ax.annotate(label, xy=(10, 0.2))
plt.tight_layout()
Maximum Likelihood
Maximum likelihood (ML) fitting is usually more work than the method of moments, but it is preferred as the resulting estimator is known to have good theoretical properties.There is a ton of theory regarding ML. We will restrict ourselves to the mechanics here.
Say we have some data $y = y_1,y_2,\ldots,y_n$ that is distributed according to some distribution:
$$Pr(Y_i=y_i | \theta)$$
Here, for example, is a Poisson distribution that describes the distribution of some discrete variables, typically counts:
In [12]:
y = np.random.poisson(5, size=100)
plt.hist(y, bins=12, normed=True)
plt.xlabel('y'); plt.ylabel('Pr(y)')
Out[12]:
The product $\prod_{i=1}^n Pr(y_i | \theta)$ gives us a measure of how likely it is to observe values $y_1,\ldots,y_n$ given the parameters $\theta$. Maximum likelihood fitting consists of choosing the appropriate function $l= Pr(Y|\theta)$ to maximize for a given set of observations. We call this function the likelihood function, because it is a measure of how likely the observations are if the model is true.
Given these data, how likely is this model?
In the above model, the data were drawn from a Poisson distribution with parameter $\lambda =5$.
$$L(y|\lambda=5) = \frac{e^{-5} 5^y}{y!}$$
So, for any given value of $y$, we can calculate its likelihood:
$$L(y|\lambda=5) = \frac{e^{-5} 5^y}{y!}$$
So, for any given value of $y$, we can calculate its likelihood:
In [13]:
poisson_like = lambda x, lam: np.exp(-lam) * (lam**x) / (np.arange(x)+1).prod()
lam = 6
value = 10
poisson_like(value, lam)
Out[13]:
In [14]:
np.sum(poisson_like(yi, lam) for yi in y)
Out[14]:
In [15]:
lam = 8
np.sum(poisson_like(yi, lam) for yi in y)
Out[15]:
We can plot the likelihood function for any value of the parameter(s):
In [16]:
lambdas = np.linspace(0,15)
x = 5
plt.plot(lambdas, [poisson_like(x, l) for l in lambdas])
plt.xlabel('$\lambda$')
plt.ylabel('L($\lambda$|x={0})'.format(x))
Out[16]:
How is the likelihood function different than the probability distribution function (PDF)? The likelihood is a function of the parameter(s) given the data, whereas the PDF returns the probability of data given a particular parameter value. Here is the PDF of the Poisson for $\lambda=5$.
In [17]:
lam = 5
xvals = np.arange(15)
plt.bar(xvals, [poisson_like(x, lam) for x in xvals])
plt.xlabel('x')
plt.ylabel('Pr(X|$\lambda$=5)')
Out[17]:
Why are we interested in the likelihood function?
A reasonable estimate of the true, unknown value for the parameter is one which maximizes the likelihood function. So, inference is reduced to an optimization problem.
A reasonable estimate of the true, unknown value for the parameter is one which maximizes the likelihood function. So, inference is reduced to an optimization problem.
Going back to the rainfall data, if we are using a gamma distribution we need to maximize:
$$\begin{align}l(\alpha,\beta) &= \sum_{i=1}^n \log[\beta^{\alpha} x^{\alpha-1} e^{-x/\beta}\Gamma(\alpha)^{-1}] \cr &= n[(\alpha-1)\overline{\log(x)} - \bar{x}\beta + \alpha\log(\beta) - \log\Gamma(\alpha)]\end{align}$$
(Its usually easier to work in the log scale)
where $n = 2012 − 1871 = 141$ and the bar indicates an average over all i. We choose $\alpha$ and $\beta$ to maximize $l(\alpha,\beta)$.
Notice $l$ is infinite if any $x$ is zero. We do not have any zeros, but we do have an NA value for one of the October data, which we dealt with above.
$$\begin{align}l(\alpha,\beta) &= \sum_{i=1}^n \log[\beta^{\alpha} x^{\alpha-1} e^{-x/\beta}\Gamma(\alpha)^{-1}] \cr &= n[(\alpha-1)\overline{\log(x)} - \bar{x}\beta + \alpha\log(\beta) - \log\Gamma(\alpha)]\end{align}$$
(Its usually easier to work in the log scale)
where $n = 2012 − 1871 = 141$ and the bar indicates an average over all i. We choose $\alpha$ and $\beta$ to maximize $l(\alpha,\beta)$.
Notice $l$ is infinite if any $x$ is zero. We do not have any zeros, but we do have an NA value for one of the October data, which we dealt with above.
Finding the MLE
To find the maximum of any function, we typically take the derivative with respect to the variable to be maximized, set it to zero and solve for that variable.$$\frac{\partial l(\alpha,\beta)}{\partial \beta} = n\left(\frac{\alpha}{\beta} - \bar{x}\right) = 0$$
Which can be solved as $\beta = \alpha/\bar{x}$. However, plugging this into the derivative with respect to $\alpha$ yields:
$$\frac{\partial l(\alpha,\beta)}{\partial \alpha} = \log(\alpha) + \overline{\log(x)} - \log(\bar{x}) - \frac{\Gamma(\alpha)'}{\Gamma(\alpha)} = 0$$
This has no closed form solution. We must use numerical optimization!
Numerical optimization alogarithms take an initial "guess" at the solution, and iteratively improve the guess until it gets "close enough" to the answer.
Here, we will use Newton-Raphson algorithm:
$$x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}$$
Here, we will use Newton-Raphson algorithm:
$$x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}$$
Which is available to us via SciPy:
In [18]:
from scipy.optimize import newton
Here is a graphical example of how Newtone-Raphson converges on a solution, using an arbitrary function:
In [19]:
# some function
func = lambda x: 3./(1 + 400*np.exp(-2*x)) - 1
xvals = np.linspace(0, 6)
plt.plot(xvals, func(xvals))
plt.text(5.3, 2.1, '$f(x)$', fontsize=16)
# zero line
plt.plot([0,6], [0,0], 'k-')
# value at step n
plt.plot([4,4], [0,func(4)], 'k:')
plt.text(4, -.2, '$x_n$', fontsize=16)
# tangent line
tanline = lambda x: -0.858 + 0.626*x
plt.plot(xvals, tanline(xvals), 'r--')
# point at step n+1
xprime = 0.858/0.626
plt.plot([xprime, xprime], [tanline(xprime), func(xprime)], 'k:')
plt.text(xprime+.1, -.2, '$x_{n+1}$', fontsize=16)
Out[19]:
To apply the Newton-Raphson algorithm, we need a function that returns a vector containing the first and second derivatives of the function with respect to the variable of interest. In our case, this is:
In [20]:
from scipy.special import psi, polygamma
dlgamma = lambda m, log_mean, mean_log: np.log(m) - psi(m) - log_mean + mean_log
dl2gamma = lambda m, *args: 1./m - polygamma(1, m)
where
log_mean
and mean_log
are $\log{\bar{x}}$ and $\overline{\log(x)}$, respectively. psi
and polygamma
are complex functions of the Gamma function that result when you take first and second derivatives of that function.
In [21]:
# Calculate statistics
log_mean = precip.mean().apply(np.log)
mean_log = precip.apply(np.log).mean()
Time to optimize!
In [22]:
# Alpha MLE for December
alpha_mle = newton(dlgamma, 2, dl2gamma, args=(log_mean[-1], mean_log[-1]))
alpha_mle
Out[22]:
And now plug this back into the solution for beta:
$$ \beta = \frac{\alpha}{\bar{X}} $$
$$ \beta = \frac{\alpha}{\bar{X}} $$
In [23]:
beta_mle = alpha_mle/precip.mean()[-1]
beta_mle
Out[23]:
We can compare the fit of the estimates derived from MLE to those from the method of moments:
In [24]:
dec = precip.Dec
dec.hist(normed=True, bins=10, grid=False)
x = np.linspace(0, dec.max())
plt.plot(x, gamma.pdf(x, alpha_mom[-1], beta_mom[-1]), 'm-')
plt.plot(x, gamma.pdf(x, alpha_mle, beta_mle), 'r--')
Out[24]:
For some common distributions, SciPy includes methods for fitting via MLE:
In [25]:
from scipy.stats import gamma
gamma.fit(precip.Dec)
Out[25]:
This fit is not directly comparable to our estimates, however, because SciPy's
gamma.fit
method fits an odd 3-parameter version of the gamma distribution.Example: truncated distribution
Suppose that we observe $Y$ truncated below at $a$ (where $a$ is known). If $X$ is the distribution of our observation, then:$$ P(X \le x) = P(Y \le x|Y \gt a) = \frac{P(a \lt Y \le x)}{P(Y \gt a)}$$
(so, $Y$ is the original variable and $X$ is the truncated variable)
Then X has the density:
$$f_X(x) = \frac{f_Y (x)}{1−F_Y (a)} \, \text{for} \, x \gt a$$
Suppose $Y \sim N(\mu, \sigma^2)$ and $x_1,\ldots,x_n$ are independent observations of $X$. We can use maximum likelihood to find $\mu$ and $\sigma$.
First, we can simulate a truncated distribution using a
while
statement to eliminate samples that are outside the support of the truncated distribution.
In [26]:
x = np.random.normal(size=10000)
a = -1
x_small = x < a
while x_small.sum():
x[x_small] = np.random.normal(size=x_small.sum())
x_small = x < a
_ = plt.hist(x, bins=100)
We can construct a log likelihood for this function using the conditional form:
$$f_X(x) = \frac{f_Y (x)}{1−F_Y (a)} \, \text{for} \, x \gt a$$
$$f_X(x) = \frac{f_Y (x)}{1−F_Y (a)} \, \text{for} \, x \gt a$$
In [27]:
from scipy.stats.distributions import norm
trunc_norm = lambda theta, a, x: -(np.log(norm.pdf(x, theta[0], theta[1])) -
np.log(1 - norm.cdf(a, theta[0], theta[1]))).sum()
For this example, we will use another optimization algorithm, the Nelder-Mead simplex algorithm. It has a couple of advantages:
- it does not require derivatives
- it can optimize (minimize) a vector of parameters
fmin
function:
In [28]:
from scipy.optimize import fmin
fmin(trunc_norm, np.array([1,2]), args=(-1, x))
Out[28]:
In general, simulating data is a terrific way of testing your model before using it with real data.
Kernel density estimates
In some instances, we may not be interested in the parameters of a particular distribution of data, but just a smoothed representation of the data at hand. In this case, we can estimate the disribution non-parametrically (i.e. making no assumptions about the form of the underlying distribution) using kernel density estimation.
In [29]:
# Some random data
y = np.random.random(15) * 10
y
Out[29]:
In [30]:
x = np.linspace(0, 10, 100)
# Smoothing parameter
s = 0.4
# Calculate the kernels
kernels = np.transpose([norm.pdf(x, yi, s) for yi in y])
plt.plot(x, kernels, 'k:')
plt.plot(x, kernels.sum(1))
plt.plot(y, np.zeros(len(y)), 'ro', ms=10)
Out[30]:
SciPy implements a Gaussian KDE that automatically chooses an appropriate bandwidth. Let's create a bi-modal distribution of data that is not easily summarized by a parametric distribution:
In [31]:
# Create a bi-modal distribution with a mixture of Normals.
x1 = np.random.normal(0, 3, 50)
x2 = np.random.normal(4, 1, 50)
# Append by row
x = np.r_[x1, x2]
In [32]:
plt.hist(x, bins=8, normed=True)
Out[32]:
In [33]:
from scipy.stats import kde
density = kde.gaussian_kde(x)
xgrid = np.linspace(x.min(), x.max(), 100)
plt.hist(x, bins=8, normed=True)
plt.plot(xgrid, density(xgrid), 'r-')
Out[33]:
Exercise: Cervical dystonia analysis
Recall the cervical dystonia database, which is a clinical trial of botulinum toxin type B (BotB) for patients with cervical dystonia from nine U.S. sites. The response variable is measurements on the Toronto Western Spasmodic Torticollis Rating Scale (TWSTRS), measuring severity, pain, and disability of cervical dystonia (high scores mean more impairment). One way to check the efficacy of the treatment is to compare the distribution of TWSTRS for control and treatment patients at the end of the study.Use the method of moments or MLE to calculate the mean and variance of TWSTRS at week 16 for one of the treatments and the control group. Assume that the distribution of the
twstrs
variable is normal:$$f(x \mid \mu, \sigma^2) = \sqrt{\frac{1}{2\pi\sigma^2}} \exp\left\{ -\frac{1}{2} \frac{(x-\mu)^2}{\sigma^2} \right\}$$
In [34]:
cdystonia = pd.read_csv("data/cdystonia.csv")
cdystonia[cdystonia.obs==6].hist(column='twstrs', by=cdystonia.treat, bins=8)
Out[34]:
In [35]:
# Write your answer here
Комментариев нет:
Отправить комментарий