Statistics: Difference between revisions

From Wiki
Jump to navigation Jump to search
No edit summary
 
Line 11: Line 11:
* population mean <math>\mu</math> is the average over the entire population of size N <math>\mu = \frac{1}{N}\sum^N x_i</math>
* population mean <math>\mu</math> is the average over the entire population of size N <math>\mu = \frac{1}{N}\sum^N x_i</math>
* sample mean <math>\overline{x}</math> is the average over a sample of size n (usually n << N) <math>\overline{x} = \frac{1}{n}\sum^n x_i</math>
* sample mean <math>\overline{x}</math> is the average over a sample of size n (usually n << N) <math>\overline{x} = \frac{1}{n}\sum^n x_i</math>
* test is <math>\xi\frac{\alpha}{\beta - 1}</math>


==Variance and Standard Deviation==
==Variance and Standard Deviation==

Revision as of 00:46, 10 December 2013

from Biostatistics, by Paulson 2008

Normal Distribution

  • 68% of area lies within one standard deviation of the mean
  • 95% for two
  • 99.7% for three standard deviations

Mean

  • population mean <math>\mu</math> is the average over the entire population of size N <math>\mu = \frac{1}{N}\sum^N x_i</math>
  • sample mean <math>\overline{x}</math> is the average over a sample of size n (usually n << N) <math>\overline{x} = \frac{1}{n}\sum^n x_i</math>
  • test is <math>\xi\frac{\alpha}{\beta - 1}</math>

Variance and Standard Deviation

  • population variance <math>\sigma^2 = \frac{1}{N}\sum (x_i-\mu)^2</math>
  • sample variance <math>s^2 = \frac{1}{n-1}\sum (x_i-\overline{x})^2 = \frac{\sum x_i^2 -n\overline{x}^2}{n-1} </math>
  • standard deviation is the square root of variance

Mode, Median

  • The mode of a sample is simply the value that occurs most often
  • The median is the value that has an equal number of values above and below it. If there are an even number of values, you average the two middle ones.

Z and Student's t Distribution

  • Z distribution for a sample normalizes <math>z_i = \frac{x_i-\overline{x}}{s}</math>
  • Student's t distribution approximates z distribution but compensates for smaller samples by fattening tails as n gets smaller. It is essentially standard normal for n > 100. The parameter is called "degrees of freedom", and amounts to n-1

Standard Error of the Mean, or Mean Standard Error (MSE)

  • <math>s_{\overline{x}} = \frac{s}{\sqrt{n}}</math>
  • standard deviationof the sample: 95% of sample values lie within the interval <math>\overline{x} \pm 2s</math>
  • standard deviation of the mean: the true mean <math>\mu</math> lies within the interval <math>\overline{x} \pm 2s_{\overline{x}}</math> 95% of the time
  • In statistical tests, the means of samples are compared, not the data points themselves

Confidence Intervals

The interval <math>\overline{x} \pm t(\alpha/2, n-1) \frac{s}{\sqrt{n}}</math> contains the mean <math>\mu</math> with a confidence level of <math>1-\alpha</math>. For example, <math>\alpha</math> would be 0.05 for a 95% confidence interval.

Hypothesis testing

  • Upper-tail test: Does sample A have higher values than sample B? We look at A - B > 0
  • Lower-tail test: Does sample A have lower values than sample B? We look at A - B < 0
  • Two-tail test: Is sample A different from sample B? We look at A - B not equal to zero
  • In all cases, the null hypothesis is that A - B is equivalent to zero within the accuracy of our test.
  • A - B is represented by z or t values, and tails refer to the outlying values of the standard normal or t distribution.
  • The two tail test is more general, but the upper and lower tail tests are more powerful because they are more specific.
  • Rejecting the null hypothesis is significant, but failing to reject it is not.

Two types of errors

  • Type 1 error: alpha is the probability (or "acceptance level") of rejecting the null hypothesis when you shouldn't. So if <math>\alpha = 0.05</math>, we will falsely claim a significant difference 5 times out of 100.
  • Type 2 error: beta is the probability (or "acceptance level") of sticking with the null hypothesis when you shouldn't. So if <math>\beta = 0.20</math>, we will wrongly ignore a significant different 20 times out of 100.
  • The power of the statistic is defined as <math>1-\beta</math>

Compare one sample to a known standard value

  • Calculated t value is

<math>t_c = \frac{\overline{x}-c}{s/\sqrt{n}}</math>

where c is known standard value. Note that this is close to zero if the mean is close to the known standard value.

  • Two-tail test: Null hypothesis is that<math>t(-\alpha/2, n-1) < t_c < t(\alpha/2, n-1)</math>Rejection means that our mean is significantly different from the standard value.
  • Lower-tail test: Null hypothesis is that<math>t_c > t(-\alpha, n-1)</math>Rejection means that our mean is significantly less than the standard value.
  • Upper-tail test: Null hypothesis is that<math>t_c < t(\alpha, n-1)</math>Rejection means that our mean is significantly greater than the standard value.

Determining adequate sample size for one-sample test

  • Rough estimate:<math>n \ge \frac{z_{\alpha/2}^2s^2}{d^2}</math>where s is an estimate of deviation and d is "detection level", the minimum difference from the standard value that needs to be detected
  • Iterative method:<math>n \ge \frac{t^2(\alpha/2, n-1)s^2}{d^2}</math>Here, n shows up on both sides of the equation, so start with a large estimate of n and recalculate until it converges.

Two-sample independent t test

  • Assume that each sample comes from a normal distribution
  • We can make no assumptions about the variances of the two samples
  • Calculated test statistic is<math>t_c = (\overline{x}_A-\overline{x}_B)/\sqrt{\frac{s_A^2}{n_A}+\frac{s_B^2}{n_B}}</math> with <math>n_1+n_2-2</math> degrees of freedom
  • Sample size determination:<math>n \ge \frac{(s_1^2+s_2^2)(z_{\alpha/2}+z_{\beta})^2}{d^2}</math>

Two-sample pooled t test

  • Assume that each sample comes from a normal distribution
  • Assume that variances of the two samples are close to equal
  • Pooled standard deviation is<math>s_{pooled} = \sqrt{\frac{(n_A-1)s_A^2+(n_B-1)s_B^2}{n_A+n_B-2}}\sqrt{\frac{1}{n_A}+\frac{1}{n_B}}</math>
  • Calculated test statistic is<math>t_c = \frac{\overline{x}_A - \overline{x}_B}{s_{pooled}}</math>
  • Sample size determination:<math>n \ge \frac{s_{pooled}^2(z_{\alpha/2}+z_{\beta})^2}{d^2}</math>

Paired t test

  • Assume that each sample comes from a normal distribution
  • Samples come in pairs (e.g. each test subject is matched to an otherwise identical control)
  • We do statistics on <math>d = x_A-x_B</math>
  • Mean(n is the number of pairs!):<math>\overline{d} = \frac{1}{n}\sum{(x_A-x_B)} = \frac{1}{n}\sum{d}</math>
  • Standard deviation of pair differences is<math>s_{paired} = \sqrt{\frac{\sum{(d_i-\overline{d})^2}}{n-1}}</math>
  • Test statistic:<math>t_c = \frac{\overline{d}\sqrt{n}}{s_{paired}}</math>
  • Sample size determination:<math>n \ge \frac{s_{paired}^2(z_{\alpha/2}+z_{\beta})^2}{d^2}</math>

Two boolean sample proportion t test

  • If each trial is a boolean "success" or "failure", let P denote the proportion of successes for the sample.
  • The combined proportion of samples A and B is<math>P_c = \frac{n_A P_A+n_B P_B}{n_A+n_B}</math>
  • The sample standard deviation is<math>s = \sqrt\frac{P_c(1-P_c)}{n_A+n_B}</math>
  • The test statistic is<math>t_c = \frac{P_A - P_B}{s}</math>

One-factor ANOVA

  • This is analogous to a two-sample independent t-test, but with more than two samples.
  • We assume that the sampled populations are normally distributed with different means but the same variance.
  • Let <math>j = 1 ... m</math> denote the m different populations, and <math>i = 1 ... n</math> denote the n members <math>x_{ij}</math> in each sample.
  • Let <math>\overline{x}_j = \frac{1}{n}\sum_{i=1}^n x_{ij}</math> denote the mean of the jth sample, and let <math>\overline{\overline{x}} = \frac{1}{nm}\sum_i \sum_j x_{ij} = \frac{1}{m}\sum_{j=1}^m\overline{x}_j</math> denote the mean of all the samples.
  • The sample variance of the population means (also "mean square treatment") is<math>MST = \frac{1}{m-1}\sum_j(\overline{x}_j - \overline{\overline{x}})^2</math>
  • The mean of the sample variances (also "mean square error") is<math>MSE = \frac{1}{m(n-1)}\sum_i \sum_j(x_{ij} - \overline{x}_j)^2</math>
  • The test statistic is <math>F_c = \frac{MST}{MSE}</math>
  • The null hypothesis is that the populations are not significantly different from each other, so that <math>F_c \approx 1</math>.
  • We reject the null hypothesis at level alpha if <math>F_c > F_t = F[\alpha, m-1, m(n-1)</math>], where
  • <math>F(\alpha, d_1, d_2)</math> is F-test based on the Fisher or F-distribution. See http://en.wikipedia.org/wiki/F-test
Contrasts (Tukey method)
  • http://en.wikipedia.org/wiki/Tukey%27s_test
  • Make the same assumptions as in one-factor ANOVA above, but instead of testing everything all at once, we compare each pair of populations independently.
  • For each pair of means, we reject the null hypothesis (that they are the same) if<math>|\overline{x}_i - \overline{x}_j| > q(\alpha, m, m(n-1))\sqrt{\frac{MSE}{n}}</math>, where <math>q(\alpha, d_1, d_2)</math> is based on a Tukey-Cramer distribution.
Confidence Intervals
  • We can also compare confidence intervals for each population:<math>\mu_i = \overline{x}_i \pm t(\alpha/2, m(n-1))\sqrt{\frac{MSE}{n}}</math>
Sample size estimate

<math>n \ge \frac{m\cdot MSE(z_{\alpha/2}+z_\beta)^2}{\delta^2}</math>, where the MSE is estimated beforehand and delta is the desired detection level

Blocked ANOVA

  • This is like the paired t-test for more than two populations.
  • We assume that the sampled populations are normally distributed with different means but the same variance.
  • We further assume that samples come in blocks, each member of the blocks coming from one of each population.
  • Let <math>j = 1 ... m</math> denote the m different populations, and <math>i = 1 ... n</math> denote the n groups of members <math>x_{ij}</math> in each sample.
  • In addition to the notation above, let<math>\overline{x}_i = \frac{1}{m}\sum_{j=1}^m x_{ij}</math> denote the mean of the ith block.
  • The sample variance of the population means (also "mean square treatment") is<math>MST = \frac{n}{m-1}\sum_j(\overline{x}_j - \overline{\overline{x}})^2</math>
  • The sample variance of the block means (also "mean square block") is<math>MSB = \frac{m}{n-1}\sum_i(\overline{x}_i - \overline{\overline{x}})^2</math>
  • The mean of the sample variances (also "mean square error") is<math>MSE = \frac{1}{(m-1)(n-1)}\sum_i \sum_j(x_{ij} - \overline{x}_j)^2</math>
  • There are two null hypotheses and two statistics:
  • Reject the hypothesis that the populations share the same mean if<math>\frac{MST}{MSE} > F[\alpha, m-1, (m-1)(n-1)</math>]
  • Reject the hypothesis that the blocks share the same mean if<math>\frac{MSB}{MSE} > F[\alpha, n-1, (m-1)(n-1)</math>]
  • If the blocks turn out to be the same, you gain no power by choosing the blocked ANOVA over the standard one-way ANOVA.
Contrasts (Tukey method)[1]
  • Make the same assumptions as in blocked ANOVA above, but instead of testing everything all at once, we compare each pair of populations independently.
  • For each pair of means, we reject the null hypothesis (that they are the same) if<math>|\overline{x}_i - \overline{x}_j| > q(\alpha, m, (m-1)(n-1))\sqrt{\frac{MSE}{n}}</math>, where <math>q(\alpha, d_1, d_2)</math> is based on a Tukey-Cramer distribution.
Confidence Intervals
  • We can also compare confidence intervals for each population:<math>\mu_i = \overline{x}_i \pm t(\alpha/2, (m-1)(n-1))\sqrt{\frac{MSE}{n}}</math>
Sample size estimate

<math>n \ge \frac{m\cdot MSE(z_{\alpha/2}+z_\beta)^2}{\delta^2}</math>, where the MSE is estimated beforehand and delta is the desired detection level

Fitting data with least-squares regression

  • We have a set of n (x,y) pairs, and want to fit these points to a line <math>\hat{y} = a + bx</math>
  • Estimate the slope with<math>b = \frac{\sum{xy} - n\overline{x} \overline{y}}{\sum{x^2} - n\overline{x}^2}</math>
  • Estimate the y-intercept with<math>a = \overline{y}-b\overline{x}</math>

Linearizing Data

  • Linearize by increasing or decreasing the power of the x or y values, or both. Linearizing y only is most common.
  • Here is the sequence of transformations to try:<math>..., y^{-3}, y^{-2}, y^{-1}, y^{-1/2}, \log{y}, y^{1/2}, y, y^2, y^3, ...</math>

Interpolating from the Linear Regression

  • Extrapolating beyond the data range is risky.
  • After solving for a and b, you can generate an alpha-level confidence interval for the average <math>\hat{y}</math> for a given x:<math>\overline{\hat{y}} \pm t(\alpha/2, n-2)\sqrt{\frac{\sum(y-\hat{y})^2}{n-2}\left[\frac{1}{n}+\frac{(x-\overline{x})^2}{\sum(x-\overline{x})^2}\right]}</math>

Correlation

  • For a set of n (x,y) pairs, we can talk about how well the samples are correlated.
  • Correlation coefficient is denoted as r:<math>r = \frac{\sum{xy}-\frac{1}{n}\sum{x}\sum{y}}{\sqrt{\left[\sum{x^2}-\frac{1}{n}\left(\sum{x}\right)^2\right]\left[\sum{y^2}-\frac{1}{n}\left(\sum{y}\right)^2\right]}}</math>
  • A r = 1 means a perfectly linear relationship with positive slope. r = -1 is perfectly correlated with negative slope. r = 0 means no relationship.
  • Coefficient of determination is <math>r^2</math> and has a direct interpretation. If <math>r^2 = 0.9</math>, that means 90% of the data variability can be explained by the regression equation. The rest is random noise.

Testing boolean data

  • We have a sample of size n from a population of boolean trials, and we know that proportion p of the trials resulted in success.
  • The sample mean is just p, and for the sake of discussion, assume <math>p < 1-p</math>.
  • If <math>np > 5</math>, we model the sample as binomial and take the sample variance as <math>s^2 = p(1-p)</math>
  • If <math>np \not> 5</math>, we model the sample as Poisson and take the sample variance as <math>s^2 = p</math>

Confidence interval for mean of boolean sample

  • The confidence interval for the sample mean is <math>p \pm z_{\alpha/2}\sqrt{\frac{p(1-p)}{n}}</math> or <math>p \pm z_{\alpha/2}\sqrt{\frac{p}{n}}</math>
  • If p is anywhere near 0.5, we throw in the "Yates factor" to expand the confidence interval:<math>p - z_{\alpha/2}\sqrt{\frac{p(1-p)}{n}} - \frac{1}{2n} < \pi < p + z_{\alpha/2}\sqrt{\frac{p(1-p)}{n}} + \frac{1}{2n}</math>

Comparing the mean of a boolean sample to a standard value

  • To compare the mean p to a standard value c, we use as test statistic<math>z_c = \frac{p-c}{\sqrt{\frac{p(1-p)}{n}}}</math> or <math>z_c = \frac{p-c}{\sqrt{\frac{p}{n}}}</math>Note that this is close to zero when p is close to c.
  • Two-tail test: Null hypothesis is that<math>-z_{\alpha/2} < z_c < z_{\alpha/2}</math>Rejection means that our mean is significantly different from the standard value.
  • Lower-tail test: Null hypothesis is that<math>z_c > -z_{\alpha}</math>Rejection means that our mean is significantly less than the standard value.
  • Upper-tail test: Null hypothesis is that<math>z_c < z_\alpha</math>Rejection means that our mean is significantly greater than the standard value.