Statistics: Difference between revisions

Newer edit →

Revision as of 00:46, 10 December 2013

from Biostatistics, by Paulson 2008

Normal Distribution

68% of area lies within one standard deviation of the mean
95% for two
99.7% for three standard deviations

Mean

population mean <math>\mu</math> is the average over the entire population of size N <math>\mu = \frac{1}{N}\sum^N x_i</math>
sample mean <math>\overline{x}</math> is the average over a sample of size n (usually n << N) <math>\overline{x} = \frac{1}{n}\sum^n x_i</math>
test is <math>\xi\frac{\alpha}{\beta - 1}</math>

Variance and Standard Deviation

population variance <math>\sigma^2 = \frac{1}{N}\sum (x_i-\mu)^2</math>

sample variance <math>s^2 = \frac{1}{n-1}\sum (x_i-\overline{x})^2 = \frac{\sum x_i^2 -n\overline{x}^2}{n-1} </math>

standard deviation is the square root of variance

Mode, Median

The mode of a sample is simply the value that occurs most often
The median is the value that has an equal number of values above and below it. If there are an even number of values, you average the two middle ones.

Z and Student's t Distribution

Z distribution for a sample normalizes <math>z_i = \frac{x_i-\overline{x}}{s}</math>

Student's t distribution approximates z distribution but compensates for smaller samples by fattening tails as n gets smaller. It is essentially standard normal for n > 100. The parameter is called "degrees of freedom", and amounts to n-1

Standard Error of the Mean, or Mean Standard Error (MSE)

<math>s_{\overline{x}} = \frac{s}{\sqrt{n}}</math>
standard deviationof the sample: 95% of sample values lie within the interval <math>\overline{x} \pm 2s</math>
standard deviation of the mean: the true mean <math>\mu</math> lies within the interval <math>\overline{x} \pm 2s_{\overline{x}}</math> 95% of the time
In statistical tests, the means of samples are compared, not the data points themselves

Confidence Intervals

The interval <math>\overline{x} \pm t(\alpha/2, n-1) \frac{s}{\sqrt{n}}</math> contains the mean <math>\mu</math> with a confidence level of <math>1-\alpha</math>. For example, <math>\alpha</math> would be 0.05 for a 95% confidence interval.

Hypothesis testing

Upper-tail test: Does sample A have higher values than sample B? We look at A - B > 0
Lower-tail test: Does sample A have lower values than sample B? We look at A - B < 0
Two-tail test: Is sample A different from sample B? We look at A - B not equal to zero
In all cases, the null hypothesis is that A - B is equivalent to zero within the accuracy of our test.
A - B is represented by z or t values, and tails refer to the outlying values of the standard normal or t distribution.
The two tail test is more general, but the upper and lower tail tests are more powerful because they are more specific.
Rejecting the null hypothesis is significant, but failing to reject it is not.

Two types of errors

Type 1 error: alpha is the probability (or "acceptance level") of rejecting the null hypothesis when you shouldn't. So if <math>\alpha = 0.05</math>, we will falsely claim a significant difference 5 times out of 100.
Type 2 error: beta is the probability (or "acceptance level") of sticking with the null hypothesis when you shouldn't. So if <math>\beta = 0.20</math>, we will wrongly ignore a significant different 20 times out of 100.
The power of the statistic is defined as <math>1-\beta</math>

Compare one sample to a known standard value

Calculated t value is

<math>t_c = \frac{\overline{x}-c}{s/\sqrt{n}}</math>

where c is known standard value. Note that this is close to zero if the mean is close to the known standard value.

Two-tail test: Null hypothesis is that<math>t(-\alpha/2, n-1) < t_c < t(\alpha/2, n-1)</math>Rejection means that our mean is significantly different from the standard value.
Lower-tail test: Null hypothesis is that<math>t_c > t(-\alpha, n-1)</math>Rejection means that our mean is significantly less than the standard value.
Upper-tail test: Null hypothesis is that<math>t_c < t(\alpha, n-1)</math>Rejection means that our mean is significantly greater than the standard value.

Determining adequate sample size for one-sample test

Rough estimate:<math>n \ge \frac{z_{\alpha/2}^2s^2}{d^2}</math>where s is an estimate of deviation and d is "detection level", the minimum difference from the standard value that needs to be detected
Iterative method:<math>n \ge \frac{t^2(\alpha/2, n-1)s^2}{d^2}</math>Here, n shows up on both sides of the equation, so start with a large estimate of n and recalculate until it converges.

Two-sample independent t test

Assume that each sample comes from a normal distribution
We can make no assumptions about the variances of the two samples
Calculated test statistic is<math>t_c = (\overline{x}_A-\overline{x}_B)/\sqrt{\frac{s_A^2}{n_A}+\frac{s_B^2}{n_B}}</math> with <math>n_1+n_2-2</math> degrees of freedom
Sample size determination:<math>n \ge \frac{(s_1^2+s_2^2)(z_{\alpha/2}+z_{\beta})^2}{d^2}</math>

Two-sample pooled t test

Assume that each sample comes from a normal distribution
Assume that variances of the two samples are close to equal
Pooled standard deviation is<math>s_{pooled} = \sqrt{\frac{(n_A-1)s_A^2+(n_B-1)s_B^2}{n_A+n_B-2}}\sqrt{\frac{1}{n_A}+\frac{1}{n_B}}</math>
Calculated test statistic is<math>t_c = \frac{\overline{x}_A - \overline{x}_B}{s_{pooled}}</math>
Sample size determination:<math>n \ge \frac{s_{pooled}^2(z_{\alpha/2}+z_{\beta})^2}{d^2}</math>

Paired t test

Assume that each sample comes from a normal distribution
Samples come in pairs (e.g. each test subject is matched to an otherwise identical control)
We do statistics on <math>d = x_A-x_B</math>
Mean(n is the number of pairs!):<math>\overline{d} = \frac{1}{n}\sum{(x_A-x_B)} = \frac{1}{n}\sum{d}</math>
Standard deviation of pair differences is<math>s_{paired} = \sqrt{\frac{\sum{(d_i-\overline{d})^2}}{n-1}}</math>
Test statistic:<math>t_c = \frac{\overline{d}\sqrt{n}}{s_{paired}}</math>
Sample size determination:<math>n \ge \frac{s_{paired}^2(z_{\alpha/2}+z_{\beta})^2}{d^2}</math>

Two boolean sample proportion t test

If each trial is a boolean "success" or "failure", let P denote the proportion of successes for the sample.
The combined proportion of samples A and B is<math>P_c = \frac{n_A P_A+n_B P_B}{n_A+n_B}</math>
The sample standard deviation is<math>s = \sqrt\frac{P_c(1-P_c)}{n_A+n_B}</math>
The test statistic is<math>t_c = \frac{P_A - P_B}{s}</math>

One-factor ANOVA

This is analogous to a two-sample independent t-test, but with more than two samples.
We assume that the sampled populations are normally distributed with different means but the same variance.
Let <math>j = 1 ... m</math> denote the m different populations, and <math>i = 1 ... n</math> denote the n members <math>x_{ij}</math> in each sample.
Let <math>\overline{x}_j = \frac{1}{n}\sum_{i=1}^n x_{ij}</math> denote the mean of the jth sample, and let <math>\overline{\overline{x}} = \frac{1}{nm}\sum_i \sum_j x_{ij} = \frac{1}{m}\sum_{j=1}^m\overline{x}_j</math> denote the mean of all the samples.
The sample variance of the population means (also "mean square treatment") is<math>MST = \frac{1}{m-1}\sum_j(\overline{x}_j - \overline{\overline{x}})^2</math>
The mean of the sample variances (also "mean square error") is<math>MSE = \frac{1}{m(n-1)}\sum_i \sum_j(x_{ij} - \overline{x}_j)^2</math>
The test statistic is <math>F_c = \frac{MST}{MSE}</math>
The null hypothesis is that the populations are not significantly different from each other, so that <math>F_c \approx 1</math>.
We reject the null hypothesis at level alpha if <math>F_c > F_t = F[\alpha, m-1, m(n-1)</math>], where
<math>F(\alpha, d_1, d_2)</math> is F-test based on the Fisher or F-distribution. See http://en.wikipedia.org/wiki/F-test

Contrasts (Tukey method)

http://en.wikipedia.org/wiki/Tukey%27s_test
Make the same assumptions as in one-factor ANOVA above, but instead of testing everything all at once, we compare each pair of populations independently.
For each pair of means, we reject the null hypothesis (that they are the same) if<math>|\overline{x}_i - \overline{x}_j| > q(\alpha, m, m(n-1))\sqrt{\frac{MSE}{n}}</math>, where <math>q(\alpha, d_1, d_2)</math> is based on a Tukey-Cramer distribution.

Confidence Intervals

We can also compare confidence intervals for each population:<math>\mu_i = \overline{x}_i \pm t(\alpha/2, m(n-1))\sqrt{\frac{MSE}{n}}</math>

Sample size estimate

<math>n \ge \frac{m\cdot MSE(z_{\alpha/2}+z_\beta)^2}{\delta^2}</math>, where the MSE is estimated beforehand and delta is the desired detection level

Blocked ANOVA

This is like the paired t-test for more than two populations.
We assume that the sampled populations are normally distributed with different means but the same variance.
We further assume that samples come in blocks, each member of the blocks coming from one of each population.
Let <math>j = 1 ... m</math> denote the m different populations, and <math>i = 1 ... n</math> denote the n groups of members <math>x_{ij}</math> in each sample.
In addition to the notation above, let<math>\overline{x}_i = \frac{1}{m}\sum_{j=1}^m x_{ij}</math> denote the mean of the ith block.
The sample variance of the population means (also "mean square treatment") is<math>MST = \frac{n}{m-1}\sum_j(\overline{x}_j - \overline{\overline{x}})^2</math>
The sample variance of the block means (also "mean square block") is<math>MSB = \frac{m}{n-1}\sum_i(\overline{x}_i - \overline{\overline{x}})^2</math>
The mean of the sample variances (also "mean square error") is<math>MSE = \frac{1}{(m-1)(n-1)}\sum_i \sum_j(x_{ij} - \overline{x}_j)^2</math>
There are two null hypotheses and two statistics:
Reject the hypothesis that the populations share the same mean if<math>\frac{MST}{MSE} > F[\alpha, m-1, (m-1)(n-1)</math>]
Reject the hypothesis that the blocks share the same mean if<math>\frac{MSB}{MSE} > F[\alpha, n-1, (m-1)(n-1)</math>]
If the blocks turn out to be the same, you gain no power by choosing the blocked ANOVA over the standard one-way ANOVA.

Contrasts (Tukey method)[1]

Make the same assumptions as in blocked ANOVA above, but instead of testing everything all at once, we compare each pair of populations independently.
For each pair of means, we reject the null hypothesis (that they are the same) if<math>|\overline{x}_i - \overline{x}_j| > q(\alpha, m, (m-1)(n-1))\sqrt{\frac{MSE}{n}}</math>, where <math>q(\alpha, d_1, d_2)</math> is based on a Tukey-Cramer distribution.

Confidence Intervals

We can also compare confidence intervals for each population:<math>\mu_i = \overline{x}_i \pm t(\alpha/2, (m-1)(n-1))\sqrt{\frac{MSE}{n}}</math>

Sample size estimate

<math>n \ge \frac{m\cdot MSE(z_{\alpha/2}+z_\beta)^2}{\delta^2}</math>, where the MSE is estimated beforehand and delta is the desired detection level

Fitting data with least-squares regression

We have a set of n (x,y) pairs, and want to fit these points to a line <math>\hat{y} = a + bx</math>
Estimate the slope with<math>b = \frac{\sum{xy} - n\overline{x} \overline{y}}{\sum{x^2} - n\overline{x}^2}</math>
Estimate the y-intercept with<math>a = \overline{y}-b\overline{x}</math>

Linearizing Data

Linearize by increasing or decreasing the power of the x or y values, or both. Linearizing y only is most common.
Here is the sequence of transformations to try:<math>..., y^{-3}, y^{-2}, y^{-1}, y^{-1/2}, \log{y}, y^{1/2}, y, y^2, y^3, ...</math>

Interpolating from the Linear Regression

Extrapolating beyond the data range is risky.
After solving for a and b, you can generate an alpha-level confidence interval for the average <math>\hat{y}</math> for a given x:<math>\overline{\hat{y}} \pm t(\alpha/2, n-2)\sqrt{\frac{\sum(y-\hat{y})^2}{n-2}\left[\frac{1}{n}+\frac{(x-\overline{x})^2}{\sum(x-\overline{x})^2}\right]}</math>

Correlation

For a set of n (x,y) pairs, we can talk about how well the samples are correlated.
Correlation coefficient is denoted as r:<math>r = \frac{\sum{xy}-\frac{1}{n}\sum{x}\sum{y}}{\sqrt{\left[\sum{x^2}-\frac{1}{n}\left(\sum{x}\right)^2\right]\left[\sum{y^2}-\frac{1}{n}\left(\sum{y}\right)^2\right]}}</math>
A r = 1 means a perfectly linear relationship with positive slope. r = -1 is perfectly correlated with negative slope. r = 0 means no relationship.
Coefficient of determination is <math>r^2</math> and has a direct interpretation. If <math>r^2 = 0.9</math>, that means 90% of the data variability can be explained by the regression equation. The rest is random noise.

Testing boolean data

We have a sample of size n from a population of boolean trials, and we know that proportion p of the trials resulted in success.
The sample mean is just p, and for the sake of discussion, assume <math>p < 1-p</math>.
If <math>np > 5</math>, we model the sample as binomial and take the sample variance as <math>s^2 = p(1-p)</math>
If <math>np \not> 5</math>, we model the sample as Poisson and take the sample variance as <math>s^2 = p</math>

Confidence interval for mean of boolean sample

The confidence interval for the sample mean is <math>p \pm z_{\alpha/2}\sqrt{\frac{p(1-p)}{n}}</math> or <math>p \pm z_{\alpha/2}\sqrt{\frac{p}{n}}</math>
If p is anywhere near 0.5, we throw in the "Yates factor" to expand the confidence interval:<math>p - z_{\alpha/2}\sqrt{\frac{p(1-p)}{n}} - \frac{1}{2n} < \pi < p + z_{\alpha/2}\sqrt{\frac{p(1-p)}{n}} + \frac{1}{2n}</math>

Comparing the mean of a boolean sample to a standard value

To compare the mean p to a standard value c, we use as test statistic<math>z_c = \frac{p-c}{\sqrt{\frac{p(1-p)}{n}}}</math> or <math>z_c = \frac{p-c}{\sqrt{\frac{p}{n}}}</math>Note that this is close to zero when p is close to c.

Two-tail test: Null hypothesis is that<math>-z_{\alpha/2} < z_c < z_{\alpha/2}</math>Rejection means that our mean is significantly different from the standard value.
Lower-tail test: Null hypothesis is that<math>z_c > -z_{\alpha}</math>Rejection means that our mean is significantly less than the standard value.
Upper-tail test: Null hypothesis is that<math>z_c < z_\alpha</math>Rejection means that our mean is significantly greater than the standard value.

@@ Line 11: / Line 11: @@
 * population mean <math>\mu</math> is the average over the entire population of size N <math>\mu = \frac{1}{N}\sum^N x_i</math>
 * sample mean <math>\overline{x}</math> is the average over a sample of size n (usually n << N) <math>\overline{x} = \frac{1}{n}\sum^n x_i</math>
+* test is <math>\xi\frac{\alpha}{\beta - 1}</math>
 ==Variance and Standard Deviation==

Statistics: Difference between revisions

Revision as of 00:46, 10 December 2013

Contents

Normal Distribution

Mean

Variance and Standard Deviation

Mode, Median

Z and Student's t Distribution

Standard Error of the Mean, or Mean Standard Error (MSE)

Confidence Intervals

Hypothesis testing

Two types of errors

Compare one sample to a known standard value

Determining adequate sample size for one-sample test

Two-sample independent t test

Two-sample pooled t test

Paired t test

Two boolean sample proportion t test

One-factor ANOVA

Contrasts (Tukey method)

Confidence Intervals

Sample size estimate

Blocked ANOVA

Contrasts (Tukey method)[1]

Confidence Intervals

Sample size estimate

Fitting data with least-squares regression

Linearizing Data

Interpolating from the Linear Regression

Correlation

Testing boolean data

Confidence interval for mean of boolean sample

Comparing the mean of a boolean sample to a standard value

Navigation menu

Statistics: Difference between revisions

Revision as of 00:46, 10 December 2013

Normal Distribution

Mean

Variance and Standard Deviation

Mode, Median

Z and Student's t Distribution

Standard Error of the Mean, or Mean Standard Error (MSE)

Confidence Intervals

Hypothesis testing

Two types of errors

Compare one sample to a known standard value

Determining adequate sample size for one-sample test

Two-sample independent t test

Two-sample pooled t test

Paired t test

Two boolean sample proportion t test

One-factor ANOVA

Contrasts (Tukey method)

Confidence Intervals

Sample size estimate

Blocked ANOVA

Contrasts (Tukey method)[1]

Confidence Intervals

Sample size estimate

Fitting data with least-squares regression

Linearizing Data

Interpolating from the Linear Regression

Correlation

Testing boolean data

Confidence interval for mean of boolean sample

Comparing the mean of a boolean sample to a standard value

Navigation menu

Search