R Statistics

From Wiki
Jump to navigation Jump to search

Basic stats

mean, min, max, range = c(min, max)

mean(c(1,2,3,4,5,NA),na.rm=TRUE)   # 3, ignore NA's
mean(c(-1,0:100,2000),trim=0.1)    # 50, ignore 10% of outliers

quantile, fivenum, IQR, summary give quantile-related stats

Correlation and covariance

Pearson correlation assumes normally distributed data

Spearman correlation is nonparametric and doesn't make assumptions about the underlying distribution:

cor(x, y, method="spearman")   # correlation
cov(x, y, method="pearson")    # covariance

Principal Components Analysis

http://en.wikipedia.org/wiki/Principal_component_analysis

http://www.youtube.com/watch?v=BfTMmoDFXyE

You have a data set in N dimensions. The first principle component is a linear combination of these dimensions that best explains the variance in the data. The second principle component is orthogonal to the first and best explains the variance in the rest of the data, and so on. It is useful for exploring a large multi-dimensional data set.

princomp involves the calculation of the eigenvalue decomposition of the data covariance matrix.

prcomp uses singular value decomposition which gives better numerical accuracy

Probability Distributions

dnorm(x, mean = 0, sd = 1, log = FALSE)                         # density function, dnorm(0) = 0.3989423
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)    # distribution function: pnorm(0) = 0.5
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)    # quantile function: qnorm(0.5) = 0 (aka inverse dist fn)
rnorm(n, mean = 0, sd = 1)                                      # generate n random values from norm dist

Same functions are available for beta, binomial, cauchy, etc.

Compare data set to a distribution function

shapiro.test(x)   # Shapiro-Wilk test for normality, small p-value means good match
ks.test(x, dist)  # Kolmogorov-Smirnov test to see if x values came from dist distribution