R Statistics
Basic stats
mean, min, max, range = c(min, max)
mean(c(1,2,3,4,5,NA),na.rm=TRUE) # 3, ignore NA's mean(c(-1,0:100,2000),trim=0.1) # 50, ignore 10% of outliers
quantile, fivenum, IQR, summary
give quantile-related stats
Correlation and covariance
Pearson correlation assumes normally distributed data
Spearman correlation is nonparametric and doesn't make assumptions about the underlying distribution:
cor(x, y, method="spearman") # correlation cov(x, y, method="pearson") # covariance
Principal Components Analysis
http://en.wikipedia.org/wiki/Principal_component_analysis
http://www.youtube.com/watch?v=BfTMmoDFXyE
You have a data set in N dimensions. The first principle component is a linear combination of these dimensions that best explains the variance in the data. The second principle component is orthogonal to the first and best explains the variance in the rest of the data, and so on. It is useful for exploring a large multi-dimensional data set.
princomp
involves the calculation of the eigenvalue decomposition of the data covariance matrix.
prcomp
uses singular value decomposition which gives better numerical accuracy
Probability Distributions
dnorm(x, mean = 0, sd = 1, log = FALSE) # density function, dnorm(0) = 0.3989423 pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) # distribution function: pnorm(0) = 0.5 qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) # quantile function: qnorm(0.5) = 0 (aka inverse dist fn) rnorm(n, mean = 0, sd = 1) # generate n random values from norm dist
Same functions are available for beta, binomial, cauchy, etc.
Compare data set to a distribution function
shapiro.test(x) # Shapiro-Wilk test for normality, small p-value means good match ks.test(x, dist) # Kolmogorov-Smirnov test to see if x values came from dist distribution