Basic statistics using R
The following page is a quick guide for using R to do most statistics necessary in an introductory statistics class. It should be especially useful for the students in my class! If you notice any errors or find any parts that could use clarifications please let me know and I will try and improve it.
Getting started with R
If you need to install R, see my main R resources page.
When you start R you will see the >
prompt. After the prompt you can type commands. A few things that you should be aware of:

Variable names are case sensitive,
a
andA
are different! 
You must always make sure you have matching opening
(
and closing)
parenthesis and quotations" "
. 
R observes the standard order of mathematical operations, so use parenthesis as needed.

Be careful to include implicit multiplication signs:
> x(2y) # This is WRONG
Will result in the error because you have forgotten the multiplication sign! You should type:
> x*(2y) # This is CORRECT

You can insert a comment in R by typing
#
, anything following the pound sign will be ignored.> # This is ignored!

Function arguments must be comma separated.

Function optional arguments don’t have to be specified unless you want to change the default value. You must use the
name=value
format to change their value.For example, the normal CDF is defined as
pnorm(x, mean=0, sd=1)
. If you are happy with the values for the mean and sd you don’t have to enter them, you can just typepnorm(x)
. However, if you want to change them you must use thename=value
format. So to change the mean to 10 and the sd to 3 you would type:pnorm(x, mean=10, sd=3)
.
Getting help in R
If need help on how to use a function, or more information about what it does, just type a question mark infront of the function’s name and press enter. To get help on the mean
function type:
> ?mean
A help window will open with more information. Don’t be intimidated by what is displayed. After reading a few of the help screens you will learn what to look for.
Math operations
Most math operators are obvious, a few that may not be as obvious are:
sqrt(x)
square root of x.sum(x)
sum up values in vector x:.x^n
raise x to the power n.
Entering data
To enter data in R, just use the basic form variableName = Data
, you store the data in the variable you name on the left hand side of the equal sign. You will make frequent use of the c()
function to enter vectors (sets) of data.
Entering a scalar
To store the value 3 in x
type (and hit enter):
> x=3
You can call your variables anything you wish, just make sure the names consist only of alpha numeric characters, begin with a letter, and have no spaces.
To see the value stored in a variable, just type it’s name and hit enter:
> x
[1] 3
If you want to store a string (eg your name) in a variable, then you must use parenthesis:
> name = "Anthony"
If you try and enter name=Anthony
R will look for a variable called Anthony
, if it does not exist you will get an error.
Entering a vector
To store 5 students ages (in years) in the variable ages
:
> ages = c(25, 22, 18, 20, 22)
You use the c()
function to store multiple values in a variable. The values are comma separated.
Entering a seqence of numbers
Since we frequently need a sequence of numbers such as x={1, 2, 3, 4, 5} R makes this easy:
> x=1:5
> x
[1] 1 2 3 4 5
To get the sequence {3, 4, 5, 6, 7, 8, 9} you would type 3:9
If you need to generate sequences of numbers that don’t increment by 1 take a look at the seq(...)
function.
Loading a table (Excel data > CSV )
If you need to enter more than a few values into a vector, or have an existing table of data, it’s much easier to enter the data using Excel or any other spreadsheet program that can export CSV files. When you load a table into R it generally expects that each column is a vector representing a measurement (such as heights, weights, ages) and each row represents an observation (an individual in a study).
Once you have entered your data into a spreadsheet, you can load it into R using the following steps:

Export your data as a CSV file (Comma Separated Values).
In Excel open your spreadsheet and select File –> Save As:, then in the format dropdown field select CSV. Pick a name for the file such as myTable.CSV and press save. You may be warned that some formatting will be lost, that’s ok.

To load the CSV file data in R, just type:
my.table=read.csv(file.choose())
R will present a file open dialog, just pick the CSV file you created. Your data will now be stored in
my.table
.
Working with tables (dataframes)
If you have loaded a bunch of data into R and you are not sure what variables exist, just type ls()
and R will list all the defined variables.
> ls()
[1] "my.table" "women" "x" "y"
We can see that 4 variables exist.
Viewing and summarizing
If you have a variable that contains a table of data called women
you could just type the variable name to see what stored inside:
> women
height weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
7 64 132
8 65 135
9 66 139
10 67 142
11 68 146
12 69 150
13 70 154
14 71 159
15 72 164
Above we can see that there are 2 columns height
and weight
and 15 observations. If your table is huge, looking at the raw data may not be particularly useful. To view the first few observations type:
> head(women)
height weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
Finally, a descriptive statistics summary of all columns can be computed by:
> summary(women)
height weight
Min. :58.0 Min. :115.0
1st Qu.:61.5 1st Qu.:124.5
Median :65.0 Median :135.0
Mean :65.0 Mean :136.7
3rd Qu.:68.5 3rd Qu.:148.0
Max. :72.0 Max. :164.0
Getting a column of data
In general we can get a column of data using: tableName$columnName

Determine the column names for your table:
> names(women) [1] "height" "weight"
In this case the
women
variable has two columns:height
andweight
. 
To get out a column, use the
tableName$columnName
dollar sign notation:> women$weight [1] 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164
In this case, we got the vector of
weight
data from thewomen
table. We could easily store the vector in the variablex
and find the mean by:> x=women$weight > mean(x) [1] 136.7333
If you are constantly accessing data in a table, you can directly type the names of the columns if you first attach(tableName)
all their names to the workspace:
> attach(women)
To find the mean you can now just type:
> mean(weight)
[1] 136.7333
When you are done working with the table, you can remove the direct access to the column names by typing detach(tableName)
.
Descriptive statistics: Numerical
If we store the ages of 5 students in a variable called ages
we can find some descriptive statistics about the ages.
> ages = c(25, 22, 18, 20, 22)
Note: if your data is stored in a different variable — perhaps x
— then just substitute your variable’s name for ages
below.
Length, min, max, range
The length is the number of elements (n) in a vector
> length(ages)
[1] 5
The min and max functions are obvious:
> min(ages)
[1] 18
> max(ages)
[1] 25
Using the min and max, we can easily find the range:
> range=max(ages)min(ages)
> range
[1] 7
Mean, median, mode
The mean and median are:
> mean(ages)
[1] 21.4
> median(ages)
[1] 22
To find the mode for a discrete or categorical variable, just make a sorted table, below you can see that the mode is 22:
> sort(table(ages))
ages
18 20 25 22
1 1 1 2
Standard deviation, variance
The sample standard deviation and variance are:
> sd(ages)
[1] 2.607681
> var(ages)
[1] 6.8
Six number summary (quartiles)
> summary(ages)
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.0 20.0 22.0 21.4 22.0 25.0
Graphs
For the following examples I will use the built in iris
data set in R. This famous (Fisher’s or Anderson’s) data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
First, lets take a look at the first few entries in the table and view a summary:
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
>
Now let’s attach
the data set so we can directly type the column names without using the tableName$columnName
notation.
> attach(iris)
Optional arguments for most graphing functions
Most plotting functions have optional arguments. Below is a partial list of all the optional argument names. Just specify them at the end of the function arguments as a comma separated list of name=value
pairs (see the example below). If the value is a string, make sure it is in quotes.
type
 what type of plot should be drawn. Possible types are:

 ”p” for points,
 ”l” for lines,
 ”b” for both,
 ”c” for the lines part alone of “b”,
 ”o” for both ‘overplotted’,
 ”h” for ‘histogram’ like (or ‘highdensity’) vertical lines,
 ”s” for stair steps,
 ”S” for other steps,
 ”n” for no plotting.
 an overall title for the plot.
sub
 a sub title for the plot.
xlab
 a title for the x axis.
ylab
 a title for the y axis.
asp
 the y/x aspect ratio.
xlim
 sets the range of the graph, use
xlim=c(min.x, max.x)
wheremin.x
andmax.x
are the min and max values you want. ylim
 sets the range of the graph, use
ylim=c(min.y, max.y)
wheremin.y
andmax.y
are the min and max values you want.
Example: to make a scatter plot of the vectors x
and y
and set the plot type to lines, specify a title, and label the axis x, and set the x limits to 010:
> plot(x, y, type="l", main="My plot", xlab="x (meters)", xlim=c(0, 10))
Histogram Like Graphs

Stem and leaf plot:
> stem(Sepal.Width) The decimal point is 1 digit(s) to the left of the  20  0 21  22  000 23  0000 24  000 25  00000000 26  00000 27  000000000 28  00000000000000 29  0000000000 30  00000000000000000000000000 31  00000000000 32  0000000000000 33  000000 34  000000000000 35  000000 36  0000 37  000 38  000000 39  00 40  0 41  0 42  0 43  44  0

Histogram:
> hist(Sepal.Width)
Pareto chart
R does not have a built in Pareto chart function, but to make one all we need to do is make a table of our categorical data, sort it in descending order, and then make a boxplot. If your categorical data is in the vector x
type:
> t=sort(table(Species), descending=T)
> barplot(t)
Density Estimate
You can estimate the population density from a sample using:
> plot(density(Sepal.Width))
Boxpots
Boxplots are great for comparing a groups of data. Let’s compare the sepal widths to the species. The key is that the first variable is an ordered vector of quantitative data Sepal.Width
and the second variable is a vector of categorical data Species
. We model the relationship as Sepal.Width~Species
meaning that the Sepal.Width
depends on the type of Species
.
> boxplot(Sepal.Width~Species, ylab="sepal width (cm)", main="Iris sepal width by species")
If want to make a simple boxplot without grouping into categories and your quantitative data is in x
, just type boxplot(x)
.
Normal quantile (QQ) plot
To make a normal quantile plot:
> qqnorm(Sepal.Width); qqline(Sepal.Width)
Scatter plots, time series plots, plotting functions
Scatter Plots & Time Series Plots
To make a scatter plot of the following 10 xy data points
x y
1 60 123
2 65 129
3 72 143
4 65 131
5 48 95
6 50 99
7 37 72
8 37 70
9 56 109
10 46 92
First store the data in the vectors x
and y
maintaining the order and then use the plot
function.
> x = c(60, 65, 72, 65, 48, 50, 37, 37, 56, 46)
> y = c(123, 129, 143, 131, 95, 99, 72, 70, 109, 92)
> plot(x, y, xlab="xaxis label", ylab="yaxis lable", main="Plot title")
Time series plots are done the same way, just use plot(t, y, ...)
. If you want connected points use the optional argument type="b"
for both lines and points:
> plot(t, y, type="b")
Plotting Functions
If you want to plot mathematical (or R) functions, use the curve(expression, xmin, xmax)
function and input an expression involving x
and the min and max values to plot (the domain).
Example: plot the sin(x) function from 2pi to 2pi.
> curve(sin(x), 2*pi, 2*pi)
Example: plot 3rd degree polynomial y=x^3 + 2x^2  4x + 3 from 10 to 10
> curve(x^3 + 2*x^2  4*x + 3, 10, 10, main="Polynomial")
Example R function: plot standard normal density
> curve(dnorm(x), 3, 3)
Probability
Factorials
To find 8!
> factorial(8)
[1] 40320
Combinations
Combinations of n choose r are done with choose(n,r)
. To find 10 choose 7:
> choose(10, 7)
[1] 120
Distributions
For discrete variables, probability is height on the distribution.
 Distributions: p=P(x)
 Binomial:
pnorm(x,n,p)
For continuous variables, probability is area so we use the CDFs to find area on the densities.
 CDFs: p=F(x), gives area to the left of x.
 Normal:
pnorm(x, mean=0, sd=1)
 Inverse CDFs: x=F^1(p), finds x with are p to the left.
 Normal:
qnorm(p, mean=0, sd=1)
Binomial distribution
Use dbinom(x,n,p)
. The binomomial distribution finds the probability of x
successes in n
trials with p
probability of individual success (the arguments must be in that order).
If the probability of guessing a correct answer on a multiple choice question is 0.25 (4 choices), what is the probability of randomly guessing 0 out of 5 questions correct? The parameters are: x=0, n=5, p=0.25. The probability is about 23.7%:
> dbinom(0, 5, 0.25)
[1] 0.2373047
To find the probability that we answer 4 or 5 question correct, we must find P(x=4 OR x=5) = P(x=4) + P(x=5):
> dbinom(4, 5, 0.25) + dbinom(4, 5, 0.25)
[1] 0.02929688
Normal distribution
The normal density is dnorm(x, mean=0, sd=1)
, the cumulative density function CDF is pnorm(x, mean=0, sd=1)
and the inverse CDF is qnorm(p, mean=0, sd=1)
. By default pnorm
returns the area to the left of x
and qnorm
expects that p
refers to the left tail’s area.
If men’s heights are normally distributed with a mean of 69 inches and a standard deviation of 2.8 inches:
Find the probability that a randomly selected man has a height less than 75 inches. Use the CDF:
> pnorm(75, mean=69, sd=2.8)
[1] 0.9839377
Find the height that the top 1.6% of men are taller than. Use the inverse CDF:
> qnorm(10.016, mean=69, sd=2.8)
[1] 75.00435
Note that in the above question, we had to use the compliment because the inverse CDF expects that p
refers to the left tail area.
Student’s t Distribution
The t density is dt(t, df)
, the cumulative density function CDF is pt(t, df)
and the inverse CDF is qt(p, df)
. The df
parameter is the degrees of freedom for the t distribution. By default pt
returns the area to the left of t
and qt
expects that p
refers to the left tail’s area. These functions work just like the normal distribution functions.
Hypothesis Tests
Specifying the alternative hypothesis. Most of the hypothesis test functions listed below have the optional argument alternative
to specify your alternative hypothesis:
 if Ha is “not equal to” set
alternative="two.sided"
(this is the default so you don’t have to set this)  if Ha is “less than” set
alternative="less"
 if Ha is “greater than” set
alternative="greater"
Specifying a confidence interval level Most of the hypothesis test functions will output a confidence interval for the population parameter being estimated. By default the confidence level is 95%. If you want a 99% confidence level then set the optional argument conf.level=0.99
.
One sample proportion test
Use: prop.test(x, n, p=0.5, alternative="two.sided", conf.level=0.95)
Where x
is the observed number of successes with a sample size n
. The null hypothesis is p
(default of 0.5).
Example: A study found that 10 out of 100 people were left handed. Test the alternative hypothesis that the true proportion of left handed people is greater than 15%.
> prop.test(10, 100, p=0.15, alternative="greater")
1sample proportions test with continuity correction
data: 10 out of 100, null probability 0.15
Xsquared = 1.5882, df = 1, pvalue = 0.8962
alternative hypothesis: true p is greater than 0.15
95 percent confidence interval:
0.05689751 1.00000000
sample estimates:
p
0.1
Two sample proportion test
Use: prop.test(x, n, p=0, alternative="two.sided", conf.level=0.95)
Where x
is an ordered vector of the observed number of successes (x=c(x1, x2)
) in both samples and n
is an ordered vector of the sample sizes (n=c(n1, n2)
). The null hypothesis is p
refers to the difference p1p2
in the two population proportions (default is 0).
Example: Test the hypothesis that the proportion of left handed females is greater than the left handed males. A study found that 15 of 80 females were left handed while 8 out of 112 males were left handed.
> x=c(15,8)
> n=c(80, 112) #Note that the order of n is the same as for x. Females listed first!
> prop.test(x, n, alternative='greater')
2sample test for equality of proportions with continuity correction
data: x out of n
Xsquared = 4.9127, df = 1, pvalue = 0.01333
alternative hypothesis: greater
95 percent confidence interval:
0.02317208 1.00000000
sample estimates:
prop 1 prop 2
0.18750000 0.07142857
Test of homogeneity (Many sample proportion test)
Todo
One sample mean test
This test is designed for analyzing raw data when the population standard deviation is unknown (hence the t distribution).
Use: t.test(x, mu=0, alternative="two.sided", conf.level=0.95)
Where x
is a vector of data and mu
is the null hypothesis for the population mean (default 0).
Example: A random study of 5 adult’s heights (in inches) were 58, 62, 54, 51, 60. Test the hypothesis that the mean height of adults is 66.3 inches.
> x=c(58, 62, 54, 51, 60)
> t.test(x, mu=66.3, alternative="two.sided")
One Sample ttest
data: x
t = 4.65, df = 4, pvalue = 0.009661
alternative hypothesis: true mean is not equal to 66.3
95 percent confidence interval:
51.44711 62.55289
sample estimates:
mean of x
57
Two sample mean test
This test is designed for analyzing raw data when the population standard deviation is unknown (hence the t distribution).
Use: t.test(x1, x2, mu=0, alternative="two.sided", conf.level=0.95)
Where x1
and x2
are vectors of data for the two samples and mu
is the null hypothesis for the difference in the mean of the two populations mu1mu2
(default 0).
Example: Test the hypothesis that the mean height of adults males is greater than adult females. A study measured (in inches) 5 female heights as 58, 49, 52, 60, 55 and 7 males as 49, 64, 58, 60, 62, 55, 59.
> x1=c(58, 49, 52, 60, 55)
> x2=c(49, 64, 58, 60, 62, 55, 59)
> t.test(x1, x2, alternative="less")
Welch Two Sample ttest
data: x1 and x2
t = 1.2258, df = 9.344, pvalue = 0.1251
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
Inf 1.635198
sample estimates:
mean of x mean of y
54.80000 58.14286
Note that since x1
was the females and x2
was the males, our alternative should be less
since we believe that the females mean height is less than the males. If we would have put males in first then the alternative would have been greater
.
Oneway ANOVA (Many sample mean test)
Todo
Test of independence
Todo
Correlation & Regression
Todo