Basic statistics using R
The following page is a quick guide for using R to do most statistics necessary in an introductory statistics class. It should be especially useful for the students in my class! If you notice any errors or find any parts that could use clarifications please let me know and I will try and improve it.
Getting started with R
If you need to install R, see my main R resources page.
When you start R you will see the > prompt. After the prompt you can type commands. A few things that you should be aware of:
-
Variable names are case sensitive,
aandAare different! -
You must always make sure you have matching opening
(and closing)parenthesis and quotations" ". -
R observes the standard order of mathematical operations, so use parenthesis as needed.
-
Be careful to include implicit multiplication signs:
> x(2-y) # This is WRONGWill result in the error because you have forgotten the multiplication sign! You should type:
> x*(2-y) # This is CORRECT -
You can insert a comment in R by typing
#, anything following the pound sign will be ignored.> # This is ignored! -
Function arguments must be comma separated.
-
Function optional arguments don’t have to be specified unless you want to change the default value. You must use the
name=valueformat to change their value.For example, the normal CDF is defined as
pnorm(x, mean=0, sd=1). If you are happy with the values for the mean and sd you don’t have to enter them, you can just typepnorm(x). However, if you want to change them you must use thename=valueformat. So to change the mean to 10 and the sd to 3 you would type:pnorm(x, mean=10, sd=3).
Getting help in R
If need help on how to use a function, or more information about what it does, just type a question mark in-front of the function’s name and press enter. To get help on the mean function type:
> ?mean
A help window will open with more information. Don’t be intimidated by what is displayed. After reading a few of the help screens you will learn what to look for.
Math operations
Most math operators are obvious, a few that may not be as obvious are:
sqrt(x)square root of x.sum(x)sum up values in vector x:.x^nraise x to the power n.
Entering data
To enter data in R, just use the basic form variableName = Data, you store the data in the variable you name on the left hand side of the equal sign. You will make frequent use of the c() function to enter vectors (sets) of data.
Entering a scalar
To store the value 3 in x type (and hit enter):
> x=3
You can call your variables anything you wish, just make sure the names consist only of alpha numeric characters, begin with a letter, and have no spaces.
To see the value stored in a variable, just type it’s name and hit enter:
> x
[1] 3
If you want to store a string (eg your name) in a variable, then you must use parenthesis:
> name = "Anthony"
If you try and enter name=Anthony R will look for a variable called Anthony, if it does not exist you will get an error.
Entering a vector
To store 5 students ages (in years) in the variable ages:
> ages = c(25, 22, 18, 20, 22)
You use the c() function to store multiple values in a variable. The values are comma separated.
Entering a seqence of numbers
Since we frequently need a sequence of numbers such as x={1, 2, 3, 4, 5} R makes this easy:
> x=1:5
> x
[1] 1 2 3 4 5
To get the sequence {3, 4, 5, 6, 7, 8, 9} you would type 3:9
If you need to generate sequences of numbers that don’t increment by 1 take a look at the seq(...) function.
Loading a table (Excel data -> CSV )
If you need to enter more than a few values into a vector, or have an existing table of data, it’s much easier to enter the data using Excel or any other spreadsheet program that can export CSV files. When you load a table into R it generally expects that each column is a vector representing a measurement (such as heights, weights, ages) and each row represents an observation (an individual in a study).
Once you have entered your data into a spreadsheet, you can load it into R using the following steps:
-
Export your data as a CSV file (Comma Separated Values).
In Excel open your spreadsheet and select File –> Save As:, then in the format drop-down field select CSV. Pick a name for the file such as myTable.CSV and press save. You may be warned that some formatting will be lost, that’s ok.
-
To load the CSV file data in R, just type:
my.table=read.csv(file.choose())R will present a file open dialog, just pick the CSV file you created. Your data will now be stored in
my.table.
Working with tables (dataframes)
If you have loaded a bunch of data into R and you are not sure what variables exist, just type ls() and R will list all the defined variables.
> ls()
[1] "my.table" "women" "x" "y"
We can see that 4 variables exist.
Viewing and summarizing
If you have a variable that contains a table of data called women you could just type the variable name to see what stored inside:
> women
height weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
7 64 132
8 65 135
9 66 139
10 67 142
11 68 146
12 69 150
13 70 154
14 71 159
15 72 164
Above we can see that there are 2 columns height and weight and 15 observations. If your table is huge, looking at the raw data may not be particularly useful. To view the first few observations type:
> head(women)
height weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
Finally, a descriptive statistics summary of all columns can be computed by:
> summary(women)
height weight
Min. :58.0 Min. :115.0
1st Qu.:61.5 1st Qu.:124.5
Median :65.0 Median :135.0
Mean :65.0 Mean :136.7
3rd Qu.:68.5 3rd Qu.:148.0
Max. :72.0 Max. :164.0
Getting a column of data
In general we can get a column of data using: tableName$columnName
-
Determine the column names for your table:
> names(women) [1] "height" "weight"In this case the
womenvariable has two columns:heightandweight. -
To get out a column, use the
tableName$columnNamedollar sign notation:> women$weight [1] 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164In this case, we got the vector of
weightdata from thewomentable. We could easily store the vector in the variablexand find the mean by:> x=women$weight > mean(x) [1] 136.7333
If you are constantly accessing data in a table, you can directly type the names of the columns if you first attach(tableName) all their names to the workspace:
> attach(women)
To find the mean you can now just type:
> mean(weight)
[1] 136.7333
When you are done working with the table, you can remove the direct access to the column names by typing detach(tableName).
Descriptive statistics: Numerical
If we store the ages of 5 students in a variable called ages we can find some descriptive statistics about the ages.
> ages = c(25, 22, 18, 20, 22)
Note: if your data is stored in a different variable — perhaps x — then just substitute your variable’s name for ages below.
Length, min, max, range
The length is the number of elements (n) in a vector
> length(ages)
[1] 5
The min and max functions are obvious:
> min(ages)
[1] 18
> max(ages)
[1] 25
Using the min and max, we can easily find the range:
> range=max(ages)-min(ages)
> range
[1] 7
Mean, median, mode
The mean and median are:
> mean(ages)
[1] 21.4
> median(ages)
[1] 22
To find the mode for a discrete or categorical variable, just make a sorted table, below you can see that the mode is 22:
> sort(table(ages))
ages
18 20 25 22
1 1 1 2
Standard deviation, variance
The sample standard deviation and variance are:
> sd(ages)
[1] 2.607681
> var(ages)
[1] 6.8
Six number summary (quartiles)
> summary(ages)
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.0 20.0 22.0 21.4 22.0 25.0
Graphs
For the following examples I will use the built in iris data set in R. This famous (Fisher’s or Anderson’s) data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
First, lets take a look at the first few entries in the table and view a summary:
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
>
Now let’s attach the data set so we can directly type the column names without using the tableName$columnName notation.
> attach(iris)
Optional arguments for most graphing functions
Most plotting functions have optional arguments. Below is a partial list of all the optional argument names. Just specify them at the end of the function arguments as a comma separated list of name=value pairs (see the example below). If the value is a string, make sure it is in quotes.
type- what type of plot should be drawn. Possible types are:
-
- ”p” for points,
- ”l” for lines,
- ”b” for both,
- ”c” for the lines part alone of “b”,
- ”o” for both ‘overplotted’,
- ”h” for ‘histogram’ like (or ‘high-density’) vertical lines,
- ”s” for stair steps,
- ”S” for other steps,
- ”n” for no plotting.
- an overall title for the plot.
sub- a sub title for the plot.
xlab- a title for the x axis.
ylab- a title for the y axis.
asp- the y/x aspect ratio.
xlim- sets the range of the graph, use
xlim=c(min.x, max.x)wheremin.xandmax.xare the min and max values you want. ylim- sets the range of the graph, use
ylim=c(min.y, max.y)wheremin.yandmax.yare the min and max values you want.
Example: to make a scatter plot of the vectors x and y and set the plot type to lines, specify a title, and label the axis x, and set the x limits to 0-10:
> plot(x, y, type="l", main="My plot", xlab="x (meters)", xlim=c(0, 10))
Histogram Like Graphs
-
Stem and leaf plot:
> stem(Sepal.Width) The decimal point is 1 digit(s) to the left of the | 20 | 0 21 | 22 | 000 23 | 0000 24 | 000 25 | 00000000 26 | 00000 27 | 000000000 28 | 00000000000000 29 | 0000000000 30 | 00000000000000000000000000 31 | 00000000000 32 | 0000000000000 33 | 000000 34 | 000000000000 35 | 000000 36 | 0000 37 | 000 38 | 000000 39 | 00 40 | 0 41 | 0 42 | 0 43 | 44 | 0 -
Histogram:
> hist(Sepal.Width)
Pareto chart
R does not have a built in Pareto chart function, but to make one all we need to do is make a table of our categorical data, sort it in descending order, and then make a boxplot. If your categorical data is in the vector x type:
> t=sort(table(Species), descending=T)
> barplot(t)
Density Estimate
You can estimate the population density from a sample using:
> plot(density(Sepal.Width))

Boxpots
Boxplots are great for comparing a groups of data. Let’s compare the sepal widths to the species. The key is that the first variable is an ordered vector of quantitative data Sepal.Width and the second variable is a vector of categorical data Species. We model the relationship as Sepal.Width~Species meaning that the Sepal.Width depends on the type of Species.
> boxplot(Sepal.Width~Species, ylab="sepal width (cm)", main="Iris sepal width by species")

If want to make a simple boxplot without grouping into categories and your quantitative data is in x, just type boxplot(x).
Normal quantile (Q-Q) plot
To make a normal quantile plot:
> qqnorm(Sepal.Width); qqline(Sepal.Width)

Scatter plots, time series plots, plotting functions
Scatter Plots & Time Series Plots
To make a scatter plot of the following 10 x-y data points
x y
1 60 123
2 65 129
3 72 143
4 65 131
5 48 95
6 50 99
7 37 72
8 37 70
9 56 109
10 46 92
First store the data in the vectors x and y maintaining the order and then use the plot function.
> x = c(60, 65, 72, 65, 48, 50, 37, 37, 56, 46)
> y = c(123, 129, 143, 131, 95, 99, 72, 70, 109, 92)
> plot(x, y, xlab="x-axis label", ylab="y-axis lable", main="Plot title")

Time series plots are done the same way, just use plot(t, y, ...). If you want connected points use the optional argument type="b" for both lines and points:
> plot(t, y, type="b")
Plotting Functions
If you want to plot mathematical (or R) functions, use the curve(expression, xmin, xmax) function and input an expression involving x and the min and max values to plot (the domain).
Example: plot the sin(x) function from -2pi to 2pi.
> curve(sin(x), -2*pi, 2*pi)

Example: plot 3rd degree polynomial y=x^3 + 2x^2 - 4x + 3 from -10 to 10
> curve(x^3 + 2*x^2 - 4*x + 3, -10, 10, main="Polynomial")

Example R function: plot standard normal density
> curve(dnorm(x), -3, 3)

Probability
Factorials
To find 8!
> factorial(8)
[1] 40320
Combinations
Combinations of n choose r are done with choose(n,r). To find 10 choose 7:
> choose(10, 7)
[1] 120
Distributions
For discrete variables, probability is height on the distribution.
- Distributions: p=P(x)
- Binomial:
pnorm(x,n,p)
For continuous variables, probability is area so we use the CDFs to find area on the densities.
- CDFs: p=F(x), gives area to the left of x.
- Normal:
pnorm(x, mean=0, sd=1) - Inverse CDFs: x=F^-1(p), finds x with are p to the left.
- Normal:
qnorm(p, mean=0, sd=1)
Binomial distribution
Use dbinom(x,n,p). The binomomial distribution finds the probability of x successes in n trials with p probability of individual success (the arguments must be in that order).
If the probability of guessing a correct answer on a multiple choice question is 0.25 (4 choices), what is the probability of randomly guessing 0 out of 5 questions correct? The parameters are: x=0, n=5, p=0.25. The probability is about 23.7%:
> dbinom(0, 5, 0.25)
[1] 0.2373047
To find the probability that we answer 4 or 5 question correct, we must find P(x=4 OR x=5) = P(x=4) + P(x=5):
> dbinom(4, 5, 0.25) + dbinom(4, 5, 0.25)
[1] 0.02929688
Normal distribution
The normal density is dnorm(x, mean=0, sd=1), the cumulative density function CDF is pnorm(x, mean=0, sd=1) and the inverse CDF is qnorm(p, mean=0, sd=1). By default pnorm returns the area to the left of x and qnorm expects that p refers to the left tail’s area.
If men’s heights are normally distributed with a mean of 69 inches and a standard deviation of 2.8 inches:
Find the probability that a randomly selected man has a height less than 75 inches. Use the CDF:
> pnorm(75, mean=69, sd=2.8)
[1] 0.9839377
Find the height that the top 1.6% of men are taller than. Use the inverse CDF:
> qnorm(1-0.016, mean=69, sd=2.8)
[1] 75.00435
Note that in the above question, we had to use the compliment because the inverse CDF expects that p refers to the left tail area.
Student’s t Distribution
The t density is dt(t, df), the cumulative density function CDF is pt(t, df) and the inverse CDF is qt(p, df). The df parameter is the degrees of freedom for the t distribution. By default pt returns the area to the left of t and qt expects that p refers to the left tail’s area. These functions work just like the normal distribution functions.
Hypothesis Tests
Specifying the alternative hypothesis. Most of the hypothesis test functions listed below have the optional argument alternative to specify your alternative hypothesis:
- if Ha is “not equal to” set
alternative="two.sided"(this is the default so you don’t have to set this) - if Ha is “less than” set
alternative="less" - if Ha is “greater than” set
alternative="greater"
Specifying a confidence interval level Most of the hypothesis test functions will output a confidence interval for the population parameter being estimated. By default the confidence level is 95%. If you want a 99% confidence level then set the optional argument conf.level=0.99.
One sample proportion test
Use: prop.test(x, n, p=0.5, alternative="two.sided", conf.level=0.95)
Where x is the observed number of successes with a sample size n. The null hypothesis is p (default of 0.5).
Example: A study found that 10 out of 100 people were left handed. Test the alternative hypothesis that the true proportion of left handed people is greater than 15%.
> prop.test(10, 100, p=0.15, alternative="greater")
1-sample proportions test with continuity correction
data: 10 out of 100, null probability 0.15
X-squared = 1.5882, df = 1, p-value = 0.8962
alternative hypothesis: true p is greater than 0.15
95 percent confidence interval:
0.05689751 1.00000000
sample estimates:
p
0.1
Two sample proportion test
Use: prop.test(x, n, p=0, alternative="two.sided", conf.level=0.95)
Where x is an ordered vector of the observed number of successes (x=c(x1, x2)) in both samples and n is an ordered vector of the sample sizes (n=c(n1, n2)). The null hypothesis is p refers to the difference p1-p2 in the two population proportions (default is 0).
Example: Test the hypothesis that the proportion of left handed females is greater than the left handed males. A study found that 15 of 80 females were left handed while 8 out of 112 males were left handed.
> x=c(15,8)
> n=c(80, 112) #Note that the order of n is the same as for x. Females listed first!
> prop.test(x, n, alternative='greater')
2-sample test for equality of proportions with continuity correction
data: x out of n
X-squared = 4.9127, df = 1, p-value = 0.01333
alternative hypothesis: greater
95 percent confidence interval:
0.02317208 1.00000000
sample estimates:
prop 1 prop 2
0.18750000 0.07142857
Test of homogeneity (Many sample proportion test)
Todo
One sample mean test
This test is designed for analyzing raw data when the population standard deviation is unknown (hence the t distribution).
Use: t.test(x, mu=0, alternative="two.sided", conf.level=0.95)
Where x is a vector of data and mu is the null hypothesis for the population mean (default 0).
Example: A random study of 5 adult’s heights (in inches) were 58, 62, 54, 51, 60. Test the hypothesis that the mean height of adults is 66.3 inches.
> x=c(58, 62, 54, 51, 60)
> t.test(x, mu=66.3, alternative="two.sided")
One Sample t-test
data: x
t = -4.65, df = 4, p-value = 0.009661
alternative hypothesis: true mean is not equal to 66.3
95 percent confidence interval:
51.44711 62.55289
sample estimates:
mean of x
57
Two sample mean test
This test is designed for analyzing raw data when the population standard deviation is unknown (hence the t distribution).
Use: t.test(x1, x2, mu=0, alternative="two.sided", conf.level=0.95)
Where x1 and x2 are vectors of data for the two samples and mu is the null hypothesis for the difference in the mean of the two populations mu1-mu2 (default 0).
Example: Test the hypothesis that the mean height of adults males is greater than adult females. A study measured (in inches) 5 female heights as 58, 49, 52, 60, 55 and 7 males as 49, 64, 58, 60, 62, 55, 59.
> x1=c(58, 49, 52, 60, 55)
> x2=c(49, 64, 58, 60, 62, 55, 59)
> t.test(x1, x2, alternative="less")
Welch Two Sample t-test
data: x1 and x2
t = -1.2258, df = 9.344, p-value = 0.1251
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf 1.635198
sample estimates:
mean of x mean of y
54.80000 58.14286
Note that since x1 was the females and x2 was the males, our alternative should be less since we believe that the females mean height is less than the males. If we would have put males in first then the alternative would have been greater.
One-way ANOVA (Many sample mean test)
Todo
Test of independence
Todo
Correlation & Regression
Todo