Location: Anthony Tanbakuchi / Resources / R Resources / Basic statistics using R

Basic statistics using R

The following page is a quick guide for using R to do most statistics necessary in an introductory statistics class. It should be especially useful for the students in my class! If you notice any errors or find any parts that could use clarifications please let me know and I will try and improve it.

Getting started with R

If you need to install R, see my main R resources page.

When you start R you will see the > prompt. After the prompt you can type commands. A few things that you should be aware of:

Getting help in R

If need help on how to use a function, or more information about what it does, just type a question mark in-front of the function’s name and press enter. To get help on the mean function type:

> ?mean

A help window will open with more information. Don’t be intimidated by what is displayed. After reading a few of the help screens you will learn what to look for.

Math operations

Most math operators are obvious, a few that may not be as obvious are:

Entering data

To enter data in R, just use the basic form variableName = Data, you store the data in the variable you name on the left hand side of the equal sign. You will make frequent use of the c() function to enter vectors (sets) of data.

Entering a scalar

To store the value 3 in x type (and hit enter):

> x=3

You can call your variables anything you wish, just make sure the names consist only of alpha numeric characters, begin with a letter, and have no spaces.

To see the value stored in a variable, just type it’s name and hit enter:

> x
[1] 3

If you want to store a string (eg your name) in a variable, then you must use parenthesis:

> name = "Anthony"

If you try and enter name=Anthony R will look for a variable called Anthony, if it does not exist you will get an error.

Entering a vector

To store 5 students ages (in years) in the variable ages:

> ages = c(25, 22, 18, 20, 22)

You use the c() function to store multiple values in a variable. The values are comma separated.

Entering a seqence of numbers

Since we frequently need a sequence of numbers such as x={1, 2, 3, 4, 5} R makes this easy:

> x=1:5
> x
[1] 1 2 3 4 5

To get the sequence {3, 4, 5, 6, 7, 8, 9} you would type 3:9

If you need to generate sequences of numbers that don’t increment by 1 take a look at the seq(...) function.

Loading a table (Excel data -> CSV )

If you need to enter more than a few values into a vector, or have an existing table of data, it’s much easier to enter the data using Excel or any other spreadsheet program that can export CSV files. When you load a table into R it generally expects that each column is a vector representing a measurement (such as heights, weights, ages) and each row represents an observation (an individual in a study).

Once you have entered your data into a spreadsheet, you can load it into R using the following steps:

  1. Export your data as a CSV file (Comma Separated Values).

    In Excel open your spreadsheet and select File –> Save As:, then in the format drop-down field select CSV. Pick a name for the file such as myTable.CSV and press save. You may be warned that some formatting will be lost, that’s ok.

  2. To load the CSV file data in R, just type:

    my.table=read.csv(file.choose())

    R will present a file open dialog, just pick the CSV file you created. Your data will now be stored in my.table.

Working with tables (dataframes)

If you have loaded a bunch of data into R and you are not sure what variables exist, just type ls() and R will list all the defined variables.

> ls()
[1] "my.table" "women"    "x"        "y"

We can see that 4 variables exist.

Viewing and summarizing

If you have a variable that contains a table of data called women you could just type the variable name to see what stored inside:

> women
   height weight
1      58    115
2      59    117
3      60    120
4      61    123
5      62    126
6      63    129
7      64    132
8      65    135
9      66    139
10     67    142
11     68    146
12     69    150
13     70    154
14     71    159
15     72    164

Above we can see that there are 2 columns height and weight and 15 observations. If your table is huge, looking at the raw data may not be particularly useful. To view the first few observations type:

> head(women)
  height weight
1     58    115
2     59    117
3     60    120
4     61    123
5     62    126
6     63    129

Finally, a descriptive statistics summary of all columns can be computed by:

> summary(women)
     height         weight     
 Min.   :58.0   Min.   :115.0  
 1st Qu.:61.5   1st Qu.:124.5  
 Median :65.0   Median :135.0  
 Mean   :65.0   Mean   :136.7  
 3rd Qu.:68.5   3rd Qu.:148.0  
 Max.   :72.0   Max.   :164.0

Getting a column of data

In general we can get a column of data using: tableName$columnName

  1. Determine the column names for your table:

    > names(women)
    [1] "height" "weight"

    In this case the women variable has two columns: height and weight.

  2. To get out a column, use the tableName$columnName dollar sign notation:

    > women$weight
     [1] 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164

    In this case, we got the vector of weight data from the women table. We could easily store the vector in the variable x and find the mean by:

    > x=women$weight
    > mean(x)
    [1] 136.7333

If you are constantly accessing data in a table, you can directly type the names of the columns if you first attach(tableName) all their names to the workspace:

	> attach(women)

To find the mean you can now just type:

	> mean(weight)
	[1] 136.7333

When you are done working with the table, you can remove the direct access to the column names by typing detach(tableName).

Descriptive statistics: Numerical

If we store the ages of 5 students in a variable called ages we can find some descriptive statistics about the ages.

> ages = c(25, 22, 18, 20, 22)

Note: if your data is stored in a different variable — perhaps x — then just substitute your variable’s name for ages below.

Length, min, max, range

The length is the number of elements (n) in a vector

> length(ages)
[1] 5

The min and max functions are obvious:

> min(ages)
[1] 18
> max(ages)
[1] 25

Using the min and max, we can easily find the range:

> range=max(ages)-min(ages)
> range
[1] 7

Mean, median, mode

The mean and median are:

> mean(ages)
[1] 21.4
> median(ages)
[1] 22

To find the mode for a discrete or categorical variable, just make a sorted table, below you can see that the mode is 22:

>  sort(table(ages))
ages
18 20 25 22 
 1  1  1  2 

Standard deviation, variance

The sample standard deviation and variance are:

> sd(ages)
[1] 2.607681
> var(ages)
[1] 6.8

Six number summary (quartiles)

> summary(ages)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   18.0    20.0    22.0    21.4    22.0    25.0 

Graphs

For the following examples I will use the built in iris data set in R. This famous (Fisher’s or Anderson’s) data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

First, lets take a look at the first few entries in the table and view a summary:

> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
> summary(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500                  
> 

Now let’s attach the data set so we can directly type the column names without using the tableName$columnName notation.

> attach(iris)

Optional arguments for most graphing functions

Most plotting functions have optional arguments. Below is a partial list of all the optional argument names. Just specify them at the end of the function arguments as a comma separated list of name=value pairs (see the example below). If the value is a string, make sure it is in quotes.

type
what type of plot should be drawn. Possible types are:
  • ”p” for points,
  • ”l” for lines,
  • ”b” for both,
  • ”c” for the lines part alone of “b”,
  • ”o” for both ‘overplotted’,
  • ”h” for ‘histogram’ like (or ‘high-density’) vertical lines,
  • ”s” for stair steps,
  • ”S” for other steps,
  • ”n” for no plotting.
  • an overall title for the plot.
    sub
    a sub title for the plot.
    xlab
    a title for the x axis.
    ylab
    a title for the y axis.
    asp
    the y/x aspect ratio.
    xlim
    sets the range of the graph, use xlim=c(min.x, max.x) where min.x and max.x are the min and max values you want.
    ylim
    sets the range of the graph, use ylim=c(min.y, max.y) where min.y and max.y are the min and max values you want.

    Example: to make a scatter plot of the vectors x and y and set the plot type to lines, specify a title, and label the axis x, and set the x limits to 0-10:

    > plot(x, y, type="l", main="My plot", xlab="x (meters)", xlim=c(0, 10))

    Histogram Like Graphs

    Pareto chart

    R does not have a built in Pareto chart function, but to make one all we need to do is make a table of our categorical data, sort it in descending order, and then make a boxplot. If your categorical data is in the vector x type:

    > t=sort(table(Species), descending=T)
    > barplot(t)

    Density Estimate

    You can estimate the population density from a sample using:

    > plot(density(Sepal.Width))

    Density plot

    Boxpots

    Boxplots are great for comparing a groups of data. Let’s compare the sepal widths to the species. The key is that the first variable is an ordered vector of quantitative data Sepal.Width and the second variable is a vector of categorical data Species. We model the relationship as Sepal.Width~Species meaning that the Sepal.Width depends on the type of Species.

    > boxplot(Sepal.Width~Species, ylab="sepal width (cm)", main="Iris sepal width by species")

    Boxplot plot

    If want to make a simple boxplot without grouping into categories and your quantitative data is in x, just type boxplot(x).

    Normal quantile (Q-Q) plot

    To make a normal quantile plot:

    > qqnorm(Sepal.Width); qqline(Sepal.Width)

    Q-Q plot

    Scatter plots, time series plots, plotting functions

    Scatter Plots & Time Series Plots

    To make a scatter plot of the following 10 x-y data points

        x   y
    1  60 123
    2  65 129
    3  72 143
    4  65 131
    5  48  95
    6  50  99
    7  37  72
    8  37  70
    9  56 109
    10 46  92

    First store the data in the vectors x and y maintaining the order and then use the plot function.

    > x = c(60, 65, 72, 65, 48, 50, 37, 37, 56, 46)
    > y = c(123, 129, 143, 131, 95, 99, 72, 70, 109, 92)
    > plot(x, y, xlab="x-axis label", ylab="y-axis lable", main="Plot title")

    Scatter plot

    Time series plots are done the same way, just use plot(t, y, ...). If you want connected points use the optional argument type="b" for both lines and points:

    > plot(t, y, type="b")

    Plotting Functions

    If you want to plot mathematical (or R) functions, use the curve(expression, xmin, xmax) function and input an expression involving x and the min and max values to plot (the domain).

    Example: plot the sin(x) function from -2pi to 2pi.

    > curve(sin(x), -2*pi, 2*pi)

    Plot of sin(x)

    Example: plot 3rd degree polynomial y=x^3 + 2x^2 - 4x + 3 from -10 to 10

    > curve(x^3 + 2*x^2 - 4*x + 3, -10, 10, main="Polynomial") 

    Plot of polynomial

    Example R function: plot standard normal density

    > curve(dnorm(x), -3, 3)

    Plot of standard normal dist

    Probability

    Factorials

    To find 8!

    > factorial(8)
    [1] 40320

    Combinations

    Combinations of n choose r are done with choose(n,r). To find 10 choose 7:

    > choose(10, 7)
    [1] 120

    Distributions

    For discrete variables, probability is height on the distribution.

    Distributions: p=P(x)
    Binomial: pnorm(x,n,p)

    For continuous variables, probability is area so we use the CDFs to find area on the densities.

    CDFs: p=F(x), gives area to the left of x.
    Normal: pnorm(x, mean=0, sd=1)
    Inverse CDFs: x=F^-1(p), finds x with are p to the left.
    Normal: qnorm(p, mean=0, sd=1)

    Binomial distribution

    Use dbinom(x,n,p). The binomomial distribution finds the probability of x successes in n trials with p probability of individual success (the arguments must be in that order).

    If the probability of guessing a correct answer on a multiple choice question is 0.25 (4 choices), what is the probability of randomly guessing 0 out of 5 questions correct? The parameters are: x=0, n=5, p=0.25. The probability is about 23.7%:

    > dbinom(0, 5, 0.25)
    [1] 0.2373047

    To find the probability that we answer 4 or 5 question correct, we must find P(x=4 OR x=5) = P(x=4) + P(x=5):

    > dbinom(4, 5, 0.25) + dbinom(4, 5, 0.25)
    [1] 0.02929688

    Normal distribution

    The normal density is dnorm(x, mean=0, sd=1), the cumulative density function CDF is pnorm(x, mean=0, sd=1) and the inverse CDF is qnorm(p, mean=0, sd=1). By default pnorm returns the area to the left of x and qnorm expects that p refers to the left tail’s area.

    If men’s heights are normally distributed with a mean of 69 inches and a standard deviation of 2.8 inches:

    Find the probability that a randomly selected man has a height less than 75 inches. Use the CDF:

    > pnorm(75, mean=69, sd=2.8)
    [1] 0.9839377

    Find the height that the top 1.6% of men are taller than. Use the inverse CDF:

    > qnorm(1-0.016, mean=69, sd=2.8)
    [1] 75.00435

    Note that in the above question, we had to use the compliment because the inverse CDF expects that p refers to the left tail area.

    Student’s t Distribution

    The t density is dt(t, df), the cumulative density function CDF is pt(t, df) and the inverse CDF is qt(p, df). The df parameter is the degrees of freedom for the t distribution. By default pt returns the area to the left of t and qt expects that p refers to the left tail’s area. These functions work just like the normal distribution functions.

    Hypothesis Tests

    Specifying the alternative hypothesis. Most of the hypothesis test functions listed below have the optional argument alternative to specify your alternative hypothesis:

    Specifying a confidence interval level Most of the hypothesis test functions will output a confidence interval for the population parameter being estimated. By default the confidence level is 95%. If you want a 99% confidence level then set the optional argument conf.level=0.99.

    One sample proportion test

    Use: prop.test(x, n, p=0.5, alternative="two.sided", conf.level=0.95)

    Where x is the observed number of successes with a sample size n. The null hypothesis is p (default of 0.5).

    Example: A study found that 10 out of 100 people were left handed. Test the alternative hypothesis that the true proportion of left handed people is greater than 15%.

    > prop.test(10, 100, p=0.15, alternative="greater")
    
    	1-sample proportions test with continuity correction
    
    data:  10 out of 100, null probability 0.15 
    X-squared = 1.5882, df = 1, p-value = 0.8962
    alternative hypothesis: true p is greater than 0.15 
    95 percent confidence interval:
     0.05689751 1.00000000 
    sample estimates:
      p 
    0.1

    Two sample proportion test

    Use: prop.test(x, n, p=0, alternative="two.sided", conf.level=0.95)

    Where x is an ordered vector of the observed number of successes (x=c(x1, x2)) in both samples and n is an ordered vector of the sample sizes (n=c(n1, n2)). The null hypothesis is p refers to the difference p1-p2 in the two population proportions (default is 0).

    Example: Test the hypothesis that the proportion of left handed females is greater than the left handed males. A study found that 15 of 80 females were left handed while 8 out of 112 males were left handed.

    > x=c(15,8)
    > n=c(80, 112)  #Note that the order of n is the same as for x. Females listed first!
    > prop.test(x, n, alternative='greater')
    
    	2-sample test for equality of proportions with continuity correction
    
    data:  x out of n 
    X-squared = 4.9127, df = 1, p-value = 0.01333
    alternative hypothesis: greater 
    95 percent confidence interval:
     0.02317208 1.00000000 
    sample estimates:
        prop 1     prop 2 
    0.18750000 0.07142857

    Test of homogeneity (Many sample proportion test)

    Todo

    One sample mean test

    This test is designed for analyzing raw data when the population standard deviation is unknown (hence the t distribution).

    Use: t.test(x, mu=0, alternative="two.sided", conf.level=0.95)

    Where x is a vector of data and mu is the null hypothesis for the population mean (default 0).

    Example: A random study of 5 adult’s heights (in inches) were 58, 62, 54, 51, 60. Test the hypothesis that the mean height of adults is 66.3 inches.

    > x=c(58, 62, 54, 51, 60)
    > t.test(x, mu=66.3, alternative="two.sided")
    
    	One Sample t-test
    
    data:  x 
    t = -4.65, df = 4, p-value = 0.009661
    alternative hypothesis: true mean is not equal to 66.3 
    95 percent confidence interval:
     51.44711 62.55289 
    sample estimates:
    mean of x 
           57

    Two sample mean test

    This test is designed for analyzing raw data when the population standard deviation is unknown (hence the t distribution).

    Use: t.test(x1, x2, mu=0, alternative="two.sided", conf.level=0.95)

    Where x1 and x2 are vectors of data for the two samples and mu is the null hypothesis for the difference in the mean of the two populations mu1-mu2 (default 0).

    Example: Test the hypothesis that the mean height of adults males is greater than adult females. A study measured (in inches) 5 female heights as 58, 49, 52, 60, 55 and 7 males as 49, 64, 58, 60, 62, 55, 59.

    > x1=c(58, 49, 52, 60, 55)
    > x2=c(49, 64, 58, 60, 62, 55, 59)
    > t.test(x1, x2, alternative="less")
    
    	Welch Two Sample t-test
    
    data:  x1 and x2 
    t = -1.2258, df = 9.344, p-value = 0.1251
    alternative hypothesis: true difference in means is less than 0 
    95 percent confidence interval:
         -Inf 1.635198 
    sample estimates:
    mean of x mean of y 
     54.80000  58.14286

    Note that since x1 was the females and x2 was the males, our alternative should be less since we believe that the females mean height is less than the males. If we would have put males in first then the alternative would have been greater.

    One-way ANOVA (Many sample mean test)

    Todo

    Test of independence

    Todo

    Correlation & Regression

    Todo

    Updated on:
    Thu Sep 04 2008 at 06 PM

    Sub Pages

    Page Contents