Location: Anthony Tanbakuchi / Resources / R Resources / Basic statistics using R

# Basic statistics using R

The following page is a quick guide for using R to do most statistics necessary in an introductory statistics class. It should be especially useful for the students in my class! If you notice any errors or find any parts that could use clarifications please let me know and I will try and improve it.

## Getting started with R

If you need to install R, see my main R resources page.

When you start R you will see the `>` prompt. After the prompt you can type commands. A few things that you should be aware of:

• Variable names are case sensitive, `a` and `A` are different!

• You must always make sure you have matching opening `(` and closing `)` parenthesis and quotations `" "`.

• R observes the standard order of mathematical operations, so use parenthesis as needed.

• Be careful to include implicit multiplication signs:

``> x(2-y)	# This is WRONG``

Will result in the error because you have forgotten the multiplication sign! You should type:

``> x*(2-y)	# This is CORRECT``
• You can insert a comment in R by typing `#`, anything following the pound sign will be ignored.

``> # This is ignored!``
• Function arguments must be comma separated.

• Function optional arguments don’t have to be specified unless you want to change the default value. You must use the `name=value` format to change their value.

For example, the normal CDF is defined as `pnorm(x, mean=0, sd=1)`. If you are happy with the values for the mean and sd you don’t have to enter them, you can just type `pnorm(x)`. However, if you want to change them you must use the `name=value` format. So to change the mean to 10 and the sd to 3 you would type: `pnorm(x, mean=10, sd=3)`.

## Getting help in R

If need help on how to use a function, or more information about what it does, just type a question mark in-front of the function’s name and press enter. To get help on the `mean` function type:

``> ?mean``

A help window will open with more information. Don’t be intimidated by what is displayed. After reading a few of the help screens you will learn what to look for.

## Math operations

Most math operators are obvious, a few that may not be as obvious are:

• `sqrt(x)` square root of x.
• `sum(x)` sum up values in vector x:.
• `x^n` raise x to the power n.

## Entering data

To enter data in R, just use the basic form `variableName = Data`, you store the data in the variable you name on the left hand side of the equal sign. You will make frequent use of the `c()` function to enter vectors (sets) of data.

### Entering a scalar

To store the value 3 in `x` type (and hit enter):

``> x=3``

You can call your variables anything you wish, just make sure the names consist only of alpha numeric characters, begin with a letter, and have no spaces.

To see the value stored in a variable, just type it’s name and hit enter:

``````> x
 3``````

If you want to store a string (eg your name) in a variable, then you must use parenthesis:

``> name = "Anthony"``

If you try and enter `name=Anthony` R will look for a variable called `Anthony`, if it does not exist you will get an error.

### Entering a vector

To store 5 students ages (in years) in the variable `ages`:

``> ages = c(25, 22, 18, 20, 22)``

You use the `c()` function to store multiple values in a variable. The values are comma separated.

### Entering a seqence of numbers

Since we frequently need a sequence of numbers such as x={1, 2, 3, 4, 5} R makes this easy:

``````> x=1:5
> x
 1 2 3 4 5``````

To get the sequence {3, 4, 5, 6, 7, 8, 9} you would type `3:9`

If you need to generate sequences of numbers that don’t increment by 1 take a look at the `seq(...)` function.

### Loading a table (Excel data -> CSV )

If you need to enter more than a few values into a vector, or have an existing table of data, it’s much easier to enter the data using Excel or any other spreadsheet program that can export CSV files. When you load a table into R it generally expects that each column is a vector representing a measurement (such as heights, weights, ages) and each row represents an observation (an individual in a study).

Once you have entered your data into a spreadsheet, you can load it into R using the following steps:

1. Export your data as a CSV file (Comma Separated Values).

In Excel open your spreadsheet and select File –> Save As:, then in the format drop-down field select CSV. Pick a name for the file such as myTable.CSV and press save. You may be warned that some formatting will be lost, that’s ok.

2. To load the CSV file data in R, just type:

``my.table=read.csv(file.choose())``

R will present a file open dialog, just pick the CSV file you created. Your data will now be stored in `my.table`.

## Working with tables (dataframes)

If you have loaded a bunch of data into R and you are not sure what variables exist, just type `ls()` and R will list all the defined variables.

``````> ls()
 "my.table" "women"    "x"        "y"``````

We can see that 4 variables exist.

### Viewing and summarizing

If you have a variable that contains a table of data called `women` you could just type the variable name to see what stored inside:

``````> women
height weight
1      58    115
2      59    117
3      60    120
4      61    123
5      62    126
6      63    129
7      64    132
8      65    135
9      66    139
10     67    142
11     68    146
12     69    150
13     70    154
14     71    159
15     72    164``````

Above we can see that there are 2 columns `height` and `weight` and 15 observations. If your table is huge, looking at the raw data may not be particularly useful. To view the first few observations type:

``````> head(women)
height weight
1     58    115
2     59    117
3     60    120
4     61    123
5     62    126
6     63    129``````

Finally, a descriptive statistics summary of all columns can be computed by:

``````> summary(women)
height         weight
Min.   :58.0   Min.   :115.0
1st Qu.:61.5   1st Qu.:124.5
Median :65.0   Median :135.0
Mean   :65.0   Mean   :136.7
3rd Qu.:68.5   3rd Qu.:148.0
Max.   :72.0   Max.   :164.0``````

### Getting a column of data

In general we can get a column of data using: `tableName\$columnName`

1. Determine the column names for your table:

``````> names(women)
 "height" "weight"``````

In this case the `women` variable has two columns: `height` and `weight`.

2. To get out a column, use the `tableName\$columnName` dollar sign notation:

``````> women\$weight
 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164``````

In this case, we got the vector of `weight` data from the `women` table. We could easily store the vector in the variable `x` and find the mean by:

``````> x=women\$weight
> mean(x)
 136.7333``````

If you are constantly accessing data in a table, you can directly type the names of the columns if you first `attach(tableName)` all their names to the workspace:

``	> attach(women)``

To find the mean you can now just type:

``````	> mean(weight)
 136.7333``````

When you are done working with the table, you can remove the direct access to the column names by typing `detach(tableName)`.

## Descriptive statistics: Numerical

If we store the ages of 5 students in a variable called `ages` we can find some descriptive statistics about the ages.

``> ages = c(25, 22, 18, 20, 22)``

Note: if your data is stored in a different variable — perhaps `x` — then just substitute your variable’s name for `ages` below.

### Length, min, max, range

The length is the number of elements (n) in a vector

``````> length(ages)
 5``````

The min and max functions are obvious:

``````> min(ages)
 18
> max(ages)
 25``````

Using the min and max, we can easily find the range:

``````> range=max(ages)-min(ages)
> range
 7``````

### Mean, median, mode

The mean and median are:

``````> mean(ages)
 21.4
> median(ages)
 22``````

To find the mode for a discrete or categorical variable, just make a sorted table, below you can see that the mode is 22:

``````>  sort(table(ages))
ages
18 20 25 22
1  1  1  2 ``````

### Standard deviation, variance

The sample standard deviation and variance are:

``````> sd(ages)
 2.607681
> var(ages)
 6.8``````

### Six number summary (quartiles)

``````> summary(ages)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
18.0    20.0    22.0    21.4    22.0    25.0 ``````

## Graphs

For the following examples I will use the built in `iris` data set in R. This famous (Fisher’s or Anderson’s) data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

First, lets take a look at the first few entries in the table and view a summary:

``````> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
> summary(iris)
Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species
Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50
1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50
Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50
Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199
3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800
Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500
> ``````

Now let’s `attach` the data set so we can directly type the column names without using the `tableName\$columnName` notation.

``> attach(iris)``

### Optional arguments for most graphing functions

Most plotting functions have optional arguments. Below is a partial list of all the optional argument names. Just specify them at the end of the function arguments as a comma separated list of `name=value` pairs (see the example below). If the value is a string, make sure it is in quotes.

`type`
what type of plot should be drawn. Possible types are:
• ”p” for points,
• ”l” for lines,
• ”b” for both,
• ”c” for the lines part alone of “b”,
• ”o” for both ‘overplotted’,
• ”h” for ‘histogram’ like (or ‘high-density’) vertical lines,
• ”s” for stair steps,
• ”S” for other steps,
• ”n” for no plotting.
• an overall title for the plot.
`sub`
a sub title for the plot.
`xlab`
a title for the x axis.
`ylab`
a title for the y axis.
`asp`
the y/x aspect ratio.
`xlim`
sets the range of the graph, use `xlim=c(min.x, max.x)` where `min.x` and `max.x` are the min and max values you want.
`ylim`
sets the range of the graph, use `ylim=c(min.y, max.y)` where `min.y` and `max.y` are the min and max values you want.

Example: to make a scatter plot of the vectors `x` and `y` and set the plot type to lines, specify a title, and label the axis x, and set the x limits to 0-10:

``> plot(x, y, type="l", main="My plot", xlab="x (meters)", xlim=c(0, 10))``

### Histogram Like Graphs

• Stem and leaf plot:

``````> stem(Sepal.Width)

The decimal point is 1 digit(s) to the left of the |

20 | 0
21 |
22 | 000
23 | 0000
24 | 000
25 | 00000000
26 | 00000
27 | 000000000
28 | 00000000000000
29 | 0000000000
30 | 00000000000000000000000000
31 | 00000000000
32 | 0000000000000
33 | 000000
34 | 000000000000
35 | 000000
36 | 0000
37 | 000
38 | 000000
39 | 00
40 | 0
41 | 0
42 | 0
43 |
44 | 0``````
• Histogram:

``> hist(Sepal.Width)`` ### Pareto chart

R does not have a built in Pareto chart function, but to make one all we need to do is make a table of our categorical data, sort it in descending order, and then make a boxplot. If your categorical data is in the vector `x` type:

``````> t=sort(table(Species), descending=T)
> barplot(t)``````

### Density Estimate

You can estimate the population density from a sample using:

``> plot(density(Sepal.Width))`` ### Boxpots

Boxplots are great for comparing a groups of data. Let’s compare the sepal widths to the species. The key is that the first variable is an ordered vector of quantitative data `Sepal.Width` and the second variable is a vector of categorical data `Species`. We model the relationship as `Sepal.Width~Species` meaning that the `Sepal.Width` depends on the type of `Species`.

``> boxplot(Sepal.Width~Species, ylab="sepal width (cm)", main="Iris sepal width by species")`` If want to make a simple boxplot without grouping into categories and your quantitative data is in `x`, just type `boxplot(x)`.

### Normal quantile (Q-Q) plot

To make a normal quantile plot:

``> qqnorm(Sepal.Width); qqline(Sepal.Width)`` ### Scatter plots, time series plots, plotting functions

Scatter Plots & Time Series Plots

To make a scatter plot of the following 10 x-y data points

``````    x   y
1  60 123
2  65 129
3  72 143
4  65 131
5  48  95
6  50  99
7  37  72
8  37  70
9  56 109
10 46  92``````

First store the data in the vectors `x` and `y` maintaining the order and then use the `plot` function.

``````> x = c(60, 65, 72, 65, 48, 50, 37, 37, 56, 46)
> y = c(123, 129, 143, 131, 95, 99, 72, 70, 109, 92)
> plot(x, y, xlab="x-axis label", ylab="y-axis lable", main="Plot title")`````` Time series plots are done the same way, just use `plot(t, y, ...)`. If you want connected points use the optional argument `type="b"` for both lines and points:

``> plot(t, y, type="b")``

Plotting Functions

If you want to plot mathematical (or R) functions, use the `curve(expression, xmin, xmax)` function and input an expression involving `x` and the min and max values to plot (the domain).

Example: plot the sin(x) function from -2pi to 2pi.

``> curve(sin(x), -2*pi, 2*pi)`` Example: plot 3rd degree polynomial y=x^3 + 2x^2 - 4x + 3 from -10 to 10

``> curve(x^3 + 2*x^2 - 4*x + 3, -10, 10, main="Polynomial") `` Example R function: plot standard normal density

``> curve(dnorm(x), -3, 3)`` ## Probability

### Factorials

To find 8!

``````> factorial(8)
 40320``````

### Combinations

Combinations of n choose r are done with `choose(n,r)`. To find 10 choose 7:

``````> choose(10, 7)
 120``````

## Distributions

For discrete variables, probability is height on the distribution.

Distributions: p=P(x)
Binomial: `pnorm(x,n,p)`

For continuous variables, probability is area so we use the CDFs to find area on the densities.

CDFs: p=F(x), gives area to the left of x.
Normal: `pnorm(x, mean=0, sd=1)`
Inverse CDFs: x=F^-1(p), finds x with are p to the left.
Normal: `qnorm(p, mean=0, sd=1)`

### Binomial distribution

Use `dbinom(x,n,p)`. The binomomial distribution finds the probability of `x` successes in `n` trials with `p` probability of individual success (the arguments must be in that order).

If the probability of guessing a correct answer on a multiple choice question is 0.25 (4 choices), what is the probability of randomly guessing 0 out of 5 questions correct? The parameters are: x=0, n=5, p=0.25. The probability is about 23.7%:

``````> dbinom(0, 5, 0.25)
 0.2373047``````

To find the probability that we answer 4 or 5 question correct, we must find P(x=4 OR x=5) = P(x=4) + P(x=5):

``````> dbinom(4, 5, 0.25) + dbinom(4, 5, 0.25)
 0.02929688``````

### Normal distribution

The normal density is `dnorm(x, mean=0, sd=1)`, the cumulative density function CDF is `pnorm(x, mean=0, sd=1)` and the inverse CDF is `qnorm(p, mean=0, sd=1)`. By default `pnorm` returns the area to the left of `x` and `qnorm` expects that `p` refers to the left tail’s area.

If men’s heights are normally distributed with a mean of 69 inches and a standard deviation of 2.8 inches:

Find the probability that a randomly selected man has a height less than 75 inches. Use the CDF:

``````> pnorm(75, mean=69, sd=2.8)
 0.9839377``````

Find the height that the top 1.6% of men are taller than. Use the inverse CDF:

``````> qnorm(1-0.016, mean=69, sd=2.8)
 75.00435``````

Note that in the above question, we had to use the compliment because the inverse CDF expects that `p` refers to the left tail area.

### Student’s t Distribution

The t density is `dt(t, df)`, the cumulative density function CDF is `pt(t, df)` and the inverse CDF is `qt(p, df)`. The `df` parameter is the degrees of freedom for the t distribution. By default `pt` returns the area to the left of `t` and `qt` expects that `p` refers to the left tail’s area. These functions work just like the normal distribution functions.

## Hypothesis Tests

Specifying the alternative hypothesis. Most of the hypothesis test functions listed below have the optional argument `alternative` to specify your alternative hypothesis:

• if Ha is “not equal to” set `alternative="two.sided"` (this is the default so you don’t have to set this)
• if Ha is “less than” set `alternative="less"`
• if Ha is “greater than” set `alternative="greater"`

Specifying a confidence interval level Most of the hypothesis test functions will output a confidence interval for the population parameter being estimated. By default the confidence level is 95%. If you want a 99% confidence level then set the optional argument `conf.level=0.99`.

### One sample proportion test

Use: `prop.test(x, n, p=0.5, alternative="two.sided", conf.level=0.95)`

Where `x` is the observed number of successes with a sample size `n`. The null hypothesis is `p` (default of 0.5).

Example: A study found that 10 out of 100 people were left handed. Test the alternative hypothesis that the true proportion of left handed people is greater than 15%.

``````> prop.test(10, 100, p=0.15, alternative="greater")

1-sample proportions test with continuity correction

data:  10 out of 100, null probability 0.15
X-squared = 1.5882, df = 1, p-value = 0.8962
alternative hypothesis: true p is greater than 0.15
95 percent confidence interval:
0.05689751 1.00000000
sample estimates:
p
0.1``````

### Two sample proportion test

Use: `prop.test(x, n, p=0, alternative="two.sided", conf.level=0.95)`

Where `x` is an ordered vector of the observed number of successes (`x=c(x1, x2)`) in both samples and `n` is an ordered vector of the sample sizes (`n=c(n1, n2)`). The null hypothesis is `p` refers to the difference `p1-p2` in the two population proportions (default is 0).

Example: Test the hypothesis that the proportion of left handed females is greater than the left handed males. A study found that 15 of 80 females were left handed while 8 out of 112 males were left handed.

``````> x=c(15,8)
> n=c(80, 112)  #Note that the order of n is the same as for x. Females listed first!
> prop.test(x, n, alternative='greater')

2-sample test for equality of proportions with continuity correction

data:  x out of n
X-squared = 4.9127, df = 1, p-value = 0.01333
alternative hypothesis: greater
95 percent confidence interval:
0.02317208 1.00000000
sample estimates:
prop 1     prop 2
0.18750000 0.07142857``````

Todo

### One sample mean test

This test is designed for analyzing raw data when the population standard deviation is unknown (hence the t distribution).

Use: `t.test(x, mu=0, alternative="two.sided", conf.level=0.95)`

Where `x` is a vector of data and `mu` is the null hypothesis for the population mean (default 0).

Example: A random study of 5 adult’s heights (in inches) were 58, 62, 54, 51, 60. Test the hypothesis that the mean height of adults is 66.3 inches.

``````> x=c(58, 62, 54, 51, 60)
> t.test(x, mu=66.3, alternative="two.sided")

One Sample t-test

data:  x
t = -4.65, df = 4, p-value = 0.009661
alternative hypothesis: true mean is not equal to 66.3
95 percent confidence interval:
51.44711 62.55289
sample estimates:
mean of x
57``````

### Two sample mean test

This test is designed for analyzing raw data when the population standard deviation is unknown (hence the t distribution).

Use: `t.test(x1, x2, mu=0, alternative="two.sided", conf.level=0.95)`

Where `x1` and `x2` are vectors of data for the two samples and `mu` is the null hypothesis for the difference in the mean of the two populations `mu1-mu2` (default 0).

Example: Test the hypothesis that the mean height of adults males is greater than adult females. A study measured (in inches) 5 female heights as 58, 49, 52, 60, 55 and 7 males as 49, 64, 58, 60, 62, 55, 59.

``````> x1=c(58, 49, 52, 60, 55)
> x2=c(49, 64, 58, 60, 62, 55, 59)
> t.test(x1, x2, alternative="less")

Welch Two Sample t-test

data:  x1 and x2
t = -1.2258, df = 9.344, p-value = 0.1251
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf 1.635198
sample estimates:
mean of x mean of y
54.80000  58.14286``````

Note that since `x1` was the females and `x2` was the males, our alternative should be `less` since we believe that the females mean height is less than the males. If we would have put males in first then the alternative would have been `greater`.

Todo

Todo

## Correlation & Regression

Todo

Updated on:
Thu Sep 04 2008 at 06 PM