When we analyze a dataset, we often care about two things:
1. Where the “center” value is located. We often measure the “center” using the mean and median.
2. How “spread out” the values are. We measure “spread” using range, interquartile range, variance, and standard deviation.
Range
The range is the difference between the largest and smallest value in a dataset.
Suppose we have this dataset of final math exam scores for 20 students:
The largest value is 98. The smallest value is 58. Thus, the range is 98 – 58 = 40.
Interquartile Range
The interquartile range is the difference between the first quartile and the third quartile in a dataset.
Quartiles are values that split up a dataset into four equal parts. Here is how to find the interquartile range of the following dataset of exam scores:
1. Arrange the values from smallest to largest.
58, 66, 71, 73, 74, 77, 78, 82, 84, 85, 88, 88, 88, 90, 90, 92, 92, 94, 96, 98
2. Find the median. (In this case, it’s the average of the middle two values)
58, 66, 71, 73, 74, 77, 78, 82, 84, 85 (MEDIAN) 88, 88, 88, 90, 90, 92, 92, 94, 96, 98
3. The median splits the dataset into two halves. The median of the lower half is the lower quartile (Q1) and the median of the upper half is the upper quartile (Q3)
58, 66, 71, 73, 74, 77, 78, 82, 84, 85, 88, 88, 88, 90, 90, 92, 92, 94, 96, 98
4. The interquartile range is equal to Q3 – Q1.
In this case, Q1 is the average of the middle two values in the lower half of the data set (75.5) and Q3 is the average of the middle two values in the upper half of the data set(91).
Thus, the interquartile range is 91 – 75.5 = 15.5
Interquartile Range vs. Range
The interquartile range more resistant to outliers compared to the range, which can make it a better metric to use to measure “spread.”
For example, suppose we have the following dataset with incomes for ten people:
The range is $2,468,000, but the interquartile range is $34,000, which is a much better indication of how spread out the incomes actually are.
In this case, the outlier income of person J causes the range to be extremely large and makes it a poor indicator of “spread” for these incomes.
Variance
The variance is a common way to measure how spread out data values are.
The formula to find the variance of a population (denoted as σ2) is:
σ2 = Σ (xi – μ)2 / N
where μ is the population mean, xi is the ith element from the population, N is the population size, and Σ is just a fancy symbol that means “sum.”
Usually we work with samples, not populations. And the formula to find the variance of a sample (denoted as s2) is:
s2 = Σ (xi – x)2 / (n-1)
Standard Deviation
The standard deviation is the square root of the variance. It’s the most common way to measure how “spread out” data values are.
The formula to find the standard deviation of a population (denoted as σ ) is:
√Σ (xi – μ)2 / N
And the formula to find the standard deviation of a sample (denoted as s) is:
√Σ (xi – x)2 / (n-1)