Variance, often represented by , , measures dispersion, or how far the values in a dataset spread out from the mean. The sample standard deviation, , is the square root of the sample variance. In the context of estimators, the variance of the estimator is how uncertain the estimator is.
The figure above shows two samples with the same sample mean but different sample variances with the red graph having a smaller variance compared to the black graph. We can see that the sample with the higher sample variance has a much wider spread.
Sample variance is the average of the squared distance of each data point from the sample mean and can be calculated as such.
Sample variance can also be used to estimate the variance of the population. In this case, we need to use the unbiased sample variance.
But why do we divide by instead of to estimate the variance of the population?
Simply put, as samples are subsets of the population, the range of values (maximum value – minimum value) of the samples will always be the same or smaller than the range of values of the population. If we calculated the variances of multiple samples using the first formula, the sample variances will mostly be lesser than the actual population variance.
How much these sample variances deviate to the left of the actual population variance depends on the sample size. With larger sample sizes, sample variance deviates to left by smaller amounts. From this example, we can see that the first formula is biased as it often gives a smaller deviation than the actual population deviation, making it unsuitable to be an estimator.
In order to use the sample variance as an estimator for population variance, we have to get rid of the biases by correcting the underestimation. Not only that, as seen in the figure above, the amount of underestimation is different depending on the sample size. Fortunately, by dividing by instead of , we are able to resolve both issues.
n / n | n / (n-1) | % increase |
2 / 2 = 1 | 2 / 1 = 2 | 100 |
10 / 10 = 1 | 10 / 9 = 1.111 | 11.1 |
500 / 500 = 1 | 500 / 499 = 1.002 | 0.2 |
From the table, we can see that by dividing by instead of , the sample variance is larger. Moreover, as n increases, the percentage increase is also smaller, resolving the two issues regarding underestimation. Hence, the unbiased sample variance equation
would make a suitable estimator for population variance.
Example
Following the previous example on test scores , the teacher now wants to find the spread of the test scores, i.e. the variance.
From earlier, we found the mean of the sample to be .
To find the variance, we first have to find the sum of squared distance of each data point from the mean.
Sum of squared distances
Following that, we can find the biased and unbiased variance by dividing the squared distance of each data point from the mean by the and respectively.
Biased
Unbiased
Code
In code, we can compute sample variance (both biased and unbiased) as follows.
// create an array of doubles for our dataset
val values = doubleArrayOf(3.0, 8.0, 21.0, 9.0)
// get the array size
val n = values.size
val mean = Mean(values).value()
var total = 0.0
var sample_biased_variance: Double
var sample_unbiased_variance: Double
// find the squared distance of each value from the sample mean
for (i in 0..n-1) {
total += (values.get(i) - mean).pow(2)
}
sample_biased_variance = total / n
sample_unbiased_variance = total / (n-1)
println("Sample variance (biased): " + sample_biased_variance)
println("Sample variance (unbiased): " + sample_unbiased_variance)
Sample variance (biased): 43.6875
Sample variance (unbiased): 58.25
This can also be simplified through the class Variance in NM Dev.
// create an array of doubles for our dataset
val values = doubleArrayOf(3.0, 8.0, 21.0, 9.0)
// create the Variance object for biased sample variance
val sample_biased_variance = Variance(values, false)
// create the Variance object for unbiased sample variance
val sample_unbiased_variance = Variance(values, true)
println("Sample variance (biased): " + sample_biased_variance.value())
println("Sample variance (unbiased): " + sample_unbiased_variance.value())
Sample variance (biased): 43.6875
Sample variance (unbiased): 58.25
Similar to weighted mean, weighted variance is the sum of squared distance of each data point from the sample mean multiplied by their respective weights, divided by the sum of all their weights.
The unbiased version is
where
Using NM Dev, we can use the class WeightedVariance to compute the weighted sample variance of a dataset.
Example
Following the example from weighted mean, where the student’s grades are and the respective weights are , we now want to find out the weighted variance.
The weighted mean is .
We can find the weighted variance by first finding the sum of the squared distance of each data point from the weighted mean multiplied by their respective weights.
To find the biased weighted variance, we can divide the sum of squared distances by the total weight.
Biased
Likewise, we can find the unbiased version by following its formula
Unbiased
Code
// create arrays for our dataset and our weights
val values = doubleArrayOf(88.0, 94.0, 69.0, 66.0, 80.0)
val weights = doubleArrayOf(2.0, 3.0, 4.0, 1.0, 5.0)
// create the WeightedVariance object for biased sample variance
val weighted_biased_variance = WeightedVariance(values, weights, false)
// create the WeightedVariance object for unbiased sample variance
val weighted_unbiased_variance = WeightedVariance(values, weights, true)
println("Weighted variance (biased): " + weighted_biased_variance.value())
println("Weighted variance (unbiased): " + weighted_unbiased_variance.value())
Weighted variance (biased): 93.06666666666666
Weighted variance (unbiased): 123.17647058823529