Variance

Variance, often represented by \sigma^{2}, s^{2}, measures dispersion, or how far the values in a dataset spread out from the mean.  The sample standard deviation, s, is the square root of the sample variance. In the context of estimators, the variance of the estimator is how uncertain the estimator is.

The figure above shows two samples with the same sample mean but different sample variances with the red graph having a smaller variance compared to the black graph. We can see that the sample with the higher sample variance has a much wider spread.

Sample variance is the average of the squared distance of each data point from the sample mean and can be calculated as such.

\sigma^{2}=\frac{\sum_{i=1}^{n}{(x_{i}-\bar{x})^{2}}}{n}

Sample variance can also be used to estimate the variance of the population. In this case, we need to use the unbiased sample variance.

s^{2}=\frac{\sum_{i=1}^{n}{(x_{i}-\bar{x})^{2}}}{n-1}

But why do we divide by (n-1) instead of n to estimate the variance of the population?

Simply put, as samples are subsets of the population, the range of values (maximum value – minimum value) of the samples will always be the same or smaller than the range of values of the population. If we calculated the variances of multiple samples using the first formula, the sample variances will mostly be lesser than the actual population variance.

How much these sample variances deviate to the left of the actual population variance depends on the sample size. With larger sample sizes, sample variance deviates to left by smaller amounts. From this example, we can see that the first formula is biased as it often gives a smaller deviation than the actual population deviation, making it unsuitable to be an estimator.

In order to use the sample variance as an estimator for population variance, we have to get rid of the biases by correcting the underestimation. Not only that, as seen in the figure above, the amount of underestimation is different depending on the sample size. Fortunately, by dividing by (n-1) instead of n, we are able to resolve both issues.

n / n

n / (n-1)

% increase

2 / 2 = 1

2 / 1 = 2

100

10 / 10 = 1

10 / 9 = 1.111

11.1

500 / 500 = 1

500 / 499 = 1.002

0.2

From the table, we can see that by dividing by (n-1) instead of n, the sample variance is larger. Moreover, as n increases, the percentage increase is also smaller, resolving the two issues regarding underestimation. Hence, the unbiased sample variance equation

s^{2}=\frac{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}{n-1}

would make a suitable estimator for population variance.

In code, we can compute sample variance (both biased and unbiased) as follows.

				
					// create an array of doubles for our dataset
val values = doubleArrayOf(3.0, 8.0, 21.0, 9.0)

// get the array size
val n = values.size

val mean = Mean(values).value()

var total = 0.0
var sample_biased_variance: Double
var sample_unbiased_variance: Double

// find the squared distance of each value from the sample mean
for (i in 0..n-1) {
    total += (values.get(i) - mean).pow(2)
}

sample_biased_variance = total / n
sample_unbiased_variance = total / (n-1)

println("Sample variance (biased): " + sample_biased_variance)
println("Sample variance (unbiased): " + sample_unbiased_variance)

				
			
				
					Sample variance (biased): 43.6875
Sample variance (unbiased): 58.25

				
			

This can also be simplified through the class Variance in NM Dev.

				
					// create an array of doubles for our dataset
val values = doubleArrayOf(3.0, 8.0, 21.0, 9.0)

// create the Variance object for biased sample variance
val sample_biased_variance = Variance(values, false)

// create the Variance object for unbiased sample variance
val sample_unbiased_variance = Variance(values, true)

println("Sample variance (biased): " + sample_biased_variance.value())
println("Sample variance (unbiased): " + sample_unbiased_variance.value())

				
			
				
					Sample variance (biased): 43.6875
Sample variance (unbiased): 58.25

				
			

Weighted Variance

Similar to weighted mean, weighted variance is the sum of squared distance of each data point from the sample mean multiplied by their respective weights, divided by the sum of all their weights.

\sigma^{2}=\frac{\sum_{i=1}^{n}{(x_{i}-\bar{x}^{*})^{2}w_{i}}}{\sum_{i=1}^{n}{w_{i}}}

The unbiased version is

s^{2}=\frac{V_{1}}{V_{1}^{2}-V_{2}}\times\sum_{i=1}^{n}{(x_{i}-\bar{x}^{*})^{2}w_{i}}

where

V_{2}=\sum_{i=1}^{n}{w_{i}^{2}}

Using NM Dev, we can use the class WeightedVariance to compute the weighted sample variance of a dataset.

				
					// create arrays for our dataset and our weights
val values = doubleArrayOf(88.0, 94.0, 69.0, 66.0, 80.0)
val weights = doubleArrayOf(2.0, 3.0, 4.0, 1.0, 5.0)

// create the WeightedVariance object for biased sample variance
val weighted_biased_variance = WeightedVariance(values, weights, false)

// create the WeightedVariance object for unbiased sample variance
val weighted_unbiased_variance = WeightedVariance(values, weights, true)

println("Weighted variance (biased): " + weighted_biased_variance.value())
println("Weighted variance (unbiased): " + weighted_unbiased_variance.value())

				
			
				
					Weighted variance (biased): 93.06666666666666
Weighted variance (unbiased): 123.17647058823529