Sample vs population variance with multivariate distributions

Proposed answer to the following question(s):

Introduction

On a normal distribution, sample variance does not perform better than population variance, as a point estimator, evaluated by Mean Squared Error (MSE). Similarly, on a Bernoulli distribution, sample variance does not perform better than population variance [1], as a point estimator, evaluated by MSE and Probability of Impossible Estimand [1].

However, this underperformance of sample variance does not extend to the sum of variances of a multivariate distribution. Depending on the multivariate distribution, below a certain number of samples and above a certain number of variates (or dimensions), sample variance achieves lower MSE than population variance. This document provides simple conditions for when lower MSE is achieved.

Formal Setup

In this document, a \(d\)-dimensional random vector \(D = (D_1, D_2, ..., D_d)\) defines the multivariate distribution under consideration. All \(D_c\) must be jointly independent and must have finite mean and variance. \(S_c\) denotes the population variance at variate \(D_c\) with \(n\) samples from the multivariate distribution. \[\begin{eqnarray*} S_c & := & \frac{1}{n} \sum_{i=1}^n \left( X_{i,c} - \frac{1}{n} \sum_{j=1}^n X_{j,c} \right)^2 \end{eqnarray*}\] where \(X_i = (X_{i,1}, X_{i,2}, ..., X_{i,d})\) is the \(i\)-th sample from the \(d\)-dimensional multivariate distribution. Let \(s_*\) equal the estimator expectation averaged across dimensions. \[\begin{eqnarray*} s_* & := & \frac{1}{d} \sum_{c=1}^d \operatorname{E}\!\left[{ S_c}\right] \end{eqnarray*}\] and \(v_*\) equal the estimator variance averaged across dimensions. \[\begin{eqnarray*} v_* & := & \frac{1}{d} \sum_{c=1}^d \operatorname{Var}\!\left[{ S_c}\right] \end{eqnarray*}\]

Conditions favoring sample variance

By Corollary 2 the sum of samples variances has lower MSE than the sum of population variances if and only if \[\begin{eqnarray*} \frac{s_*^2}{v_*} d & > & 2n-1 \end{eqnarray*}\]

This result applies to any distribution as long as \(s_*\) and \(v_*\) exist. We can see that for a large numbers of dimensions and not particularly small average estimates, the condition is met for sample variance to outperform population variance.

More precise conditions can be found assuming particular distributions such as Bernoulli and normal distributions.

Multivariate Bernoulli distributions

With Theorem 1 the following simple condition \[\begin{eqnarray*} \sum_{c=1}^d p_c & \ge & n + 1 \\ \end{eqnarray*}\] implies the sample variance achieves lower MSE than population variance. The \(p_c\) are the probabilities of the Bernoulli distributions, but after flipping any Bernoulli distributions to have probabilities less than \(1/2\). In other words, if any distribution \(B_c\) has probability \(p_c > 1/2\), the corresponding random variable \(B_c\) can be replaced by a flipped random variable \(B'_c := 1 - B_c\).

Multivariate normal distributions

For a multivariate normal distribution with equal variances across dimensions, Theorem 4 shows that sample variance achieves lower MSE than population variance if and only if \[\begin{eqnarray*} d & > & 4 + \frac{2}{n-1} \\ \end{eqnarray*}\]

If all \(\operatorname{E}\!\left[{ S_c}\right]\) are close to \(s_*\) and all \(\operatorname{Var}\!\left[{ S_c}\right]\) are close to \(v_*\), then the inequality condition of Theorem 4 is only approximate \[\begin{eqnarray*} d & \gtrapprox & 4 + \frac{2}{n-1} \end{eqnarray*}\]

Conclusion

As long as these exists a finite average across dimensions of variance to be estimated and variance of the estimates, there is some number of dimensions large enough for the sum of sample variances to be a better estimator than the sum of population variances, as measured by MSE.

In the general case, Theorem 1 provides a simple condition based on averages across dimensions. In the case of multivariate Bernoulli distributions, if the sum of distribution probabilities is greater than \(n+1\), then sample variance performs better. For multivariate normal distributions whose variates are independent and approximately identical, sample variance will achieve lower MSE after \(4\) dimensions.

Proofs

Theorem 1

Given \(n \ge 2\) independent samples from a multivariate distribution of \(d\) jointly independent Bernoulli distributions with probabilities \(p_c \le 1/2\) for \(c \in \{1, ..., d\}\), if \[\begin{eqnarray*} \sum_{c=1}^d p_c & \ge & n + 1 \\ \end{eqnarray*}\] then the sum of sample variances has lower MSE than the sum of population variances.

Proof \[\begin{eqnarray*} 1 + \frac{1}{n-1} & = & \frac{n}{n-1} \\ 2n + 2\left(1 + \frac{1}{n-1}\right) & = & 2n + 2 \frac{n}{n-1} \\ 2n + \left(1 + \frac{1}{n-1}\right) & = & 2n \left(1 + \frac{1}{n-1}\right) - \left(1 + \frac{1}{n-1}\right) \\ 2n + 1 + \frac{1}{n-1} & = & \frac{n}{n-1}(2n-1) \\ \end{eqnarray*}\]

Let \[\begin{eqnarray*} p_* & := & \frac{1}{d} \sum_{c=1}^d p_c \\ \end{eqnarray*}\]

\[\begin{eqnarray*} \sum_{c=1}^d p_c & \ge & n + 1 \\ 2 p_* d & \ge & 2(n + 1) \\ & \ge & 2n + 1 + \frac{1}{n-1} \\ & \ge & \frac{n}{n-1} (2n - 1) \\ 2 p_* d \frac{n-1}{n} & \ge & 2n - 1 \\ \end{eqnarray*}\]

By Theorem 3, \[\begin{eqnarray*} \frac{s_*^2}{v_*} d & \ge & 2 p_* d \frac{n-1}{n} \\ & \ge & 2n - 1 \\ \end{eqnarray*}\]

Given that \[ \frac{\operatorname{E}\!\left[{ S}\right]^2}{\operatorname{Var}\!\left[{ S}\right]} = \frac{\left(\sum_{c=1}^d \operatorname{E}\!\left[{ S_c}\right]\right)^2}{\sum_{c=1}^d \operatorname{Var}\!\left[{ S_c}\right]} = \frac{\left(s_* d\right)^2}{v_* d} = \frac{s_*^2}{v_*} d \]

From Theorem 2, it follow that the sum of sample variances has smaller MSE than the sum of population variances.

Theorem 2

Consider any estimator \(F\) of parameter \(\theta\) from a specific distribution where \[\begin{eqnarray*} \operatorname{E}\!\left[{ F}\right] & = & \frac{n-1}{n} \theta \end{eqnarray*}\] It follows that \[\begin{eqnarray*} \operatorname{MSE}\!\left[{ F}\right] & > & \operatorname{MSE}\!\left[{ \frac{n}{n-1} F}\right] \end{eqnarray*}\] if and only if \[\begin{eqnarray*} \frac{\operatorname{E}\!\left[{ F}\right]^2}{\operatorname{Var}\!\left[{ F}\right]} & > & 2n-1 \end{eqnarray*}\]

Proof

\[\begin{eqnarray*} \operatorname{E}\!\left[{ F}\right] & = & \frac{n-1}{n} \theta \\ \left( 1 + \frac{1}{n-1} \right) \operatorname{E}\!\left[{ F}\right] & = & \theta \\ \operatorname{E}\!\left[{ F - \theta}\right] & = & - \frac{\operatorname{E}\!\left[{ F}\right]}{n-1} \\ \end{eqnarray*}\]

\[\begin{eqnarray*} \operatorname{MSE}\!\left[{ F}\right] & > & \operatorname{MSE}\!\left[{ \frac{n}{n-1} F}\right] \\ \operatorname{Var}\!\left[{ F}\right] + \operatorname{E}\!\left[{ F-\theta}\right]^2 & > & \operatorname{Var}\!\left[{ \frac{n}{n-1}F}\right] + 0^2 \\ \operatorname{E}\!\left[{ F-\theta}\right]^2 & > & \left(\frac{n^2}{(n-1)^2} - 1 \right) \operatorname{Var}\!\left[{ F}\right] \\ \left(\frac{\operatorname{E}\!\left[{ F}\right]}{n-1}\right)^2 & > & \frac{2n-1}{(n-1)^2} \operatorname{Var}\!\left[{ F}\right] \\ \frac{\operatorname{E}\!\left[{ F}\right]^2}{\operatorname{Var}\!\left[{ F}\right]} & > & 2n-1 \\ \end{eqnarray*}\] QED

Corollary 2

Given \(n \ge 2\) independent samples from a multivariate distribution of \(d\) jointly independent distributions, the sum of sample variances has lower MSE than the sum of population variances if and only if \[\begin{eqnarray*} \frac{s_*^2}{v_*} d & > & 2n-1 \end{eqnarray*}\] where \(s_*\) and \(v_*\) are defined in section ‘Formal section’.

Proof

The population parameter to be estimated is the sum of variances \[\begin{eqnarray*} \theta & = & \sum_{c=1}^d \operatorname{Var}\!\left[{ D_c}\right] \end{eqnarray*}\]

\(S\) denotes the sum of population variances of the components. \[\begin{eqnarray*} S & := & \sum_{c=1}^d S_c \end{eqnarray*}\]

The expectation of population variance [2] at component \(c\) is \[\begin{eqnarray*} \operatorname{E}\!\left[{ S_c}\right] & = & \frac{n-1}{n} \operatorname{Var}\!\left[{ D_c}\right] \end{eqnarray*}\] and thus the expectation of the sum of population variances is \[\begin{eqnarray*} \operatorname{E}\!\left[{ S}\right] & = & \frac{n-1}{n} \theta \end{eqnarray*}\]

The expected value and variance of the sum of population variances can be expressed in terms of these averages. \[ \frac{\operatorname{E}\!\left[{ S}\right]^2}{\operatorname{Var}\!\left[{ S}\right]} = \frac{\left(\sum_{c=1}^d \operatorname{E}\!\left[{ S_c}\right]\right)^2}{\sum_{c=1}^d \operatorname{Var}\!\left[{ S_c}\right]} = \frac{s_*^2}{v_*} d \]

Thus by Theorem 2 the sum of samples variances has lower MSE than the sum of population variances if and only if \[\begin{eqnarray*} \frac{s_*^2}{v_*} d & > & 2n-1 \end{eqnarray*}\]

QED

Theorem 3

Given independent samples \(X_1\), …, \(X_n\) from a multivariate distribution of \(d\) Bernoulli distributions with probabilities \(p_c \le 1/2\) for \(c \in \{1, ..., d\}\), \[\begin{eqnarray*} \frac{s_*^2}{v_*} & \ge & 2 p_* \frac{n-1}{n} \\ \end{eqnarray*}\] where \(s_*\) and \(v_*\) are defined in section ‘Formal section’ and \(p_*\) denote the average across components of Bernoulli distribution probability. More formally, \[\begin{eqnarray*} p_* & := & \frac{1}{d} \sum_{c=1}^d p_c \\ \end{eqnarray*}\]

Proof

Define \(S_c\) as found in section ‘Formal setup’ and \[\begin{eqnarray*} \hat{p}_c & := & \frac{1}{n} \sum_{i=1}^n X_{i,c} \\ \end{eqnarray*}\] where \(X_i = (X_{i,1}, X_{i,2}, ..., X_{i,d})\) is the \(i\)-th sample from the \(d\)-dimensional multivariate distribution. By Lemma 1, it is accurate to call \(\hat{p}_c (1-\hat{p}_c)\) population variance.

By Lemma 2, \[\begin{eqnarray*} \frac{1}{d} \sum_{c=1}^d \frac{\operatorname{E}\!\left[{ S_c}\right]}{4} & \ge & \frac{1}{d} \sum_{c=1}^d \operatorname{Var}\!\left[{ S_c}\right] \\ \frac{1}{4} s_* & \ge & v_* \\ \frac{s_*^2}{v_*} & \ge & 4 s_* \\ \end{eqnarray*}\]

From the expectation of population variance, the variance of a Bernoulli distributions, and \(p_c \le 1/2\), for all \(c\) we have \[\begin{eqnarray*} \operatorname{E}\!\left[{ S_c}\right] & = & \frac{n-1}{n} p(1-p) \\ & \ge & \frac{n-1}{n} \frac{p_c}{2} \\ \end{eqnarray*}\]

Combining results gets \[\begin{eqnarray*} \frac{1}{d} \sum_{c=1}^d \operatorname{E}\!\left[{ S_c}\right] & \ge & \frac{1}{d} \sum_{c=1}^d \frac{n-1}{n} \frac{p_c}{2} \\ s_* & \ge & \frac{n-1}{2n} p_* \\ 4 s_* & \ge & 2 \frac{n-1}{n} p_* \\ \frac{s_*^2}{v_*} & \ge & 2 p_* \frac{n-1}{n} \\ \end{eqnarray*}\]

Theorem 4

Given \(n \ge 2\) samples from a multivariate distribution of jointly independent normal distributions all with variance \(\sigma^2\), the sum of sample variances has lower MSE than the sum of population variances if and only if \[\begin{eqnarray*} d & > & 4 + \frac{2}{n-1} \\ \end{eqnarray*}\]

Proof

Consider any \(S_c\) as the population variance of the \(n\) samples from the given distribution. \(n S_c / \sigma^2\) has a chi-squared distribution with \(n-1\) degrees of freedom [2] and thus: \[\begin{eqnarray*} \operatorname{E}\!\left[{ n S_c / \sigma^2}\right] & = & n-1 \\ \operatorname{E}\!\left[{ S_c}\right] & = & \frac{n-1}{n \sigma^2} \\ \operatorname{Var}\!\left[{ n S_c / \sigma^2}\right] & = & 2(n-1) \\ \operatorname{Var}\!\left[{ S_c}\right] & = & \frac{2(n-1)}{n^2 \sigma^4} \\ \end{eqnarray*}\] and thus \[ \frac{s_*^2}{v_*} = \frac{\left( \frac{n-1}{n \sigma^2} \right)^2}{ \frac{2(n-1)}{n^2 \sigma^4}} \\ = \frac{n-1}{2} \] which means the inequality of Theorem 2 can be replaced with \[\begin{eqnarray*} \frac{n-1}{2} d & > & 2n-1 \end{eqnarray*}\] and further simplified to \[\begin{eqnarray*} d & > & 2 \left( \frac{n}{n-1} + \frac{n-1}{n-1} \right) \\ d & > & 4 + \frac{2}{n-1} \\ \end{eqnarray*}\]

QED

Lemma 1

\(\hat{p} (1-\hat{p})\) equals the population variance of samples \(X_1\), …, \(X_n\) from a Bernoulli distribution (taking values \(0\) or \(1\)) where \[\begin{eqnarray*} \hat{p} & := & \frac{1}{n} \sum_{i=1}^n X_i \end{eqnarray*}\]

Proof

\[\begin{eqnarray*} \frac{1}{n} \sum_{i=1}^n \left(X_i - \hat{p} \right)^2 & = & \frac{1}{n} \sum_{i=1}^n X_i^2 - 2 \hat{p}_c\frac{1}{n} \sum_{i=1}^n X_i + \hat{p}^2 \\ & = & \frac{1}{n} \sum_{i=1}^n X_i - \hat{p}^2 \\ & = & \hat{p} - \hat{p}^2 \\ & = & \hat{p} (1-\hat{p}) \\ \end{eqnarray*}\] QED

Lemma 2

Let \(S\) denote population variance of samples drawn from a Bernoulli distribution. \[\begin{eqnarray*} \operatorname{Var}\!\left[{ S}\right] & \le & \frac{\operatorname{E}\!\left[{ S}\right]}{4} \\ \end{eqnarray*}\]

Proof

Define \(\hat{p}\) as in Lemma 1. Since \(\hat{p} \in [0, 1]\), the following inequalities must hold. \[\begin{eqnarray*} \hat{p} (1-\hat{p}) & \le & \frac{1}{4} \\ \hat{p}^2 (1-\hat{p})^2 & \le & \frac{\hat{p} (1-\hat{p})}{4} \\ \operatorname{E}\!\left[{ S^2}\right] & \le & \frac{\operatorname{E}\!\left[{ S}\right]}{4} \\ \operatorname{E}\!\left[{ S^2}\right] - \operatorname{E}\!\left[{ S}\right]^2 & \le & \operatorname{E}\!\left[{ S}\right] \left(\frac{1}{4} - \operatorname{E}\!\left[{ S}\right]\right) \\ \operatorname{Var}\!\left[{ S}\right] & \le & \frac{\operatorname{E}\!\left[{ S}\right]}{4} \\ \end{eqnarray*}\] QED

References

1. Ellerman EC Sample vs population variance with bernoulli distributions. https://castedo.com/osa/138/

2. DeGroot MH, Schervish MJ (2002) Probability and statistics, 3rd ed. Addison-Wesley, Boston