Introduction

On a normal distribution, sample variance does not perform better than population variance, as a point estimator, evaluated by Mean Squared Error (MSE). Similarly, on a Bernoulli distribution, sample variance does not perform better than population variance [1], as a point estimator, evaluated by MSE and Probability of Impossible Estimand [1].

However, this underperformance of sample variance does not extend to the sum of variances of a multivariate distribution. Depending on the multivariate distribution, below a certain number of samples and above a certain number of variates (or dimensions), sample variance achieves lower MSE than population variance. This document provides simple conditions for when lower MSE is achieved.

Formal Setup

In this document, a dd-dimensional random vector D=(D1,D2,...,Dd)D = (D_1, D_2, ..., D_d) defines the multivariate distribution under consideration. All DcD_c must be jointly independent and must have finite mean and variance. ScS_c denotes the population variance at variate DcD_c with nn samples from the multivariate distribution. Sc:=1ni=1n(Xi,c1nj=1nXj,c)2 S_c := \frac{1}{n} \sum_{i=1}^n \left( X_{i,c} - \frac{1}{n} \sum_{j=1}^n X_{j,c} \right)^2 where Xi=(Xi,1,Xi,2,...,Xi,d)X_i = (X_{i,1}, X_{i,2}, ..., X_{i,d}) is the ii-th sample from the dd-dimensional multivariate distribution. Let ss_* equal the estimator expectation averaged across dimensions. s:=1dc=1dE ⁣[Sc] s_* := \frac{1}{d} \sum_{c=1}^d {\operatorname{E}\!\left[{ S_c}\right]} and vv_* equal the estimator variance averaged across dimensions. v:=1dc=1dVar ⁣[Sc] v_* := \frac{1}{d} \sum_{c=1}^d \operatorname{Var}\!\left[{ S_c}\right]

Conditions favoring sample variance

By Corollary 2 the sum of samples variances has lower MSE than the sum of population variances if and only if s2vd>2n1 \frac{s_*^2}{v_*} d > 2n-1

This result applies to any distribution as long as ss_* and vv_* exist. We can see that for a large numbers of dimensions and not particularly small average estimates, the condition is met for sample variance to outperform population variance.

More precise conditions can be found assuming particular distributions such as Bernoulli and normal distributions.

Multivariate Bernoulli distributions

With Theorem 1 the following simple condition c=1dpcn+1 \sum_{c=1}^d p_c \ge n + 1 implies the sample variance achieves lower MSE than population variance. The pcp_c are the probabilities of the Bernoulli distributions, but after flipping any Bernoulli distributions to have probabilities less than 1/21/2. In other words, if any distribution BcB_c has probability pc>1/2p_c > 1/2, the corresponding random variable BcB_c can be replaced by a flipped random variable Bc:=1BcB'_c := 1 - B_c.

Multivariate normal distributions

For a multivariate normal distribution with equal variances across dimensions, Theorem 4 shows that sample variance achieves lower MSE than population variance if and only if d>4+2n1 d > 4 + \frac{2}{n-1}

If all E ⁣[Sc]{\operatorname{E}\!\left[{ S_c}\right]} are close to ss_* and all Var ⁣[Sc]\operatorname{Var}\!\left[{ S_c}\right] are close to vv_*, then the inequality condition of Theorem 4 is only approximate d4+2n1 d \gtrapprox 4 + \frac{2}{n-1}

Conclusion

As long as these exists a finite average across dimensions of variance to be estimated and variance of the estimates, there is some number of dimensions large enough for the sum of sample variances to be a better estimator than the sum of population variances, as measured by MSE.

In the general case, Theorem 1 provides a simple condition based on averages across dimensions. In the case of multivariate Bernoulli distributions, if the sum of distribution probabilities is greater than n+1n+1, then sample variance performs better. For multivariate normal distributions whose variates are independent and approximately identical, sample variance will achieve lower MSE after 44 dimensions.

Proofs

Theorem 1

Given n2n \ge 2 independent samples from a multivariate distribution of dd jointly independent Bernoulli distributions with probabilities pc1/2p_c \le 1/2 for c{1,...,d}c \in \{1, ..., d\}, if c=1dpcn+1 \sum_{c=1}^d p_c \ge n + 1 then the sum of sample variances has lower MSE than the sum of population variances.

Proof 1+1n1=nn12n+2(1+1n1)=2n+2nn12n+(1+1n1)=2n(1+1n1)(1+1n1)2n+1+1n1=nn1(2n1) \begin{aligned} 1 + \frac{1}{n-1} & = \frac{n}{n-1} \\ 2n + 2\left(1 + \frac{1}{n-1}\right) & = 2n + 2 \frac{n}{n-1} \\ 2n + \left(1 + \frac{1}{n-1}\right) & = 2n \left(1 + \frac{1}{n-1}\right) - \left(1 + \frac{1}{n-1}\right) \\ 2n + 1 + \frac{1}{n-1} & = \frac{n}{n-1}(2n-1) \\ \end{aligned}

Let p:=1dc=1dpc p_* := \frac{1}{d} \sum_{c=1}^d p_c

c=1dpcn+12pd2(n+1)2n+1+1n1nn1(2n1)2pdn1n2n1 \begin{aligned} \sum_{c=1}^d p_c & \ge n + 1 \\ 2 p_* d & \ge 2(n + 1) \\ & \ge 2n + 1 + \frac{1}{n-1} \\ & \ge \frac{n}{n-1} (2n - 1) \\ 2 p_* d \frac{n-1}{n} & \ge 2n - 1 \\ \end{aligned}

By Theorem 3, s2vd2pdn1n2n1 \begin{aligned} \frac{s_*^2}{v_*} d & \ge 2 p_* d \frac{n-1}{n} \\ & \ge 2n - 1 \\ \end{aligned}

Given that E ⁣[S]2Var ⁣[S]=(c=1dE ⁣[Sc])2c=1dVar ⁣[Sc]=(sd)2vd=s2vd \frac{{\operatorname{E}\!\left[{ S}\right]}^2}{\operatorname{Var}\!\left[{ S}\right]} = \frac{\left(\sum_{c=1}^d {\operatorname{E}\!\left[{ S_c}\right]}\right)^2}{\sum_{c=1}^d \operatorname{Var}\!\left[{ S_c}\right]} = \frac{\left(s_* d\right)^2}{v_* d} = \frac{s_*^2}{v_*} d

From Theorem 2, it follow that the sum of sample variances has smaller MSE than the sum of population variances.

Theorem 2

Consider any estimator FF of parameter θ\theta from a specific distribution where E ⁣[F]=n1nθ {\operatorname{E}\!\left[{ F}\right]} = \frac{n-1}{n} \theta It follows that MSE ⁣[F]>MSE ⁣[nn1F] \operatorname{MSE}\!\left[{ F}\right] > \operatorname{MSE}\!\left[{ \frac{n}{n-1} F}\right] if and only if E ⁣[F]2Var ⁣[F]>2n1 \frac{{\operatorname{E}\!\left[{ F}\right]}^2}{\operatorname{Var}\!\left[{ F}\right]} > 2n-1

Proof

E ⁣[F]=n1nθ(1+1n1)E ⁣[F]=θE ⁣[Fθ]=E ⁣[F]n1 \begin{aligned} {\operatorname{E}\!\left[{ F}\right]} & = \frac{n-1}{n} \theta \\ \left( 1 + \frac{1}{n-1} \right) {\operatorname{E}\!\left[{ F}\right]} & = \theta \\ {\operatorname{E}\!\left[{ F - \theta}\right]} & = - \frac{{\operatorname{E}\!\left[{ F}\right]}}{n-1} \\ \end{aligned}

MSE ⁣[F]>MSE ⁣[nn1F]Var ⁣[F]+E ⁣[Fθ]2>Var ⁣[nn1F]+02E ⁣[Fθ]2>(n2(n1)21)Var ⁣[F](E ⁣[F]n1)2>2n1(n1)2Var ⁣[F]E ⁣[F]2Var ⁣[F]>2n1 \begin{aligned} \operatorname{MSE}\!\left[{ F}\right] & > \operatorname{MSE}\!\left[{ \frac{n}{n-1} F}\right] \\ \operatorname{Var}\!\left[{ F}\right] + {\operatorname{E}\!\left[{ F-\theta}\right]}^2 & > \operatorname{Var}\!\left[{ \frac{n}{n-1}F}\right] + 0^2 \\ {\operatorname{E}\!\left[{ F-\theta}\right]}^2 & > \left(\frac{n^2}{(n-1)^2} - 1 \right) \operatorname{Var}\!\left[{ F}\right] \\ \left(\frac{{\operatorname{E}\!\left[{ F}\right]}}{n-1}\right)^2 & > \frac{2n-1}{(n-1)^2} \operatorname{Var}\!\left[{ F}\right] \\ \frac{{\operatorname{E}\!\left[{ F}\right]}^2}{\operatorname{Var}\!\left[{ F}\right]} & > 2n-1 \\ \end{aligned} QED

Corollary 2

Given n2n \ge 2 independent samples from a multivariate distribution of dd jointly independent distributions, the sum of sample variances has lower MSE than the sum of population variances if and only if s2vd>2n1 \frac{s_*^2}{v_*} d > 2n-1 where ss_* and vv_* are defined in section ‘Formal section’.

Proof

The population parameter to be estimated is the sum of variances θ=c=1dVar ⁣[Dc] \theta = \sum_{c=1}^d \operatorname{Var}\!\left[{ D_c}\right]

SS denotes the sum of population variances of the components. S:=c=1dSc S := \sum_{c=1}^d S_c

The expectation of population variance [2] at component cc is E ⁣[Sc]=n1nVar ⁣[Dc] {\operatorname{E}\!\left[{ S_c}\right]} = \frac{n-1}{n} \operatorname{Var}\!\left[{ D_c}\right] and thus the expectation of the sum of population variances is E ⁣[S]=n1nθ {\operatorname{E}\!\left[{ S}\right]} = \frac{n-1}{n} \theta

The expected value and variance of the sum of population variances can be expressed in terms of these averages. E ⁣[S]2Var ⁣[S]=(c=1dE ⁣[Sc])2c=1dVar ⁣[Sc]=s2vd \frac{{\operatorname{E}\!\left[{ S}\right]}^2}{\operatorname{Var}\!\left[{ S}\right]} = \frac{\left(\sum_{c=1}^d {\operatorname{E}\!\left[{ S_c}\right]}\right)^2}{\sum_{c=1}^d \operatorname{Var}\!\left[{ S_c}\right]} = \frac{s_*^2}{v_*} d

Thus by Theorem 2 the sum of samples variances has lower MSE than the sum of population variances if and only if s2vd>2n1 \frac{s_*^2}{v_*} d > 2n-1

QED

Theorem 3

Given independent samples X1X_1, …, XnX_n from a multivariate distribution of dd Bernoulli distributions with probabilities pc1/2p_c \le 1/2 for c{1,...,d}c \in \{1, ..., d\}, s2v2pn1n \frac{s_*^2}{v_*} \ge 2 p_* \frac{n-1}{n} where ss_* and vv_* are defined in section ‘Formal section’ and pp_* denote the average across components of Bernoulli distribution probability. More formally, p:=1dc=1dpc p_* := \frac{1}{d} \sum_{c=1}^d p_c

Proof

Define ScS_c as found in section ‘Formal setup’ and p^c:=1ni=1nXi,c \hat{p}_c := \frac{1}{n} \sum_{i=1}^n X_{i,c} where Xi=(Xi,1,Xi,2,...,Xi,d)X_i = (X_{i,1}, X_{i,2}, ..., X_{i,d}) is the ii-th sample from the dd-dimensional multivariate distribution. By Lemma 1, it is accurate to call p^c(1p^c)\hat{p}_c (1-\hat{p}_c) population variance.

By Lemma 2, 1dc=1dE ⁣[Sc]41dc=1dVar ⁣[Sc]14svs2v4s \begin{aligned} \frac{1}{d} \sum_{c=1}^d \frac{{\operatorname{E}\!\left[{ S_c}\right]}}{4} & \ge \frac{1}{d} \sum_{c=1}^d \operatorname{Var}\!\left[{ S_c}\right] \\ \frac{1}{4} s_* & \ge v_* \\ \frac{s_*^2}{v_*} & \ge 4 s_* \\ \end{aligned}

From the expectation of population variance, the variance of a Bernoulli distributions, and pc1/2p_c \le 1/2, for all cc we have E ⁣[Sc]=n1np(1p)n1npc2 \begin{aligned} {\operatorname{E}\!\left[{ S_c}\right]} & = \frac{n-1}{n} p(1-p) \\ & \ge \frac{n-1}{n} \frac{p_c}{2} \\ \end{aligned}

Combining results gets 1dc=1dE ⁣[Sc]1dc=1dn1npc2sn12np4s2n1nps2v2pn1n \begin{aligned} \frac{1}{d} \sum_{c=1}^d {\operatorname{E}\!\left[{ S_c}\right]} & \ge \frac{1}{d} \sum_{c=1}^d \frac{n-1}{n} \frac{p_c}{2} \\ s_* & \ge \frac{n-1}{2n} p_* \\ 4 s_* & \ge 2 \frac{n-1}{n} p_* \\ \frac{s_*^2}{v_*} & \ge 2 p_* \frac{n-1}{n} \\ \end{aligned}

Theorem 4

Given n2n \ge 2 samples from a multivariate distribution of jointly independent normal distributions all with variance σ2\sigma^2, the sum of sample variances has lower MSE than the sum of population variances if and only if d>4+2n1 d > 4 + \frac{2}{n-1}

Proof

Consider any ScS_c as the population variance of the nn samples from the given distribution. nSc/σ2n S_c / \sigma^2 has a chi-squared distribution with n1n-1 degrees of freedom [2] and thus: E ⁣[nSc/σ2]=n1E ⁣[Sc]=n1nσ2Var ⁣[nSc/σ2]=2(n1)Var ⁣[Sc]=2(n1)n2σ4 \begin{aligned} {\operatorname{E}\!\left[{ n S_c / \sigma^2}\right]} & = n-1 \\ {\operatorname{E}\!\left[{ S_c}\right]} & = \frac{n-1}{n \sigma^2} \\ \operatorname{Var}\!\left[{ n S_c / \sigma^2}\right] & = 2(n-1) \\ \operatorname{Var}\!\left[{ S_c}\right] & = \frac{2(n-1)}{n^2 \sigma^4} \\ \end{aligned} and thus s2v=(n1nσ2)22(n1)n2σ4=n12 \frac{s_*^2}{v_*} = \frac{\left( \frac{n-1}{n \sigma^2} \right)^2}{ \frac{2(n-1)}{n^2 \sigma^4}} = \frac{n-1}{2} which means the inequality of Theorem 2 can be replaced with n12d>2n1 \frac{n-1}{2} d > 2n-1 and further simplified to d>2(nn1+n1n1)d>4+2n1 \begin{aligned} d & > 2 \left( \frac{n}{n-1} + \frac{n-1}{n-1} \right) \\ d & > 4 + \frac{2}{n-1} \\ \end{aligned}

QED

Lemma 1

p^(1p^)\hat{p} (1-\hat{p}) equals the population variance of samples X1X_1, …, XnX_n from a Bernoulli distribution (taking values 00 or 11) where p^:=1ni=1nXi \hat{p} := \frac{1}{n} \sum_{i=1}^n X_i

Proof

1ni=1n(Xip^)2=1ni=1nXi22p^c1ni=1nXi+p^2=1ni=1nXip^2=p^p^2=p^(1p^) \begin{aligned} \frac{1}{n} \sum_{i=1}^n \left(X_i - \hat{p} \right)^2 & = \frac{1}{n} \sum_{i=1}^n X_i^2 - 2 \hat{p}_c\frac{1}{n} \sum_{i=1}^n X_i + \hat{p}^2 \\ & = \frac{1}{n} \sum_{i=1}^n X_i - \hat{p}^2 \\ & = \hat{p} - \hat{p}^2 \\ & = \hat{p} (1-\hat{p}) \\ \end{aligned} QED

Lemma 2

Let SS denote population variance of samples drawn from a Bernoulli distribution. Var ⁣[S]E ⁣[S]4 \operatorname{Var}\!\left[{ S}\right] \le \frac{{\operatorname{E}\!\left[{ S}\right]}}{4}

Proof

Define p^\hat{p} as in Lemma 1. Since p^[0,1]\hat{p} \in [0, 1], the following inequalities must hold. p^(1p^)14p^2(1p^)2p^(1p^)4E ⁣[S2]E ⁣[S]4E ⁣[S2]E ⁣[S]2E ⁣[S](14E ⁣[S])Var ⁣[S]E ⁣[S]4 \begin{aligned} \hat{p} (1-\hat{p}) & \le \frac{1}{4} \\ \hat{p}^2 (1-\hat{p})^2 & \le \frac{\hat{p} (1-\hat{p})}{4} \\ {\operatorname{E}\!\left[{ S^2}\right]} & \le \frac{{\operatorname{E}\!\left[{ S}\right]}}{4} \\ {\operatorname{E}\!\left[{ S^2}\right]} - {\operatorname{E}\!\left[{ S}\right]}^2 & \le {\operatorname{E}\!\left[{ S}\right]} \left(\frac{1}{4} - {\operatorname{E}\!\left[{ S}\right]}\right) \\ \operatorname{Var}\!\left[{ S}\right] & \le \frac{{\operatorname{E}\!\left[{ S}\right]}}{4} \\ \end{aligned} QED

References

1.
Ellerman EC. Sample vs population variance with bernoulli distributions. Available: https://castedo.com/osa/138/
2.
DeGroot MH, Schervish MJ. Probability and statistics. 3rd ed. Boston: Addison-Wesley; 2002.