Introduction
On a normal distribution, sample variance does not perform better than
population variance, as a point estimator, evaluated by Mean Squared Error
(MSE). Similarly, on a Bernoulli distribution, sample variance does
not perform better than population variance [1], as a point estimator,
evaluated by MSE and Probability of Impossible Estimand [1].
However, this underperformance of sample variance does not
extend to the sum of variances of a multivariate distribution.
Depending on the multivariate distribution, below a certain number of samples
and above a certain number of variates (or dimensions), sample variance
achieves lower MSE than population variance. This document provides
simple conditions for when lower MSE is achieved.
In this document, a d-dimensional random vector D=(D1,D2,...,Dd)
defines the multivariate distribution under consideration.
All Dc must be jointly independent and must have finite mean and variance.
Sc denotes the population variance at variate Dc with n samples from
the multivariate distribution.
Sc:=n1i=1∑n(Xi,c−n1j=1∑nXj,c)2
where Xi=(Xi,1,Xi,2,...,Xi,d) is the i-th sample from the
d-dimensional multivariate distribution.
Let s∗ equal the estimator expectation averaged across dimensions.
s∗:=d1c=1∑dE[Sc]
and v∗ equal the estimator variance averaged across dimensions.
v∗:=d1c=1∑dVar[Sc]
Conditions favoring sample variance
By Corollary 2 the sum of samples variances has lower MSE than the sum of
population variances if and only if
v∗s∗2d>2n−1
This result applies to any distribution as long as s∗ and v∗
exist. We can see that for a large numbers of dimensions and not particularly
small average estimates, the condition is met for sample variance to outperform
population variance.
More precise conditions can be found assuming particular distributions such
as Bernoulli and normal distributions.
Multivariate Bernoulli distributions
With Theorem 1 the following simple condition
c=1∑dpc≥n+1
implies the sample variance achieves lower MSE than population variance.
The pc are the probabilities of the Bernoulli distributions, but after
flipping any Bernoulli distributions to have probabilities less than 1/2.
In other words, if any distribution Bc has probability
pc>1/2, the corresponding random variable Bc can be replaced
by a flipped random variable Bc′:=1−Bc.
Multivariate normal distributions
For a multivariate normal distribution with equal variances across dimensions,
Theorem 4 shows that sample variance achieves lower MSE than population
variance if and only if
d>4+n−12
If all E[Sc] are close to s∗ and all Var[Sc] are close to v∗,
then the inequality condition of Theorem 4 is only approximate
d⪆4+n−12
Conclusion
As long as these exists a finite average across dimensions of variance to be
estimated and variance of the estimates, there is some number of dimensions
large enough for the sum of sample variances to be a better estimator than the
sum of population variances, as measured by MSE.
In the general case, Theorem 1 provides a simple condition based on
averages across dimensions. In the case of multivariate Bernoulli distributions,
if the sum of distribution probabilities is greater than n+1, then sample
variance performs better.
For multivariate normal distributions whose variates are independent and
approximately identical, sample variance will achieve lower MSE after 4
dimensions.
Proofs
Theorem 1
Given n≥2 independent samples from a multivariate distribution of d
jointly independent Bernoulli distributions with probabilities pc≤1/2
for c∈{1,...,d}, if
c=1∑dpc≥n+1
then the sum of sample variances has lower MSE than the sum of population
variances.
Proof
1+n−112n+2(1+n−11)2n+(1+n−11)2n+1+n−11=n−1n=2n+2n−1n=2n(1+n−11)−(1+n−11)=n−1n(2n−1)
Let
p∗:=d1c=1∑dpc
c=1∑dpc2p∗d2p∗dnn−1≥n+1≥2(n+1)≥2n+1+n−11≥n−1n(2n−1)≥2n−1
By Theorem 3,
v∗s∗2d≥2p∗dnn−1≥2n−1
Given that
Var[S]E[S]2=∑c=1dVar[Sc](∑c=1dE[Sc])2=v∗d(s∗d)2=v∗s∗2d
From Theorem 2, it follow that the sum of sample variances has smaller MSE than
the sum of population variances.
Theorem 2
Consider any estimator F of parameter θ from a specific distribution where
E[F]=nn−1θ
It follows that
MSE[F]>MSE[n−1nF]
if and only if
Var[F]E[F]2>2n−1
Proof
E[F](1+n−11)E[F]E[F−θ]=nn−1θ=θ=−n−1E[F]
MSE[F]Var[F]+E[F−θ]2E[F−θ]2(n−1E[F])2Var[F]E[F]2>MSE[n−1nF]>Var[n−1nF]+02>((n−1)2n2−1)Var[F]>(n−1)22n−1Var[F]>2n−1
QED
Corollary 2
Given n≥2 independent samples from a multivariate distribution of d
jointly independent distributions, the sum of sample variances has lower MSE
than the sum of population variances
if and only if
v∗s∗2d>2n−1
where s∗ and v∗ are defined in section ‘Formal section’.
Proof
The population parameter to be estimated is the sum of variances
θ=c=1∑dVar[Dc]
S denotes the sum of population variances of the components.
S:=c=1∑dSc
The expectation of population variance [2] at component
c is
E[Sc]=nn−1Var[Dc]
and thus the expectation of the sum of population variances is
E[S]=nn−1θ
The expected value and variance of the sum of population variances can be
expressed in terms of these averages.
Var[S]E[S]2=∑c=1dVar[Sc](∑c=1dE[Sc])2=v∗s∗2d
Thus by Theorem 2 the sum of samples variances has lower MSE than the sum of
population variances if and only if
v∗s∗2d>2n−1
QED
Theorem 3
Given independent samples X1, …, Xn from a multivariate distribution
of d Bernoulli distributions with probabilities pc≤1/2
for c∈{1,...,d},
v∗s∗2≥2p∗nn−1
where s∗ and v∗ are defined in section ‘Formal section’ and
p∗ denote the average across components of Bernoulli distribution probability.
More formally,
p∗:=d1c=1∑dpc
Proof
Define Sc as found in section ‘Formal setup’ and
p^c:=n1i=1∑nXi,c
where Xi=(Xi,1,Xi,2,...,Xi,d) is the i-th sample from the
d-dimensional multivariate distribution.
By Lemma 1, it is accurate to call p^c(1−p^c) population variance.
By Lemma 2,
d1c=1∑d4E[Sc]41s∗v∗s∗2≥d1c=1∑dVar[Sc]≥v∗≥4s∗
From the expectation of population variance, the variance of a Bernoulli
distributions, and pc≤1/2, for all c we have
E[Sc]=nn−1p(1−p)≥nn−12pc
Combining results gets
d1c=1∑dE[Sc]s∗4s∗v∗s∗2≥d1c=1∑dnn−12pc≥2nn−1p∗≥2nn−1p∗≥2p∗nn−1
Theorem 4
Given n≥2 samples from a multivariate distribution of jointly independent
normal distributions all with variance σ2, the sum of sample variances
has lower MSE than the sum of population variances if and only if
d>4+n−12
Proof
Consider any Sc as the population variance of the n samples from the given
distribution. nSc/σ2 has a chi-squared distribution with n−1
degrees of freedom [2] and thus:
E[nSc/σ2]E[Sc]Var[nSc/σ2]Var[Sc]=n−1=nσ2n−1=2(n−1)=n2σ42(n−1)
and thus
v∗s∗2=n2σ42(n−1)(nσ2n−1)2=2n−1
which means the inequality of Theorem 2 can be replaced with
2n−1d>2n−1
and further simplified to
dd>2(n−1n+n−1n−1)>4+n−12
QED
Lemma 1
p^(1−p^) equals the population variance of samples X1, …,
Xn from a Bernoulli distribution (taking values 0 or 1) where
p^:=n1i=1∑nXi
Proof
n1i=1∑n(Xi−p^)2=n1i=1∑nXi2−2p^cn1i=1∑nXi+p^2=n1i=1∑nXi−p^2=p^−p^2=p^(1−p^)
QED
Lemma 2
Let S denote population variance of samples drawn from a Bernoulli distribution.
Var[S]≤4E[S]
Proof
Define p^ as in Lemma 1. Since p^∈[0,1], the following
inequalities must hold.
p^(1−p^)p^2(1−p^)2E[S2]E[S2]−E[S]2Var[S]≤41≤4p^(1−p^)≤4E[S]≤E[S](41−E[S])≤4E[S]
QED
References
2.
DeGroot MH, Schervish MJ. Probability and statistics. 3rd ed. Boston: Addison-Wesley; 2002.