NOTE: A newly published article on FST [1] looks like
a promising source of a better formal definition.
The name FST was used as early as [2] for a certain
measure between populations. In the many decades since, the name has been
used to have slightly different meanings. The recent publication
[3] covers a long list of different FST meanings in
many papers over the decades.
Most of the papers cover additional concepts which are not required to define
and gain intuition on FST. And the mathematical details behind a formal
definition are spread across many papers.
This documents gives a simple formal definition to FST, equivalent to
[3] and [4]. This simple definition
also makes clear how FST is precisely a ratio of variances.
The Definition
Random variables AS, AT and D model uncertainty for FST:
AS and AT for the allele found in a random gamete
from the “Sub-population” and “Top” population, respectively
D for the random decent, drift, or divergence
of the “sub-population” from the “top” population
“Top” population can mean ancestral population (as in
[3]), or it can mean “total” population
(the original meaning in [2]).
Given assumptions
AS and AT take values of 0 or 1
AT and D are independent
E(AS)=E(AT)
the definition follows
FST:=Var(AT)Var(E(AS∣D))
Convenient Expectations
Conveniently, expectations of allele variables are allele frequencies.
The following variable definition will be convenient
p:=E(AT)
Due to the assumptions of FST the following are conveniently true
pE(AS)=E(AS)=E(AS2)pE(AS∣D)=E(AT2)=E(AS2∣D)
and
Var(AT)=E(AT2)−E(AT)2=p−p2=p(1−p)
FST as variance explained or uncertainty reduced
In light of the following theorem, FST can be interpreted as allele
variance explained by random descent/drift/divergence. Alternatively, an
interpretation can also be allele uncertainty reduced by knowing
descent/drift/divergence.
Theorem 1
Var(AT)=Var(E(AS∣D))+E(Var(AS∣D))
Proof
Var(E(AS∣D))E(Var(AS∣D))=E(E(AS∣D)2)−E(E(AS∣D))2=E(E(AS∣D)2)−p2=E(E(AS2∣D)−E(AS∣D)2)=p−E(E(AS∣D)2)
Unbiased Estimators
Consider observing two independent random descents/drifts/divergences D1
and D2 under the assumptions for D of FST.
Furthermore, for each j∈{1,2},
consider observing nj independent random gametes within each resulting
sub-population. Define n1+n2 independent observed alleles AS,j,i
with i indexing sampled gametes within each sampled sub-populations resulting
from the independent descents/drifts/divergences. For convenience define the
following:
p^1:=n11i=1∑n1AS,1,ip^2:=n21i=1∑n2AS,2,i
The “Hudson” estimator of FST is defined in [3]
as
p^1(1−p^2)+p^2(1−p^1)(p^1−p^2)2−n1−1p^1(1−p^1)−n2−1p^2(1−p^2)
We first show that the denominator is an unbiased estimator of
2Var(AT).
With p^1 and p^1 independent it follows:
E(p^1(1−p^2)+p^2(1−p^1))=E(p^1)E(1−p^2)+E(p^2)E(1−p^1)=2p(1−p)=2Var(AT)
We now show the “Hudson” numerator is an unbiased estimator of
2Var(E(AS∣D)). For each j∈{0,1} define:
v^j:=nj−11i=1∑nj(AS,j,i−p^i)2
It follows as a classic unbiased estimator of variance
[5] that:
E(v^j)=E(Var(AS∣D1))
Since E(Var(AS∣D))=Var(AT)−E(Var(AS∣D)), it follows
that an unbiased estimator of 2Var(E(AS∣D)) is
p^1(1−p^2)+p^2(1−p^1)−v^1−v^2
We now show this is equivalent to the “Hudson” numerator.
Note that
p^1(1−p^2)+p^2(1−p^1)v^i(p^1−p^2)2=p^1+p^2−2p^1p^2=p^i−p^i2+ni−1p^i(1−p^i)=p^12+p^22−2p^1p^2
it follows that
p^1(1−p^2)+p^2(1−p^1)−v^1−v^2=(p^1−p^2)2−n1−1p^1(1−p^1)−n2−1p^2(1−p^2)
which is the numerator in the “Hudson” estimator in [3].
References
1.
Ochoa A, Storey JD. Estimating FST and kinship for arbitrary population structures. Feldman MW, editor. PLOS Genetics. 2021;17: e1009241–. doi:
10.1371/journal.pgen.1009241
3.
Bhatia G, Patterson N, Sankararaman S, Price AL. Estimating and interpreting FST: The impact of rare variants. Genome Research. 2013;23: 1514–1521. doi:
10.1101/gr.154831.113
5.
DeGroot MH, Schervish MJ. Probability and statistics. 3rd ed. Boston: Addison-Wesley; 2002.