NOTE: A newly published article on FSTF_{ST} [1] looks like a promising source of a better formal definition.

The name FSTF_{ST} was used as early as [2] for a certain measure between populations. In the many decades since, the name has been used to have slightly different meanings. The recent publication [3] covers a long list of different FSTF_{ST} meanings in many papers over the decades. Most of the papers cover additional concepts which are not required to define and gain intuition on FSTF_{ST}. And the mathematical details behind a formal definition are spread across many papers.

This documents gives a simple formal definition to FSTF_{ST}, equivalent to [3] and [4]. This simple definition also makes clear how FSTF_{ST} is precisely a ratio of variances.

The Definition

Random variables ASA_S, ATA_T and DD model uncertainty for FSTF_{ST}:

  • ASA_S and ATA_T for the allele found in a random gamete from the “Sub-population” and “Top” population, respectively

  • DD for the random decent, drift, or divergence of the “sub-population” from the “top” population

“Top” population can mean ancestral population (as in [3]), or it can mean “total” population (the original meaning in [2]).

Given assumptions

  • ASA_S and ATA_T take values of 00 or 11

  • ATA_T and DD are independent

  • E ⁣(AS)=E ⁣(AT)\operatorname{E}\!\left({ A_S}\right) = \operatorname{E}\!\left({ A_T}\right)

the definition follows

FST:=Var(E ⁣(ASD))Var(AT) F_{ST} := \frac{ \operatorname{Var}({ \operatorname{E}\!\left({ A_S|D}\right)}) }{ \operatorname{Var}({ A_T}) }

Convenient Expectations

Conveniently, expectations of allele variables are allele frequencies. The following variable definition will be convenient p:=E ⁣(AT) p := \operatorname{E}\!\left({ A_T}\right)

Due to the assumptions of FSTF_{ST} the following are conveniently true p=E ⁣(AS)p=E ⁣(AT2)E ⁣(AS)=E ⁣(AS2)E ⁣(ASD)=E ⁣(AS2D) \begin{aligned} p &= \operatorname{E}\!\left({ A_S}\right) & p &= \operatorname{E}\!\left({ A_T^2}\right) \\ \operatorname{E}\!\left({ A_S}\right) &= \operatorname{E}\!\left({ A_S^2}\right) & \operatorname{E}\!\left({ A_S|D}\right) &= \operatorname{E}\!\left({ A_S^2|D}\right) \\ \end{aligned} and Var(AT)=E ⁣(AT2)E ⁣(AT)2=pp2=p(1p) \operatorname{Var}({ A_T}) = \operatorname{E}\!\left({ A_T^2}\right) - \operatorname{E}\!\left({ A_T}\right)^2 = p - p^2 = p(1-p)

FSTF_{ST} as variance explained or uncertainty reduced

In light of the following theorem, FSTF_{ST} can be interpreted as allele variance explained by random descent/drift/divergence. Alternatively, an interpretation can also be allele uncertainty reduced by knowing descent/drift/divergence.

Theorem 1

Var(AT)=Var(E ⁣(ASD))+E ⁣(Var(ASD)) \operatorname{Var}({ A_T}) = \operatorname{Var}({ \operatorname{E}\!\left({ A_S|D}\right)}) + \operatorname{E}\!\left({ \operatorname{Var}({ A_S|D})}\right)

Proof

Var(E ⁣(ASD))=E ⁣(E ⁣(ASD)2)E ⁣(E ⁣(ASD))2=E ⁣(E ⁣(ASD)2)p2E ⁣(Var(ASD))=E ⁣(E ⁣(AS2D)E ⁣(ASD)2)=pE ⁣(E ⁣(ASD)2) \begin{aligned} \operatorname{Var}({ \operatorname{E}\!\left({ A_S|D}\right)}) & = \operatorname{E}\!\left({ \operatorname{E}\!\left({ A_S|D}\right)^2}\right) - \operatorname{E}\!\left({ \operatorname{E}\!\left({ A_S|D}\right)}\right)^2 \\ & = \operatorname{E}\!\left({ \operatorname{E}\!\left({ A_S|D}\right)^2}\right) - p^2 \\ \operatorname{E}\!\left({ \operatorname{Var}({ A_S|D})}\right) & = \operatorname{E}\!\left({ \operatorname{E}\!\left({ A_S^2|D}\right) - \operatorname{E}\!\left({ A_S|D}\right)^2 }\right) \\ & = p - \operatorname{E}\!\left({ \operatorname{E}\!\left({ A_S|D}\right)^2}\right) \\ \end{aligned}

Unbiased Estimators

Consider observing two independent random descents/drifts/divergences D1D_1 and D2D_2 under the assumptions for DD of FSTF_{ST}. Furthermore, for each j{1,2}j \in \{1, 2 \}, consider observing njn_j independent random gametes within each resulting sub-population. Define n1+n2n_1 + n_2 independent observed alleles AS,j,iA_{S,j,i} with ii indexing sampled gametes within each sampled sub-populations resulting from the independent descents/drifts/divergences. For convenience define the following: p^1:=1n1i=1n1AS,1,ip^2:=1n2i=1n2AS,2,i \begin{aligned} \hat{p}_1 & := \frac{1}{n_1} \sum_{i=1}^{n_1} A_{S,1,i} & \hat{p}_2 & := \frac{1}{n_2} \sum_{i=1}^{n_2} A_{S,2,i} \end{aligned}

The “Hudson” estimator of FSTF_{ST} is defined in [3] as (p^1p^2)2p^1(1p^1)n11p^2(1p^2)n21p^1(1p^2)+p^2(1p^1) \frac{ (\hat{p}_1 - \hat{p}_2)^2 - \frac{\hat{p}_1 (1 - \hat{p}_1)}{n_1-1} - \frac{\hat{p}_2 (1 - \hat{p}_2)}{n_2-1} }{ \hat{p}_1 (1-\hat{p}_2) + \hat{p}_2 (1-\hat{p}_1) }

We first show that the denominator is an unbiased estimator of 2Var(AT)2 \operatorname{Var}({ A_T}). With p^1\hat{p}_1 and p^1\hat{p}_1 independent it follows:

E ⁣(p^1(1p^2)+p^2(1p^1))=E ⁣(p^1)E ⁣(1p^2)+E ⁣(p^2)E ⁣(1p^1)=2p(1p)=2Var(AT) \begin{aligned} \operatorname{E}\!\left({ \hat{p}_1 (1-\hat{p}_2) + \hat{p}_2 (1-\hat{p}_1) }\right) & = \operatorname{E}\!\left({ \hat{p}_1}\right) \operatorname{E}\!\left({ 1-\hat{p}_2}\right) + \operatorname{E}\!\left({ \hat{p}_2}\right) \operatorname{E}\!\left({ 1-\hat{p}_1}\right) \\ & = 2 p (1-p) \\ & = 2 \operatorname{Var}({ A_T}) \end{aligned}

We now show the “Hudson” numerator is an unbiased estimator of 2Var(E ⁣(ASD))2 \operatorname{Var}({ \operatorname{E}\!\left({ A_S|D}\right)}). For each j{0,1}j \in \{0,1\} define: v^j:=1nj1i=1nj(AS,j,ip^i)2 \hat{v}_j := \frac{1}{n_j-1} \sum_{i=1}^{n_j} (A_{S,j,i} - \hat{p}_i)^2 It follows as a classic unbiased estimator of variance [5] that: E ⁣(v^j)=E ⁣(Var(ASD1)) \operatorname{E}\!\left({ \hat{v}_j }\right) = \operatorname{E}\!\left({ \operatorname{Var}({ A_S|D_1})}\right) Since E ⁣(Var(ASD))=Var(AT)E ⁣(Var(ASD))\operatorname{E}\!\left({ \operatorname{Var}({ A_S|D})}\right) = \operatorname{Var}({ A_T}) - \operatorname{E}\!\left({ \operatorname{Var}({ A_S|D})}\right), it follows that an unbiased estimator of 2Var(E ⁣(ASD))2 \operatorname{Var}({ \operatorname{E}\!\left({ A_S|D}\right)}) is p^1(1p^2)+p^2(1p^1)v^1v^2 \hat{p}_1 (1-\hat{p}_2) + \hat{p}_2 (1-\hat{p}_1) - \hat{v}_1 - \hat{v}_2 We now show this is equivalent to the “Hudson” numerator. Note that p^1(1p^2)+p^2(1p^1)=p^1+p^22p^1p^2v^i=p^ip^i2+p^i(1p^i)ni1(p^1p^2)2=p^12+p^222p^1p^2 \begin{aligned} \hat{p}_1 (1-\hat{p}_2) + \hat{p}_2 (1-\hat{p}_1) & = \hat{p}_1 + \hat{p}_2 - 2 \hat{p}_1 \hat{p}_2 \\ \hat{v}_i & = \hat{p}_i - \hat{p}_i^2 + \frac{\hat{p}_i (1 - \hat{p}_i)}{n_i-1} \\ (\hat{p}_1 - \hat{p}_2)^2 & = \hat{p}_1^2 + \hat{p}_2^2 - 2 \hat{p}_1 \hat{p}_2 \\ \end{aligned} it follows that p^1(1p^2)+p^2(1p^1)v^1v^2=(p^1p^2)2p^1(1p^1)n11p^2(1p^2)n21 \hat{p}_1 (1-\hat{p}_2) + \hat{p}_2 (1-\hat{p}_1) - \hat{v}_1 - \hat{v}_2 = (\hat{p}_1 - \hat{p}_2)^2 - \frac{\hat{p}_1 (1 - \hat{p}_1)}{n_1-1} - \frac{\hat{p}_2 (1 - \hat{p}_2)}{n_2-1} which is the numerator in the “Hudson” estimator in [3].

References

1.
Ochoa A, Storey JD. Estimating FST and kinship for arbitrary population structures. Feldman MW, editor. PLOS Genetics. 2021;17: e1009241–. doi:10.1371/journal.pgen.1009241
2.
Wright S. THE GENETICAL STRUCTURE OF POPULATIONS. Annals of Eugenics. 1949;15: 323–354. doi:10.1111/j.1469-1809.1949.tb02451.x
3.
Bhatia G, Patterson N, Sankararaman S, Price AL. Estimating and interpreting FST: The impact of rare variants. Genome Research. 2013;23: 1514–1521. doi:10.1101/gr.154831.113
4.
5.
DeGroot MH, Schervish MJ. Probability and statistics. 3rd ed. Boston: Addison-Wesley; 2002.