WORKING DRAFT

Copredictor Definition

Given any discrete random variables XX and YY, define a copredictor of YY with XX as any random variable WW such that X and W are independent andY=f(X,W) for some function f \begin{aligned} & \text{$X$ and $W$ are independent and} \\ & Y = f(X,W) \text{ for some function $f$} \end{aligned}

Copredictor Existence

Given any discrete random variables XX and YY, there exists a copredictor WW. The range of WW is (rngX)(rngY)(\operatorname{rng}{ X}) \mapsto (\operatorname{rng}{ Y}), that is all functions from the finite range of XX to the range of YY. Consider any xx, yy, ww from rngX\operatorname{rng}{ X}, rngY\operatorname{rng}{ Y}, and (rngX)(rngY)(\operatorname{rng}{ X}) \mapsto (\operatorname{rng}{ Y}) respectively. Denote S={ω:W(ω)=w}{ω:X(ω)=x}{ω:Y(ω)=y} S = \{\omega:W(\omega)=w\} \cap \{\omega:X(\omega)=x\} \cap \{\omega:Y(\omega)=y\} Define WW such that P(S)=  P(X=x) ⁣xrngXP(Y=w(x)X=x) \operatorname{P}({ S}) = \; \operatorname{P}({ X=x}) \! \prod_{x' \in \operatorname{rng}{ X}} \operatorname{P}({ Y=w(x')|X=x'}) when w(x)=yw(x) = y otherwise P(S)=0\operatorname{P}({ S}) = 0.

Proof TO DO

A Functional Representation Lemma can be found on page 626 of [1] and is a more powerful result beyond just existence of a copredictor. A further Strong Functional Representation Lemma can be found in [2].

Optimal Copredictor

Given any copredictor WW of YY with XX, it follows that H(Y)=I(X,W;Y)+H(YX,W)=I(X,W;Y)+0=I(X;Y)+I(W;Y)I(X;W;Y)=I(X;Y)+I(W;Y)+I(X;WY) \begin{aligned} \operatorname{H}({ Y}) & = I(X,W;Y) + H(Y|X,W) \\ & = I(X,W;Y) + 0 \\ & = I(X;Y) + I(W;Y) - I(X;W;Y) \\ & = I(X;Y) + I(W;Y) + I(X;W|Y) \end{aligned} since I(X;W;Y)+I(X;WY)=I(X;Y)=0I(X;W;Y) + I(X;W|Y) = I(X;Y) = 0.

We seek a copredictor that maximizes how much predictive information is provided ‘exclusively’ by independent variables XX and WW and not their interaction I(X;WY)I(X;W|Y).

To this end we want the minimum I(X;WY)I(X;W|Y) across all possible copredictors WW of YY with XX. This minimum is defined as the excess functional information in [2]. In this document, any copredictor achieving this minimum is considered an optimal copredictor.

The sum of mutual information and excess functional information is the conditional entropy of YY given an optimal copredictor WW with XX:

H(YW)=I(X;YY)+H(YX,W)=I(X;Y)I(X;W;Y)+0=I(X;Y)+I(X;WY) \begin{aligned} H(Y|W) & = I(X;Y|Y) + H(Y|X,W) \\ & = I(X;Y) - I(X;W;Y) + 0 \\ & = I(X;Y) + I(X;W|Y) \\ \end{aligned}

Thus minimizing I(X;WY)I(X;W|Y) is equivalent to minimizing H(YW)H(Y|W). The optimal copredictor is also a copredictor that achieves a minimum H(YW)H(Y|W).

Linear Subspace of Copredictors

Consider any finite discrete random variables XX and YY. Let FF denote (rngX)(rngY)(\operatorname{rng}{ X}) \mapsto (\operatorname{rng}{ Y}), the set of all functions from the range of XX to the range of YY.

The following linear subspace of copredictors captures all of the possible values of H(YW)H(Y|W) and I(X;WY)I(X;W|Y). Consider all copredictors WW of YY with XX where the range of WW is FF, and for each wFw \in F, xrngXx \in \operatorname{rng}{ X}, yrngYy \in \operatorname{rng}{ Y} with S={ω:W(ω)=w}{ω:X(ω)=x}{ω:Y(ω)=y} S = \{\omega:W(\omega)=w\} \cap \{\omega:X(\omega)=x\} \cap \{\omega:Y(\omega)=y\} the following holds P(S)={P(Y=y)P(X=x) if w(x)=y0 otherwise  \begin{aligned} \operatorname{P}({ S}) & = \begin{cases} \operatorname{P}({ Y=y}) \operatorname{P}({ X=x}) & \text{ if } w(x) = y \\ 0 & \text{ otherwise } \end{cases} \end{aligned}

Each possible copredictor WW defines a point in the subspace FRF \mapsto \mathbb{R} where qw:=P(W=w)q_w := P(W=w) is the coordinate along the dimension wFw \in F.

This is the linear subspace of possible qq defined by the linear constraints: P(Y=yX=x)=wFw(x)=yqw P(Y=y|X=x) = \sum_{\substack{w \in F \\ w(x)=y}} q_w for all yrngYy \in \operatorname{rng}{ Y}, and xrngXx \in \operatorname{rng}{ X} and qw0 q_w \ge 0 for all wFw \in F.

We show that H(YW)H(Y|W) is a linear function in terms of qwq_w across wFw \in F. Because of this, minimizing H(YW)H(Y|W) is minimizing a linear objective function in the subspace.

For any wFw \in F and yrngYy \in \operatorname{rng}{ Y}, P(Y=yW=w)=P(Y=y,W=w)P(W=w)=xrngXP(Y=y,X=x,W=w)P(W=w)=xrngXw(x)=yP(X=x)P(W=w)P(W=w)=xrngXw(x)=yP(X=x)=P(Xw1(y)) \begin{aligned} \operatorname{P}({ Y=y|W=w}) & = \frac{\operatorname{P}({ Y=y,W=w})}{\operatorname{P}({ W=w})} \\ & = \sum_{x \in \operatorname{rng}{ X}} \frac{\operatorname{P}({ Y=y,X=x,W=w})}{\operatorname{P}({ W=w})} \\ & = \sum_{\substack{x \in \operatorname{rng}{ X} \\ w(x)=y }} \frac{\operatorname{P}({ X=x}) \operatorname{P}({ W=w})}{\operatorname{P}({ W=w})} \\ & = \sum_{\substack{x \in \operatorname{rng}{ X} \\ w(x)=y }} \operatorname{P}({ X=x}) \\ & = \operatorname{P}({ X \in w^{-1}(y)}) \\ \end{aligned} For any wFw \in F let m(w)=yrngYP(Xw1(y))log21P(Xw1(y)) m(w) = \sum_{y \in \operatorname{rng}{ Y}} \operatorname{P}({ X \in w^{-1}(y)}) \log_2\frac{1}{\operatorname{P}({ X \in w^{-1}(y)})} Note that m(w)m(w) does not depend on any probabilities P(W=w)\operatorname{P}({ W=w}) of a specific copredictor WW. The function mm only depends on functions wFw \in F. Combining the previous results shows that H(YW)H(Y|W) is a linear objective function. H(YW)=wFP(W=w)yrngYP(Y=yW=w)log21P(Y=yW=w)=wFqwm(w) \begin{aligned} H(Y|W) & = \sum_{w \in F} P(W=w) \sum_{y \in \operatorname{rng}{ Y}} P(Y=y|W=w) \log_2\frac{1}{P(Y=y|W=w)} \\ & = \sum_{w \in F} q_w \, m(w) \end{aligned}

Any copredictor ZZ can be simplified such that H(YZ)=H(YW)H(Y|Z) = H(Y|W) for some WW in the above linear subspace. For every value zz in the range of ZZ, it must correspond to some function wF=(rngXrngY)w \in F = (\operatorname{rng}{ X} \mapsto \operatorname{rng}{ Y}) since H(YX,W)=0H(Y|X,W) = 0. Let WW be a random variable with range FF such that P(W=w)P(W=w) equals the sum of probabilities P(Z=z)\operatorname{P}({ Z=z}) for which zz corresponds to ww.

Linear Optimization

Minimization of the previously identified linear objective function within the linear subspace of copredictors is a standard form linear programming problem [3]. It is a primal problem for which a dual problem exists. Although the primal problem is an optimization on YX|Y|^{|X|} variables, the dual problem is an optimzation on only YX|Y| \cdot |X| variables.

In matrix notation, the dual linear program is to maximize bTrb^T r under constraint ATrmA^T r \le m. The resulting maximum bTrb^T r equals the minimum possible H(YW)H(Y|W).

The vector bb consists of the values P(Y=yX=x)P(Y=y|X=x) across values xrngXx \in \operatorname{rng}{ X} and yrngYy \in \operatorname{rng}{ Y}. The vector mm consists of the values m(w)m(w) across wFw \in F. The matrix expression ATrmA^T r \le m represents the linear constaints xrngXr(x,w(x))m(w) \sum_{x \in \operatorname{rng}{ X}} r_{(x,w(x))} \le m(w) across all wFw \in F. The maximized bTrb^T r represents xrngXyrngYP(Y=yX=x)r(x,y) \sum_{\substack{x \in \operatorname{rng}{ X} \\ y \in \operatorname{rng}{ Y}}} \operatorname{P}({ Y=y|X=x}) \, r_{(x,y)} and will equal the minimum possible H(YW)H(Y|W) across all copredictors WW.

References

1.
El Gamal AA, Kim Y-H. Network information theory. Cambridge ; New York: Cambridge University Press; 2011.
2.
Li CT, Gamal AE. Strong functional representation lemma and applications to coding theorems. 2017 IEEE International Symposium on Information Theory (ISIT). Aachen, Germany: IEEE; 2017. pp. 589–593. doi:10.1109/ISIT.2017.8006596
3.
Cornuejols G, Tütüncü R. Optimization methods in finance. Cambridge, UK ; New York: Cambridge University Press; 2007.