WORKING DRAFT
Copredictor Definition
Given any discrete random variables X and Y, define a copredictor of Y
with X as any random variable W such that
X and W are independent andY=f(X,W) for some function f
Copredictor Existence
Given any discrete random variables X and Y, there exists a
copredictor W. The range of W is (rngX)↦(rngY), that is
all functions from the finite range of X to the range of Y. Consider any
x, y, w
from rngX, rngY, and (rngX)↦(rngY) respectively.
Denote
S={ω:W(ω)=w}∩{ω:X(ω)=x}∩{ω:Y(ω)=y}
Define W such that
P(S)=P(X=x)x′∈rngX∏P(Y=w(x′)∣X=x′)
when w(x)=y otherwise P(S)=0.
Proof TO DO
A Functional Representation Lemma can be found on page 626 of
[1] and is a more powerful result beyond
just existence of a copredictor. A further Strong Functional Representation
Lemma can be found in [2].
Optimal Copredictor
Given any copredictor W of Y with X, it follows that
H(Y)=I(X,W;Y)+H(Y∣X,W)=I(X,W;Y)+0=I(X;Y)+I(W;Y)−I(X;W;Y)=I(X;Y)+I(W;Y)+I(X;W∣Y)
since I(X;W;Y)+I(X;W∣Y)=I(X;Y)=0.
We seek a copredictor that maximizes how much predictive information is
provided ‘exclusively’ by independent variables X and W and not their
interaction I(X;W∣Y).
To this end we want the minimum I(X;W∣Y) across all possible copredictors W
of Y with X. This minimum is defined as the excess functional information
in [2]. In this document, any copredictor achieving this minimum
is considered an optimal copredictor.
The sum of mutual information and excess functional information is the
conditional entropy of Y given an optimal copredictor W with X:
H(Y∣W)=I(X;Y∣Y)+H(Y∣X,W)=I(X;Y)−I(X;W;Y)+0=I(X;Y)+I(X;W∣Y)
Thus minimizing I(X;W∣Y) is equivalent to minimizing H(Y∣W).
The optimal copredictor is also a copredictor that achieves a minimum H(Y∣W).
Linear Subspace of Copredictors
Consider any finite discrete random variables X and Y.
Let F denote (rngX)↦(rngY), the set of all functions
from the range of X to the range of Y.
The following linear subspace of copredictors captures all
of the possible values of H(Y∣W) and I(X;W∣Y).
Consider all copredictors W of Y with X where the range of W is F,
and for each w∈F, x∈rngX, y∈rngY with
S={ω:W(ω)=w}∩{ω:X(ω)=x}∩{ω:Y(ω)=y}
the following holds
P(S)={P(Y=y)P(X=x)0 if w(x)=y otherwise
Each possible copredictor W defines a point in the subspace F↦R
where qw:=P(W=w) is the coordinate along the dimension w∈F.
This is the linear subspace of possible q defined by the linear constraints:
P(Y=y∣X=x)=w∈Fw(x)=y∑qw
for all y∈rngY, and x∈rngX and
qw≥0
for all w∈F.
We show that H(Y∣W) is a linear function in terms of qw across w∈F.
Because of this, minimizing H(Y∣W) is minimizing a linear objective function
in the subspace.
For any w∈F and y∈rngY,
P(Y=y∣W=w)=P(W=w)P(Y=y,W=w)=x∈rngX∑P(W=w)P(Y=y,X=x,W=w)=x∈rngXw(x)=y∑P(W=w)P(X=x)P(W=w)=x∈rngXw(x)=y∑P(X=x)=P(X∈w−1(y))
For any w∈F let
m(w)=y∈rngY∑P(X∈w−1(y))log2P(X∈w−1(y))1
Note that m(w) does not depend on any probabilities P(W=w) of a specific copredictor W.
The function m only depends on functions w∈F.
Combining the previous results shows that H(Y∣W) is a linear objective function.
H(Y∣W)=w∈F∑P(W=w)y∈rngY∑P(Y=y∣W=w)log2P(Y=y∣W=w)1=w∈F∑qwm(w)
Any copredictor Z can be simplified such that H(Y∣Z)=H(Y∣W)
for some W in the above linear subspace.
For every value z in the range of Z,
it must correspond to some function
w∈F=(rngX↦rngY) since H(Y∣X,W)=0.
Let W be a random variable with range F such that P(W=w) equals
the sum of probabilities P(Z=z) for which z corresponds to w.
Linear Optimization
Minimization of the previously identified linear objective function
within the linear subspace of copredictors is a standard form linear
programming problem [3].
It is a primal problem for which a dual problem exists.
Although the primal problem is an optimization on ∣Y∣∣X∣ variables,
the dual problem is an optimzation on only ∣Y∣⋅∣X∣ variables.
In matrix notation, the dual linear program is to maximize bTr under
constraint ATr≤m. The resulting maximum bTr equals the minimum
possible H(Y∣W).
The vector b consists of the values P(Y=y∣X=x)
across values x∈rngX and y∈rngY.
The vector m consists of the values m(w) across w∈F.
The matrix expression ATr≤m represents the linear constaints
x∈rngX∑r(x,w(x))≤m(w)
across all w∈F.
The maximized bTr represents
x∈rngXy∈rngY∑P(Y=y∣X=x)r(x,y)
and will equal the minimum possible H(Y∣W) across all copredictors W.
References
1.
El Gamal AA, Kim Y-H. Network information theory. Cambridge ; New York: Cambridge University Press; 2011.
2.
Li CT, Gamal AE. Strong functional representation lemma and applications to coding theorems. 2017 IEEE International Symposium on Information Theory (ISIT). Aachen, Germany: IEEE; 2017 Jun. pp. 589–593. doi:
10.1109/ISIT.2017.8006596
3.
Cornuejols G, Tütüncü R. Optimization methods in finance. Cambridge, UK ; New York: Cambridge University Press; 2007.