7 Appendix
Through the empirical exploration of the training behavior of DNNs in the information plane, regarding information compression phase, Schwartz-Ziv and Tishby [18] claimed that (i) information compression is a general process; (ii) information compression is induced by SGD. In this section, we demonstrate how the F-Principle can be used to understand the compression phase.
1.1 7.1 Computation of Information
For any random variables U and V with a joint distribution P(u, v): the entropy of U is defined as \(I(U)=-\sum _{u}P(u)\log P(u)\); their mutual information is defined as \(I(U,V)=\sum _{u,v}P(u,v)\log \frac{P(u,v)}{P(u)P(v)}\); the conditional entropy of U on V is defined as
$$ I(U|V)=\sum _{u,v}P(u,v)\log \frac{P(v)}{P(u,v)}=I(U)-I(U,V). $$
By the construction of the DNN, its output T is a deterministic function of its input X, thus, \(I(T|X)=0\) and \(I(X,T)=I(T)\). To compute entropy numerically, we evenly bin X, Y, T to \(X_{b}\), \(Y_{b}\), \(T_{b}\) with bin size b as follows. For any value v, its binned value is define as \(v_{b}=\mathrm{Round}(v/b)\times b\). In our work, I(T) and I(Y, T) are approximated by \(I(T_{b})\) and \(I(Y_{b},T_{b})\), respectively, with \(b=0.05\). Note that, after binning, one value of \(X_{b}\) may map to multiple values of \(T_{b}\). Thus, \(I(T_{b}|X_{b})\ne 0\) and \(I(X_{b},T_{b})\ne I(T_{b})\). The difference vanishes as bin size shrinks. Therefore, with a small bin size, \(I(T_{b})\) is a good approximation of I(X, T). In experiments, we also find that \(I(X_{b},T_{b})\) and \(I(T_{b})\) behave almost the same in the information plane for the default value \(b=0.05\).
1.2 7.2 Compression vs. No Compression in the Information Plane
We demonstrate how compression can appear or disappear by tuning the parameter \(\alpha \) in Eq. (2) with \(f_{0}(x)=x\) for \(x\in [-1,1]\) using full batch gradient descent (GD) without stochasticity. In our simulations, the DNN well fits f(x) for both \(\alpha \) equal to 0 and 0.5 after training (see Fig. 5a and c). In the information plane, there is no compression phase for I(T) for \(\alpha =0\) (see Fig. 5b). By increasing \(\alpha \) in Eq. (2) we can observe that: (i) the fitted function is discretized with only few possible outputs (see Fig. 5c); (ii) the compression of I(T) appears (see Fig. 5d). For \(\alpha >0\), behaviors of information plane are similar to previous results [18]. To understand why compression happens for \(\alpha >0\), we next focus on the training courses for different \(\alpha \) in the frequency domain.
A key feature of the class of functions described by Eq. (2) is that the dominant low-frequency components for f(x) with different \(\alpha \) are the same. By the F-Principle, the DNN first captures those dominant low-frequency components, thus, the training courses for different \(\alpha \) at the beginning are similar, i.e., (i) the DNN output is close to \(f_{0}(x)\) at certain training epochs (blue lines in Fig. 5a and c); (ii) I(T) in the information plane increases rapidly until it reaches a value close to the entropy of \(f_{0}(x)\), i.e., \(I(f_{0}(x))\) (see Fig. 5b and d). For \(\alpha =0\), the target function is \(f_{0}(x)\), therefore, I(T) will be closer and closer to \(I(f_{0}(x))\) during the training. For \(\alpha >0\), the entropy of the target function, I(f(x)), is much less than \(I(f_{0}(x))\). In the latter stage of capturing high-frequency components, the DNN output T would converge to the discretized function f(x). Therefore, I(T) would decrease from \(I(f_{0}(x))\) to I(f(x)).
This analysis is also applicable to other functions. As the discretization is in general inevitable for classification problems with discrete labels, we can often observe information compression in practice as described in the previous study [18].